## House Price Prediction
The goal of this project is to forecast house sale prices through advanced regression techniques. Leveraging features like LotArea, 1stFlrSF, 2ndFlrSF, BedroomAbvGr, KitchenAbvGr, and FullBath, the model is trained on a dataset with 1460 rows and 81 columns. Evaluation metrics, including Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE), are employed for validation. The final model predicts house prices for the test dataset, with logarithmic transformation for submission.

The dataset has been taken from kaggle.

In [27]:
# loading libraries
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.model_selection import train_test_split

In [8]:
train_data = pd.read_csv(".\house_data\\train.csv")

train_data.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [9]:
train_data.describe()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
count,1460.0,1460.0,1201.0,1460.0,1460.0,1460.0,1460.0,1460.0,1452.0,1460.0,...,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,730.5,56.89726,70.049958,10516.828082,6.099315,5.575342,1971.267808,1984.865753,103.685262,443.639726,...,94.244521,46.660274,21.95411,3.409589,15.060959,2.758904,43.489041,6.321918,2007.815753,180921.19589
std,421.610009,42.300571,24.284752,9981.264932,1.382997,1.112799,30.202904,20.645407,181.066207,456.098091,...,125.338794,66.256028,61.119149,29.317331,55.757415,40.177307,496.123024,2.703626,1.328095,79442.502883
min,1.0,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,34900.0
25%,365.75,20.0,59.0,7553.5,5.0,5.0,1954.0,1967.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,129975.0
50%,730.5,50.0,69.0,9478.5,6.0,5.0,1973.0,1994.0,0.0,383.5,...,0.0,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0
75%,1095.25,70.0,80.0,11601.5,7.0,6.0,2000.0,2004.0,166.0,712.25,...,168.0,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0
max,1460.0,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,...,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0,755000.0


In [11]:
test_data = pd.read_csv(".\house_data\\test.csv")

test_data.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,1461,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,...,120,0,,MnPrv,,0,6,2010,WD,Normal
1,1462,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,...,0,0,,,Gar2,12500,6,2010,WD,Normal
2,1463,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,...,0,0,,MnPrv,,0,3,2010,WD,Normal
3,1464,60,RL,78.0,9978,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,6,2010,WD,Normal
4,1465,120,RL,43.0,5005,Pave,,IR1,HLS,AllPub,...,144,0,,,,0,1,2010,WD,Normal


In [14]:
print("The size of train data: ", train_data.shape)
print("The size of test data: ",test_data.shape)

The size of train data:  (1460, 81)
The size of test data:  (1459, 80)


In [16]:
y = train_data.SalePrice # target

features = ['LotArea', '1stFlrSF', '2ndFlrSF',
            'BedroomAbvGr', 'KitchenAbvGr', 'FullBath']

X = train_data[features] # featrues

train_X, val_X, train_y, val_y = train_test_split(X, y)

In [18]:
house_module = RandomForestRegressor(random_state=1)

house_module.fit(train_X, train_y)

predictions = house_module.predict(val_X)

In [20]:
rmse = mean_squared_error(val_y, predictions, squared=False)
print("Root Mean Squared Error:", rmse)

Root Mean Squared Error: 45063.421031650294


In [22]:
mea = mean_absolute_error(val_y, predictions)

print("Mean Absolute Error: ", mea)

Mean Absolute Error:  29077.32853255846


In [24]:
full_house_module = RandomForestRegressor(random_state=1)

full_house_module.fit(X, y)

final_predictions = full_house_module.predict(test_data[features])

# Applying logarithmic transformation
final_predictions_log = np.log1p(final_predictions)

In [25]:
submission = pd.DataFrame({
    'Id': test_data.Id,
    'SalePrice': np.exp(final_predictions_log)  # Invert the logarithmic transformation for submission
})

submission.to_csv('submission.csv', index=False)