## House Prices Prediction

<p>This notebook implements the house prices prediction competetion found on <a href="https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/overview">kaggle</a></p>

<h5>Datasets used - </h5>
<ul style="margin-botton:20px">
    <li>train.csv - to train and test the regression model</li>
</ul>

<div style="height:20px;text-align_center"><hr/></div>

### Solution

In [3]:
import numpy as np
from sklearn.metrics import mean_squared_log_error
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

In [9]:
def compute_rmsle(y_test: np.ndarray, y_pred: np.ndarray, precision: int = 2) -> float:
    rmsle = np.sqrt(mean_squared_log_error(y_test, y_pred))
    return round(rmsle, precision)

def standardize_column(col_names,data):
    """
    Function to standardize column value in dataframe column

    :param col: list of column names that need to be standardized along with the data set
    :return: standardized data
    """ 
    for col in col_names:   
        mean = np.mean(data[col],axis=0)
        std = np.std(data[col],axis=0)
        data[col] = (data[col] - mean) / std
    
    return data
#Function to return train and test data split
def split_train_test_data(X,y,split_factor):
    """
    Function to split data in train and test

    :param X: dataset containing feature columns
    :param y: dataset containing target column
    :param split_factor: ration of test to train data
    :return: X_train, y_train, X_val, y_val
    """ 
    return  train_test_split(X, y, test_size = split_factor, random_state = 0)

## 1) Model Training

<p style="margin-top:20px"><strong>Loading the data</strong></p>

In [14]:
dataset = pd.read_csv('../data/train.csv',index_col= 'Id')
dataset.sample(20)

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
13,20,RL,,12968,Pave,,IR2,Lvl,AllPub,Inside,...,0,,,,0,9,2008,WD,Normal,144000
195,20,RL,60.0,7180,Pave,,IR1,Lvl,AllPub,Inside,...,0,,,,0,5,2008,WD,Normal,127000
1449,50,RL,70.0,11767,Pave,,Reg,Lvl,AllPub,Inside,...,0,,GdWo,,0,5,2007,WD,Normal,112000
320,80,RL,,14115,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,6,2009,WD,Normal,187500
527,20,RL,70.0,13300,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,6,2007,WD,Normal,132000
988,20,RL,83.0,10159,Pave,,IR1,Lvl,AllPub,Inside,...,0,,,,0,4,2010,New,Partial,395192
922,90,RL,67.0,8777,Pave,,Reg,Lvl,AllPub,Inside,...,0,,GdPrv,,0,9,2008,WD,Normal,145900
1049,20,RL,100.0,21750,Pave,,Reg,Lvl,AllPub,Inside,...,0,,GdPrv,,0,11,2009,WD,Normal,115000
531,80,RL,85.0,10200,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,8,2008,WD,Abnorml,175000
1400,50,RL,51.0,6171,Pave,,Reg,Lvl,AllPub,Inside,...,0,,MnPrv,,0,10,2009,WD,Normal,137450


<p style="margin-top:20px"><strong>Extracting features and target variables</strong></p>

In [15]:
X = dataset[['TotalBsmtSF','GrLivArea','GarageCars','GarageArea','HouseStyle','BldgType']]
y = dataset['SalePrice']

<p style="margin-top:20px"><strong>Splitting train-test data</strong></p>

In [20]:
X_train, X_test, y_train, y_test = split_train_test_data(X, y, 0.25)

<p style="margin-top:20px"><strong>Scaling continuous features</strong></p>

In [21]:
print(X_train)
X_train = standardize_column(['TotalBsmtSF','GrLivArea','GarageArea'],X_train)
print(X_train)

      TotalBsmtSF  GrLivArea  GarageCars  GarageArea HouseStyle BldgType
Id                                                                      
1293          994       2372           1         432     2Story     1Fam
1019          384       1472           2         402       SLvl     1Fam
1214          648        960           1         364       SLvl     1Fam
1431          732       1838           2         372     2Story     1Fam
811          1040       1309           2         484     1Story     1Fam
...           ...        ...         ...         ...        ...      ...
764          1252       2365           3         856     2Story     1Fam
836          1067       1067           2         436     1Story     1Fam
1217            0       1902           2         539     1.5Fin   Duplex
560          1374       1557           2         420     1Story   TwnhsE
685          1195       1839           2         486     2Story     1Fam

[1095 rows x 6 columns]
      TotalBsmtSF  GrLivAr

<p style="margin-top:20px"><strong>Encoding categorical features</strong></p>

In [22]:
X_train = pd.get_dummies(X_train, columns=['HouseStyle','BldgType'], prefix=['HouseStyle_is','BldgType_is'])

<p style="margin-top:20px"><strong>Fitting the model</strong></p>

In [23]:
reg_multiple = LinearRegression()
reg_multiple.fit(X_train, y_train)

LinearRegression()

## 2) Model Evaluation

<p style="margin-top:20px"><strong>Scaling continuous features</strong></p>

In [24]:
print(X_test)
X_test = standardize_column(['TotalBsmtSF','GrLivArea','GarageArea'],X_test)
print(X_test)

      TotalBsmtSF  GrLivArea  GarageCars  GarageArea HouseStyle BldgType
Id                                                                      
530          2035       2515           2         484     1Story     1Fam
492           806       1578           1         240     1.5Fin     1Fam
460           709       1203           1         352     1.5Fin     1Fam
280          1160       2022           2         505     2Story     1Fam
656           525       1092           1         264     2Story    Twnhs
...           ...        ...         ...         ...        ...      ...
584          1237       2775           2         880     2.5Unf     1Fam
1246          585       1868           2         477       SLvl     1Fam
1391         1525       1525           2         541     1Story     1Fam
1376         1571       1571           3         722     1Story     1Fam
639           796        796           0           0     1Story     1Fam

[365 rows x 6 columns]
      TotalBsmtSF  GrLivAre

<p style="margin-top:20px"><strong>Encoding categorical features</strong></p>

In [25]:
X_test = pd.get_dummies(X_test, columns=['HouseStyle','BldgType'], prefix=['HouseStyle_is','BldgType_is'])

<p style="margin-top:20px"><strong>Evaluating the model performance</strong></p>

In [26]:
y_pred = reg_multiple.predict(X_test)
compute_rmsle(y_test,y_pred)

0.22