## House Prices Prediction

<p>This notebook implements the house prices prediction competetion found on <a href="https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/overview">kaggle</a></p>

<h5>Datasets used - </h5>
<ul style="margin-botton:20px">
    <li>train.csv - to train and test the regression model</li>
</ul>

<div style="height:20px;text-align_center"><hr/></div>

### Solution

In [1]:
import numpy as np
from sklearn.metrics import mean_squared_log_error
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

In [2]:
def compute_rmsle(y_test: np.ndarray, y_pred: np.ndarray, precision: int = 2) -> float:
    rmsle = np.sqrt(mean_squared_log_error(y_test, y_pred))
    return round(rmsle, precision)

def standardize_column(col_names,data):
    """
    Function to standardize column value in dataframe column

    :param col_names: list of column names that need to be standardized along with the data set
    :data : dataframe
    :return: standardized data
    """ 
    for col in col_names:   
        mean = np.mean(data[col],axis=0)
        std = np.std(data[col],axis=0)
        data[col] = (data[col] - mean) / std
    
    return data

#Function to encode categorical variables
def encode_categories(col_names,data,encoder):
    """
    Function to encode categorical varaibles

    :param col_names: list of column names that need to be encoded along with the data set
    :data : dataframe
    :return: encoded dataset
    """ 
    
    categorical_encoded = encoder.fit_transform(data[col_names])
    feature_names = encoder.get_feature_names_out(input_features=col_names)
    categorical_encoded = pd.DataFrame(categorical_encoded, columns=feature_names, index=data.index)
    encoded_data = data.join(categorical_encoded)
    encoded_data.drop(columns=col_names,axis=1, inplace=True)
    
    return  encoded_data

#Function to return train and test data split
def split_train_test_data(X,y,split_factor):
    """
    Function to split data in train and test

    :param X: dataset containing feature columns
    :param y: dataset containing target column
    :param split_factor: ration of test to train data
    :return: X_train, y_train, X_val, y_val
    """ 
    return  train_test_split(X, y, test_size = split_factor, random_state = 0)



## 1) Model Training

<p style="margin-top:20px"><strong>Loading the data</strong></p>

In [3]:
dataset = pd.read_csv('../data/train.csv',index_col= 'Id')
dataset.sample(20)

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1162,20,RL,,14778,Pave,,IR1,Low,AllPub,CulDSac,...,0,,,,0,11,2008,WD,Normal,224000
20,20,RL,70.0,7560,Pave,,Reg,Lvl,AllPub,Inside,...,0,,MnPrv,,0,5,2009,COD,Abnorml,139000
313,190,RM,65.0,7800,Pave,,Reg,Lvl,AllPub,Inside,...,0,,MnPrv,,0,5,2006,WD,Normal,119900
1300,20,RL,75.0,7500,Pave,,Reg,Lvl,AllPub,Inside,...,0,,GdPrv,,0,5,2010,WD,Normal,154000
53,90,RM,110.0,8472,Grvl,,IR2,Bnk,AllPub,Corner,...,0,,,,0,5,2010,WD,Normal,110000
710,20,RL,,7162,Pave,,IR1,Lvl,AllPub,Inside,...,0,,MnPrv,,0,12,2008,WD,Abnorml,109900
353,50,RL,60.0,9084,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,3,2008,ConLw,Normal,95000
316,60,RL,71.0,7795,Pave,,IR1,Lvl,AllPub,Inside,...,0,,,,0,7,2009,WD,Normal,188500
608,20,RL,78.0,7800,Pave,,Reg,Bnk,AllPub,Inside,...,0,,,,0,8,2006,WD,Normal,225000
620,60,RL,85.0,12244,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,8,2008,WD,Normal,305000


<p style="margin-top:20px"><strong>Extracting features and target variables</strong></p>

In [4]:
X = dataset[['TotalBsmtSF','GrLivArea','GarageCars','GarageArea','HouseStyle','BldgType']]
y = dataset['SalePrice']

<p style="margin-top:20px"><strong>Splitting train-test data</strong></p>

In [5]:
X_train, X_test, y_train, y_test = split_train_test_data(X, y, 0.25)

<p style="margin-top:20px"><strong>Scaling continuous features</strong></p>

In [6]:
X_train = standardize_column(['TotalBsmtSF','GrLivArea','GarageArea'],X_train)

<p style="margin-top:20px"><strong>Encoding categorical features</strong></p>

In [7]:
encoder = OneHotEncoder(sparse=False)

In [8]:
X_train = encode_categories(['HouseStyle','BldgType'],X_train,encoder)

In [9]:
X_train.sample(5)

Unnamed: 0_level_0,TotalBsmtSF,GrLivArea,GarageCars,GarageArea,HouseStyle_1.5Fin,HouseStyle_1.5Unf,HouseStyle_1Story,HouseStyle_2.5Fin,HouseStyle_2.5Unf,HouseStyle_2Story,HouseStyle_SFoyer,HouseStyle_SLvl,BldgType_1Fam,BldgType_2fmCon,BldgType_Duplex,BldgType_Twnhs,BldgType_TwnhsE
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
1437,-0.466869,-1.272496,2,0.266019,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1035,-0.330581,-1.098226,1,-1.103061,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
209,0.246206,1.047221,2,-0.085759,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
879,0.129389,-0.709025,2,0.494199,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0
1375,-0.072609,0.845843,3,2.205549,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


<p style="margin-top:20px"><strong>Fitting the model</strong></p>

In [10]:
reg_multiple = LinearRegression()
reg_multiple.fit(X_train, y_train)

LinearRegression()

## 2) Model Evaluation

<p style="margin-top:20px"><strong>Scaling continuous features</strong></p>

In [11]:
X_test = standardize_column(['TotalBsmtSF','GrLivArea','GarageArea'],X_test)

<p style="margin-top:20px"><strong>Encoding categorical features</strong></p>

In [12]:
X_test = encode_categories(['HouseStyle','BldgType'],X_test,encoder)

In [13]:
X_test.sample(5)

Unnamed: 0_level_0,TotalBsmtSF,GrLivArea,GarageCars,GarageArea,HouseStyle_1.5Fin,HouseStyle_1.5Unf,HouseStyle_1Story,HouseStyle_2.5Fin,HouseStyle_2.5Unf,HouseStyle_2Story,HouseStyle_SFoyer,HouseStyle_SLvl,BldgType_1Fam,BldgType_2fmCon,BldgType_Duplex,BldgType_Twnhs,BldgType_TwnhsE
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
1215,-0.267669,-0.906798,1,-0.786517,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0
6,-0.519315,-0.247596,2,0.018791,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
142,1.310485,0.42795,2,0.824098,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1341,-0.398369,-1.137428,4,0.018791,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1390,-0.638311,-0.509097,2,-0.160167,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


<p style="margin-top:20px"><strong>Model predictions</strong></p>

In [14]:
y_pred = reg_multiple.predict(X_test)
y_pred
comparison_frame = pd.DataFrame({'Actual Price':y_test,'Predicted Price':np.round(y_pred,2),'Error':np.round(y_test-y_pred,2)}, columns=['Actual Price','Predicted Price','Error'], index=y_test.index)
comparison_frame.sample(20)

Unnamed: 0_level_0,Actual Price,Predicted Price,Error
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1071,135000,132004.14,2995.86
123,136000,135776.19,223.81
547,210000,178769.04,31230.96
57,172500,194794.9,-22294.9
1082,133000,131764.88,1235.12
252,235000,227511.18,7488.82
1098,170000,164479.92,5520.08
1244,465000,305002.22,159997.78
391,119000,130595.77,-11595.77
1201,116050,110486.89,5563.11


<p style="margin-top:20px"><strong>Evaluating the model performance</strong></p>

In [15]:
compute_rmsle(y_test,y_pred)

0.22