## House Prices Prediction

<p>This notebook implements the house prices prediction competetion found on <a href="https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/overview">kaggle</a></p>

<h5>Datasets used - </h5>
<ul style="margin-botton:20px">
    <li>train.csv - to train and test the regression model</li>
</ul>

<div style="height:20px;text-align_center"><hr/></div>

### Solution

In [1]:
import numpy as np
from sklearn.metrics import mean_squared_log_error
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
import joblib



In [2]:
def compute_rmsle(y_test: np.ndarray, y_pred: np.ndarray, precision: int = 2) -> float:
    rmsle = np.sqrt(mean_squared_log_error(y_test, y_pred))
    return round(rmsle, precision)

def standardize_column(col_names,data,scaler_path):
    """
    Function to standardize column value in dataframe column

    :param col_names: list of column names that need to be standardized along with the data set
    :data : dataframe
    :return: standardized data
    """
    
    scaler = joblib.load(scaler_path)
    standarized_data = scaler.transform(data[col_names])
    data.loc[:,col_names] = standarized_data  
    return data

#Function to encode categorical variables
def encode_categories(col_names,data,encoder_path):
    """
    Function to encode categorical varaibles

    :param col_names: list of column names that need to be encoded along with the data set
    :data : dataframe
    :return: encoded dataset
    """ 
    
    encoder = joblib.load(encoder_path)
    
    categorical_encoded = encoder.transform(data[col_names])
    feature_names = encoder.get_feature_names_out(input_features=col_names)
    categorical_encoded = pd.DataFrame(categorical_encoded, columns=feature_names, index=data.index)
    encoded_data = data.join(categorical_encoded)
    encoded_data.drop(columns=col_names,axis=1, inplace=True)
    
    return  encoded_data

#Function to return train and test data split
def split_train_test_data(X,y,split_factor):
    """
    Function to split data in train and test

    :param X: dataset containing feature columns
    :param y: dataset containing target column
    :param split_factor: ration of test to train data
    :return: X_train, y_train, X_val, y_val
    """ 
    return  train_test_split(X, y, test_size = split_factor, random_state = 0)



## 1) Model Training

<p style="margin-top:20px"><strong>Loading the data</strong></p>

In [3]:
dataset = pd.read_csv('../data/train.csv',index_col= 'Id')
dataset.sample(20)

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
716,20,RL,78.0,10140,Pave,,Reg,Lvl,AllPub,Inside,...,0,,MnPrv,,0,8,2009,WD,Normal,165000
874,40,RL,60.0,12144,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,Othr,0,9,2009,WD,Normal,133000
506,90,RM,60.0,7596,Pave,Grvl,Reg,Lvl,AllPub,Inside,...,0,,,,0,7,2009,COD,Normal,124500
45,20,RL,70.0,7945,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,5,2006,WD,Normal,141000
1195,60,RL,80.0,9364,Pave,,Reg,Lvl,AllPub,Corner,...,0,,MnPrv,,0,3,2010,WD,Normal,158000
1112,60,RL,80.0,10480,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,9,2008,WD,Normal,205000
1041,20,RL,88.0,13125,Pave,,Reg,Lvl,AllPub,Corner,...,0,,GdPrv,,0,1,2006,WD,Normal,155000
301,190,RL,90.0,15750,Pave,,Reg,Lvl,AllPub,Corner,...,0,,,,0,6,2006,WD,Normal,157000
1444,30,RL,,8854,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,5,2009,WD,Normal,121000
1417,190,RM,60.0,11340,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,4,2010,WD,Normal,122500


<p style="margin-top:20px"><strong>Extracting features and target variables</strong></p>

In [4]:
X = dataset[['TotalBsmtSF','GrLivArea','GarageCars','GarageArea','HouseStyle','BldgType']]
y = dataset['SalePrice']

<p style="margin-top:20px"><strong>Splitting train-test data</strong></p>

In [5]:
X_train, X_test, y_train, y_test = split_train_test_data(X, y, 0.25)

<p style="margin-top:20px"><strong>Scaling continuous features</strong></p>

In [6]:
scaler = StandardScaler()
scaler.fit(X_train[['TotalBsmtSF','GrLivArea','GarageArea']])
joblib.dump(scaler,'../models/scaler.joblib')
X_train = standardize_column(['TotalBsmtSF','GrLivArea','GarageArea'],X_train,'../models/scaler.joblib')

<p style="margin-top:20px"><strong>Encoding categorical features</strong></p>

In [7]:
enc = OneHotEncoder(sparse=False)
enc.fit(X_train[['HouseStyle','BldgType']])
joblib.dump(enc,'../models/encoder.joblib')
X_train = encode_categories(['HouseStyle','BldgType'],X_train,'../models/encoder.joblib')

In [8]:
X_train.sample(5)

Unnamed: 0_level_0,TotalBsmtSF,GrLivArea,GarageCars,GarageArea,HouseStyle_1.5Fin,HouseStyle_1.5Unf,HouseStyle_1Story,HouseStyle_2.5Fin,HouseStyle_2.5Unf,HouseStyle_2Story,HouseStyle_SFoyer,HouseStyle_SLvl,BldgType_1Fam,BldgType_2fmCon,BldgType_Duplex,BldgType_Twnhs,BldgType_TwnhsE
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
843,0.173195,-0.689662,2,0.085376,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0
449,-0.6713,-0.281098,1,-1.302718,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
506,-0.233233,0.849716,2,-0.342461,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
179,2.823497,1.380268,3,3.298911,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
163,1.180748,0.038396,2,0.285034,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


<p style="margin-top:20px"><strong>Fitting the model</strong></p>

In [9]:
reg_multiple = LinearRegression()
reg_multiple.fit(X_train, y_train)
joblib.dump(reg_multiple,'../models/model.joblib')

['../models/model.joblib']

## 2) Model Evaluation

<p style="margin-top:20px"><strong>Scaling continuous features</strong></p>

In [10]:
X_test = standardize_column(['TotalBsmtSF','GrLivArea','GarageArea'],X_test,'../models/scaler.joblib')

<p style="margin-top:20px"><strong>Encoding categorical features</strong></p>

In [11]:
X_test = encode_categories(['HouseStyle','BldgType'],X_test,'../models/encoder.joblib')

In [12]:
X_test.sample(5)

Unnamed: 0_level_0,TotalBsmtSF,GrLivArea,GarageCars,GarageArea,HouseStyle_1.5Fin,HouseStyle_1.5Unf,HouseStyle_1Story,HouseStyle_2.5Fin,HouseStyle_2.5Unf,HouseStyle_2Story,HouseStyle_SFoyer,HouseStyle_SLvl,BldgType_1Fam,BldgType_2fmCon,BldgType_Duplex,BldgType_Twnhs,BldgType_TwnhsE
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
1274,-0.089645,-0.317888,1,-0.760791,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0
1040,-1.036355,-1.725595,1,-0.884388,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
550,-0.350051,1.109183,2,0.679595,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1390,-0.780816,-0.587037,2,-0.152311,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
40,-2.569588,-0.714834,0,-2.243961,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


<p style="margin-top:20px"><strong>Model predictions</strong></p>

In [13]:
y_pred = reg_multiple.predict(X_test)
comparison_frame = pd.DataFrame({'Actual Price':y_test,'Predicted Price':np.round(y_pred,2),'Error':np.round(y_test-y_pred,2)}, columns=['Actual Price','Predicted Price','Error'], index=y_test.index)
comparison_frame.sample(20)

Unnamed: 0_level_0,Actual Price,Predicted Price,Error
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1106,325000,303953.06,21046.94
1164,108959,146604.62,-37645.62
990,197000,182143.15,14856.85
827,109500,67078.92,42421.08
796,171000,170211.38,788.62
770,538000,387536.8,150463.2
482,374000,281455.04,92544.96
311,165600,165950.13,-350.13
1383,157000,186084.1,-29084.1
309,82500,129924.42,-47424.42


<p style="margin-top:20px"><strong>Evaluating the model performance</strong></p>

In [14]:
compute_rmsle(y_test,y_pred)

0.28

## 3) Model Inference

In [15]:
dataset_inf = pd.read_csv('../data/test.csv',index_col= 'Id')
linear_model_loaded = joblib.load('../models/model.joblib')

X_inf = dataset_inf.copy()[['TotalBsmtSF','GrLivArea','GarageCars','GarageArea','HouseStyle','BldgType']]
X_inf.dropna(how='any',inplace=True)

X_inf = standardize_column(['TotalBsmtSF','GrLivArea','GarageArea'],X_inf,'../models/scaler.joblib')
X_inf = encode_categories(['HouseStyle','BldgType'],X_inf,'../models/encoder.joblib')

y_pred_inf = linear_model_loaded.predict(X_inf)
y_pred_inf

array([122027.0400057 , 165764.86844561, 190327.56121512, ...,
       178840.13126803, 107045.45330138, 246557.32127283])