## House Prices Prediction

<p>This notebook implements the house prices prediction competetion found on <a href="https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/overview">kaggle</a></p>

<h5>Datasets used - </h5>
<ul style="margin-botton:20px">
    <li>train.csv - to train and test the regression model</li>
</ul>

<div style="height:20px;text-align_center"><hr/></div>

### Solution

In [1]:
import numpy as np
from sklearn.metrics import mean_squared_log_error
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

In [2]:
def compute_rmsle(y_test: np.ndarray, y_pred: np.ndarray, precision: int = 2) -> float:
    rmsle = np.sqrt(mean_squared_log_error(y_test, y_pred))
    return round(rmsle, precision)

def standardize_column(col_names,data):
    """
    Function to standardize column value in dataframe column

    :param col_names: list of column names that need to be standardized along with the data set
    :data : dataframe
    :return: standardized data
    """ 
    for col in col_names:   
        mean = np.mean(data[col],axis=0)
        std = np.std(data[col],axis=0)
        data[col] = (data[col] - mean) / std
    
    return data

#Function to encode categorical variables
def encode_categories(col_names,data,encoder):
    """
    Function to encode categorical varaibles

    :param col_names: list of column names that need to be encoded along with the data set
    :data : dataframe
    :return: encoded dataset
    """ 
    
    categorical_encoded = encoder.fit_transform(data[col_names])
    feature_names = encoder.get_feature_names_out(input_features=col_names)
    categorical_encoded = pd.DataFrame(categorical_encoded, columns=feature_names, index=data.index)
    encoded_data = data.join(categorical_encoded)
    encoded_data.drop(columns=col_names,axis=1, inplace=True)
    
    return  encoded_data

#Function to return train and test data split
def split_train_test_data(X,y,split_factor):
    """
    Function to split data in train and test

    :param X: dataset containing feature columns
    :param y: dataset containing target column
    :param split_factor: ration of test to train data
    :return: X_train, y_train, X_val, y_val
    """ 
    return  train_test_split(X, y, test_size = split_factor, random_state = 0)



## 1) Model Training

<p style="margin-top:20px"><strong>Loading the data</strong></p>

In [3]:
dataset = pd.read_csv('../data/train.csv',index_col= 'Id')
dataset.sample(20)

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
173,160,RL,44.0,5306,Pave,,IR1,Lvl,AllPub,Inside,...,0,,,,0,6,2006,WD,Normal,239000
339,20,RL,91.0,14145,Pave,,Reg,Lvl,AllPub,Corner,...,0,,,Shed,400,5,2006,WD,Normal,202500
463,20,RL,60.0,8281,Pave,,IR1,Lvl,AllPub,Inside,...,0,,GdWo,,0,12,2009,WD,Normal,62383
1202,60,RL,80.0,10400,Pave,,Reg,Lvl,AllPub,Corner,...,0,,,,0,3,2009,WD,Normal,197900
891,50,RL,60.0,8064,Pave,,Reg,Lvl,AllPub,Corner,...,0,,MnPrv,Shed,2000,7,2007,WD,Normal,122900
827,45,RM,50.0,6130,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,5,2008,WD,Normal,109500
1333,20,RL,67.0,8877,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,5,2009,WD,Normal,100000
640,120,RL,53.0,3982,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,10,2006,New,Partial,264561
669,20,RL,,14175,Pave,,Reg,Bnk,AllPub,Corner,...,0,,,,0,11,2006,WD,Normal,168000
1271,40,RL,,23595,Pave,,Reg,Low,AllPub,Inside,...,0,,,,0,4,2010,WD,Normal,260000


<p style="margin-top:20px"><strong>Extracting features and target variables</strong></p>

In [4]:
X = dataset[['TotalBsmtSF','GrLivArea','GarageCars','GarageArea','HouseStyle','BldgType']]
y = dataset['SalePrice']

<p style="margin-top:20px"><strong>Splitting train-test data</strong></p>

In [5]:
X_train, X_test, y_train, y_test = split_train_test_data(X, y, 0.25)

<p style="margin-top:20px"><strong>Scaling continuous features</strong></p>

In [6]:
X_train = standardize_column(['TotalBsmtSF','GrLivArea','GarageArea'],X_train)

<p style="margin-top:20px"><strong>Encoding categorical features</strong></p>

In [7]:
encoder = OneHotEncoder(sparse=False)

In [8]:
X_train = encode_categories(['HouseStyle','BldgType'],X_train,encoder)

In [9]:
X_train.sample(5)

Unnamed: 0_level_0,TotalBsmtSF,GrLivArea,GarageCars,GarageArea,HouseStyle_1.5Fin,HouseStyle_1.5Unf,HouseStyle_1Story,HouseStyle_2.5Fin,HouseStyle_2.5Unf,HouseStyle_2Story,HouseStyle_SFoyer,HouseStyle_SLvl,BldgType_1Fam,BldgType_2fmCon,BldgType_Duplex,BldgType_Twnhs,BldgType_TwnhsE
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
1128,1.153977,0.017096,3,0.750901,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
319,0.70861,2.178034,3,0.874499,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
572,-0.466869,-1.272496,1,-0.874881,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
771,-0.481471,-1.284114,2,0.494199,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0
736,-0.427929,0.477941,2,-0.722761,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


<p style="margin-top:20px"><strong>Fitting the model</strong></p>

In [10]:
reg_multiple = LinearRegression()
reg_multiple.fit(X_train, y_train)

LinearRegression()

## 2) Model Evaluation

<p style="margin-top:20px"><strong>Scaling continuous features</strong></p>

In [11]:
X_test = standardize_column(['TotalBsmtSF','GrLivArea','GarageArea'],X_test)

<p style="margin-top:20px"><strong>Encoding categorical features</strong></p>

In [12]:
X_test = encode_categories(['HouseStyle','BldgType'],X_test,encoder)

In [13]:
X_test.sample(5)

Unnamed: 0_level_0,TotalBsmtSF,GrLivArea,GarageCars,GarageArea,HouseStyle_1.5Fin,HouseStyle_1.5Unf,HouseStyle_1Story,HouseStyle_2.5Fin,HouseStyle_2.5Unf,HouseStyle_2Story,HouseStyle_SFoyer,HouseStyle_SLvl,BldgType_1Fam,BldgType_2fmCon,BldgType_Duplex,BldgType_Twnhs,BldgType_TwnhsE
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
1349,0.818897,-0.007886,2,0.170904,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
868,-0.043333,-0.687064,2,0.305122,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1174,1.045183,2.88316,0,-2.128697,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
28,1.251962,0.373471,3,1.325179,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1110,1.509461,0.613181,3,1.727833,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


<p style="margin-top:20px"><strong>Evaluating the model performance</strong></p>

In [14]:
y_pred = reg_multiple.predict(X_test)
compute_rmsle(y_test,y_pred)

0.22