# House Price Prediction with Scikit-Learn

The goal of this project is to predict house prices from a set of variables explaining each home. This is a famous machine-learning challenge hosted on kaggle. It is ideal to test some ML concept on real world data. More information can be found on the competition's [kaggle-page](https://www.kaggle.com/c/house-prices-advanced-regression-techniques).

In [1]:
import pandas as pd
import numpy as np

import os

from sklearn.model_selection import train_test_split

from sklearn.impute import SimpleImputer 
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Ridge

## Dataset

In [2]:
# get train and test data set
data_loc = './data'

train_data_base = pd.read_csv(os.path.join(data_loc,'train.csv'), index_col='Id')
test_data_base = pd.read_csv(os.path.join(data_loc,'test.csv'), index_col='Id')

In [3]:
# split a validation set from the full train set
train_set, val_set = train_test_split(train_data_base, test_size=0.2, random_state=42)

# copy the test set for preprocessing
test_set = test_data_base.copy()

In [4]:
train_data_base.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1460 entries, 1 to 1460
Data columns (total 80 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   MSSubClass     1460 non-null   int64  
 1   MSZoning       1460 non-null   object 
 2   LotFrontage    1201 non-null   float64
 3   LotArea        1460 non-null   int64  
 4   Street         1460 non-null   object 
 5   Alley          91 non-null     object 
 6   LotShape       1460 non-null   object 
 7   LandContour    1460 non-null   object 
 8   Utilities      1460 non-null   object 
 9   LotConfig      1460 non-null   object 
 10  LandSlope      1460 non-null   object 
 11  Neighborhood   1460 non-null   object 
 12  Condition1     1460 non-null   object 
 13  Condition2     1460 non-null   object 
 14  BldgType       1460 non-null   object 
 15  HouseStyle     1460 non-null   object 
 16  OverallQual    1460 non-null   int64  
 17  OverallCond    1460 non-null   int64  
 18  YearBuil

## Preprocessing

The same preprocessing steps are applied independently to the train, test and validation set.

The data set containes nummerical and categorical values. There are a few features that have a grading system in words, which can easily be ranked, like from "poor" to "excellent". These can be mapped to a numerical value.

In [5]:
# mapping categorical to numerical features
map1 = {'Reg':0, 'IR1': 1, 'IR2':2, 'IR3':3}
map2 = {'Po':0, 'Fa':1, 'TA':2, 'Gd': 3, 'Ex':4}
map3 = {'Gtl':0, 'Mod':1, 'Sev':2}
map4 = {'NA':0, 'Po':1, 'Fa':2, 'TA':3, 'Gd': 4, 'Ex':5}
map5 = {'NA':0, 'No':1, 'Mn':2, 'Av':3, 'Gd': 4}
map6 = {'NA':0, 'Unf':1, 'LwQ':2, 'Rec':3, 'BLQ': 4, 'ALQ':5, 'GLQ':6}
map7 = {'NA':0, 'Unf':1, 'RFn':2, 'Fin':3}
map8 = {'N':0, 'P':1, 'Y':2}
map9 = {'NA':0, 'Fa':1, 'TA':2, 'Gd':3, 'Ex':4}

def MapNumerical(feature, mapping):
    train_set.loc[:, feature] = train_set[feature].map(mapping)
    val_set.loc[:, feature] = val_set[feature].map(mapping)
    test_set.loc[:, feature] = test_set[feature].map(mapping)

In [18]:
# assign the maps to the according categorical features
map_feature_lst = [('LotShape', map1), ('HeatingQC', map2), ('KitchenQual', map2), ('LandSlope', map3),
                   ('ExterQual', map2), ('ExterCond', map2), ('BsmtQual', map4), ('BsmtCond', map4), 
                   ('BsmtExposure', map5), ('BsmtFinType1', map6), ('BsmtFinType2', map6), ('FireplaceQu', map4),
                   ('GarageFinish', map7), ('GarageQual', map4), ('GarageCond', map4), ('PavedDrive', map8),
                   ('PoolQC', map9)]

for feature, mapping in map_feature_lst:
    MapNumerical(feature, mapping)

Also binary values can directely be encoded to 1/0 values.

In [7]:
binmap1 = {'Grvl':0, 'Pave':1}
binmap2 = {'N':0, 'Y':1}

MapNumerical('Street', binmap1)
MapNumerical('CentralAir', binmap2)

## ML Model
We can now extract the numerical features and build a model with them. For the sake of simplicity the remaining categorical features are neglected. Also the engineering of more sophisticated features will not be done in this notebook.

In [8]:
# get a list of all features which are now numerical
features_num = train_set.select_dtypes(include=['float64', 'int64']).columns.drop(['SalePrice']).tolist()

These featues are now used to predict the sale price, or rather the log of the sale price. This is according to the evaluation criteria of this competition and should scale the house prices.

In [9]:
# define train, val and test data
X_train = train_set[features_num].to_numpy()
y_train = np.log(train_set['SalePrice']).to_numpy()

X_val = val_set[features_num].to_numpy()
y_val = np.log(val_set['SalePrice']).to_numpy()

X_test = test_set[features_num].to_numpy()

In [10]:
# set up the ML pipeline

# imputing strategy for missing values
imputer = SimpleImputer(missing_values=np.nan)

# scale values
scaler = StandardScaler()

# regression
regressor = Ridge()

# pipeline
pipe = Pipeline([('imputer', imputer), ('scaler', scaler), ('regressor', regressor)])

In [11]:
# tune hyperparameters with cross validation
param_grid = {'regressor__alpha':[0.001, 0.01, 0.1, 1, 10, 50, 100, 500, 1000], 
              'imputer__strategy':['mean', 'median']}

search = GridSearchCV(pipe, param_grid, scoring='neg_mean_squared_error', n_jobs=-1)
search.fit(X_train,y_train)

print(search.best_params_)

{'imputer__strategy': 'mean', 'regressor__alpha': 100}


In [12]:
# apply the best hyperparameters to the model and fit it to the train data
pipe.set_params(regressor__alpha=search.best_params_['regressor__alpha'],
               imputer__strategy=search.best_params_['imputer__strategy'])
pipe.fit(X_train,y_train)

Pipeline(steps=[('imputer', SimpleImputer()), ('scaler', StandardScaler()),
                ('regressor', Ridge(alpha=100))])

## Evaluate the Model

Now we use the validation set to evaluate the model and create a prediction for the test set, which then can be loaded to the kaggle competition page.

In [13]:
y_val_pred = pipe.predict(X_val)

# root mean squared error for the log scaled prices
np.sqrt(mean_squared_error(y_val, y_val_pred))

0.14762297810024838

In [14]:
# the root mean squared error for the unscaled prices
np.sqrt(mean_squared_error(np.exp(y_val), np.exp(y_val_pred)))

27914.038546602194

In [15]:
# predict the prices for the test set
y_test_pred = np.exp(pipe.predict(X_test))

In [16]:
# create the submission file for the kaggle competition
submission = pd.DataFrame(y_test_pred, columns=['SalePrice'])
submission['Id'] = test_data_base.index
submission['Id'].astype('int')

submission.to_csv(os.path.join(data_loc,'submission.csv'), index=None)

The submission scored 0.14672 on the public leader board, which is pretty close to the estimated score with the validation set.

## Conclusion

This notebook showcases just a very simple regression model using scikit-learn. This result can be used a s baseline to compare other models to. A few possibilities to improve the score:

- Include the remaining categorical features.
- Design some 'hand-crafted' features.
- Evaluate feature importance and select only the most important few.
- Use a more sophisticated regression model (e.g. kernel regression, random forest, adaboost...)
