# House price prediction

## Project summary

In this project I will be using a dataset containg data about houses. We will try to use the different house features to predict the price of a house. We will first explore our data to know whether we might encounter some challenges, then we will pre-process the data so that it is ready to build models. Finally, we will try several different regressors and choose the best one.

I. Exploring the data

II. Pre-processing

III. Bulding our regressor

In [1]:
# python libraries
import pandas as pd
import numpy as np

# regressors
import xgboost
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor

# pre-processing
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

# model selection
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV


## I. Exploring the data

In [2]:
data = pd.read_csv("train.csv")
data.head(10)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000
5,6,50,RL,85.0,14115,Pave,,IR1,Lvl,AllPub,...,0,,MnPrv,Shed,700,10,2009,WD,Normal,143000
6,7,20,RL,75.0,10084,Pave,,Reg,Lvl,AllPub,...,0,,,,0,8,2007,WD,Normal,307000
7,8,60,RL,,10382,Pave,,IR1,Lvl,AllPub,...,0,,,Shed,350,11,2009,WD,Normal,200000
8,9,50,RM,51.0,6120,Pave,,Reg,Lvl,AllPub,...,0,,,,0,4,2008,WD,Abnorml,129900
9,10,190,RL,50.0,7420,Pave,,Reg,Lvl,AllPub,...,0,,,,0,1,2008,WD,Normal,118000


In [3]:
print(data.isnull().sum())

Id                 0
MSSubClass         0
MSZoning           0
LotFrontage      259
LotArea            0
                ... 
MoSold             0
YrSold             0
SaleType           0
SaleCondition      0
SalePrice          0
Length: 81, dtype: int64


As we can see, we have several columns with null values which we are going to need to tackle. It also seems like we have non numerical values so we will also need to encode that for our model to work.

## II. Pre-processing

In the following cell, in am taking care of missing data in the dataset. I iterate over the data and if a column contains numerical data I replace the missing data with the mean value of that column. If not I replace the missing values with missing.

In [4]:
for key, values in data.iteritems():
    if (pd.api.types.is_numeric_dtype(data[key])):
        data[key].fillna(value= data[key].mean(), inplace=True)
    else :
        data[key].fillna(value= "Missing", inplace=True) 

    
mising_data = pd.Series(data.isnull().sum())     


Here I am using a Label Encoder to deal with non numerical data. I also split the data into X and y.

In [5]:
one_hot_enc = LabelEncoder()
for key, values in data.iteritems():
    if (pd.api.types.is_string_dtype(data[key])):        
           data[key] = one_hot_enc.fit_transform(data[key])


scaler = StandardScaler()
scaler.fit(data)
scaler.transform(data)

X = data.drop('SalePrice', 1)
y = data.SalePrice

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

## III. Building our regresor

Now that we scaled and split our data, let us find which is the best regressor. We will use grid search to achieve this. This notebook is useful : https://github.com/codebasics/py/blob/master/ML/15_gridsearch/15_grid_search.ipynb.

I didn't know thi when staring this project, but grid search seems to be ill adapted to use on XG boost, I takes to much time or never even converges. I decided to use random search instead which worked for me.

In [6]:
# linear regression grid search
grid_search_linear_reg = GridSearchCV(LinearRegression(), { 'fit_intercept': [True, False],
                                                        'normalize': [True, False], 
                                                        'copy_X': [True, False] 
                                                        }, cv=5)
grid_search_linear_reg.fit(X_train, y_train)
print("linear reg score :", grid_search_linear_reg.best_score_)


linear reg score : 0.792177289049633


In [7]:
# decision tree grid search
decision_tree_param_grid = {'criterion': ['mse', 'mae'],
              'min_samples_split': [10, 20, 40],
              'max_depth': [2, 6, 8],
              'min_samples_leaf': [20, 40, 100],
              'max_leaf_nodes': [5, 20, 100],
              }

grid_search_decision_trees = GridSearchCV(DecisionTreeRegressor(), decision_tree_param_grid, cv=5)
grid_search_decision_trees.fit(X_train, y_train)
print("decision trees score :", grid_search_decision_trees.best_score_)


decision trees score : 0.7420121889339863


In [8]:
# xgboost grid search
xgb_param_grid = {
    'n_estimators': [100, 500, 900, 1100, 1500],
    'max_depth': [2,3,5,10,15],
    'learning_rate': [0.05, 0.1, 0.15, 0.2],
    'min_child_weight': [1,2,3,4],
    'booster': ['gbtree','gblinear'],
    'base_score': [0.25, 0.5, 0.75, 1]
}
grid_search_xgb = RandomizedSearchCV(xgboost.XGBRegressor(), param_distributions = xgb_param_grid,
                              cv=5, n_iter=50,
                              scoring = 'neg_mean_absolute_error', n_jobs = 4,
                              verbose = 5,
                              return_train_score = True,
                              random_state = 42)
grid_search_xgb.fit(X_train, y_train)

Fitting 5 folds for each of 50 candidates, totalling 250 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  10 tasks      | elapsed:    7.8s
[Parallel(n_jobs=4)]: Done  64 tasks      | elapsed:  1.3min
[Parallel(n_jobs=4)]: Done 154 tasks      | elapsed:  2.5min
[Parallel(n_jobs=4)]: Done 250 out of 250 | elapsed:  3.8min finished


RandomizedSearchCV(cv=5, error_score=nan,
                   estimator=XGBRegressor(base_score=None, booster=None,
                                          colsample_bylevel=None,
                                          colsample_bynode=None,
                                          colsample_bytree=None, gamma=None,
                                          gpu_id=None, importance_type='gain',
                                          interaction_constraints=None,
                                          learning_rate=None,
                                          max_delta_step=None, max_depth=None,
                                          min_child_weight=None, missing=nan,
                                          monotone_constraints=None,
                                          n_...
                   iid='deprecated', n_iter=50, n_jobs=4,
                   param_distributions={'base_score': [0.25, 0.5, 0.75, 1],
                                        'booster': ['gbt

In [9]:
xgb_best_estimator = grid_search_xgb.best_estimator_
print("xgb score :",cross_val_score(xgb_best_estimator, X_train, y_train, cv=5).mean())

xgb score : 0.8513626688170917


It seems like xgb is superior to our other regressors to make predictions on our dataset. We shall now try to ake predictions with it on the test set to see if it generalizes.

In [10]:
xgb_best_estimator.fit(X_train, y_train)
xgb_best_estimator.score(X_test, y_test)


0.8877980255398377