<a href="https://colab.research.google.com/github/quintonmills/HousePricePredict/blob/main/HPFeatureEngineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

###Predicting Sale Price of Houses

The aim of the project is to build a machine learning model to predict the sale price of homes based on different explanatory variabeles describing aspects of residential houses



###Reproducibility: Setting the seed

To ensure repreoducibility it is important to set the seed

In [2]:
# to handle datasets
import pandas as pd
import numpy as np

#for plotting 
import matplotlib.pyplot as plt

#For the yeo-johnson transformation
import scipy.stats as stats

#to divide train and test set
from sklearn.model_selection import train_test_split

#Feature scaling
from sklearn.preprocessing import MinMaxScaler

#to save the trained scalar class
import joblib

#to visualise all the columns in the dataframe
pd.pandas.set_option('display.max_columns', None)

In [3]:
#load dataset
data = pd.read_csv('train.csv')

#rows and columns of the data
print(data.shape)

#visualise the dataset
data.head()

(1460, 81)


Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,,,0,12,2008,WD,Normal,250000


###Seperate dataset into train and test

Our engineering techniques will learn:


*   Mean
*   Mode
*   exponents for the yeo-johnson
*   category frequency
*   category to number mapping
from the train set





Seperating the data into train and test involves randomness, therefore we need to set the seed

In [4]:
#lets seperate into train and test set
x_train, x_test, y_train, y_test = train_test_split(
    data.drop(['Id', 'SalePrice'], axis = 1),  #Predictive variables
    data['SalePrice'], #target
    test_size = 0.1, #portion of dataset to allocate to test set
    random_state = 0, #were setting the seed
)
x_train.shape, x_test.shape

((1314, 79), (146, 79))

###Feature Engineering

in the following cells we will engineer the variabels of the house price dataset so that we tackle:
1. Missing values
2. Temporal variables
3. Non-Gaussian distributed variables
4. Categorical variables: remove rare labels
5. Categorical variables: convert strings to numbers
6. put the variables in a similar scale

###Target

We apply the logarithm

In [5]:
y_train = np.log(y_train)
y_test = np.log(y_test)

###Missing values

##Categorical variables

We will replaec missing values with the string "missing" in those varibles with a lot of missing data

alternatively we will replace missing data with the most frequent category in those variables that contain fewer observations without values

In [6]:
#Lets identify the categorical variables, we will captuer those in type object
cat_vars = [var for var in data.columns if data[var].dtype =="O"]

#MSSubClass is also categorical by definition

#Lets add MSSubClass to the list of categorical variables
cat_vars = cat_vars + ['MSSubClass']

#Cast all variables as categorical
x_train[cat_vars] = x_train[cat_vars].astype('O')
x_test[cat_vars] = x_test[cat_vars].astype('O')

#number of categorical variables
len(cat_vars)

44

In [7]:
# Make a list of the categorical variables that contain missing values

cat_vars_with_na = [
                    var for var in cat_vars
                    if x_train[var].isnull().sum() > 0
]
#Print percentage of missing values per variable
x_train[cat_vars_with_na].isnull().mean().sort_values(ascending = False)

PoolQC          0.995434
MiscFeature     0.961187
Alley           0.938356
Fence           0.814307
FireplaceQu     0.472603
GarageType      0.056317
GarageFinish    0.056317
GarageQual      0.056317
GarageCond      0.056317
BsmtExposure    0.025114
BsmtFinType2    0.025114
BsmtQual        0.024353
BsmtCond        0.024353
BsmtFinType1    0.024353
MasVnrType      0.004566
Electrical      0.000761
dtype: float64

In [8]:
#variable to impute with the string missing
with_string_missing = [
 var for var in cat_vars_with_na if x_train[var].isnull().mean() > 0.1]

#variables to impute with the most frequent category
with_frequent_category = [
                          var for var in cat_vars_with_na if x_train[var].isnull().mean() < 0.1]


In [9]:
with_string_missing

['Alley', 'FireplaceQu', 'PoolQC', 'Fence', 'MiscFeature']

In [10]:
#replace missing values with new label: "Missing"

x_train[with_string_missing] = x_train[with_string_missing].fillna('Missing')
x_test[with_string_missing] = x_test[with_string_missing].fillna('Missing')


In [11]:
for var in with_frequent_category:

  #there can be more than 1 mode in a variable
  #We take the first one woth [0]
  mode = x_train[var].mode()[0]
  print(var, mode)

  x_train[var].fillna(mode, inplace = True)
  x_test[var].fillna(mode, inplace = True)
  

MasVnrType None
BsmtQual TA
BsmtCond TA
BsmtExposure No
BsmtFinType1 Unf
BsmtFinType2 Unf
Electrical SBrkr
GarageType Attchd
GarageFinish Unf
GarageQual TA
GarageCond TA


In [12]:
#Check that wee have no missing infomation in the engineered variables
x_train[cat_vars_with_na].isnull().sum()

Alley           0
MasVnrType      0
BsmtQual        0
BsmtCond        0
BsmtExposure    0
BsmtFinType1    0
BsmtFinType2    0
Electrical      0
FireplaceQu     0
GarageType      0
GarageFinish    0
GarageQual      0
GarageCond      0
PoolQC          0
Fence           0
MiscFeature     0
dtype: int64

In [13]:
#Check that test set does not contain null values in the engineered variables
[var for var in cat_vars_with_na if x_test[var].isnull().sum() > 0]


[]

###Numerical variables

To engineer missing values in numerical variables, we will:
* add a binary missing indicator variable
* and then replace the missing values in the original variable with the mean

In [14]:
#now lets identify the numerical variables
num_vars = [
            var for var in x_train.columns if var not in cat_vars and var != 'SalePrice']
            

In [15]:
#number of numerical variables
len(num_vars)

35

In [16]:
#Make a list with the numerical variables that contain missing values
vars_with_na = [
                var for var in num_vars
                if x_train[var].isnull().sum() > 0
]
#print percentage of missing values per variable
x_train[vars_with_na].isnull().mean()

LotFrontage    0.177321
MasVnrArea     0.004566
GarageYrBlt    0.056317
dtype: float64

In [17]:
#Replace missing values as we described above

for var in vars_with_na:

  #calculate the mean using the train set
  mean_val = x_train[var].mean()

  print(var, mean_val)

  #add binary missing indicator (in train and test)
  x_train[var + '_na'] = np.where(x_train[var].isnull(), 1, 0)
  x_test[var + '_na'] = np.where(x_test[var].isnull(), 1, 0)

  #replace missing values by the mean (in train and test)
  x_train[var].fillna(mean_val, inplace = True)
  x_test[var].fillna(mean_val, inplace = True)

#Check that we have no more missing values in the engineered variables
x_train[vars_with_na].isnull().sum()

LotFrontage 69.87974098057354
MasVnrArea 103.7974006116208
GarageYrBlt 1978.2959677419356


LotFrontage    0
MasVnrArea     0
GarageYrBlt    0
dtype: int64

In [19]:
#Check that test set does not contain null values in the engineered variables
[var for var in vars_with_na if x_test[var].isnull().sum() > 0]

[]

In [20]:
#Check the binary missing indicator variables

x_train[['LotFrontage_na', 'MasVnrArea_na', 'GarageYrBlt_na']].head()

Unnamed: 0,LotFrontage_na,MasVnrArea_na,GarageYrBlt_na
930,0,0,0
656,0,0,0
45,0,0,0
1348,1,0,0
55,0,0,0


###Temporal variables

##Capture elapsed time

there are 4 variables that refer ti the years in which the house or garage were built or remodeled

we will capture the time elapsed between those varibles and the year in which the house was sold

In [21]:
def elapsed_years(df, var):
  df[var] = df['YrSold'] - df[var]
  return df

In [22]:
for var in ['YearBuilt', 'YearRemodAdd', 'GarageYrBlt']:
  x_train = elapsed_years(x_train, var)
  x_test = elapsed_years(x_test, var)

In [27]:
#Now we drop YrSlold
x_train.drop(['YrSold'], axis = 1, inplace = True)
x_test.drop(['YrSold'], axis = 1, inplace = True)

In [None]:
###Numerical variable transformation

##Logarithmic transformation
the numerical variables are not normally distributed

we will transform the logarithm the positive numerical values in order to get a more gaussian-like distribution

In [28]:
for var in ["LotFrontage", "1stFlrSF", "GrLivArea"]:
  x_train[var] = np.log(x_train[var])
  x_test[var] = np.log(x_test[var])

In [31]:
#Check that test set does not contain null values on the engineered variables
[var for var in ["LotFrontage", "1stFlrSF", "GrLivArea"] if x_test[var].isnull().sum() > 0]

[]

In [32]:
#Same for train set
[var for var in ["LotFrontage", "1stFlrSF", "GrLivArea"] if x_train[var].isnull().sum() > 0]

[]

###Yeo-Johnson transformation

We will apply the Yeo-Johnson transformation to LotArea

In [34]:
#The yeo johnson transformation learns the best exponent to transform 
#it needs to learn it from the train set

x_train['LotArea'], param = stats.yeojohnson(x_train['LotArea'])

#and then apply the transformation to the test set with the same
#parameter: See who this time we pass param as argument to the yeojohnson
x_test['LotArea'] = stats.yeojohnson(x_test['LotArea'], lmbda = param)

print(param)

1.7496163569053917


  loglike = -n_samples / 2 * np.log(trans.var(axis=0))
  tmp1 = (x - w) * (fx - fv)
  tmp2 = (x - v) * (fx - fw)
  p = (x - v) * tmp2 - (x - w) * tmp1


In [35]:
#Check absence of na in the train set
[var for var in x_train.columns if x_test[var].isnull().sum() > 0]

[]

###Binarize skewed variables
there are a few variables that are very skewed, we would transform thise into binary variables


In [36]:
skewed = [
    'BsmtFinSF2', 'LowQualFinSF', 'EnclosedPorch',
    '3SsnPorch', 'ScreenPorch', 'MiscVal'
]

for var in skewed:

  #map the variable values into 0 and 1
  x_train[var] = np.where(x_train[var] ==0, 0, 1)
  x_test[var] = np.where(x_test[var] == 0, 0, 1)