# Feature Engineering

In the following cells, we will engineer / pre-process the variables of the House Price Dataset from Kaggle. We will engineer the variables so that we tackle:

1. Missing values
2. Temporal variables
3. Non-Gaussian distributed variables
4. Categorical variables: remove rare labels
5. Categorical variables: convert strings to numbers
5. Standarise the values of the variables to the same range

## 1. Importing the Necessary Libraries

In [1]:
# importing the necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
pd.set_option("display.max_columns",None)
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
import scipy.stats as stats
import joblib

In [2]:
# Load and visualize the dataset
df = pd.read_csv("C:\\Users\\yozil\\Desktop\\My projects\\9.0 End_to_End_House_Price_Prediction\\Data\\raw_data\\train.csv")
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,,,0,12,2008,WD,Normal,250000


In [3]:
# shape of the data frame
df.shape

(1460, 81)

## 2. Separate dataset into train and test

It is important to separate our data intro training and testing set. 

When we engineer features, some techniques learn parameters from data. It is important to learn these parameters only from the train set. This is to avoid over-fitting.

Our feature engineering techniques will learn:

- mean
- mode
- exponents for the yeo-johnson
- category frequency
- and category to number mappings

from the train set.

**Separating the data into train and test involves randomness, therefore, we need to set the seed.**

In [4]:
x_train,x_test, y_train, y_test = train_test_split( df.drop(["Id", "SalePrice"], axis=1),
                                                   df["SalePrice"],
                                                   test_size = 0.1,
                                                   random_state = 0)

In [5]:
# shape of each splits
[x_train.shape, x_test.shape, y_train.shape, y_test.shape]

[(1314, 79), (146, 79), (1314,), (146,)]

# Feature Engineering

In the following cells, we will engineer the variables of the House Price Dataset so that we tackle:

1. Missing values
2. Temporal variables
3. Non-Gaussian distributed variables
4. Categorical variables: remove rare labels
5. Categorical variables: convert strings to numbers
5. Put the variables in a similar scale

#### Target

Since the target variable is not normally distributed, We apply the logarithm transformation such that it resembles a normal transformation

In [6]:
y_train = np.log(y_train)
y_test = np.log(y_test)

### 3. Missing values

##### Categorical variables

* We will replace missing values with the string **"missing"** in those variables with a lot of missing data. 

* Alternatively, we will replace missing data with the **most frequent** category in those variables that contain fewer observations without values. 

In [7]:
# first let's mask categorical features as a list
cat_vars = [var for var in x_train.select_dtypes("object").columns]

# by definition 'MSSubClass' is a categorical column from the dataset discription so we add it.
cat_vars = cat_vars + ['MSSubClass']

# now let's cast all the cat vars as an object
x_train[cat_vars] = x_train[cat_vars].astype("object")
x_test[cat_vars] = x_test[cat_vars].astype("object")


cat_vars

['MSZoning',
 'Street',
 'Alley',
 'LotShape',
 'LandContour',
 'Utilities',
 'LotConfig',
 'LandSlope',
 'Neighborhood',
 'Condition1',
 'Condition2',
 'BldgType',
 'HouseStyle',
 'RoofStyle',
 'RoofMatl',
 'Exterior1st',
 'Exterior2nd',
 'MasVnrType',
 'ExterQual',
 'ExterCond',
 'Foundation',
 'BsmtQual',
 'BsmtCond',
 'BsmtExposure',
 'BsmtFinType1',
 'BsmtFinType2',
 'Heating',
 'HeatingQC',
 'CentralAir',
 'Electrical',
 'KitchenQual',
 'Functional',
 'FireplaceQu',
 'GarageType',
 'GarageFinish',
 'GarageQual',
 'GarageCond',
 'PavedDrive',
 'PoolQC',
 'Fence',
 'MiscFeature',
 'SaleType',
 'SaleCondition',
 'MSSubClass']

In [8]:
# number of categorical features
len(cat_vars)

44

In [9]:
# categorical variables with missing values
cat_vars_with_na = [var for var in cat_vars if x_train[var].isnull().sum() > 0]
cat_vars_with_na

['Alley',
 'MasVnrType',
 'BsmtQual',
 'BsmtCond',
 'BsmtExposure',
 'BsmtFinType1',
 'BsmtFinType2',
 'Electrical',
 'FireplaceQu',
 'GarageType',
 'GarageFinish',
 'GarageQual',
 'GarageCond',
 'PoolQC',
 'Fence',
 'MiscFeature']

In [10]:
# number of categorical values with missing values
len(cat_vars_with_na)

16

In [11]:
# let's see how much percentage of missing values exist in each features
x_train[cat_vars_with_na].isnull().mean().sort_values(ascending = False)

PoolQC          0.995434
MiscFeature     0.961187
Alley           0.938356
Fence           0.814307
FireplaceQu     0.472603
GarageType      0.056317
GarageFinish    0.056317
GarageQual      0.056317
GarageCond      0.056317
BsmtExposure    0.025114
BsmtFinType2    0.025114
BsmtQual        0.024353
BsmtCond        0.024353
BsmtFinType1    0.024353
MasVnrType      0.004566
Electrical      0.000761
dtype: float64

In [12]:
# mask features to impute with string "Missing"
cat_vars_with_missing = [var for var in cat_vars_with_na if x_train[var].isnull().mean() > 0.1]
cat_vars_with_missing


['Alley', 'FireplaceQu', 'PoolQC', 'Fence', 'MiscFeature']

In [13]:
# mask features to impute with the Frequent variables
cat_vars_with_frequent = [var for var in cat_vars_with_na if x_train[var].isnull().mean() < 0.1]
cat_vars_with_frequent

['MasVnrType',
 'BsmtQual',
 'BsmtCond',
 'BsmtExposure',
 'BsmtFinType1',
 'BsmtFinType2',
 'Electrical',
 'GarageType',
 'GarageFinish',
 'GarageQual',
 'GarageCond']

In [14]:
# Lets replace missing values with string "Missing"
for var in cat_vars_with_missing:
    x_train[var] = x_train[var].fillna("Missing")
    x_test[var] = x_test[var].fillna("Missing")

In [15]:
# Now Let's replace missing values with the Frequent value exist in each features
for var in cat_vars_with_frequent:
    mode = x_train[var].mode()[0]
    
    x_train[var] = x_train[var].fillna(mode)
    x_test[var] = x_test[var].fillna(mode)
    
    print(var, mode)
    print()

MasVnrType None

BsmtQual TA

BsmtCond TA

BsmtExposure No

BsmtFinType1 Unf

BsmtFinType2 Unf

Electrical SBrkr

GarageType Attchd

GarageFinish Unf

GarageQual TA

GarageCond TA



In [16]:
# Now let's check there is no missing value in the categorical features in the training set
x_train[cat_vars].isnull().sum().sum()

0

In [17]:
# Now let's check there is no missing value in the categorical features in the test set
x_test[cat_vars].isnull().sum().sum()

0

There are no missing categorical values both in the training and testing sets

#### Numerical variables

To engineer missing values in numerical variables, we will:

- add a binary missing indicator variable
- and then replace the missing values in the original variable with the mean

In [18]:
# first let's mask all temporal variables
year_vars = [var for var in x_train.select_dtypes("number").columns if var.__contains__("Year") or var.__contains__("Yr")]
year_vars

['YearBuilt', 'YearRemodAdd', 'GarageYrBlt', 'YrSold']

In [19]:
# first let's mask all numerical features
num_vars = [var for var in x_train.select_dtypes("number").columns if var not in year_vars]
num_vars

['LotFrontage',
 'LotArea',
 'OverallQual',
 'OverallCond',
 'MasVnrArea',
 'BsmtFinSF1',
 'BsmtFinSF2',
 'BsmtUnfSF',
 'TotalBsmtSF',
 '1stFlrSF',
 '2ndFlrSF',
 'LowQualFinSF',
 'GrLivArea',
 'BsmtFullBath',
 'BsmtHalfBath',
 'FullBath',
 'HalfBath',
 'BedroomAbvGr',
 'KitchenAbvGr',
 'TotRmsAbvGrd',
 'Fireplaces',
 'GarageCars',
 'GarageArea',
 'WoodDeckSF',
 'OpenPorchSF',
 'EnclosedPorch',
 '3SsnPorch',
 'ScreenPorch',
 'PoolArea',
 'MiscVal',
 'MoSold']

In [20]:
# number of numerical variables
len(num_vars)

31

In [21]:
# now let's mask numerical variables with missing values
num_vars_with_na = [var for var in num_vars if x_train[var].isnull().sum() > 0 ]
num_vars_with_na

['LotFrontage', 'MasVnrArea']

In [22]:
# now let's fill those columns with the mean
for var in num_vars_with_na:
    mean = x_train[var].mean()
    
    # first let's add missing value indicators
    x_train[var + "_na"] = np.where(x_train[var].isnull(), 1, 0 )
    x_test[var + "_na"] = np.where(x_test[var].isnull(), 1, 0 )
    
    x_train[var] = x_train[var].fillna(mean)
    x_test[var] = x_test[var].fillna(mean)
    
    print(var, mean)
    print()
    

LotFrontage 69.87974098057354

MasVnrArea 103.7974006116208



In [23]:
# now let's confirm that there are no missing values in the numerical variables of the trainng set
x_train[num_vars].isnull().sum().sum()

0

In [24]:
# let's confirm that there are no missing values in the numerical variables of the test set
x_test[num_vars].isnull().sum().sum()

0

In [25]:
# now let's check the binary missing value indicators in the numerical variables
x_train[["LotFrontage_na", "MasVnrArea_na"]].head()

Unnamed: 0,LotFrontage_na,MasVnrArea_na
930,0,0
656,0,0
45,0,0
1348,1,0
55,0,0


Now we can confirm that there are no missing values in the numerical variables

#### Temporal Variables

In [26]:
# first let's mask all temporal variables with missing values
year_vars_with_na = [var for var in year_vars if x_train[var].isnull().sum() > 0]
year_vars_with_na

['GarageYrBlt']

In [27]:
# we fill temporal features with missing values with the frequent value in the feature
for var in year_vars_with_na:
    mode = x_train[var].mode()[0]
    
    print(var, mode)
    
    # first let's add missing value indicators
    x_train[var + "_na"] = np.where(x_train[var].isnull(), 1, 0 )
    x_test[var + "_na"] = np.where(x_test[var].isnull(), 1, 0 )
    
    x_train[var] = x_train[var].fillna(mode)
    x_test[var] = x_test[var].fillna(mode)

GarageYrBlt 2005.0


In [28]:
# now let's confirm there are no missing values in the trianing set of temporal variables
x_train[year_vars].isnull().sum().sum()

0

In [29]:
# now let's confirm there are no missing values in the test set of temporal variables
x_test[year_vars].isnull().sum().sum()

0

In [30]:
# now let's check the binary missing value indicators in the temporal variables
x_train[["GarageYrBlt_na"]].head()

Unnamed: 0,GarageYrBlt_na
930,0
656,0
45,0
1348,0
55,0


### 4. Tempopral Variables(Capture elapsed time)

there are 4 temporal variables that refer to the years in which the house or the garage were built or remodeled. 

We will capture the time elapsed between those variables and the year in which the house was sold:

In [31]:
# the temporal variables are
year_vars

['YearBuilt', 'YearRemodAdd', 'GarageYrBlt', 'YrSold']

In [32]:
# let's define a function that captures the elapsed time
def capture_elapsed_time(data, var):
    data[var] = data["YrSold"] - data[var]
    return data

# let's apply the capture elapsed time function in the data
for var in year_vars:
    if var != "YrSold":
        x_train =capture_elapsed_time(x_train,var)
        x_test = capture_elapsed_time(x_test, var)

In [33]:
# Now let's check the training temporal variables dataframe
x_train[year_vars].head()

Unnamed: 0,YearBuilt,YearRemodAdd,GarageYrBlt,YrSold
930,2,2,2.0,2009
656,49,2,49.0,2008
45,5,5,5.0,2010
1348,9,9,9.0,2007
55,44,44,44.0,2008


In [34]:
# now let's check the test temporal variables dataframe
x_test[year_vars].head()

Unnamed: 0,YearBuilt,YearRemodAdd,GarageYrBlt,YrSold
529,50,32,32.0,2007
491,65,56,65.0,2006
459,59,59,59.0,2009
279,31,31,31.0,2008
655,39,39,39.0,2010


Since I captured all the elapsed year variables then the **YrSold** variable is no more important so we drop it.

In [35]:
# droping the "YrSold" feature from both trianing and testing sets.
x_train = x_train.drop("YrSold", axis= 1)
x_test = x_test.drop("YrSold", axis = 1)

### 5. Numerical Variable Transformaion
* Logarithmic transformation

some continous numerical variables are not normally distributed.

Here I transform those numerical variables which (are not normally distributed and have positive numerical values) with the logarightmic transformation, in order to get a more Gaussian-like distribution.

In [36]:
# first let's capture all continous variables
cont_vars = [var for var in num_vars if x_train[var].nunique() > 20]
cont_vars

['LotFrontage',
 'LotArea',
 'MasVnrArea',
 'BsmtFinSF1',
 'BsmtFinSF2',
 'BsmtUnfSF',
 'TotalBsmtSF',
 '1stFlrSF',
 '2ndFlrSF',
 'LowQualFinSF',
 'GrLivArea',
 'GarageArea',
 'WoodDeckSF',
 'OpenPorchSF',
 'EnclosedPorch',
 'ScreenPorch',
 'MiscVal']

In [37]:
# number of continous variables 
len(cont_vars)

17

In [38]:
# first let's capture skewed continous varaibles with positive values.
trans_cont_vars = [var for var in cont_vars if all(x_train[var]) > 0 ]
trans_cont_vars

['LotFrontage', 'LotArea', '1stFlrSF', 'GrLivArea']

* Here we apply
1. Logarithmic Transformation on ['LotFrontage','1stFlrSF', 'GrLivArea']
2. Yoejohnoson Transformation on ['LotArea']

##### 5.1 Logarithmic Transformation

In [39]:
# let's apply a logarithmic transformation
for var in ['LotFrontage','1stFlrSF', 'GrLivArea']:
    x_train[var] = np.log(x_train[var])
    x_test[var] = np.log(x_test[var])

#### 5.2 Yeo-Johnson transformation

In [40]:
# Let's apply the Yeo-Johnson transformation to the train set of LotArea.
x_train["LotArea"], param = stats.yeojohnson(x_train["LotArea"])
print(param)
x_test["LotArea"] = stats.yeojohnson(x_test["LotArea"], lmbda = param)

0.01775557036572992


The Low value of transformation above indicates the data get's changed significantly

#### 5.3 Binarize Skewed Variables

There were a few variables very skewed, we would transform those into binary variables.

In [41]:
# variables that were too much skewed were 
skewed_vars = [
    'BsmtFinSF2', 'LowQualFinSF', 'EnclosedPorch',
    '3SsnPorch', 'ScreenPorch', 'MiscVal'
]

In [42]:
# now let's binarize these variables
for var in skewed_vars:
    x_train[var] = np.where(x_train[var] == 0, 0, 1)
    x_test[var] = np.where(x_test[var] == 0, 0, 1)

### 6. Categorical Variables Transformation

#### 6.1 Apply mappings

These are variables which values have an assigned order, related to quality.

In [43]:
# re-map strings to numbers, which determine quality

qual_mappings = {'Po': 1, 'Fa': 2, 'TA': 3, 'Gd': 4, 'Ex': 5, 'Missing': 0, 'NA': 0}

qual_vars = ['ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond',
             'HeatingQC', 'KitchenQual', 'FireplaceQu',
             'GarageQual', 'GarageCond',
            ]

for var in qual_vars:
    x_train[var] = x_train[var].map(qual_mappings)
    x_test[var] = x_test[var].map(qual_mappings)

In [44]:
# Exposure Mapping
exposure_mappings = {'No': 1, 'Mn': 2, 'Av': 3, 'Gd': 4}

var = 'BsmtExposure'

x_train[var] = x_train[var].map(exposure_mappings)
x_test[var] = x_test[var].map(exposure_mappings)

In [45]:
# Finish Mappings
finish_mappings = {'Missing': 0, 'NA': 0, 'Unf': 1, 'LwQ': 2, 'Rec': 3, 'BLQ': 4, 'ALQ': 5, 'GLQ': 6}

finish_vars = ['BsmtFinType1', 'BsmtFinType2']

for var in finish_vars:
    x_train[var] = x_train[var].map(finish_mappings)
    x_test[var] = x_test[var].map(finish_mappings)

In [46]:
# Garage Mappings
garage_mappings = {'Missing': 0, 'NA': 0, 'Unf': 1, 'RFn': 2, 'Fin': 3}

var = 'GarageFinish'

x_train[var] = x_train[var].map(garage_mappings)
x_test[var] = x_test[var].map(garage_mappings)

In [47]:
# Fence Mappings
fence_mappings = {'Missing': 0, 'NA': 0, 'MnWw': 1, 'GdWo': 2, 'MnPrv': 3, 'GdPrv': 4}

var = 'Fence'

x_train[var] = x_train[var].map(fence_mappings)
x_test[var] = x_test[var].map(fence_mappings)

In [48]:
# check that there are no missing values in the training set
[var for var in x_train.columns if x_train[var].isnull().sum() > 0]

[]

#### 6.2 Removing Rare Labels

For the remaining categorical variables, we will group those categories that are present in less than 1% of the observations. That is, all values of categorical variables that are shared by less than 1% of houses, well be replaced by the string "Rare".

In [49]:
# first let's mask the quality variables
quality_vars = qual_vars + finish_vars + ['BsmtExposure','GarageFinish','Fence']

In [50]:
# number of quality variables
len(quality_vars)

14

In [51]:
# now let's mask all the remaining categorical variablew which are not quality variables as a list
cat_others = [var for var in cat_vars if var not in quality_vars]
cat_others

['MSZoning',
 'Street',
 'Alley',
 'LotShape',
 'LandContour',
 'Utilities',
 'LotConfig',
 'LandSlope',
 'Neighborhood',
 'Condition1',
 'Condition2',
 'BldgType',
 'HouseStyle',
 'RoofStyle',
 'RoofMatl',
 'Exterior1st',
 'Exterior2nd',
 'MasVnrType',
 'Foundation',
 'Heating',
 'CentralAir',
 'Electrical',
 'Functional',
 'GarageType',
 'PavedDrive',
 'PoolQC',
 'MiscFeature',
 'SaleType',
 'SaleCondition',
 'MSSubClass']

In [52]:
# number of the remaining variables
len(cat_others)

30

In [53]:
# now let's replace all values which present in cat_others with less than 1% using string "Rare"
def analyse_rare(data, var, rare_percent):
    temp = data.groupby(var)[var].count()/ len(data)
    return temp[temp > rare_percent].index

for var in cat_others:
    frequent_ls = analyse_rare(x_train, var, rare_percent=0.01)
    print(var, frequent_ls)
    print()
    
    x_train[var] = np.where(x_train[var].isin(frequent_ls) , x_train[var], "Rare")
    x_test[var] = np.where(x_test[var].isin(frequent_ls), x_test[var], "Rare")

MSZoning Index(['FV', 'RH', 'RL', 'RM'], dtype='object', name='MSZoning')

Street Index(['Pave'], dtype='object', name='Street')

Alley Index(['Grvl', 'Missing', 'Pave'], dtype='object', name='Alley')

LotShape Index(['IR1', 'IR2', 'Reg'], dtype='object', name='LotShape')

LandContour Index(['Bnk', 'HLS', 'Low', 'Lvl'], dtype='object', name='LandContour')

Utilities Index(['AllPub'], dtype='object', name='Utilities')

LotConfig Index(['Corner', 'CulDSac', 'FR2', 'Inside'], dtype='object', name='LotConfig')

LandSlope Index(['Gtl', 'Mod'], dtype='object', name='LandSlope')

Neighborhood Index(['Blmngtn', 'BrDale', 'BrkSide', 'ClearCr', 'CollgCr', 'Crawfor',
       'Edwards', 'Gilbert', 'IDOTRR', 'MeadowV', 'Mitchel', 'NAmes', 'NWAmes',
       'NoRidge', 'NridgHt', 'OldTown', 'SWISU', 'Sawyer', 'SawyerW',
       'Somerst', 'StoneBr', 'Timber'],
      dtype='object', name='Neighborhood')

Condition1 Index(['Artery', 'Feedr', 'Norm', 'PosN', 'RRAn'], dtype='object', name='Condition1')

Con

#### 6.3 Encoding of categorical variables

Next, we need to transform the strings of the categorical variables into numbers. 

We will do it so that we capture the monotonic relationship between the label and the target.

In [59]:
# let's define a function that replaces categorical variables in to numbers
def replace_categories(x_train, y_train, x_test, var , target):
    
    # first concat the trianing and testing data frames
    temp = pd.concat([x_train, y_train], axis = 1)


    # now let's group this data frame by the selected feature
    ordered_labels = temp.groupby(var)[target].mean().sort_values().index
    
    # let's create a dictionary
    ordinal_dict = { k : i for i , k in enumerate(ordered_labels)}
    print(var, ordinal_dict)
    print()
    
    x_train[var] = x_train[var].map(ordinal_dict)
    x_test[var] = x_test[var].map(ordinal_dict)

for var in cat_others:
    replace_categories(x_train, y_train, x_test, var, "SalePrice")

MSZoning {'Rare': 0, 'RM': 1, 'RH': 2, 'RL': 3, 'FV': 4}

Street {'Rare': 0, 'Pave': 1}

Alley {'Grvl': 0, 'Pave': 1, 'Missing': 2}

LotShape {'Reg': 0, 'IR1': 1, 'Rare': 2, 'IR2': 3}

LandContour {'Bnk': 0, 'Lvl': 1, 'Low': 2, 'HLS': 3}

Utilities {'Rare': 0, 'AllPub': 1}

LotConfig {'Inside': 0, 'FR2': 1, 'Corner': 2, 'Rare': 3, 'CulDSac': 4}

LandSlope {'Gtl': 0, 'Mod': 1, 'Rare': 2}

Neighborhood {'IDOTRR': 0, 'MeadowV': 1, 'BrDale': 2, 'Edwards': 3, 'BrkSide': 4, 'OldTown': 5, 'Sawyer': 6, 'SWISU': 7, 'NAmes': 8, 'Mitchel': 9, 'SawyerW': 10, 'Rare': 11, 'NWAmes': 12, 'Gilbert': 13, 'Blmngtn': 14, 'CollgCr': 15, 'Crawfor': 16, 'ClearCr': 17, 'Somerst': 18, 'Timber': 19, 'StoneBr': 20, 'NridgHt': 21, 'NoRidge': 22}

Condition1 {'Artery': 0, 'Feedr': 1, 'Norm': 2, 'RRAn': 3, 'Rare': 4, 'PosN': 5}

Condition2 {'Rare': 0, 'Norm': 1}

BldgType {'2fmCon': 0, 'Duplex': 1, 'Twnhs': 2, '1Fam': 3, 'TwnhsE': 4}

HouseStyle {'SFoyer': 0, '1.5Fin': 1, 'Rare': 2, '1Story': 3, 'SLvl': 4, '2Story'

In [63]:
# first let's check the existance of nan values in the training set
x_train.isnull().sum().sum()

0

In [64]:
# now let's check the existance of nan values in the test set
x_test.isnull().sum().sum()

0

#### Feature Scaling

For use in linear models, features need to be either scaled. We will scale features to the minimum and maximum values:

In [65]:
# first let's initialize the minmax scaler
scaler = MinMaxScaler()

In [66]:
# now let's fit the scaler to the trianing set
scaler.fit(x_train)

In [67]:
# now let's transform the training and test set .
x_train = pd.DataFrame(scaler.transform(x_train), columns = x_train.columns)
x_test = pd.DataFrame(scaler.transform(x_test), columns = x_train.columns)


In [68]:
# finally let's visualize the trianing and test sets
x_train.head()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,SaleType,SaleCondition,LotFrontage_na,MasVnrArea_na,GarageYrBlt_na
0,0.75,0.75,0.461171,0.366365,1.0,1.0,0.333333,1.0,1.0,0.0,0.0,0.863636,0.4,1.0,0.75,0.6,0.777778,0.5,0.014706,0.04918,0.0,0.0,1.0,1.0,0.333333,0.0,0.666667,0.5,1.0,0.666667,0.666667,0.666667,1.0,0.002835,0.0,0.0,0.673479,0.239935,1.0,1.0,1.0,1.0,0.55976,0.0,0.0,0.52325,0.0,0.0,0.666667,0.0,0.375,0.333333,0.666667,0.416667,1.0,0.0,0.0,0.75,0.018692,1.0,0.75,0.430183,0.5,0.5,1.0,0.116686,0.032907,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.545455,0.666667,0.75,0.0,0.0,0.0
1,0.75,0.75,0.456066,0.388528,1.0,1.0,0.333333,0.333333,1.0,0.0,0.0,0.363636,0.4,1.0,0.75,0.6,0.444444,0.75,0.360294,0.04918,0.0,0.0,0.6,0.6,0.666667,0.03375,0.666667,0.5,0.5,0.333333,0.666667,0.0,0.8,0.142807,0.0,0.0,0.114724,0.17234,1.0,1.0,1.0,1.0,0.434539,0.0,0.0,0.406196,0.333333,0.0,0.333333,0.5,0.375,0.333333,0.666667,0.25,1.0,0.0,0.0,0.75,0.457944,0.5,0.25,0.220028,0.5,0.5,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.75,1.0,0.0,0.636364,0.666667,0.75,0.0,0.0,0.0
2,0.916667,0.75,0.394699,0.336782,1.0,1.0,0.0,0.333333,1.0,0.0,0.0,0.954545,0.4,1.0,1.0,0.6,0.888889,0.5,0.036765,0.098361,1.0,0.0,0.3,0.2,0.666667,0.2575,1.0,0.5,1.0,1.0,0.666667,0.0,1.0,0.080794,0.0,0.0,0.601951,0.286743,1.0,1.0,1.0,1.0,0.627205,0.0,0.0,0.586296,0.333333,0.0,0.666667,0.0,0.25,0.333333,1.0,0.333333,1.0,0.333333,0.8,0.75,0.046729,0.5,0.5,0.406206,0.5,0.5,1.0,0.228705,0.149909,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.090909,0.666667,0.75,0.0,0.0,0.0
3,0.75,0.75,0.445002,0.48228,1.0,1.0,0.666667,0.666667,1.0,0.0,0.0,0.454545,0.4,1.0,0.75,0.6,0.666667,0.5,0.066176,0.163934,0.0,0.0,1.0,1.0,0.333333,0.0,0.666667,0.5,1.0,0.666667,0.666667,1.0,1.0,0.25567,0.0,0.0,0.018114,0.242553,1.0,1.0,1.0,1.0,0.56692,0.0,0.0,0.529943,0.333333,0.0,0.666667,0.0,0.375,0.333333,0.666667,0.25,1.0,0.333333,0.4,0.75,0.084112,0.5,0.5,0.362482,0.5,0.5,1.0,0.469078,0.045704,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.636364,0.666667,0.75,1.0,0.0,0.0
4,0.75,0.75,0.577658,0.391756,1.0,1.0,0.333333,0.333333,1.0,0.0,0.0,0.363636,0.4,1.0,0.75,0.6,0.555556,0.5,0.323529,0.737705,0.0,0.0,0.6,0.7,0.666667,0.17,0.333333,0.5,0.5,0.333333,0.666667,0.0,0.6,0.086818,0.0,0.0,0.434278,0.233224,1.0,0.75,1.0,1.0,0.549026,0.0,0.0,0.513216,0.0,0.0,0.666667,0.0,0.375,0.333333,0.333333,0.416667,1.0,0.333333,0.8,0.75,0.411215,0.5,0.5,0.406206,0.5,0.5,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.545455,0.666667,0.75,0.0,0.0,0.0


In [69]:
# let's visualize the test set
x_test.head()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,SaleType,SaleCondition,LotFrontage_na,MasVnrArea_na,GarageYrBlt_na
0,0.75,0.75,0.445002,0.620348,1.0,1.0,0.333333,0.333333,1.0,1.0,0.0,0.727273,0.4,1.0,0.75,0.6,0.555556,0.25,0.367647,0.540984,1.0,0.0,0.1,0.5,0.333333,0.064873,0.666667,0.5,1.0,0.333333,0.666667,0.0,0.4,0.215982,0.0,0.0,0.379006,0.333061,1.0,0.5,1.0,1.0,0.764014,0.0,0.0,0.714182,0.333333,0.0,1.0,0.0,0.5,0.666667,0.333333,0.583333,0.0,0.666667,0.6,0.75,0.299065,0.5,0.5,0.341326,0.5,0.5,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.181818,0.666667,0.0,1.0,1.0,0.0
1,0.416667,0.75,0.490408,0.378248,1.0,1.0,0.0,0.333333,1.0,0.0,0.0,0.363636,0.0,1.0,0.75,0.2,0.555556,0.75,0.477941,0.934426,0.0,0.0,0.1,0.1,0.333333,0.0,0.333333,0.5,0.5,0.333333,0.666667,0.0,0.6,0.071403,0.4,1.0,0.110543,0.131915,1.0,0.5,1.0,0.666667,0.398758,0.331197,0.0,0.549294,0.333333,0.0,0.333333,0.0,0.375,0.333333,0.0,0.25,1.0,0.666667,0.6,0.75,0.607477,0.0,0.25,0.169252,0.5,0.5,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.75,1.0,0.0,0.636364,0.666667,0.75,0.0,0.0,0.0
2,0.416667,0.75,0.445002,0.319872,1.0,1.0,0.333333,0.0,1.0,0.5,0.0,0.181818,0.4,1.0,0.75,0.2,0.444444,0.375,0.433824,0.983607,0.0,0.0,0.3,0.2,0.0,0.100625,0.333333,0.5,0.5,0.333333,0.666667,0.0,0.2,0.032778,0.0,0.0,0.243381,0.116039,1.0,0.5,1.0,1.0,0.406964,0.119658,0.0,0.453307,0.333333,0.0,0.333333,0.0,0.375,0.333333,0.666667,0.25,1.0,0.333333,0.6,0.25,0.551402,0.0,0.25,0.248237,0.5,0.5,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.545455,0.666667,0.75,1.0,0.0,0.0
3,1.0,0.75,0.50869,0.388489,1.0,1.0,0.0,0.333333,1.0,0.0,0.0,0.772727,0.4,1.0,0.75,1.0,0.666667,0.5,0.227941,0.52459,1.0,0.0,0.7,0.7,0.666667,0.186875,0.333333,0.5,0.5,0.666667,0.666667,0.0,0.6,0.069454,0.0,0.0,0.356712,0.189853,1.0,1.0,1.0,1.0,0.469855,0.462607,0.0,0.636999,0.0,0.0,0.666667,0.5,0.5,0.333333,0.333333,0.5,1.0,0.333333,0.6,0.75,0.28972,1.0,0.5,0.356135,0.5,0.5,1.0,0.336056,0.213894,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.181818,0.666667,0.75,0.0,0.0,0.0
4,0.333333,0.25,0.0,0.04803,1.0,1.0,0.0,0.333333,1.0,0.0,0.0,0.090909,0.4,1.0,0.5,1.0,0.555556,0.5,0.286765,0.655738,0.0,0.0,0.6,0.5,0.666667,0.238125,0.333333,0.5,0.5,0.333333,0.666667,0.0,0.0,0.0,0.0,0.0,0.243846,0.085925,1.0,0.5,1.0,1.0,0.171149,0.302885,0.0,0.419061,0.0,0.0,0.333333,0.5,0.375,0.333333,0.333333,0.333333,1.0,0.0,0.0,0.25,0.364486,0.0,0.25,0.186178,0.5,0.5,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.181818,0.666667,0.5,0.0,0.0,0.0


### 7. Save The dataset