# <font color='blue'>**House Prices**</font><font color='orange'> *using* **Advanced Regression Techniques</font>**

<font color='green'>Predict sales prices through **Feature Engineering** and **Advance Regression Analysis**.

The Ames Housing dataset was compiled by Dean De Cock for use in data science education. It's an incredible alternative for data scientists looking for a modernized and expanded version of the often cited Boston Housing dataset. 

![Image Source here](http://www.agarwalestates.com/images/houses-for-buy-sell-rent.jpg)

**<font color='blue'>Flow of Work</font>**
- Import Data (from local drive/ **Google drive**/ other sources)
- <font color='green'>Get preview on data </font>
  - **Shape** of dataset - Records and Features
  - **Summary statistics** - Unique values, Minimum & Maximum values, Missing Values
  - Distinguish between Continuous and Categorical variables
  - Compare Train and Test dataset (**Missing values**)
- <font color='green'>Vizualize data</font> through Bivariate analysis on Continuous and Categorical variables 

- <font color='green'>Feature Engineering</font>
  - Missing Value Imputation
  - Outlier Treatment
  - One-Hot Encoding (Categorical variables)

- <font color='green'>Advance Regression Models</font>
  - Evaluate each models using 
    - Entire training dataset
    - Train-test split
    - K-fold cross validation
- <font color='green'>Combine results</font> from all/selective regression and ensemble models to achieve improved prediction


# Step 0: Import Libraries

In [0]:
import pandas as pd
import numpy as np
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import plotly.graph_objs as go
import warnings
warnings.filterwarnings("ignore")

# Step 1: Get Data from Google Drive

In [0]:
# Code to read csv file into Colaboratory:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [144]:
link_train="https://drive.google.com/file/d/12avQTyRXhXpgMrdMoXcrQKTzGbuCNScX/view?usp=sharing"
link_test="https://drive.google.com/file/d/1mpDqOIUVb3ZiV9GdWe9VUBKKOxWzxpLC/view?usp=sharing"

train = drive.CreateFile({'id':'1mpDqOIUVb3ZiV9GdWe9VUBKKOxWzxpLC'}) 
train.GetContentFile('train.csv')  
train=pd.read_csv('train.csv')

test = drive.CreateFile({'id':'12avQTyRXhXpgMrdMoXcrQKTzGbuCNScX'}) 
test.GetContentFile('test.csv')  
test=pd.read_csv('test.csv',encoding='ISO-8859-1')

print("Training set = ",train.shape)
print("Testing set = ",test.shape)

Training set =  (1460, 81)
Testing set =  (1459, 80)


# Step 2: Initial Assesment of Data
- Summary Statistics
- Continuous and Categorical Variables
- Note Missing Values

In [145]:
# Detailed Summary
s=train.describe(include='all')
s

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,...,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
count,1460.0,1460.0,1460,1201.0,1460.0,1460,91,1460,1460,1460,1460,1460,1460,1460,1460,1460,1460,1460.0,1460.0,1460.0,1460.0,1460,1460,1460,1460,1452.0,1452.0,1460,1460,1460,1423,1423,1422,1423,1460.0,1422,1460.0,1460.0,1460.0,1460,...,1460,1459,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460,1460.0,1460,1460.0,770,1379,1379.0,1379,1460.0,1460.0,1379,1379,1460,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,7,281,54,1460.0,1460.0,1460.0,1460,1460,1460.0
unique,,,5,,,2,2,4,4,2,5,3,25,9,8,5,8,,,,,6,8,15,16,4.0,,4,5,6,4,4,4,6,,6,,,,6,...,2,5,,,,,,,,,,,4,,7,,5,6,,3,,,5,5,3,,,,,,,3,4,4,,,,9,6,
top,,,RL,,,Pave,Grvl,Reg,Lvl,AllPub,Inside,Gtl,NAmes,Norm,Norm,1Fam,1Story,,,,,Gable,CompShg,VinylSd,VinylSd,,,TA,TA,PConc,TA,TA,No,Unf,,Unf,,,,GasA,...,Y,SBrkr,,,,,,,,,,,TA,,Typ,,Gd,Attchd,,Unf,,,TA,TA,Y,,,,,,,Gd,MnPrv,Shed,,,,WD,Normal,
freq,,,1151,,,1454,50,925,1311,1459,1052,1382,225,1260,1445,1220,726,,,,,1141,1434,515,504,864.0,,906,1282,647,649,1311,953,430,,1256,,,,1428,...,1365,1334,,,,,,,,,,,735,,1360,,380,870,,605,,,1311,1326,1340,,,,,,,3,157,49,,,,1267,1198,
mean,730.5,56.89726,,70.049958,10516.828082,,,,,,,,,,,,,6.099315,5.575342,1971.267808,1984.865753,,,,,,103.685262,,,,,,,,443.639726,,46.549315,567.240411,1057.429452,,...,,,1162.626712,346.992466,5.844521,1515.463699,0.425342,0.057534,1.565068,0.382877,2.866438,1.046575,,6.517808,,0.613014,,,1978.506164,,1.767123,472.980137,,,,94.244521,46.660274,21.95411,3.409589,15.060959,2.758904,,,,43.489041,6.321918,2007.815753,,,180921.19589
std,421.610009,42.300571,,24.284752,9981.264932,,,,,,,,,,,,,1.382997,1.112799,30.202904,20.645407,,,,,,181.066207,,,,,,,,456.098091,,161.319273,441.866955,438.705324,,...,,,386.587738,436.528436,48.623081,525.480383,0.518911,0.238753,0.550916,0.502885,0.815778,0.220338,,1.625393,,0.644666,,,24.689725,,0.747315,213.804841,,,,125.338794,66.256028,61.119149,29.317331,55.757415,40.177307,,,,496.123024,2.703626,1.328095,,,79442.502883
min,1.0,20.0,,21.0,1300.0,,,,,,,,,,,,,1.0,1.0,1872.0,1950.0,,,,,,0.0,,,,,,,,0.0,,0.0,0.0,0.0,,...,,,334.0,0.0,0.0,334.0,0.0,0.0,0.0,0.0,0.0,0.0,,2.0,,0.0,,,1900.0,,0.0,0.0,,,,0.0,0.0,0.0,0.0,0.0,0.0,,,,0.0,1.0,2006.0,,,34900.0
25%,365.75,20.0,,59.0,7553.5,,,,,,,,,,,,,5.0,5.0,1954.0,1967.0,,,,,,0.0,,,,,,,,0.0,,0.0,223.0,795.75,,...,,,882.0,0.0,0.0,1129.5,0.0,0.0,1.0,0.0,2.0,1.0,,5.0,,0.0,,,1961.0,,1.0,334.5,,,,0.0,0.0,0.0,0.0,0.0,0.0,,,,0.0,5.0,2007.0,,,129975.0
50%,730.5,50.0,,69.0,9478.5,,,,,,,,,,,,,6.0,5.0,1973.0,1994.0,,,,,,0.0,,,,,,,,383.5,,0.0,477.5,991.5,,...,,,1087.0,0.0,0.0,1464.0,0.0,0.0,2.0,0.0,3.0,1.0,,6.0,,1.0,,,1980.0,,2.0,480.0,,,,0.0,25.0,0.0,0.0,0.0,0.0,,,,0.0,6.0,2008.0,,,163000.0
75%,1095.25,70.0,,80.0,11601.5,,,,,,,,,,,,,7.0,6.0,2000.0,2004.0,,,,,,166.0,,,,,,,,712.25,,0.0,808.0,1298.25,,...,,,1391.25,728.0,0.0,1776.75,1.0,0.0,2.0,1.0,3.0,1.0,,7.0,,1.0,,,2002.0,,2.0,576.0,,,,168.0,68.0,0.0,0.0,0.0,0.0,,,,0.0,8.0,2009.0,,,214000.0


In [146]:
continuous_variables = ['MSSubClass', 'LotFrontage', 'LotArea', 'YearBuilt', 
                        'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1','BsmtFinSF2', 
                        'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF',
                        'LowQualFinSF', 'GrLivArea', 'GarageYrBlt','GarageArea', 
                        'WoodDeckSF','OpenPorchSF', 'EnclosedPorch','3SsnPorch', 
                        'ScreenPorch', 'PoolArea','MiscVal']

categorical_variables = ['MSZoning', 'Street', 'Alley','LotShape','LandContour', 
                         'Utilities','LotConfig', 'LandSlope', 'Neighborhood', 
                         'Condition1', 'Condition2','BldgType', 'HouseStyle', 
                         'RoofStyle', 'RoofMatl', 'Exterior1st','Exterior2nd', 
                         'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation',
                         'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 
                         'BsmtFinType2','Heating', 'HeatingQC', 'CentralAir', 
                         'Electrical', 'KitchenQual','Functional','FireplaceQu', 
                         'GarageType', 'GarageFinish','GarageQual','GarageCond', 
                         'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature',
                         'SaleType','SaleCondition','OverallQual','OverallCond',
                         'BsmtFullBath','BsmtHalfBath','FullBath','HalfBath',
                         'BedroomAbvGr','KitchenAbvGr','TotRmsAbvGrd',
                         'Fireplaces','GarageCars','YrSold','MoSold']

print("\nContinuous Variables:\n",continuous_variables)
print("\nCategorical Variables:\n",categorical_variables)


Continuous Variables:
 ['MSSubClass', 'LotFrontage', 'LotArea', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'GarageYrBlt', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal']

Categorical Variables:
 ['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual', 'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature', 'SaleType', 'SaleCondition', 'OverallQual', 'OverallCond', 'BsmtFullBath', 'BsmtH

## Step 2.1: Preview Data (for Outliers)
---
*Not all data influence House Price, so need to find variables that actually influences House Price*

- Bi-variate Analysis
  - Categorical : Scatter plot
  - Continuous  : Scatter plot


In [147]:
# Bivariate Analysis - Categorical Values
train_temp = train
train_temp.sort_values('SalePrice', axis = 0, ascending = True, inplace = True)  

nrow=1
ncol=1
fig = make_subplots(rows=12, cols=5,subplot_titles=categorical_variables,horizontal_spacing = 0.05,vertical_spacing = 0.03)                                                
for i in categorical_variables:
  fig.add_trace(go.Scatter(x=train_temp[i], y=train_temp.SalePrice,mode="markers+text"),row=nrow, col=ncol)              
  ncol=ncol+1
  if ncol==6:
    ncol=1
    nrow=nrow+1              

fig.update_layout(height=3000, width=1200,title_text="Categorical Variables: Influence on SalePrice")
fig.update_yaxes(matches=None)
fig.update_xaxes(matches=None)
fig.show()

In [148]:
# Univariate Analysis - Continuous Values
nrow=1
ncol=1
fig = make_subplots(rows=5, cols=5,subplot_titles=continuous_variables,horizontal_spacing = 0.05,vertical_spacing = 0.04)                                                
for i in continuous_variables:
  fig.add_trace(go.Scatter(x=train_temp[i], y=train_temp.SalePrice,mode="markers+text"),row=nrow, col=ncol)              
  ncol=ncol+1
  if ncol==6:
    ncol=1
    nrow=nrow+1              

fig.update_layout(height=1200, width=1200,title_text="Continuous Variables: Influence on SalePrice")
fig.update_yaxes(matches=None)
fig.update_xaxes(matches=None)
fig.show()

From scatter plot drawn from continuous and categorical variables, it is obvious that outliers influence the decision variables. Further, heteroskedascity is also visible across multiple variable (which is crucial for Linear Regression). 

## Step 2.2: Preview Data (for Missing Values)
- Training Data
- Testing Data


In [149]:
# Missing Value Count - Remove 5 missing value in one columns as ignore
train_missing = train.isna().sum()[train.isna().sum()>0]
train_col = train.columns[train.isna().sum()>0]
test_missing  = test.isna().sum()[test.isna().sum()>0]
test_col = test.columns[test.isna().sum()>0]

#print("\nTraining Set: Columns with missing values:\n",train_missing)
#print("\nTesting  Set: Columns with missing values:\n",test_missing)

# Plot Graph
temp_df = pd.DataFrame()

temp_df['Variable'] = train_col
temp_df['Missing'] = train_missing.values
temp_df['Data'] = 'Training'

temp_df1 = pd.DataFrame()
temp_df1['Variable'] = test_col
temp_df1['Missing'] = test_missing.values
temp_df1['Data'] = 'Testing'

temp_df = temp_df.append(temp_df1)
fig = px.bar(temp_df,x='Variable',y='Missing',color='Missing',facet_col='Data',height=350)
fig.show()

# <font color='red'>Step 3: Feature Engineering</font>

## Step 3.1: Missing Values Treatment

In [150]:
def replace_na(dataframe,columns,value):
  for i in columns:
    dataframe[i]=dataframe[i].fillna(value)
  return dataframe

def remove_na_rows(dataframe,n):
  for i in dataframe.columns:
    index = dataframe[i].isna()    
    if index.sum()>0 and index.sum()<=n:
      dataframe = dataframe[-index]      
  return dataframe

print("before missing imputing",train.shape,test.shape)

# Remove and Impute Train Dataset
columns=['LotFrontage','MasVnrArea']
train = replace_na(train,columns,0)
columns=['Alley','BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2','MasVnrType','FireplaceQu','GarageType','GarageFinish','GarageQual','GarageCond','PoolQC','Fence','MiscFeature']
train = replace_na(train,columns,'None')
train              = train[-train.Electrical.isna()] # Remove 1 missing value
train.GarageYrBlt  = train.GarageYrBlt.fillna(train.GarageYrBlt.max())

print("na replace imputing",train.shape,test.shape)

# Remove and Impute Test Dataset 
test = remove_na_rows(test,5) 
columns=['LotFrontage','MasVnrArea']
test = replace_na(test,columns,0)
columns=['Alley','BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2','MasVnrType','FireplaceQu','GarageType','GarageFinish','GarageQual','GarageCond','PoolQC','Fence','MiscFeature']
test = replace_na(test,columns,'None')
test.GarageYrBlt  = test.GarageYrBlt.fillna(test.GarageYrBlt.max())
print("remove na rows imputing",train.shape,test.shape)

# Check modifications
print(" Missing value in Training Dataset = ",
      train.isna().sum().sum(),
      "\n Missing value in Testing Dataset  = ",
      test.isna().sum().sum())
print(train.shape,test.shape)

before missing imputing (1460, 81) (1459, 80)
na replace imputing (1459, 81) (1459, 80)
remove na rows imputing (1459, 81) (1447, 80)
 Missing value in Training Dataset =  0 
 Missing value in Testing Dataset  =  0
(1459, 81) (1447, 80)


## Step 3.2: Outlier Treatment

- Outliers can be spotted clearly in all scatter plots. Thus, it would be better to remove the outliers in order to prevent influence on SalePrice.

- Further, it helps in variable selection (variable with least participation)

In [151]:
# Remove SalePrice Outlier (Changes to Test Dataset : Not Required)
train = train[train.SalePrice < 700000]

# Drop Columns:  Redundant or Insuficient Data in Training and Testing Dataset
train = train.drop(['TotalBsmtSF'],axis=1)  # Redundant Column: May lead to multi-colinearity
train = train.drop(['TotRmsAbvGrd'],axis=1) # Redundant Column: May lead to multi-colinearity
train = train.drop(['PoolArea'],axis=1)     # Not sufficient data
test  = test.drop(['TotalBsmtSF'],axis=1)  # Redundant Column: May lead to multi-colinearity
test  = test.drop(['TotRmsAbvGrd'],axis=1) # Redundant Column: May lead to multi-colinearity
test  = test.drop(['PoolArea'],axis=1)     # Not sufficient data

# Impute Outlier with Mean - Training Dataset
train.LotArea[train.LotArea>25000]    = train.LotArea.mean()
train.MasVnrArea[train.MasVnrArea>2000]  = train.MasVnrArea.mean() 
train.BsmtFinSF1[train.BsmtFinSF1>2500]  = train.BsmtFinSF1.mean()
train.BsmtFinSF2[train.BsmtFinSF2>2000]  = train.BsmtFinSF2.mean()
train.BsmtUnfSF[train.BsmtUnfSF>3000]   = train.BsmtUnfSF.mean()
train[train['1stFlrSF']>2500]['1stFlrSF'] = train['1stFlrSF'].mean()
train.LowQualFinSF[train.LowQualFinSF>50]  = train.LowQualFinSF.mean()
train.GrLivArea[train.GrLivArea>3500]   = train.GrLivArea.mean()
train.OpenPorchSF[train.OpenPorchSF>350]  = train.OpenPorchSF.mean()
train.EnclosedPorch[train.EnclosedPorch>350]= train.EnclosedPorch.mean()
train[train['3SsnPorch']>350]['3SsnPorch'] = train['3SsnPorch'].mean()
train.ScreenPorch[train.ScreenPorch>350]  = train.ScreenPorch.mean()
train.MiscVal[train.MiscVal>1000]     = train.MiscVal.mean()

# Impute Outlier with Mean - Testing Dataset
test.LotArea[test.LotArea>25000]    = test.LotArea.mean()
test.MasVnrArea[test.MasVnrArea>2000]  = test.MasVnrArea.mean() 
test.BsmtFinSF1[test.BsmtFinSF1>2500]  = test.BsmtFinSF1.mean()
test.BsmtFinSF2[test.BsmtFinSF2>2000]  = test.BsmtFinSF2.mean()
test.BsmtUnfSF[test.BsmtUnfSF>3000]   = test.BsmtUnfSF.mean()
test[test['1stFlrSF']>2500]['1stFlrSF'] = test['1stFlrSF'].mean()
test.LowQualFinSF[test.LowQualFinSF>50]  = test.LowQualFinSF.mean()
test.GrLivArea[test.GrLivArea>3500]   = test.GrLivArea.mean()
test.OpenPorchSF[test.OpenPorchSF>350]  = test.OpenPorchSF.mean()
test.EnclosedPorch[test.EnclosedPorch>350]= test.EnclosedPorch.mean()
test[test['3SsnPorch']>350]['3SsnPorch'] = test['3SsnPorch'].mean()
test.ScreenPorch[test.ScreenPorch>350]  = test.ScreenPorch.mean()
test.MiscVal[test.MiscVal>1000]     = test.MiscVal.mean()

print(train.shape,test.shape)

(1457, 78) (1447, 77)


## Step 3.3: One-Hot Encoding
- Most categorical variables are ordinal. Thus, value need not to be altered. 
- For remaining variables, new columns should be created. (Note: For varaibles with two levels, only one value is to be imputed by 1. )
- Further, if a variable level appears very few times, it can be removed as it do not have much influence over target variable. This improves simplicity.

In [152]:
print("Before Outlier and One Hot Encoding: ",
      "\n Train Dataset = ",train.shape,
      "\n Test Dataset = ",test.shape)

# One-Hot Encoding of selective categorical variables
OHE_cols = ['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 
            'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 
            'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'ExterQual', 
            'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 
            'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual', 
            'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 
            'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature', 'SaleType', 'SaleCondition']
print("before get dummy",train.shape,test.shape)

train = pd.get_dummies(train, columns = OHE_cols, drop_first = True)
test  = pd.get_dummies(test , columns = OHE_cols, drop_first = True)

print("after get dummy",train.shape,test.shape)


test['Utilities_NoSeWa']=0
test['Condition2_RRAe']=0
test['Condition2_RRAn']=0
test['Condition2_RRNn']=0
test['HouseStyle_2.5Fin']=0
test['RoofMatl_CompShg']=0
test['RoofMatl_Membran']=0
test['RoofMatl_Metal']=0
test['RoofMatl_Roll']=0
test['Exterior1st_CBlock']=0
test['Exterior1st_ImStucc']=0
test['Exterior1st_Stone']=0
test['Exterior2nd_Other']=0
test['Heating_GasA']=0
test['Heating_OthW']=0
test['Electrical_Mix']=0
test['GarageQual_Fa']=0
test['PoolQC_Fa']=0
test['MiscFeature_TenC']=0




print("After Outlier and One Hot Encoding: ",
      "\n Train Dataset = ",train.shape,
      "\n Test Dataset = ",test.shape)

print(train.shape,test.shape)

Before Outlier and One Hot Encoding:  
 Train Dataset =  (1457, 78) 
 Test Dataset =  (1447, 77)
before get dummy (1457, 78) (1447, 77)
after get dummy (1457, 258) (1447, 238)
After Outlier and One Hot Encoding:  
 Train Dataset =  (1457, 258) 
 Test Dataset =  (1447, 257)
(1457, 258) (1447, 257)


#<font color='red'> Step 4: Model Building</font>

In [0]:
y_train = train.SalePrice
X_train = train.drop(['Id','SalePrice'],axis=1)
test    = test.drop(['Id'],axis=1)

from sklearn.metrics import r2_score
from sklearn.model_selection import KFold, cross_val_score,train_test_split

def model_details(model,X,y,model_name):
      print(model_name,":")

      # Single Training Set (All training data is used) ------------------------
      y_pred = model.fit(X,y).predict(X)      
      print("R2 Score (entire training set) =",r2_score(y,y_pred))
      y_pred_temp = y_pred

      # Train-Test Split -------------------------------------------------------
      X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=100)
      y_pred = model.fit(X_train,y_train).predict(X_test)
      print("R2 Score from Train-Test Split =",r2_score(y_test,y_pred))


      # K-fold cross-validation ------------------------------------------------
      scores = -cross_val_score(model, X,y,cv=KFold(n_splits=5, shuffle=False))
      print("R2 Score from K-Fold Cross-Validation =",scores.mean(),"\n")
      # More shuffling requires more folds (to train more samples)
        
      return y_pred_temp

## Step 4.1: Different Regression Models
Regression Techniques used are as follows:
- K-Nearest Neighbour Regression
- Linear Regression
- Lasso Regression
- Ridge Regression
- Elastic Net Regression
- Bayesian Ridge Regression
- Decision Tree Regression
- Random Forest Regression
- Ada Boost Regression
- Gradient Boost Regression
- XGB Regression
- LGB Regression

In [154]:
# Import Libraries
result = pd.DataFrame()

from sklearn.neighbors import KNeighborsRegressor
knn = KNeighborsRegressor(n_neighbors=5, p=2,weights='uniform')
y_pred = model_details(knn,X_train.values,y_train.values,"kNN Regression")
result['knn']=y_pred

from sklearn.linear_model import LinearRegression
linreg = LinearRegression(fit_intercept=True,normalize=False)
y_pred = model_details(linreg,X_train,y_train,"OLS Regression")
result['linreg']=y_pred

from sklearn.linear_model import Ridge
rd = Ridge(alpha=0.5,tol=0.001,normalize=True,random_state=100)
y_pred = model_details(rd,X_train,y_train,"Ridge Regression")
result['ridge']=y_pred

from sklearn.linear_model import Lasso
ls = Lasso(alpha=10,max_iter=500,tol=0.0001,random_state=100)
y_pred = model_details(ls,X_train.values,y_train.values,"Lasso Regression")
result['lasso']=y_pred

from sklearn.linear_model import ElasticNet
enet = ElasticNet(alpha=0.5,l1_ratio=0.9, tol=0.0001,random_state=100)
y_pred = model_details(enet,X_train.values,y_train.values,"Elastic Net Regression")
result['enet']=y_pred

from sklearn.linear_model import BayesianRidge
br = BayesianRidge(tol=0.0001,n_iter=20)
y_pred = model_details(br,X_train,y_train,"Bayesian-Ridge Regression")
result['bayridge']=y_pred

from sklearn.tree import DecisionTreeRegressor
DT = DecisionTreeRegressor(criterion='mse', splitter='best',max_depth=5,
                                        max_features=0.75, random_state=100)
y_pred = model_details(DT,X_train.values,y_train.values,"Decision Tree Regression")
result['dt']=y_pred

from sklearn.ensemble import RandomForestRegressor
RandForest = RandomForestRegressor(bootstrap=True, criterion='mse',max_depth=5, 
                                   max_features=0.75,min_impurity_split=None,
                                   n_estimators=20,oob_score=False,random_state=100)
y_pred = model_details(RandForest,X_train,y_train,"Random Forest Regression")
result['randforest']=y_pred

from sklearn.ensemble import AdaBoostRegressor
ada = AdaBoostRegressor(base_estimator=DT,learning_rate=0.5, loss='linear',n_estimators=20, random_state=100)
y_pred = model_details(ada,X_train.values,y_train.values,"Ada-Boost Regression")
result['ada']=y_pred

from sklearn.ensemble import GradientBoostingRegressor
GradBoost = GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse',learning_rate=0.5, 
                                      loss='huber', max_depth=3,max_features=0.75, 
                                      n_estimators=20,random_state=100,tol=0.0001)
y_pred = model_details(GradBoost,X_train,y_train,"Gradient-Boosting Regression")
result['gradboost']=y_pred

from xgboost import XGBRegressor
XGB = XGBRegressor(n_estimators=20,importance_type='gain', objective='reg:squarederror',
                   learning_rate=0.5,max_depth=5,random_state=100)
y_pred = model_details(XGB,X_train.values,y_train.values,"XGB Regression")
result['xgb']=y_pred

from lightgbm import LGBMRegressor
LGB = LGBMRegressor(n_estimators=20,learning_rate=0.5,random_state=100)
y_pred = model_details(LGB,X_train.values,y_train.values,"LGB Regression")
result['lgb']=y_pred

kNN Regression :
R2 Score (entire training set) = 0.7888258239489785
R2 Score from Train-Test Split = 0.6492698475464447
R2 Score from K-Fold Cross-Validation = 11.151003224027848 

OLS Regression :
R2 Score (entire training set) = 0.9312341711673864
R2 Score from Train-Test Split = 0.8347003589472259
R2 Score from K-Fold Cross-Validation = 9.458785701080414 

Ridge Regression :
R2 Score (entire training set) = 0.8950113864939081
R2 Score from Train-Test Split = 0.8341026198475264
R2 Score from K-Fold Cross-Validation = 3.7529503695600064 

Lasso Regression :
R2 Score (entire training set) = 0.9241697777762411
R2 Score from Train-Test Split = 0.8572852120143538
R2 Score from K-Fold Cross-Validation = 5.510238014643535 

Elastic Net Regression :
R2 Score (entire training set) = 0.8983934574622836
R2 Score from Train-Test Split = 0.8686715766787486
R2 Score from K-Fold Cross-Validation = 3.5583064078232582 

Bayesian-Ridge Regression :
R2 Score (entire training set) = 0.9132569445332004


## Step 4.2: Result from all Regression models

In [155]:
# Preview of predicted values 
print("Results obtained from Multiple Regression Models:")
result.head()

Results obtained from Multiple Regression Models:


Unnamed: 0,knn,linreg,ridge,lasso,enet,bayridge,dt,randforest,ada,gradboost,xgb,lgb
0,101180.0,38838.145018,65313.776124,49950.916352,60907.553503,59357.682396,93859.521277,79562.911367,82604.671429,75995.632465,41322.410156,47058.451675
1,83242.2,16612.72925,49524.79425,21962.173293,26043.27901,28266.958214,93859.521277,79980.926854,75362.109091,43841.780464,39645.441406,45414.987433
2,81360.0,53654.276324,64202.494774,57452.849693,53625.871498,55007.14254,93859.521277,83634.088912,83014.0,49075.425579,39260.484375,48509.790222
3,77030.0,44825.707286,40987.713924,52606.924897,32040.266221,49940.949372,93859.521277,74139.291997,75159.202703,24238.022214,42325.0,39562.238978
4,119284.4,72118.637044,88843.075879,76174.094431,77386.457007,81597.630792,93859.521277,82568.120036,82604.671429,67498.169026,50687.519531,60684.077124


## Step 4.3: Combine Models to improve result

In [156]:
print("Considering all models:\n")
print("Training Accuracy (taking average) = ",r2_score(y_train,result.mean(axis=1)))

#lr1 = LinearRegression()
lr1 = Lasso(alpha=10,max_iter=500,tol=0.0001,random_state=100)
lr1.fit(result,y_train)  
print("Training Accuracy (doing regression) = ",r2_score(y_train,lr1.predict(result)))


print("Considering selected models:\n")
result = result.drop(['linreg','ridge','lasso'],axis=1) 
print("Training Accuracy (taking average) = ",r2_score(y_train,result.mean(axis=1)))

#lr2 = LinearRegression()
lr2 = Lasso(alpha=10,max_iter=500,tol=0.0001,random_state=100)
lr2.fit(result,y_train)        
print("Training Accuracy (doing regression) = ",r2_score(y_train,lr2.predict(result)))

Considering all models:

Training Accuracy (taking average) =  0.9481617395625325
Training Accuracy (doing regression) =  0.988783050187466
Considering selected models:

Training Accuracy (taking average) =  0.9491892254849217
Training Accuracy (doing regression) =  0.9887590165126873


## Step 4.4: Predict sales price from Test data

In [0]:
# Predicting test dataset using all regression models
test_results = pd.DataFrame()
test_results['knn'] = knn.predict(test)
test_results['linreg'] = linreg.predict(test)
test_results['ridge'] = rd.predict(test)
test_results['lasso'] = ls.predict(test)
test_results['enet'] = enet.predict(test)
test_results['bayridge'] = br.predict(test)
test_results['dt'] = DT.predict(test)
test_results['randforest'] = RandForest.predict(test)
test_results['ada'] = ada.predict(test)
test_results['gradboost'] = GradBoost.predict(test)
test_results['xgb'] = XGB.predict(test.values)
test_results['lgb'] = LGB.predict(test)

## Step 4.5: Compare Combination Approach

In [158]:
# AVERAGE APPROACH & REGRESSION APPROACH -------------------------------------------------------

# Results from all models
print("\nConsidering all features:")
test_pred_mean_all = test_results.mean(axis=1)
test_pred_std_all  = test_results.std(axis=1)
test_pred_reg_all  = lr1.predict(test_results)
test_pred_avg_reg_diff_all = abs(test_pred_mean_all - test_pred_reg_all)
print(" Deviation : ",test_pred_std_all.mean(),"\n",
      "From Regression : ",test_pred_avg_reg_diff_all.mean())
      

# Results from selected models
print("\nConsidering selected features:")
test_results = test_results.drop(['linreg','ridge','lasso'],axis=1) 
test_pred_mean_sel = test_results.mean(axis=1)
test_pred_std_sel  = test_results.std(axis=1)
test_pred_reg_sel  = lr2.predict(test_results)
test_pred_avg_reg_diff_sel = abs(test_pred_mean_sel - test_pred_reg_sel)
print(" Deviation : ",test_pred_std_sel.mean(),"\n",
      "From Regression : ",test_pred_avg_reg_diff_sel.mean())


Considering all features:
 Deviation :  216874.9977683683 
 From Regression :  49556.23539402407

Considering selected features:
 Deviation :  23006.952232639127 
 From Regression :  13932.331692655642


# Final Verdict

Lasso regressor chooses regression results selectively, which improves prediction which can be seen as decrease in standard deviation. Therefore, instead of simply taking the average of multiple regressor model, it is better to perform Lasso regression on resutls obtained. 

- Due to heteroskedascity in data, Regression Model such as Linear Regression do not represent the data properly. This is captured as low prediction performance, reflected as an increase in prediction ability when Linear Regression is removed when combining model results. 
- As Elastic net encapsulates Lasso and Ridge regression, it would be redundant to include prediction from it, and lead to biased result. Thus, along with Linear Regression, Lasso and Ridge Regression is not considered when combining models. 