# Group Name: Suicide Squad

Our team joined a machine learning (House Prices: Advanced Regression Techniques) competition in Kaggle. Participants are competing with each other to find the most accurate model for predicting house prices using the data provided by the website. We created a model which gave us a score of 0.11599 that made us the Champions within our cohort (12th Cohort) and put our group in the top 9% on Kaggle's public leaderboard.
***
# Members:

**Sal Lascano**<br>
**Ansel Santos**<br>
**Moon Kang**<br>
**Yicong Xu**

# Preparing the Packages to be used

Calling the packages we will use for the project, along with the plotly username and api key

In [1]:
import pandas as pd
import numpy as np
import scipy as sp
from scipy import stats
from scipy.stats import norm, skew
import plotly
import plotly.plotly as py
import plotly.graph_objs as go
import plotly.figure_factory as ff
from plotly import tools
import seaborn as sns
import json
import requests
import warnings
from requests.auth import HTTPBasicAuth

np.random.seed(0)

def ignore_warn(*args, **kwargs):
    pass

warnings.warn = ignore_warn #ignore annoying warning (from sklearn and seaborn)

username = 'moonkang' # Replace with YOUR USERNAME
api_key = 'nKkigC26m95bqBAK52af' # Replace with YOUR API KEY

auth = HTTPBasicAuth(username, api_key)
headers = {'Plotly-Client-Platform': 'python'}

plotly.tools.set_credentials_file(username=username, api_key=api_key)

pd.set_option('display.max_columns', None)

# Loading the data

Load the data from .csv and looking at the head of the data and dimensions

In [3]:
## load training data
train = pd.read_csv("train.csv", header = 0, index_col=None)
## load test data
test = pd.read_csv("test.csv", header = 0, index_col=None)

print(train.shape)
print(test.shape)
train.head()

(1460, 81)
(1459, 80)


Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,,,0,12,2008,WD,Normal,250000


# Exploratory Data Analysis

We did an initial analysis of the data by using Python’s Pandas and Plotly. Looking at the graphs of the train data, we saw that there are outliers that needed to be removed. Also, a look at the distribution of the SalePrice variable revealed that it is skewed and required a Log Transformation. We proceeded by dropping the outliers whose SalePrice was below 300,000 and GrLivArea above 4,000. Then the SalePrice was Log Transformed to reduce its skewness.

In [4]:
train.describe()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,TotRmsAbvGrd,Fireplaces,GarageYrBlt,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
count,1460.0,1460.0,1201.0,1460.0,1460.0,1460.0,1460.0,1460.0,1452.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1379.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,730.5,56.89726,70.049958,10516.828082,6.099315,5.575342,1971.267808,1984.865753,103.685262,443.639726,46.549315,567.240411,1057.429452,1162.626712,346.992466,5.844521,1515.463699,0.425342,0.057534,1.565068,0.382877,2.866438,1.046575,6.517808,0.613014,1978.506164,1.767123,472.980137,94.244521,46.660274,21.95411,3.409589,15.060959,2.758904,43.489041,6.321918,2007.815753,180921.19589
std,421.610009,42.300571,24.284752,9981.264932,1.382997,1.112799,30.202904,20.645407,181.066207,456.098091,161.319273,441.866955,438.705324,386.587738,436.528436,48.623081,525.480383,0.518911,0.238753,0.550916,0.502885,0.815778,0.220338,1.625393,0.644666,24.689725,0.747315,213.804841,125.338794,66.256028,61.119149,29.317331,55.757415,40.177307,496.123024,2.703626,1.328095,79442.502883
min,1.0,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,0.0,0.0,0.0,334.0,0.0,0.0,334.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,1900.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,34900.0
25%,365.75,20.0,59.0,7553.5,5.0,5.0,1954.0,1967.0,0.0,0.0,0.0,223.0,795.75,882.0,0.0,0.0,1129.5,0.0,0.0,1.0,0.0,2.0,1.0,5.0,0.0,1961.0,1.0,334.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,129975.0
50%,730.5,50.0,69.0,9478.5,6.0,5.0,1973.0,1994.0,0.0,383.5,0.0,477.5,991.5,1087.0,0.0,0.0,1464.0,0.0,0.0,2.0,0.0,3.0,1.0,6.0,1.0,1980.0,2.0,480.0,0.0,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0
75%,1095.25,70.0,80.0,11601.5,7.0,6.0,2000.0,2004.0,166.0,712.25,0.0,808.0,1298.25,1391.25,728.0,0.0,1776.75,1.0,0.0,2.0,1.0,3.0,1.0,7.0,1.0,2002.0,2.0,576.0,168.0,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0
max,1460.0,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,1474.0,2336.0,6110.0,4692.0,2065.0,572.0,5642.0,3.0,2.0,3.0,2.0,8.0,3.0,14.0,3.0,2010.0,4.0,1418.0,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0,755000.0


Dropping the ID column

In [5]:
train.drop("Id", axis = 1, inplace = True)
test.drop("Id", axis = 1, inplace = True)

Added the sum of Basement and Livable Area as a column for the scatterplot matrix

In [6]:
train['LivBsmtArea']= train['TotalBsmtSF'] + train['GrLivArea']
test['LivBsmtArea']= test['TotalBsmtSF'] + test['GrLivArea'] #also applied it to the test set

Building the Scatterplot Matrix, just for us to see the relationship of the target variable with variables we think will have a big impact to the price

In [7]:
dataframe = train[['SalePrice', 'GrLivArea', 'TotalBsmtSF', 'LivBsmtArea']].copy()

fig = ff.create_scatterplotmatrix(dataframe, height=1000, width=1000, diag='histogram', 
                                  size=3, title="House Train Variables")
py.iplot(fig)

Remove the outliers and log transform the sale price

In [8]:
#Outlier removal
train = train.drop(train[(train['GrLivArea']>4000) & (train['SalePrice']<300000)].index)

#Log transform
train["LogSalePrice"] = np.log1p(train["SalePrice"])

Plot the data without the outliers

In [9]:
dataframe = train[['LogSalePrice', 'LivBsmtArea']].copy()

fig = ff.create_scatterplotmatrix(dataframe, height=500, width=500, diag='histogram', 
                                  size=3, title="LogSalePrice vs LivBsmtArea")
py.iplot(fig)

In [10]:
SalePrice = go.Histogram(x=train['SalePrice'], cumulative=dict(enabled=True))
LogSalePrice = go.Histogram(x=train['LogSalePrice'], cumulative=dict(enabled=True))
fig = tools.make_subplots(rows=1, cols=2)

fig.append_trace(SalePrice, 1, 1)
fig.append_trace(LogSalePrice, 1, 2)

fig['layout'].update(height=350, width=750)

py.iplot(fig)

This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]



# Correlation Matrix Heatmap

The correlation heat map was useful in giving us an overview of which numerical features or variables are important, and which variables are highly correlated with each other and can be combined.  The variables which have a more yellow color are the ones which have a higher correlation to the target variable while the variables which are more green are less (negatively) correlated.

In [11]:
corrmat = train.corr()
y = list(corrmat.columns.values)
x = list(corrmat.index.values)
z = [list(corrmat[x]) for x in corrmat]

In [12]:
trace = go.Heatmap(z=z, x=x, y=y, colorscale='Viridis')
data=[trace]
fig['layout'].update(height=900, width=900)
py.iplot(data)

# Data Cleaning process

Combining Train and Test Data sets. Ideally, we should apply the same process on a separate place to avoid data leakage, but for this project, we combined them and did the cleaning process to make our jupyter notebook shorter.

In [13]:
ntrain = train.shape[0]
ntest = test.shape[0]

saleprice_train = train.SalePrice.values
priceperft_train = train.SalePrice.values/train.LotArea.values
logsaleprice_train = train.LogSalePrice.values
lotarea_train = train.LotArea.values
lotarea_test = test.LotArea.values

all_df = pd.concat((train, test)).reset_index(drop=True)
all_df.drop(['SalePrice'], axis=1, inplace=True)
all_df.drop(['LogSalePrice'], axis=1, inplace=True)

print("Dimensions - {}".format(all_df.shape))

Dimensions - (2917, 80)


Counting the missingness on each column

In [14]:
all_df_na = (all_df.isnull().sum() / len(all_df)) * 100
all_df_na = all_df_na.drop(all_df_na[all_df_na == 0].index).sort_values(ascending=False)[:30]
missing_data = pd.DataFrame({'Missing Ratio' :all_df_na})
missing_data.head(10)

data = [go.Bar(x=all_df_na.index, y=all_df_na, marker=dict(
        color='rgb(158,202,225)',
        line=dict(
            color='rgb(8,48,107)',
            width=1.5
        )))]

layout = go.Layout(
    xaxis=dict(tickangle=-90), title="% of Missingness"
)

fig = go.Figure(data=data, layout=layout)
fig['layout'].update(height=400, width=800)

py.iplot(fig)

# Filling the Missingness (Part 1)

We counted missingness from each column to see which feature had the most missingness. We handled missingness in two ways. The first way was by using the description.txt provided by Kaggle. The description.txt contains information on what empty data points meant for some of the columns, which helped us on the imputation.

In [17]:
#Handling features guided by data description
all_df.loc[:, "Alley"] = all_df.loc[:, "Alley"].fillna("None")
all_df.loc[:, "BsmtQual"] = all_df.loc[:, "BsmtQual"].fillna("NoBasmt")
all_df.loc[:, "BsmtCond"] = all_df.loc[:, "BsmtCond"].fillna("NoBasmt")
all_df.loc[:, "BsmtExposure"] = all_df.loc[:, "BsmtExposure"].fillna("NoBasmt")
all_df.loc[:, "BsmtFinType1"] = all_df.loc[:, "BsmtFinType1"].fillna("NoBasmt")
all_df.loc[:, "BsmtFinType2"] = all_df.loc[:, "BsmtFinType2"].fillna("NoBasmt")
all_df.loc[:, "BsmtFullBath"] = all_df.loc[:, "BsmtFullBath"].fillna(0)
all_df.loc[:, "BsmtHalfBath"] = all_df.loc[:, "BsmtHalfBath"].fillna(0)
all_df.loc[:, "BsmtUnfSF"] = all_df.loc[:, "BsmtUnfSF"].fillna(0)
all_df.loc[:, "Fence"] = all_df.loc[:, "Fence"].fillna("NoFnc")
all_df.loc[:, "FireplaceQu"] = all_df.loc[:, "FireplaceQu"].fillna("NoFrplc")
all_df.loc[:, "Fireplaces"] = all_df.loc[:, "Fireplaces"].fillna(0)
all_df.loc[:, "Functional"] = all_df.loc[:, "Functional"].fillna("Typ")
all_df.loc[:, "GarageType"] = all_df.loc[:, "GarageType"].fillna("NoGrg")
all_df.loc[:, "GarageFinish"] = all_df.loc[:, "GarageFinish"].fillna("NoGrg")
all_df.loc[:, "GarageQual"] = all_df.loc[:, "GarageQual"].fillna("NoGrg")
all_df.loc[:, "GarageCond"] = all_df.loc[:, "GarageCond"].fillna("NoGrg")
all_df.loc[:, "GarageArea"] = all_df.loc[:, "GarageArea"].fillna(0)
all_df.loc[:, "GarageCars"] = all_df.loc[:, "GarageCars"].fillna(0)
all_df.loc[:, "MiscFeature"] = all_df.loc[:, "MiscFeature"].fillna("NoMscFtr")
all_df.loc[:, "MiscVal"] = all_df.loc[:, "MiscVal"].fillna(0)
all_df.loc[:, "PoolQC"] = all_df.loc[:, "PoolQC"].fillna("NoPool")
all_df.loc[:, "PoolArea"] = all_df.loc[:, "PoolArea"].fillna(0)
all_df.loc[:, "GarageYrBlt"] = all_df.loc[:, "GarageYrBlt"].fillna(all_df['YearBuilt'])
all_df.loc[:, "BsmtFinSF1"] = all_df.loc[:, "BsmtFinSF1"].fillna(0)
all_df.loc[:, "BsmtFinSF2"] = all_df.loc[:, "BsmtFinSF2"].fillna(0)
all_df.loc[:, "TotalBsmtSF"] = all_df.loc[:, "TotalBsmtSF"].fillna(0)
all_df.loc[:, "LivBsmtArea"] = all_df.loc[:, "LivBsmtArea"].fillna(all_df['TotalBsmtSF'] + all_df['GrLivArea'])


# Filling the Missingness (Part 2)

The second way we addressed missingness was by determining what kind of missingness occurred and then deciding how to impute. There are three kinds of missingness, namely, missing at random, missing not at random, and missing completely at random. Based on this classification, we decided on the imputation method to use.

In [18]:
# By missing at random, decided to fill NA with 0 for BedroomAbvGr as missingness can be  
# interpreted to be directly related to lack of value
all_df.loc[:, "BedroomAbvGr"] = all_df.loc[:, "BedroomAbvGr"].fillna(0)

# By missing at random, decided to fill NA with N for CentralAir as missingness can be 
# interpreted to be directly related to lack of Central Air
all_df.loc[:, "CentralAir"] = all_df.loc[:, "CentralAir"].fillna("N")

# By missing at random, decided to fill NA with Norm for Condition1 & Condition2 as missingness 
# can be interpreted to be directly related to lack of proximity to any conditions
all_df.loc[:, "Condition1"] = all_df.loc[:, "Condition1"].fillna("Norm")
all_df.loc[:, "Condition2"] = all_df.loc[:, "Condition2"].fillna("Norm")

# By missing at random, decided to fill NA with 0 for EnclosedPorch as missingness can be  
# interpreted to be directly related to lack of value
all_df.loc[:, "EnclosedPorch"] = all_df.loc[:, "EnclosedPorch"].fillna(0)

# By missing completely at random, decided to fill NA with TA (averaging) for ExterCond & ExterQual
# since we assume it will have minumum impact
all_df.loc[:, "ExterCond"] = all_df.loc[:, "ExterCond"].fillna("TA")
all_df.loc[:, "ExterQual"] = all_df.loc[:, "ExterQual"].fillna("TA")

# By missing at random, decided to fill NA with 0 for HalfBath as missingness can be 
# interpreted to be directly related to lack of value
all_df.loc[:, "HalfBath"] = all_df.loc[:, "HalfBath"].fillna(0)

# By missing completely at random, decided to fill NA with TA (averaging) for HeatingQC 
# since we assume it will have minumum impact
all_df.loc[:, "HeatingQC"] = all_df.loc[:, "HeatingQC"].fillna("TA")

# By missing at random, decided to fill NA with 0 for KitchenAbvGr as missingness can be  
# interpreted to be directly related to lack of value
all_df.loc[:, "KitchenAbvGr"] = all_df.loc[:, "KitchenAbvGr"].fillna(0)

# By missing completely at random, decided to fill NA with TA (averaging) for KitchenQual 
# since we assume it will have minumum impact
all_df.loc[:, "KitchenQual"] = all_df.loc[:, "KitchenQual"].fillna("TA")

# *****LotFrontage : NA most likely means no lot frontage - missing at random*****
all_df.loc[:, "LotFrontage"] = all_df.loc[:, "LotFrontage"].fillna(0)

# By missing completely at random, decided to fill NA with Reg (averaging) for LotShape 
# since we assume it will have minumum impact
all_df.loc[:, "LotShape"] = all_df.loc[:, "LotShape"].fillna("Reg")

# By missing at random, decided to fill NA with None and 0 for MasVnrType & MasVnrArea 
# as missingness can be interpreted to be directly 
# related to lack of vaue
all_df.loc[:, "MasVnrType"] = all_df.loc[:, "MasVnrType"].fillna("None")
all_df.loc[:, "MasVnrArea"] = all_df.loc[:, "MasVnrArea"].fillna(0)

# By missing at random, decided to fill NA with 0 for OpenPorchSF as missingness can be  
# interpreted to be directly related to lack of vaue
all_df.loc[:, "OpenPorchSF"] = all_df.loc[:, "OpenPorchSF"].fillna(0)

# By missing at random, decided to fill NA with N for PavedDrive as missingness can be  
# interpreted to be directly related to No Paved Drive
all_df.loc[:, "PavedDrive"] = all_df.loc[:, "PavedDrive"].fillna("N")

# By missing completely at random, decided to fill NA with Normal (averaging) for SaleCondition 
# since we assume it will have minumum impact
all_df.loc[:, "SaleCondition"] = all_df.loc[:, "SaleCondition"].fillna("Normal")

# By missing at random, decided to fill NA with 0 for ScreenPorch as missingness can be  
# interpreted to be directly related to lack of value
all_df.loc[:, "ScreenPorch"] = all_df.loc[:, "ScreenPorch"].fillna(0)

# By missing at random, decided to fill NA with 0 for TotRmsAbvGrd as missingness can be  
# interpreted to be directly related to lack of value
all_df.loc[:, "TotRmsAbvGrd"] = all_df.loc[:, "TotRmsAbvGrd"].fillna(0)

# By missing at random, decided to fill NA with AllPub for Utilities as missingness can be 
# interpreted to be directly related to All public Utilities 
all_df.loc[:, "Utilities"] = all_df.loc[:, "Utilities"].fillna("AllPub")

# By missing at random, decided to fill NA with 0 for WoodDeckSF as missingness can be  
# interpreted to be directly related to lack of value
all_df.loc[:, "WoodDeckSF"] = all_df.loc[:, "WoodDeckSF"].fillna(0)


# Transformations

This section is divided into two parts, specifically, numeric variables that needed to be transformed into categories and categorical variables that needed to be transformed into numeric values.

# Part 1

The MSSubclass, MoSold, and YrSold features were numeric, though once they are analyzed they should be categorical. Assuming a linear model is to be used, a house with a subclass of 180 is not nine times more valuable than a house of class 20. Therefore, this variable should be categorical. The same concept can be applied to the MoSold and YrSold features because housing market prices do not go in only one direction.

In [19]:
# There are variables with numerical vaues that after investigating in the Description we 
# find that are actually best described as categories
# Changing the month sold and Year sold variables into categorical is a great example of 
# using dummy variables to adjust for seasonality

# adding the MS SubClass2
#all_df['MSSubClass2'] = [tmp1.get(key) for key in all_df.MSSubClass.values]

all_df = all_df.replace({"MSSubClass" : {20 : "SC20", 30 : "SC30", 40 : "SC40", 45 : "SC45", 
                                       50 : "SC50", 60 : "SC60", 70 : "SC70", 75 : "SC75", 
                                       80 : "SC80", 85 : "SC85", 90 : "SC90", 120 : "SC120", 
                                       150 : "SC150", 160 : "SC160", 180 : "SC180", 190 : "SC190"},
                       "MoSold" : {1 : "Jan", 2 : "Feb", 3 : "Mar", 4 : "Apr", 5 : "May", 6 : "Jun",
                                   7 : "Jul", 8 : "Aug", 9 : "Sep", 10 : "Oct", 11 : "Nov", 12 : "Dec"}
                      })

all_df['YrSold'] = ["Year" + str(x) for x in all_df['YrSold']]

#all_df['GarageCars'] = [str(x) + " Cars" for x in all_df['GarageCars']]


# (Part 2)

The team manually did label encoding by looking for categorical features that can be simplified by converting into integers. An example is the basement condition variable whose categories were transformed as follows: No basement into 0, poor into 1, fair into 2, typical into 3, good into 4, and excellent into 5.

In [20]:
# Encoding categorical features as ordered numbers after gathering 
# inside of order from description

all_df = all_df.replace({"Alley" : {"None": 0, "Grvl" : 1, "Pave" : 2},
                       "BsmtCond" : {"NoBasmt" : 0, "Po" : 1, "Fa" : 2, "TA" : 3, "Gd" : 4, "Ex" : 5},
                       "BsmtExposure" : {"NoBasmt" : 0, "Mn" : 1, "Av": 2, "Gd" : 3},
                       "BsmtFinType1" : {"NoBasmt" : 0, "Unf" : 1, "LwQ": 2, "Rec" : 3, "BLQ" : 4, 
                                         "ALQ" : 5, "GLQ" : 6},
                       "BsmtFinType2" : {"NoBasmt" : 0, "Unf" : 1, "LwQ": 2, "Rec" : 3, "BLQ" : 4, 
                                         "ALQ" : 5, "GLQ" : 6},
                       "BsmtQual" : {"NoBasmt" : 0, "Po" : 1, "Fa" : 2, "TA": 3, "Gd" : 4, "Ex" : 5},
                       "ExterCond" : {"Po" : 1, "Fa" : 2, "TA": 3, "Gd": 4, "Ex" : 5},
                       "ExterQual" : {"Po" : 1, "Fa" : 2, "TA": 3, "Gd": 4, "Ex" : 5},
                       "FireplaceQu" : {"NoFrplc" : 0, "Po" : 1, "Fa" : 2, "TA" : 3, "Gd" : 4, "Ex" : 5},
                       "Functional" : {"Sal" : 1, "Sev" : 2, "Maj2" : 3, "Maj1" : 4, "Mod": 5, 
                                       "Min2" : 6, "Min1" : 7, "Typ" : 8},
                       "GarageCond" : {"NoGrg" : 0, "Po" : 1, "Fa" : 2, "TA" : 3, "Gd" : 4, "Ex" : 5},
                       "GarageQual" : {"NoGrg" : 0, "Po" : 1, "Fa" : 2, "TA" : 3, "Gd" : 4, "Ex" : 5},
                       "HeatingQC" : {"Po" : 1, "Fa" : 2, "TA" : 3, "Gd" : 4, "Ex" : 5},
                       "KitchenQual" : {"Po" : 1, "Fa" : 2, "TA" : 3, "Gd" : 4, "Ex" : 5},
                       "LandSlope" : {"Sev" : 1, "Mod" : 2, "Gtl" : 3},
                       "LotShape" : {"IR3" : 1, "IR2" : 2, "IR1" : 3, "Reg" : 4},
                       "PavedDrive" : {"N" : 0, "P" : 1, "Y" : 2},
                       "PoolQC" : {"NoPool" : 0, "Fa" : 1, "TA" : 2, "Gd" : 3, "Ex" : 4},
                       "Street" : {"Grvl" : 1, "Pave" : 2},
                       "Utilities" : {"ELO" : 1, "NoSeWa" : 2, "NoSewr" : 3, "AllPub" : 4}}
                     )

# Adding Simplified version of existing columns

With most of the data set cleaned and transformed we noticed that you can combine variables. This was the case for overall quality and condition. Since these features are similar to each other, multiplying them together will allow our models to interpret them as one, which can increase the accuracy of our predictions.

In [21]:
# Create new features
# Simplifications of existing features
all_df["SimplOverallQual"] = all_df.OverallQual.replace({1 : 1, 2 : 1, 3 : 1, # bad
                                                       4 : 2, 5 : 2, 6 : 2, # average
                                                       7 : 3, 8 : 3, 9 : 3, 10 : 3 # good
                                                      })
all_df["SimplOverallCond"] = all_df.OverallCond.replace({1 : 1, 2 : 1, 3 : 1, # bad
                                                       4 : 2, 5 : 2, 6 : 2, # average
                                                       7 : 3, 8 : 3, 9 : 3, 10 : 3 # good
                                                      })
all_df["SimplPoolQC"] = all_df.PoolQC.replace({1 : 1, 2 : 1, # average
                                             3 : 2, 4 : 2 # good
                                            })
all_df["SimplGarageCond"] = all_df.GarageCond.replace({1 : 1, # bad
                                                     2 : 1, 3 : 1, # average
                                                     4 : 2, 5 : 2 # good
                                                    })
all_df["SimplGarageQual"] = all_df.GarageQual.replace({1 : 1, # bad
                                                     2 : 1, 3 : 1, # average
                                                     4 : 2, 5 : 2 # good
                                                    })
all_df["SimplFireplaceQu"] = all_df.FireplaceQu.replace({1 : 1, # bad
                                                       2 : 1, 3 : 1, # average
                                                       4 : 2, 5 : 2 # good
                                                      })
all_df["SimplFireplaceQu"] = all_df.FireplaceQu.replace({1 : 1, # bad
                                                       2 : 1, 3 : 1, # average
                                                       4 : 2, 5 : 2 # good
                                                      })
all_df["SimplFunctional"] = all_df.Functional.replace({1 : 1, 2 : 1, # bad
                                                     3 : 2, 4 : 2, # major
                                                     5 : 3, 6 : 3, 7 : 3, # minor
                                                     8 : 4 # typical
                                                    })
all_df["SimplKitchenQual"] = all_df.KitchenQual.replace({1 : 1, # bad
                                                       2 : 1, 3 : 1, # average
                                                       4 : 2, 5 : 2 # good
                                                      })
all_df["SimplHeatingQC"] = all_df.HeatingQC.replace({1 : 1, # bad
                                                   2 : 1, 3 : 1, # average
                                                   4 : 2, 5 : 2 # good
                                                  })
all_df["SimplBsmtFinType1"] = all_df.BsmtFinType1.replace({1 : 1, # unfinished
                                                         2 : 1, 3 : 1, # rec room
                                                         4 : 2, 5 : 2, 6 : 2 # living quarters
                                                        })
all_df["SimplBsmtFinType2"] = all_df.BsmtFinType2.replace({1 : 1, # unfinished
                                                         2 : 1, 3 : 1, # rec room
                                                         4 : 2, 5 : 2, 6 : 2 # living quarters
                                                        })
all_df["SimplBsmtCond"] = all_df.BsmtCond.replace({1 : 1, # bad
                                                 2 : 1, 3 : 1, # average
                                                 4 : 2, 5 : 2 # good
                                                })
all_df["SimplBsmtQual"] = all_df.BsmtQual.replace({1 : 1, # bad
                                                 2 : 1, 3 : 1, # average
                                                 4 : 2, 5 : 2 # good
                                                })
all_df["SimplExterCond"] = all_df.ExterCond.replace({1 : 1, # bad
                                                   2 : 1, 3 : 1, # average
                                                   4 : 2, 5 : 2 # good
                                                  })
all_df["SimplExterQual"] = all_df.ExterQual.replace({1 : 1, # bad
                                                   2 : 1, 3 : 1, # average
                                                   4 : 2, 5 : 2 # good
                                                  })


# Combination of variables (additional Columns)

In [22]:
# 2* Combinations of existing features
# Overall quality of the house
all_df["OverallGrade"] = all_df["OverallQual"] * all_df["OverallCond"]
# Overall quality of the garage
all_df["GarageGrade"] = all_df["GarageQual"] * all_df["GarageCond"]
# Overall quality of the exterior
all_df["ExterGrade"] = all_df["ExterQual"] * all_df["ExterCond"]
# Overall kitchen score
all_df["KitchenScore"] = all_df["KitchenAbvGr"] * all_df["KitchenQual"]
# Overall fireplace score
all_df["FireplaceScore"] = all_df["Fireplaces"] * all_df["FireplaceQu"]
# Overall garage score
all_df["GarageScore"] = all_df["GarageArea"] * all_df["GarageQual"]
# Overall pool score
all_df["PoolScore"] = all_df["PoolArea"] * all_df["PoolQC"]
# Simplified overall quality of the house
all_df["SimplOverallGrade"] = all_df["SimplOverallQual"] * all_df["SimplOverallCond"]
# Simplified overall quality of the exterior
all_df["SimplExterGrade"] = all_df["SimplExterQual"] * all_df["SimplExterCond"]
# Simplified overall pool score
all_df["SimplPoolScore"] = all_df["PoolArea"] * all_df["SimplPoolQC"]
# Simplified overall garage score
all_df["SimplGarageScore"] = all_df["GarageArea"] * all_df["SimplGarageQual"]
# Simplified overall fireplace score
all_df["SimplFireplaceScore"] = all_df["Fireplaces"] * all_df["SimplFireplaceQu"]
# Simplified overall kitchen score
all_df["SimplKitchenScore"] = all_df["KitchenAbvGr"] * all_df["SimplKitchenQual"]
# Total number of bathrooms
all_df["TotalBath"] = all_df["BsmtFullBath"] + (0.5 * all_df["BsmtHalfBath"]) + \
all_df["FullBath"] + (0.5 * all_df["HalfBath"])

# Total SF for 1st + 2nd floors
all_df["AllFlrsSF"] = all_df["1stFlrSF"] + all_df["2ndFlrSF"]
# Total SF for porch
all_df["AllPorchSF"] = all_df["OpenPorchSF"] + all_df["EnclosedPorch"] + \
all_df["3SsnPorch"] + all_df["ScreenPorch"]
# Has masonry veneer or not
all_df["HasMasVnr"] = all_df.MasVnrType.replace({"BrkCmn" : 1, "BrkFace" : 1, "CBlock" : 1, 
                                               "Stone" : 1, "None" : 0})
# House completed before sale or not
all_df["BoughtOffPlan"] = all_df.SaleCondition.replace({"Abnorml" : 0, "Alloca" : 0, "AdjLand" : 0, 
                                                      "Family" : 0, "Normal" : 0, "Partial" : 1})


# Plots checking the relationship with SalesPrice

In [23]:
plot_data = all_df[:ntrain]

In [24]:
trace1 = go.Scatter(x=list(plot_data["WoodDeckSF"]), y=list(logsaleprice_train), 
                               mode='markers', marker = dict( size = 2))
trace2 = go.Scatter(x=list(plot_data["MasVnrArea"]), y=list(logsaleprice_train), 
                          mode='markers', marker = dict( size = 2))
trace3 = go.Scatter(x=list(plot_data["GarageScore"]), y=list(logsaleprice_train), 
                         mode='markers', marker = dict( size = 2))
trace4 = go.Scatter(x=list(plot_data["PoolScore"]), y=list(logsaleprice_train), 
                       mode='markers', marker = dict( size = 2))
trace5 = go.Scatter(x=list(plot_data["LotArea"]), y=list(logsaleprice_train), 
                       mode='markers', marker = dict( size = 2))
trace6 = go.Scatter(x=list(plot_data["LotFrontage"]), y=list(logsaleprice_train), 
                       mode='markers', marker = dict( size = 2))

fig = tools.make_subplots(rows=3, cols=2, subplot_titles=('WoodDeckSF vs. LogSalePrice', 
                                                          'MasVnrArea vs. LogSalePrice',
                                                          'GarageScore vs. LogSalePrice', 
                                                          'PoolScore vs. LogSalePrice',
                                                          'LotArea vs. LogSalePrice',
                                                          'LotFrontage vs. LogSalePrice'))

fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 2)
fig.append_trace(trace3, 2, 1)
fig.append_trace(trace4, 2, 2)
fig.append_trace(trace5, 3, 1)
fig.append_trace(trace6, 3, 2)

fig['layout'].update(showlegend=False, height=1200, width=1000)

py.iplot(fig)

This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]
[ (2,1) x3,y3 ]  [ (2,2) x4,y4 ]
[ (3,1) x5,y5 ]  [ (3,2) x6,y6 ]



In [25]:
def boxplot_col(col_name, height, width):
    x_data = [x for x in plot_data[col_name].unique()]
    x_data.sort()
    y_data = [[x for x,y in zip(logsaleprice_train, plot_data[col_name]) if y==i] for i in x_data]

    traces = []

    for xd, yd in zip(x_data, y_data):
            traces.append(go.Box(
                y=yd,
                name=xd,
                boxpoints='all',
                marker=dict(
                    size=1,
                ),
                line=dict(width=1),
            ))

    fig = go.Figure(data=traces)
    fig['layout'].update(showlegend=False, height=height, width=width, title=col_name + " vs. LogSalePrice")
    return fig

def distplot_col(col_name, height, width):
    x_data = [x for x in plot_data[col_name].unique()]
    x_data.sort()
    y_data = [[x for x,y in zip(logsaleprice_train, plot_data[col_name]) if y==i] for i in x_data]


    # Group data together
    hist_data = y_data
    group_labels = x_data
    colors = ['#835AF1', '#7FA6EE', '#B8F7D4', '#393E46', '#2BCDC1', '#F66095']

    fig = ff.create_distplot(hist_data, group_labels, bin_size=0.02, show_curve=True, colors=colors)
    fig['layout'].update(showlegend=False, height=height, width=width, title="LogSalePrice by " + col_name)

    # Plot!
    return fig

In [26]:
py.iplot(boxplot_col('SimplOverallQual', 400, 400))

In [27]:
py.iplot(boxplot_col('GarageCars', 400, 400))

In [28]:
py.iplot(boxplot_col('OverallQual', 400, 800))

In [29]:
py.iplot(boxplot_col('MSSubClass', 400, 1000))

In [31]:
py.iplot(boxplot_col('Neighborhood', 400, 1000))

In [32]:
py.iplot(distplot_col('Neighborhood', 700, 700))

# Box Cox Transformation for skewed features

After doing all of the transformations, the numeric variables whose distribution have high skewness were transformed using a box-cox transformation, while categorical variables that were not label encoded were dummified.

In [33]:
numeric_feats = all_df.dtypes[all_df.dtypes != "object"].index

# Check the skew of all numerical features
skewed_feats = all_df[numeric_feats].apply(lambda x: skew(x.dropna())).sort_values(ascending=False)
skewness = pd.DataFrame({'Skew' :skewed_feats})
skewness.head(10)

Unnamed: 0,Skew
PoolScore,22.357445
SimplPoolScore,22.332942
MiscVal,21.939672
PoolQC,20.341424
SimplPoolQC,18.985317
PoolArea,17.688664
LotArea,13.109495
LowQualFinSF,12.084539
3SsnPorch,11.37208
KitchenAbvGr,4.30055


In [34]:
skewness = skewness[abs(skewness) > 0.5]
print("{} features to Box Cox Transform".format(skewness.shape[0]))

from scipy.special import boxcox1p
from scipy.stats import boxcox
skewed_features = skewness.index
lam = 0.12
for feat in skewed_features:
    all_df[feat] = boxcox1p(all_df[feat], lam)

86 features to Box Cox Transform


# Dummifying

In [35]:
all_df = pd.get_dummies(all_df, drop_first=False)
print(all_df.shape)

(2917, 296)


# Re-assigning to Test and Train

In [37]:
print (ntrain)

1458


In [38]:
train = all_df[:ntrain]
test = all_df[ntrain:]

train.to_csv('train_cleaned.csv')
test.to_csv('test_cleaned.csv')

In [39]:
print (train.shape)
train.head()

(1458, 296)


Unnamed: 0,1stFlrSF,2ndFlrSF,3SsnPorch,Alley,BedroomAbvGr,BsmtCond,BsmtFinSF1,BsmtFinSF2,BsmtFinType1,BsmtFinType2,BsmtFullBath,BsmtHalfBath,BsmtQual,BsmtUnfSF,EnclosedPorch,ExterCond,ExterQual,FireplaceQu,Fireplaces,FullBath,Functional,GarageArea,GarageCars,GarageCond,GarageQual,GarageYrBlt,GrLivArea,HalfBath,HeatingQC,KitchenAbvGr,KitchenQual,LandSlope,LivBsmtArea,LotArea,LotFrontage,LotShape,LowQualFinSF,MasVnrArea,MiscVal,OpenPorchSF,OverallCond,OverallQual,PavedDrive,PoolArea,PoolQC,ScreenPorch,Street,TotRmsAbvGrd,TotalBsmtSF,Utilities,WoodDeckSF,YearBuilt,YearRemodAdd,SimplOverallQual,SimplOverallCond,SimplPoolQC,SimplGarageCond,SimplGarageQual,SimplFireplaceQu,SimplFunctional,SimplKitchenQual,SimplHeatingQC,SimplBsmtFinType1,SimplBsmtFinType2,SimplBsmtCond,SimplBsmtQual,SimplExterCond,SimplExterQual,OverallGrade,GarageGrade,ExterGrade,KitchenScore,FireplaceScore,GarageScore,PoolScore,SimplOverallGrade,SimplExterGrade,SimplPoolScore,SimplGarageScore,SimplFireplaceScore,SimplKitchenScore,TotalBath,AllFlrsSF,AllPorchSF,HasMasVnr,BoughtOffPlan,BldgType_1Fam,BldgType_2fmCon,BldgType_Duplex,BldgType_Twnhs,BldgType_TwnhsE,BsmtExposure_0,BsmtExposure_1,BsmtExposure_2,BsmtExposure_3,BsmtExposure_No,CentralAir_N,CentralAir_Y,Condition1_Artery,Condition1_Feedr,Condition1_Norm,Condition1_PosA,Condition1_PosN,Condition1_RRAe,Condition1_RRAn,Condition1_RRNe,Condition1_RRNn,Condition2_Artery,Condition2_Feedr,Condition2_Norm,Condition2_PosA,Condition2_PosN,Condition2_RRAe,Condition2_RRAn,Condition2_RRNn,Electrical_FuseA,Electrical_FuseF,Electrical_FuseP,Electrical_Mix,Electrical_SBrkr,Exterior1st_AsbShng,Exterior1st_AsphShn,Exterior1st_BrkComm,Exterior1st_BrkFace,Exterior1st_CBlock,Exterior1st_CemntBd,Exterior1st_HdBoard,Exterior1st_ImStucc,Exterior1st_MetalSd,Exterior1st_Plywood,Exterior1st_Stone,Exterior1st_Stucco,Exterior1st_VinylSd,Exterior1st_Wd Sdng,Exterior1st_WdShing,Exterior2nd_AsbShng,Exterior2nd_AsphShn,Exterior2nd_Brk Cmn,Exterior2nd_BrkFace,Exterior2nd_CBlock,Exterior2nd_CmentBd,Exterior2nd_HdBoard,Exterior2nd_ImStucc,Exterior2nd_MetalSd,Exterior2nd_Other,Exterior2nd_Plywood,Exterior2nd_Stone,Exterior2nd_Stucco,Exterior2nd_VinylSd,Exterior2nd_Wd Sdng,Exterior2nd_Wd Shng,Fence_GdPrv,Fence_GdWo,Fence_MnPrv,Fence_MnWw,Fence_NoFnc,Foundation_BrkTil,Foundation_CBlock,Foundation_PConc,Foundation_Slab,Foundation_Stone,Foundation_Wood,GarageFinish_Fin,GarageFinish_NoGrg,GarageFinish_RFn,GarageFinish_Unf,GarageType_2Types,GarageType_Attchd,GarageType_Basment,GarageType_BuiltIn,GarageType_CarPort,GarageType_Detchd,GarageType_NoGrg,Heating_Floor,Heating_GasA,Heating_GasW,Heating_Grav,Heating_OthW,Heating_Wall,HouseStyle_1.5Fin,HouseStyle_1.5Unf,HouseStyle_1Story,HouseStyle_2.5Fin,HouseStyle_2.5Unf,HouseStyle_2Story,HouseStyle_SFoyer,HouseStyle_SLvl,LandContour_Bnk,LandContour_HLS,LandContour_Low,LandContour_Lvl,LotConfig_Corner,LotConfig_CulDSac,LotConfig_FR2,LotConfig_FR3,LotConfig_Inside,MSSubClass_SC120,MSSubClass_SC150,MSSubClass_SC160,MSSubClass_SC180,MSSubClass_SC190,MSSubClass_SC20,MSSubClass_SC30,MSSubClass_SC40,MSSubClass_SC45,MSSubClass_SC50,MSSubClass_SC60,MSSubClass_SC70,MSSubClass_SC75,MSSubClass_SC80,MSSubClass_SC85,MSSubClass_SC90,MSZoning_C (all),MSZoning_FV,MSZoning_RH,MSZoning_RL,MSZoning_RM,MasVnrType_BrkCmn,MasVnrType_BrkFace,MasVnrType_None,MasVnrType_Stone,MiscFeature_Gar2,MiscFeature_NoMscFtr,MiscFeature_Othr,MiscFeature_Shed,MiscFeature_TenC,MoSold_Apr,MoSold_Aug,MoSold_Dec,MoSold_Feb,MoSold_Jan,MoSold_Jul,MoSold_Jun,MoSold_Mar,MoSold_May,MoSold_Nov,MoSold_Oct,MoSold_Sep,Neighborhood_Blmngtn,Neighborhood_Blueste,Neighborhood_BrDale,Neighborhood_BrkSide,Neighborhood_ClearCr,Neighborhood_CollgCr,Neighborhood_Crawfor,Neighborhood_Edwards,Neighborhood_Gilbert,Neighborhood_IDOTRR,Neighborhood_MeadowV,Neighborhood_Mitchel,Neighborhood_NAmes,Neighborhood_NPkVill,Neighborhood_NWAmes,Neighborhood_NoRidge,Neighborhood_NridgHt,Neighborhood_OldTown,Neighborhood_SWISU,Neighborhood_Sawyer,Neighborhood_SawyerW,Neighborhood_Somerst,Neighborhood_StoneBr,Neighborhood_Timber,Neighborhood_Veenker,RoofMatl_CompShg,RoofMatl_Membran,RoofMatl_Metal,RoofMatl_Roll,RoofMatl_Tar&Grv,RoofMatl_WdShake,RoofMatl_WdShngl,RoofStyle_Flat,RoofStyle_Gable,RoofStyle_Gambrel,RoofStyle_Hip,RoofStyle_Mansard,RoofStyle_Shed,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial,SaleType_COD,SaleType_CWD,SaleType_Con,SaleType_ConLD,SaleType_ConLI,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,YrSold_Year2006,YrSold_Year2007,YrSold_Year2008,YrSold_Year2009,YrSold_Year2010
0,10.406963,10.401709,0.0,0.0,1.508272,1.508272,9.979228,0.0,2.191871,0.722791,0.722791,0.0,1.775363,6.882509,0.0,1.508272,1.775363,0.0,0.0,1.174319,2.514122,9.431758,1.174319,1.508272,1.508272,12.418023,12.028119,0.722791,1.998964,0.722791,1.775363,1.508272,13.043819,16.32977,5.443964,1.775363,0.0,7.37589,0.0,5.340987,1.998964,2.361882,1.174319,0.0,0.0,0.0,1.174319,2.514122,10.406963,1.775363,0.0,12.418023,12.418023,1.508272,1.174319,0.0,0.722791,0.722791,0.0,1.775363,1.174319,1.174319,1.174319,0.722791,0.722791,1.174319,0.722791,1.174319,4.477431,2.652139,3.003505,1.775363,0.0,11.932229,0.0,2.191871,1.174319,0.0,9.431758,0.0,1.174319,1.648361,12.028119,5.340987,0.722791,0.0,1,0,0,0,0,0,0,0,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0
1,11.299689,0.0,0.0,0.0,1.508272,1.508272,10.708672,0.0,1.998964,0.722791,0.0,0.722791,1.775363,8.087687,0.0,1.508272,1.508272,1.508272,0.722791,1.174319,2.514122,9.063206,1.174319,1.508272,1.508272,12.384272,11.299689,0.0,1.998964,0.722791,1.508272,1.508272,13.001543,16.710265,5.78674,1.775363,0.0,0.0,0.0,0.0,2.514122,2.191871,1.174319,0.0,0.0,0.0,1.174319,2.191871,11.299689,1.775363,8.182455,12.384272,12.384272,1.174319,1.508272,0.0,0.722791,0.722791,0.722791,1.775363,0.722791,1.174319,1.174319,0.722791,0.722791,1.174319,0.722791,0.722791,4.960257,2.652139,2.652139,1.508272,1.508272,11.51125,0.0,2.191871,0.722791,0.0,9.063206,0.722791,0.722791,1.351829,11.299689,0.0,0.0,0.0,1,0,0,0,0,0,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0
2,10.569631,10.43307,0.0,0.0,1.508272,1.508272,9.178122,0.0,2.191871,0.722791,0.722791,0.0,1.775363,8.942439,0.0,1.508272,1.775363,1.508272,0.722791,1.174319,2.514122,9.65425,1.174319,1.508272,1.508272,12.415536,12.134586,0.722791,1.998964,0.722791,1.775363,1.508272,13.180477,17.191428,5.517651,1.508272,0.0,7.022779,0.0,4.753512,1.998964,2.361882,1.174319,0.0,0.0,0.0,1.174319,2.191871,10.569631,1.775363,0.0,12.415536,12.41678,1.508272,1.174319,0.0,0.722791,0.722791,0.722791,1.775363,1.174319,1.174319,1.174319,0.722791,0.722791,1.174319,0.722791,1.174319,4.477431,2.652139,3.003505,1.775363,1.508272,12.186333,0.0,2.191871,1.174319,0.0,9.65425,0.722791,1.174319,1.648361,12.134586,4.753512,0.722791,0.0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0
3,10.668686,10.130006,0.0,0.0,1.508272,1.775363,7.559229,0.0,1.998964,0.722791,0.722791,0.0,1.508272,9.400492,8.003139,1.508272,1.508272,1.775363,0.722791,0.722791,2.514122,9.771898,1.508272,1.508272,1.508272,12.411803,12.038097,0.0,1.775363,0.722791,1.775363,1.508272,12.949366,16.694578,5.314331,1.508272,0.0,0.0,0.0,4.477431,1.998964,2.361882,1.174319,0.0,0.0,0.0,1.174319,2.361882,10.130006,1.775363,0.0,12.306501,12.376717,1.508272,1.174319,0.0,0.722791,0.722791,1.174319,1.775363,1.174319,1.174319,1.174319,0.722791,1.174319,0.722791,0.722791,0.722791,4.477431,2.652139,2.652139,1.775363,1.775363,12.320685,0.0,2.191871,0.722791,0.0,9.771898,1.174319,1.174319,1.174319,12.038097,8.241335,0.0,0.0,1,0,0,0,0,0,0,0,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0
4,11.071991,10.878094,0.0,0.0,1.775363,1.508272,9.815437,0.0,2.191871,0.722791,0.722791,0.0,1.775363,9.195319,0.0,1.508272,1.775363,1.508272,0.722791,1.174319,2.514122,10.353934,1.508272,1.508272,1.508272,12.414292,12.650546,0.722791,1.998964,0.722791,1.775363,1.508272,13.733027,17.927999,5.868651,1.508272,0.0,8.503314,0.0,5.868651,1.998964,2.514122,1.174319,0.0,0.0,0.0,1.174319,2.652139,11.071991,1.775363,7.337267,12.414292,12.414292,1.508272,1.174319,0.0,0.722791,0.722791,0.722791,1.775363,1.174319,1.174319,1.174319,0.722791,0.722791,1.174319,0.722791,1.174319,4.678929,2.652139,3.003505,1.775363,1.508272,12.985274,0.0,2.191871,1.174319,0.0,10.353934,0.722791,1.174319,1.648361,12.650546,5.868651,0.722791,0.0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0


In [40]:
print (test.shape)
test.head()

(1459, 296)


Unnamed: 0,1stFlrSF,2ndFlrSF,3SsnPorch,Alley,BedroomAbvGr,BsmtCond,BsmtFinSF1,BsmtFinSF2,BsmtFinType1,BsmtFinType2,BsmtFullBath,BsmtHalfBath,BsmtQual,BsmtUnfSF,EnclosedPorch,ExterCond,ExterQual,FireplaceQu,Fireplaces,FullBath,Functional,GarageArea,GarageCars,GarageCond,GarageQual,GarageYrBlt,GrLivArea,HalfBath,HeatingQC,KitchenAbvGr,KitchenQual,LandSlope,LivBsmtArea,LotArea,LotFrontage,LotShape,LowQualFinSF,MasVnrArea,MiscVal,OpenPorchSF,OverallCond,OverallQual,PavedDrive,PoolArea,PoolQC,ScreenPorch,Street,TotRmsAbvGrd,TotalBsmtSF,Utilities,WoodDeckSF,YearBuilt,YearRemodAdd,SimplOverallQual,SimplOverallCond,SimplPoolQC,SimplGarageCond,SimplGarageQual,SimplFireplaceQu,SimplFunctional,SimplKitchenQual,SimplHeatingQC,SimplBsmtFinType1,SimplBsmtFinType2,SimplBsmtCond,SimplBsmtQual,SimplExterCond,SimplExterQual,OverallGrade,GarageGrade,ExterGrade,KitchenScore,FireplaceScore,GarageScore,PoolScore,SimplOverallGrade,SimplExterGrade,SimplPoolScore,SimplGarageScore,SimplFireplaceScore,SimplKitchenScore,TotalBath,AllFlrsSF,AllPorchSF,HasMasVnr,BoughtOffPlan,BldgType_1Fam,BldgType_2fmCon,BldgType_Duplex,BldgType_Twnhs,BldgType_TwnhsE,BsmtExposure_0,BsmtExposure_1,BsmtExposure_2,BsmtExposure_3,BsmtExposure_No,CentralAir_N,CentralAir_Y,Condition1_Artery,Condition1_Feedr,Condition1_Norm,Condition1_PosA,Condition1_PosN,Condition1_RRAe,Condition1_RRAn,Condition1_RRNe,Condition1_RRNn,Condition2_Artery,Condition2_Feedr,Condition2_Norm,Condition2_PosA,Condition2_PosN,Condition2_RRAe,Condition2_RRAn,Condition2_RRNn,Electrical_FuseA,Electrical_FuseF,Electrical_FuseP,Electrical_Mix,Electrical_SBrkr,Exterior1st_AsbShng,Exterior1st_AsphShn,Exterior1st_BrkComm,Exterior1st_BrkFace,Exterior1st_CBlock,Exterior1st_CemntBd,Exterior1st_HdBoard,Exterior1st_ImStucc,Exterior1st_MetalSd,Exterior1st_Plywood,Exterior1st_Stone,Exterior1st_Stucco,Exterior1st_VinylSd,Exterior1st_Wd Sdng,Exterior1st_WdShing,Exterior2nd_AsbShng,Exterior2nd_AsphShn,Exterior2nd_Brk Cmn,Exterior2nd_BrkFace,Exterior2nd_CBlock,Exterior2nd_CmentBd,Exterior2nd_HdBoard,Exterior2nd_ImStucc,Exterior2nd_MetalSd,Exterior2nd_Other,Exterior2nd_Plywood,Exterior2nd_Stone,Exterior2nd_Stucco,Exterior2nd_VinylSd,Exterior2nd_Wd Sdng,Exterior2nd_Wd Shng,Fence_GdPrv,Fence_GdWo,Fence_MnPrv,Fence_MnWw,Fence_NoFnc,Foundation_BrkTil,Foundation_CBlock,Foundation_PConc,Foundation_Slab,Foundation_Stone,Foundation_Wood,GarageFinish_Fin,GarageFinish_NoGrg,GarageFinish_RFn,GarageFinish_Unf,GarageType_2Types,GarageType_Attchd,GarageType_Basment,GarageType_BuiltIn,GarageType_CarPort,GarageType_Detchd,GarageType_NoGrg,Heating_Floor,Heating_GasA,Heating_GasW,Heating_Grav,Heating_OthW,Heating_Wall,HouseStyle_1.5Fin,HouseStyle_1.5Unf,HouseStyle_1Story,HouseStyle_2.5Fin,HouseStyle_2.5Unf,HouseStyle_2Story,HouseStyle_SFoyer,HouseStyle_SLvl,LandContour_Bnk,LandContour_HLS,LandContour_Low,LandContour_Lvl,LotConfig_Corner,LotConfig_CulDSac,LotConfig_FR2,LotConfig_FR3,LotConfig_Inside,MSSubClass_SC120,MSSubClass_SC150,MSSubClass_SC160,MSSubClass_SC180,MSSubClass_SC190,MSSubClass_SC20,MSSubClass_SC30,MSSubClass_SC40,MSSubClass_SC45,MSSubClass_SC50,MSSubClass_SC60,MSSubClass_SC70,MSSubClass_SC75,MSSubClass_SC80,MSSubClass_SC85,MSSubClass_SC90,MSZoning_C (all),MSZoning_FV,MSZoning_RH,MSZoning_RL,MSZoning_RM,MasVnrType_BrkCmn,MasVnrType_BrkFace,MasVnrType_None,MasVnrType_Stone,MiscFeature_Gar2,MiscFeature_NoMscFtr,MiscFeature_Othr,MiscFeature_Shed,MiscFeature_TenC,MoSold_Apr,MoSold_Aug,MoSold_Dec,MoSold_Feb,MoSold_Jan,MoSold_Jul,MoSold_Jun,MoSold_Mar,MoSold_May,MoSold_Nov,MoSold_Oct,MoSold_Sep,Neighborhood_Blmngtn,Neighborhood_Blueste,Neighborhood_BrDale,Neighborhood_BrkSide,Neighborhood_ClearCr,Neighborhood_CollgCr,Neighborhood_Crawfor,Neighborhood_Edwards,Neighborhood_Gilbert,Neighborhood_IDOTRR,Neighborhood_MeadowV,Neighborhood_Mitchel,Neighborhood_NAmes,Neighborhood_NPkVill,Neighborhood_NWAmes,Neighborhood_NoRidge,Neighborhood_NridgHt,Neighborhood_OldTown,Neighborhood_SWISU,Neighborhood_Sawyer,Neighborhood_SawyerW,Neighborhood_Somerst,Neighborhood_StoneBr,Neighborhood_Timber,Neighborhood_Veenker,RoofMatl_CompShg,RoofMatl_Membran,RoofMatl_Metal,RoofMatl_Roll,RoofMatl_Tar&Grv,RoofMatl_WdShake,RoofMatl_WdShngl,RoofStyle_Flat,RoofStyle_Gable,RoofStyle_Gambrel,RoofStyle_Hip,RoofStyle_Mansard,RoofStyle_Shed,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial,SaleType_COD,SaleType_CWD,SaleType_Con,SaleType_ConLD,SaleType_ConLI,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,YrSold_Year2006,YrSold_Year2007,YrSold_Year2008,YrSold_Year2009,YrSold_Year2010
1458,10.509831,0.0,0.0,0.0,1.174319,1.508272,9.099159,6.808655,1.508272,1.174319,0.0,0.0,1.508272,7.98873,0.0,1.508272,1.508272,0.0,0.0,0.722791,2.514122,10.052734,0.722791,1.508272,1.508272,12.365346,10.509831,0.0,1.508272,0.722791,1.508272,1.508272,12.123569,17.291258,5.78674,1.775363,0.0,0.0,0.0,0.0,2.191871,1.998964,1.174319,0.0,0.0,6.483418,1.174319,1.998964,10.474295,1.775363,6.757911,12.365346,12.365346,1.174319,1.174319,0.0,0.722791,0.722791,0.0,1.775363,0.722791,0.722791,0.722791,0.722791,0.722791,0.722791,0.722791,0.722791,4.249608,2.652139,2.652139,1.508272,0.0,12.641371,0.0,1.775363,0.722791,0.0,10.052734,0.0,0.722791,0.722791,10.509831,6.483418,0.0,0.0,1,0,0,0,0,0,0,0,0,1,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1
1459,11.421845,0.0,0.0,0.0,1.508272,1.508272,10.577009,0.0,1.998964,0.722791,0.0,0.0,1.508272,8.805059,0.0,1.508272,1.508272,0.0,0.0,0.722791,2.514122,8.273395,0.722791,1.508272,1.508272,12.361545,11.421845,0.722791,1.508272,0.722791,1.775363,1.508272,13.134339,17.929545,5.807546,1.508272,0.0,6.298877,17.516165,4.519621,2.191871,2.191871,1.174319,0.0,0.0,0.0,1.174319,2.191871,11.421845,1.775363,8.738427,12.361545,12.361545,1.174319,1.174319,0.0,0.722791,0.722791,0.0,1.775363,1.174319,0.722791,1.174319,0.722791,0.722791,0.722791,0.722791,0.722791,4.519621,2.652139,2.652139,1.775363,0.0,10.60874,0.0,1.775363,0.722791,0.0,8.273395,0.0,1.174319,0.968564,11.421845,4.519621,0.722791,0.0,1,0,0,0,0,0,0,0,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1
1460,10.589259,9.963638,0.0,0.0,1.508272,1.508272,10.230419,0.0,2.191871,0.722791,0.0,0.0,1.775363,6.719014,0.0,1.508272,1.508272,1.508272,0.722791,1.174319,2.514122,9.160799,1.174319,1.508272,1.508272,12.410557,11.909965,0.722791,1.775363,0.722791,1.508272,1.508272,13.034812,17.831693,5.656937,1.508272,0.0,0.0,0.0,4.434198,1.998964,1.998964,1.174319,0.0,0.0,0.0,1.174319,2.191871,10.589259,1.775363,7.523786,12.410557,12.411803,1.174319,1.174319,0.0,0.722791,0.722791,0.722791,1.775363,0.722791,1.174319,1.174319,0.722791,0.722791,1.174319,0.722791,0.722791,3.986804,2.652139,2.652139,1.508272,1.508272,11.622735,0.0,1.775363,0.722791,0.0,9.160799,0.722791,0.722791,1.351829,11.909965,4.434198,0.0,0.0,1,0,0,0,0,0,0,0,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1
1461,10.584366,9.890642,0.0,0.0,1.508272,1.508272,9.632892,0.0,2.191871,0.722791,0.0,0.0,1.508272,8.348538,0.0,1.508272,1.508272,1.775363,0.722791,1.174319,2.514122,9.108063,1.174319,1.508272,1.508272,12.411803,11.872453,0.722791,1.998964,0.722791,1.775363,1.508272,13.00762,16.826583,5.744441,1.508272,0.0,3.675065,0.0,4.519621,2.191871,2.191871,1.174319,0.0,0.0,0.0,1.174319,2.361882,10.584366,1.775363,8.560166,12.411803,12.411803,1.174319,1.174319,0.0,0.722791,0.722791,1.174319,1.775363,1.174319,1.174319,1.174319,0.722791,0.722791,0.722791,0.722791,0.722791,4.519621,2.652139,2.652139,1.775363,1.775363,11.562494,0.0,1.775363,0.722791,0.0,9.108063,1.174319,1.174319,1.351829,11.872453,4.519621,0.722791,0.0,1,0,0,0,0,0,0,0,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1
1462,11.333057,0.0,0.0,0.0,1.174319,1.508272,7.937554,0.0,1.998964,0.722791,0.0,0.0,1.775363,10.798143,0.0,1.508272,1.775363,0.0,0.0,1.174319,2.514122,9.2629,1.174319,1.508272,1.508272,12.404321,11.333057,0.0,1.998964,0.722791,1.775363,1.508272,13.037817,14.827685,4.789665,1.508272,0.0,0.0,0.0,5.82813,1.998964,2.514122,1.174319,0.0,0.0,6.808655,1.174319,1.998964,11.333057,1.775363,0.0,12.404321,12.404321,1.508272,1.174319,0.0,0.722791,0.722791,0.0,1.775363,1.174319,1.174319,1.174319,0.722791,0.722791,1.174319,0.722791,1.174319,4.678929,2.652139,3.003505,1.775363,0.0,11.739362,0.0,2.191871,1.174319,0.0,9.2629,0.0,1.174319,1.174319,11.333057,7.645382,0.0,0.0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1


# Modeling

Now that the data is ready, we can start creating the model for predicting the LogSalePrice! The group tested various models but ended up with two models that were stacked. We found that the Lasso Regression and Gradient Boosting models, when stacked, made the best prediction of the target variable.

In [41]:
from sklearn.linear_model import ElasticNet, Lasso,  BayesianRidge, LassoLarsIC
from sklearn.ensemble import RandomForestRegressor,  GradientBoostingRegressor
from sklearn.kernel_ridge import KernelRidge
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.metrics import mean_squared_error
from sklearn import linear_model
from sklearn import ensemble
import xgboost as xgb
import lightgbm as lgb

# Checking for any leftover Na's

In [42]:
index = all_df.index[np.sum(np.isnan(all_df), axis=1)>0]
all_df_na = all_df.loc[index, :]
all_df_na.head(10)

Unnamed: 0,1stFlrSF,2ndFlrSF,3SsnPorch,Alley,BedroomAbvGr,BsmtCond,BsmtFinSF1,BsmtFinSF2,BsmtFinType1,BsmtFinType2,BsmtFullBath,BsmtHalfBath,BsmtQual,BsmtUnfSF,EnclosedPorch,ExterCond,ExterQual,FireplaceQu,Fireplaces,FullBath,Functional,GarageArea,GarageCars,GarageCond,GarageQual,GarageYrBlt,GrLivArea,HalfBath,HeatingQC,KitchenAbvGr,KitchenQual,LandSlope,LivBsmtArea,LotArea,LotFrontage,LotShape,LowQualFinSF,MasVnrArea,MiscVal,OpenPorchSF,OverallCond,OverallQual,PavedDrive,PoolArea,PoolQC,ScreenPorch,Street,TotRmsAbvGrd,TotalBsmtSF,Utilities,WoodDeckSF,YearBuilt,YearRemodAdd,SimplOverallQual,SimplOverallCond,SimplPoolQC,SimplGarageCond,SimplGarageQual,SimplFireplaceQu,SimplFunctional,SimplKitchenQual,SimplHeatingQC,SimplBsmtFinType1,SimplBsmtFinType2,SimplBsmtCond,SimplBsmtQual,SimplExterCond,SimplExterQual,OverallGrade,GarageGrade,ExterGrade,KitchenScore,FireplaceScore,GarageScore,PoolScore,SimplOverallGrade,SimplExterGrade,SimplPoolScore,SimplGarageScore,SimplFireplaceScore,SimplKitchenScore,TotalBath,AllFlrsSF,AllPorchSF,HasMasVnr,BoughtOffPlan,BldgType_1Fam,BldgType_2fmCon,BldgType_Duplex,BldgType_Twnhs,BldgType_TwnhsE,BsmtExposure_0,BsmtExposure_1,BsmtExposure_2,BsmtExposure_3,BsmtExposure_No,CentralAir_N,CentralAir_Y,Condition1_Artery,Condition1_Feedr,Condition1_Norm,Condition1_PosA,Condition1_PosN,Condition1_RRAe,Condition1_RRAn,Condition1_RRNe,Condition1_RRNn,Condition2_Artery,Condition2_Feedr,Condition2_Norm,Condition2_PosA,Condition2_PosN,Condition2_RRAe,Condition2_RRAn,Condition2_RRNn,Electrical_FuseA,Electrical_FuseF,Electrical_FuseP,Electrical_Mix,Electrical_SBrkr,Exterior1st_AsbShng,Exterior1st_AsphShn,Exterior1st_BrkComm,Exterior1st_BrkFace,Exterior1st_CBlock,Exterior1st_CemntBd,Exterior1st_HdBoard,Exterior1st_ImStucc,Exterior1st_MetalSd,Exterior1st_Plywood,Exterior1st_Stone,Exterior1st_Stucco,Exterior1st_VinylSd,Exterior1st_Wd Sdng,Exterior1st_WdShing,Exterior2nd_AsbShng,Exterior2nd_AsphShn,Exterior2nd_Brk Cmn,Exterior2nd_BrkFace,Exterior2nd_CBlock,Exterior2nd_CmentBd,Exterior2nd_HdBoard,Exterior2nd_ImStucc,Exterior2nd_MetalSd,Exterior2nd_Other,Exterior2nd_Plywood,Exterior2nd_Stone,Exterior2nd_Stucco,Exterior2nd_VinylSd,Exterior2nd_Wd Sdng,Exterior2nd_Wd Shng,Fence_GdPrv,Fence_GdWo,Fence_MnPrv,Fence_MnWw,Fence_NoFnc,Foundation_BrkTil,Foundation_CBlock,Foundation_PConc,Foundation_Slab,Foundation_Stone,Foundation_Wood,GarageFinish_Fin,GarageFinish_NoGrg,GarageFinish_RFn,GarageFinish_Unf,GarageType_2Types,GarageType_Attchd,GarageType_Basment,GarageType_BuiltIn,GarageType_CarPort,GarageType_Detchd,GarageType_NoGrg,Heating_Floor,Heating_GasA,Heating_GasW,Heating_Grav,Heating_OthW,Heating_Wall,HouseStyle_1.5Fin,HouseStyle_1.5Unf,HouseStyle_1Story,HouseStyle_2.5Fin,HouseStyle_2.5Unf,HouseStyle_2Story,HouseStyle_SFoyer,HouseStyle_SLvl,LandContour_Bnk,LandContour_HLS,LandContour_Low,LandContour_Lvl,LotConfig_Corner,LotConfig_CulDSac,LotConfig_FR2,LotConfig_FR3,LotConfig_Inside,MSSubClass_SC120,MSSubClass_SC150,MSSubClass_SC160,MSSubClass_SC180,MSSubClass_SC190,MSSubClass_SC20,MSSubClass_SC30,MSSubClass_SC40,MSSubClass_SC45,MSSubClass_SC50,MSSubClass_SC60,MSSubClass_SC70,MSSubClass_SC75,MSSubClass_SC80,MSSubClass_SC85,MSSubClass_SC90,MSZoning_C (all),MSZoning_FV,MSZoning_RH,MSZoning_RL,MSZoning_RM,MasVnrType_BrkCmn,MasVnrType_BrkFace,MasVnrType_None,MasVnrType_Stone,MiscFeature_Gar2,MiscFeature_NoMscFtr,MiscFeature_Othr,MiscFeature_Shed,MiscFeature_TenC,MoSold_Apr,MoSold_Aug,MoSold_Dec,MoSold_Feb,MoSold_Jan,MoSold_Jul,MoSold_Jun,MoSold_Mar,MoSold_May,MoSold_Nov,MoSold_Oct,MoSold_Sep,Neighborhood_Blmngtn,Neighborhood_Blueste,Neighborhood_BrDale,Neighborhood_BrkSide,Neighborhood_ClearCr,Neighborhood_CollgCr,Neighborhood_Crawfor,Neighborhood_Edwards,Neighborhood_Gilbert,Neighborhood_IDOTRR,Neighborhood_MeadowV,Neighborhood_Mitchel,Neighborhood_NAmes,Neighborhood_NPkVill,Neighborhood_NWAmes,Neighborhood_NoRidge,Neighborhood_NridgHt,Neighborhood_OldTown,Neighborhood_SWISU,Neighborhood_Sawyer,Neighborhood_SawyerW,Neighborhood_Somerst,Neighborhood_StoneBr,Neighborhood_Timber,Neighborhood_Veenker,RoofMatl_CompShg,RoofMatl_Membran,RoofMatl_Metal,RoofMatl_Roll,RoofMatl_Tar&Grv,RoofMatl_WdShake,RoofMatl_WdShngl,RoofStyle_Flat,RoofStyle_Gable,RoofStyle_Gambrel,RoofStyle_Hip,RoofStyle_Mansard,RoofStyle_Shed,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial,SaleType_COD,SaleType_CWD,SaleType_Con,SaleType_ConLD,SaleType_ConLI,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,YrSold_Year2006,YrSold_Year2007,YrSold_Year2008,YrSold_Year2009,YrSold_Year2010


## Cross Validation Function

Cross-validation scores were computed to help us decide which model to use. The team ran Lasso Regression, Ridge Regression, Elastic Net, Extreme Gradient Boosting, Gradient Boosting, Light Gradient Boosting, and Random Forest. The lowest cross-validation score came from Lasso Regression, a linear model. We were not surprised upon viewing the results as the exploratory data analysis we did showed that features had a noticeable linear relationship with the target variable.

The Gradient Boosting model was selected not because it provided the best cross-validation score, but because it improved our model when it was stacked with Lasso. Gradient Boosting, being a tree-based model, complemented Lasso regression on features which did not have a clear linear relationship with the target. We believe that this is the reason why stacking it with Lasso increased prediction accuracy.

In [43]:
n_folds_all = 5

def rmsle_cv(model):
    kf = KFold(n_folds_all, shuffle=True, random_state=42).get_n_splits(train.values)
    rmse= np.sqrt(-cross_val_score(model, train.values, logsaleprice_train, scoring="neg_mean_squared_error", cv = kf))
    return(rmse)

## Elastic Net

In [48]:
ENet = make_pipeline(RobustScaler(), ElasticNet(alpha=0.0005, l1_ratio=0, random_state=3))

## Elastic Net Grid Search

In [26]:
#could not get this to work
#has infinity problem
# Grid Search for Algorithm Tuning
from sklearn.model_selection import GridSearchCV
# prepare a range of alpha values to test
l1_ratio = np.array([1,0.1,0.01,0.001,0.0001,0])
# create and fit a ridge regression model, testing each alpha
model = ElasticNet()
grid = GridSearchCV(estimator=model, param_grid=dict(l1_ratio=l1_ratio))
grid.fit(train, logsaleprice_train)
print(grid)
# summarize the results of the grid search
print(grid.best_score_)
print(grid.best_estimator_.l1_ratio)


GridSearchCV(cv=None, error_score='raise',
       estimator=ElasticNet(alpha=1.0, copy_X=True, fit_intercept=True, l1_ratio=0.5,
      max_iter=1000, normalize=False, positive=False, precompute=False,
      random_state=None, selection='cyclic', tol=0.0001, warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'l1_ratio': array([  1.00000e+00,   1.00000e-01,   1.00000e-02,   1.00000e-03,
         1.00000e-04,   0.00000e+00])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)
0.82885193396
0.0


## Lasso Regression

In [44]:
#lasso original settings which has the best output
#lasso = make_pipeline(RobustScaler(), Lasso(alpha =0.0005, random_state=1))
lasso = make_pipeline(RobustScaler(), Lasso(alpha =0.0005, random_state=1))

## Lasso Regression Gridsearch

We further improved the models by tuning the parameters using the GridSearchCV and RandomizedSearchCV function in python's sklearn package. The grid search for Lasso’s alpha variable gave a value of 0.0001 and Gradient Boosting’s learning_rate and min_samples_leaf variables a value of 0.11 and 13 respectively. We adjusted the variables to 0.0005, 0.05 and 13 respectively as the ones that grid search gave overfit within the training sample and needed to be adjusted manually for the test data.

In [49]:
from sklearn.model_selection import GridSearchCV
# prepare a range of alpha values to test
alpha = np.linspace(0,0.0001,10)
# create and fit a ridge regression model, testing each alpha
model = Lasso()
grid = GridSearchCV(estimator=model, param_grid=dict(alpha=alpha))
grid.fit(train, logsaleprice_train)
print(grid)
# summarize the results of the grid search
print(grid.best_score_)
print(grid.best_estimator_.alpha)


GridSearchCV(cv=None, error_score='raise',
       estimator=Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'alpha': array([  0.00000e+00,   1.11111e-05,   2.22222e-05,   3.33333e-05,
         4.44444e-05,   5.55556e-05,   6.66667e-05,   7.77778e-05,
         8.88889e-05,   1.00000e-04])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)
0.909449175182
0.0001


## Gradient Boosting

In [46]:
GBoost = GradientBoostingRegressor(n_estimators=3000, learning_rate=0.05,
                                   max_depth=4, max_features='sqrt',
                                   min_samples_leaf=13, min_samples_split=13, 
                                   max_leaf_nodes=None,
                                   loss='huber', random_state =5)

#max leaf nodes = None always!!
# original settings that has the best outcome:
# n_estimators=3000, learning_rate=0.05,
#  max_depth=4, max_features='sqrt',
#  min_samples_leaf=13, min_samples_split=13, 
#  loss='huber', random_state =5)

## Randomized Gridsearch for GBoost

In [40]:
# Randomized Search for Algorithm Tuning
from scipy.stats import uniform as sp_rand
from sklearn import datasets
from sklearn.linear_model import Ridge
from sklearn.model_selection import RandomizedSearchCV
# load the diabetes datasets
# prepare a uniform distribution to sample for the alpha parameter
param_grid = {'learning_rate': sp_rand()}
# create and fit a ridge regression model, testing random alpha values
model = GradientBoostingRegressor()
rsearch = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=100)
rsearch.fit(train, logsaleprice_train)
print(rsearch)
# summarize the results of the random parameter search
print(rsearch.best_score_)
print(rsearch.best_estimator_.learning_rate)

RandomizedSearchCV(cv=None, error_score='raise',
          estimator=GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=0.1, loss='ls', max_depth=3, max_features=None,
             max_leaf_nodes=None, min_impurity_decrease=0.0,
             min_impurity_split=None, min_samples_leaf=1,
             min_samples_split=2, min_weight_fraction_leaf=0.0,
             n_estimators=100, presort='auto', random_state=None,
             subsample=1.0, verbose=0, warm_start=False),
          fit_params=None, iid=True, n_iter=100, n_jobs=1,
          param_distributions={'learning_rate': <scipy.stats._distn_infrastructure.rv_frozen object at 0x10dffed30>},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          return_train_score='warn', scoring=None, verbose=0)
0.903231150732
0.110375141164


## Gridsearch for GBoost

In [131]:
#could not get this to work
#has infinity problem

# Grid Search for Algorithm Tuning
from sklearn.model_selection import GridSearchCV
# prepare a range of alpha values to test
#np.linspace(0,1,100)
max_depth = [int(x) for x in np.linspace(3,9,7)]
# create and fit a ridge regression model, testing each alpha
model = GradientBoostingRegressor(n_estimators=3000, learning_rate=0.05,
                                   max_depth=4, max_features='sqrt',
                                   min_samples_leaf=13, min_samples_split=10, 
                                   loss='huber', random_state =5)
grid = GridSearchCV(estimator=model, param_grid=dict(max_depth=max_depth))
grid.fit(train, logsaleprice_train)
print(grid)
# summarize the results of the grid search
print(grid.best_score_)
print(grid.best_estimator_.max_depth)

GridSearchCV(cv=None, error_score='raise',
       estimator=GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=0.05, loss='huber', max_depth=4,
             max_features='sqrt', max_leaf_nodes=None,
             min_impurity_decrease=0.0, min_impurity_split=None,
             min_samples_leaf=13, min_samples_split=10,
             min_weight_fraction_leaf=0.0, n_estimators=3000,
             presort='auto', random_state=5, subsample=1.0, verbose=0,
             warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'max_depth': [3, 4, 5, 6, 7, 8, 9]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)
0.915730926315
3


# Model Scores

## Checking which variables work based on the Lasso model

In [47]:
# for filtering columns which are important based on lasso
lasso2 = Lasso(alpha =0.0005, random_state=1)
lasso2.fit(train, logsaleprice_train)
colnames = list(train)
significant_col = lasso2.coef_==0
colnames = [x for x,y in zip(colnames, significant_col) if y==True]
print (len(colnames))
print (colnames)

184
['Alley', 'BsmtCond', 'BsmtFinType2', 'BsmtFullBath', 'BsmtHalfBath', 'ExterCond', 'FireplaceQu', 'Fireplaces', 'GarageCond', 'GarageQual', 'GarageYrBlt', 'LandSlope', 'LotFrontage', 'LotShape', 'OverallCond', 'PoolArea', 'PoolQC', 'Street', 'TotRmsAbvGrd', 'Utilities', 'YearBuilt', 'YearRemodAdd', 'SimplOverallQual', 'SimplOverallCond', 'SimplPoolQC', 'SimplGarageCond', 'SimplGarageQual', 'SimplFireplaceQu', 'SimplFunctional', 'SimplKitchenQual', 'SimplHeatingQC', 'SimplBsmtFinType1', 'SimplBsmtFinType2', 'SimplBsmtCond', 'SimplBsmtQual', 'GarageGrade', 'ExterGrade', 'KitchenScore', 'GarageScore', 'SimplExterGrade', 'SimplPoolScore', 'SimplFireplaceScore', 'AllFlrsSF', 'HasMasVnr', 'BoughtOffPlan', 'BldgType_2fmCon', 'BldgType_Twnhs', 'BsmtExposure_0', 'BsmtExposure_2', 'Condition1_Feedr', 'Condition1_PosA', 'Condition1_RRAn', 'Condition1_RRNe', 'Condition1_RRNn', 'Condition2_Artery', 'Condition2_Feedr', 'Condition2_Norm', 'Condition2_PosA', 'Condition2_PosN', 'Condition2_RRAe', '

## Log Sale Price as target

In [49]:
print (lasso)
score = rmsle_cv(lasso)
print("\nLasso score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

Pipeline(memory=None,
     steps=[('robustscaler', RobustScaler(copy=True, quantile_range=(25.0, 75.0), with_centering=True,
       with_scaling=True)), ('lasso', Lasso(alpha=0.0005, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=1,
   selection='cyclic', tol=0.0001, warm_start=False))])

Lasso score: 0.1119 (0.0047)



In [50]:
print (ENet)
score = rmsle_cv(ENet)
print("ElasticNet score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

Pipeline(memory=None,
     steps=[('robustscaler', RobustScaler(copy=True, quantile_range=(25.0, 75.0), with_centering=True,
       with_scaling=True)), ('elasticnet', ElasticNet(alpha=0.0005, copy_X=True, fit_intercept=True, l1_ratio=0,
      max_iter=1000, normalize=False, positive=False, precompute=False,
      random_state=3, selection='cyclic', tol=0.0001, warm_start=False))])
ElasticNet score: 0.1229 (0.0068)



In [51]:
print (GBoost)
score = rmsle_cv(GBoost)
print("\nGradient Boosting score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=0.05, loss='huber', max_depth=4,
             max_features='sqrt', max_leaf_nodes=None,
             min_impurity_decrease=0.0, min_impurity_split=None,
             min_samples_leaf=13, min_samples_split=13,
             min_weight_fraction_leaf=0.0, n_estimators=3000,
             presort='auto', random_state=5, subsample=1.0, verbose=0,
             warm_start=False)

Gradient Boosting score: 0.1160 (0.0081)



# Stacking

With the models chosen and their parameters set, we ran a stacking code with Lasso and Gradient Boosting as the base models and Lasso as our meta model. Stacking is a type of ensembling which improves model accuracy by combining a list of base models using a meta model. For the predictions made by the base models, since we are using Lasso as our meta model, a beta will be multiplied to each of these predictions which are calculated by running a Lasso regression.

Stacking enables the combination of models which has the ability to improve the score further. As in the modeling that we did, the cross-validation score of 0.1119 using a plain Lasso model improved to 0.1069 when we used stacking.

In [52]:
class StackingAveragedModels(BaseEstimator, RegressorMixin, TransformerMixin):
    def __init__(self, base_models, meta_model, n_folds=n_folds_all):
        self.base_models = base_models
        self.meta_model = meta_model
        self.n_folds = n_folds
   
    # We again fit the data on clones of the original models
    def fit(self, X, y):
        self.base_models_ = [list() for x in self.base_models]
        self.meta_model_ = clone(self.meta_model)
        kfold = KFold(n_splits=self.n_folds, shuffle=True, random_state=156)
        
        # Train cloned base models then create out-of-fold predictions
        # that are needed to train the cloned meta-model
        out_of_fold_predictions = np.zeros((X.shape[0], len(self.base_models)))
        for i, model in enumerate(self.base_models):
            for train_index, holdout_index in kfold.split(X, y):
                instance = clone(model)
                self.base_models_[i].append(instance)
                instance.fit(X[train_index], y[train_index])
                y_pred = instance.predict(X[holdout_index])
                out_of_fold_predictions[holdout_index, i] = y_pred
                
        # Now train the cloned  meta-model using the out-of-fold predictions as new feature
        self.meta_model_.fit(out_of_fold_predictions, y)
        return self
   
    #Do the predictions of all base models on the test data and use the averaged predictions as 
    #meta-features for the final prediction which is done by the meta-model
    def predict(self, X):
        meta_features = np.column_stack([
            np.column_stack([model.predict(X) for model in base_models]).mean(axis=1)
            for base_models in self.base_models_ ])
        return self.meta_model_.predict(meta_features)

# Running the CV function to check the scores

In [53]:
print(lasso)
print(GBoost)
stacked_averaged_models = StackingAveragedModels(base_models = (lasso, GBoost),
                                                 meta_model = lasso)

score = rmsle_cv(stacked_averaged_models)
print("\nStacking Averaged models score: {:.4f} ({:.4f})".format(score.mean(), score.std()))

Pipeline(memory=None,
     steps=[('robustscaler', RobustScaler(copy=True, quantile_range=(25.0, 75.0), with_centering=True,
       with_scaling=True)), ('lasso', Lasso(alpha=0.0005, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=1,
   selection='cyclic', tol=0.0001, warm_start=False))])
GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=0.05, loss='huber', max_depth=4,
             max_features='sqrt', max_leaf_nodes=None,
             min_impurity_decrease=0.0, min_impurity_split=None,
             min_samples_leaf=13, min_samples_split=13,
             min_weight_fraction_leaf=0.0, n_estimators=3000,
             presort='auto', random_state=5, subsample=1.0, verbose=0,
             warm_start=False)

Stacking Averaged models score: 0.1069 (0.0054)


# Applying the models to test set

In [54]:
def rmsle(y, y_pred):
    return np.sqrt(mean_squared_error(y, y_pred))

## Stacked RMSLE

In [55]:
stacked_averaged_models.fit(train.values, logsaleprice_train)
stacked_train_pred = stacked_averaged_models.predict(train.values)
stacked_pred = np.expm1(stacked_averaged_models.predict(test.values))
stacked_pred = stacked_pred.round(2)
print(rmsle(logsaleprice_train, stacked_train_pred))

0.075112049819


## Pure Lasso RMSLE (Showing that Stacked is better than pure Lasso)

In [56]:
lasso.fit(train.values, logsaleprice_train)
lasso_train_pred = lasso.predict(train.values)
lasso_pred = np.expm1(lasso.predict(test.values))
lasso_pred = lasso_pred.round(2)
print(rmsle(logsaleprice_train, lasso_train_pred))

0.100601270717


In [57]:
stacked_pred

array([ 117985.01,  159535.19,  186829.37, ...,  163233.58,  117843.19,
        217341.69])

# Submission - LogSalePrice

In [68]:
Submission = pd.DataFrame({'Id':list(range(1461,2920,1)),'SalePrice':stacked_pred})
Submission.to_csv('submission_mar_6_2018_7h36m.csv', index=False)

# The End

Thank you everyone for listening to Suicide Squad's presentation! :)