# Advanced House Pricing Prediction : EDA & Modelling

![](https://storage.googleapis.com/kaggle-competitions/kaggle/5407/logos/front_page.png)


I have recently started working on AMES Housing Prices Regression Dataset. This notebook showcases some of the exploratory analysis, data visualization, data processing, missing value treatment, tree based models and model blending etc. I have applied on the datasets. Please provide your feedback in the comments.

# Table of Contents - 
* [Extrapolatory Analysis & Data Transformation](#DataTransformations)
* [Feature Engineering](#feature)
* [Model Development ](#model)



# Extrapolatory Analysis


## Importing the Raw Datasets<a name="DataTransformations"></a>

The Competition primarily provides two datasets, training data and testing dataset. The data is split 50-50 between training and testing datasets, each dataset containing 1460 records. Each record corresponds to a single transaction for a house purchase, the dependent variable is **SalePrice**, the price at which house was sold. There are about 79 independent variables capturing different pieces of information about a house being sold - area, location, number of rooms etc. 

Let's start off by importing the datasets and having a quick glance at the data. 



In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

#Viz libraries used in the notebook
import plotly.express as px
from plotly.subplots import make_subplots
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import plotly.graph_objects as go
import seaborn as sns
import matplotlib.pyplot as plt                    
import matplotlib


#Importing sklearn libraries
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor #Random forest libraries
from sklearn.model_selection import cross_validate #cross validation
from sklearn.impute import SimpleImputer           #Treatment of missing values
from sklearn.preprocessing import OrdinalEncoder   #Ordinal Encoder package
from sklearn.preprocessing import LabelEncoder     #For Label Encoding
from sklearn.metrics import mean_squared_log_error #Mean Squared Log Error metric from sklearn
import xgboost as xgb                              #XGboost package
from sklearn.model_selection import GridSearchCV   #Grid search for finding out the best package


#AV = AutoViz_Class()
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

train=pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')
test=pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')

train

In [None]:
train.describe()

Some things immediately stand out from the outputs above - 

* The dataset is very rich in terms of the number of properties available for a house, everything from construction material, housetype, garage, basement, neighborhood properties etc. are covered in the dataset
* The ID column is the primary key in the dataset,it is a unique differentiator for every house in the dataset. There are 1460 records in the training dataset, each corresponding to a single house
* **LotArea** variable corresponds to the area of the property. There is a huge difference in the minimum and maximum values for the Lot Area, ranging from 1300 sq ft. to 215245 sq. ft
* There are several Quality variables indicating the quality of overall house, Exteriors, Kitchens, Garage etc. These variables have values on a 1-10 scale
* **YrBuilt** has values from 1872 to 2010, showing that there is a lot of variation in age of the house at the time of sale
* **YrSold** variables refers to the year in which house was sold, it shows that sales transactions are in between **2006-2010**, so a fairly short time period. With such a short time range  we can discount the impact of inflation or any other external factors on house prices
* There are several categorical variables as well like Neighborhood, Lot Shape, Street, Condition etc. We will analyze these variables separately
* **SalePrice** shows the price of house sold in Dollars. The value of SalePrice ranges from 34900 to 755,000. This is the dependent variable in this compeition. We will soon be having a more detailed look at this variable. 



## Dependent Variable - SalePrice

In [None]:
train['logSalePrice']=np.log(train.SalePrice) 

fig = px.histogram(train, x="SalePrice",title='Distribution of SalePrice',height=400)
fig.show()

fig1 = px.violin(train, x="SalePrice",title='Violin Plot for SalePrice',height=300)
fig1.update_traces(box_visible=True, meanline_visible=True)
fig1.show()


fig2 = px.violin(train, x="logSalePrice",title='Violin Plot for Log(SalePrice)',height=300)
fig2.update_traces(box_visible=True, meanline_visible=True)
fig2.show()

As the plots above show, histogram of SalePrice variable shows that values are not equally distributed around the mean, there is a **right skew** in the distribution. Plotting the Violin plot for SalePrice shows us that the mean is around 173k and third quartile value is at 213k, but there are quite a few values in the long tail with SalePrice values in excess of 300k. These values are also pulling the mean away from Median.

We will be training linear regression & tree based models for predicting the SalePrice values later in the notebook, therefore these outlier values can cause lot of issues in model training and generate lot of variance for model predictions. One of the common ways to deal with this issue is to log transform the SalePrice values to reduce the skew of the distribution. We did this and plotted the distribution for log(SalePrice), it seems to be centered around mean and normally distributed.

If your variable is skewed, high values will affect the variances and push your split points towards higher values - forcing your decision tree to make less balanced splits and trying to "isolate" the tail from the rest of the points.Link below provides more details on impact of outlier values on Tree based models - 

https://stats.stackexchange.com/questions/447863/log-transforming-target-var-for-training-a-random-forest-regressor
 

## Univariate Analysis - Relationship between Independent variables on SalePrice

After having understood basic structure of the training dataset and the target variable SalePrice, let's now look at relationship between individual variables in the training dataset. We will break this analysis into two parts - first we will be looking at relationship between continous variables and target variable, and then we would perform the same analysis for categorical variables and target variable. First let's start off by looking at the count of missing values for all the columns in the training dataset and then we will look at distributions.


In [None]:
#Missing Values for Categorical Variables
k=train.isnull().sum()
k[k>0].sort_values(ascending=False)


In [None]:
numcols=train._get_numeric_data().columns
train_num=train.loc[:,numcols]
cols=train_num.columns
tup=tuple(cols[1:])
#train_num
cols=train_num.columns

fig = make_subplots(
    rows=9, cols=4,shared_yaxes=True,subplot_titles=tup)
#    subplot_titles=("Plot 1", "Plot 2", "Plot 3", "Plot 4"))
k=1
j=1

for i in range(1,36):
#    print(k,j)
    fig.add_trace(
    go.Scatter(y=train_num['SalePrice'], x=train_num.iloc[:,i],mode='markers',name=cols[i]),
    row=k, col=j
    )
#    k=k%8
    j=j%4
    j=j+1
    
#    k=k+1
    if(i%4==0):
        k=k+1
      #  j=j+1

        
fig.update_layout(height=1800, width=800,
                  title_text="Dependency between SalePrice & Continous Variables",showlegend=False)
fig.show()

First thoughts - that's a lot of Violin plots to look at - let's look at them in detail and summarise our findings - 
* Y Axis on each of the violin plots above is the SalePrice variable and x axis represents the variable being plotted. As is to be expected, not all variables have a correlation with SalePrice, for quite a few variables, there seems to be no dependency with SalePrice
* **OverallQual** variable seems to be much more positively correlated with SalePrice than **OverallCond**
* **Lot Area** does not seem to have any correlation with SalePrice, SalePrice seems to vary wildly for same values of Lot Area
* **GrLivArea,TotalBsmtSF & 1stFlrSF** all look to have a positive linear correlation with SalePrice
* Houses built more recently sell for higher values, which is intuitive because newly built houses would be expected to fetch higher values than older houses
* SalePrice tends to generally increase with increase in **TotRmsAbvGrd**
* **Avg. SalePrice** increases with increase in number of bathrooms
* SalePrice seems to increase with increase in **GarageArea**
* Majority of houses don't have a swimming pool
* The **month and year** of purchase does not seem to have any impact on SalePrice

First glance at this distribution gives out several ideas for Feature Engineering as well, wherein lot of individual variables don't have a high correlation with SalePrice, but when combined with other variables could generate meaningful features. Here are some examples - 
* Combining the bathroom count variables for different floors & basement to derive total bathrooms
* Total area in a house combining area for different floors and basements
* Age of a house at sales using YrSold and YrBuilt values
* Combining area for different porch variables

We will be focusing on this soon after we've completed the extrapolatory analysis and are getting ready for model building. Let's now get back to extrapolatory analysis. 

In [None]:
#fig=go.Figure()
train_str_data=train.select_dtypes(include='object')
train_str_data['SalePrice']=train['SalePrice']

cols=train_str_data.columns
tup=tuple(cols[1:])
fig = make_subplots(
    rows=10, cols=4,shared_yaxes=True,subplot_titles=tup)

row=1
col=1

for i in range(1,41):
    uniqvals=train_str_data.iloc[:,i].unique() 
    lenvals=len(uniqvals)
    for j in range(lenvals):
        k=train_str_data.iloc[:,i]==uniqvals[j]
        df=train_str_data.loc[k]
        fig.add_trace(go.Violin(x=df.iloc[:,i],
                        y=df['SalePrice']),row=row,col=col)
    
    col=col%4
    col=col+1
    if(i%4==0):
        row=row+1
    #print(row,col)
fig.update_traces(box_visible=True, meanline_visible=True)
fig.update_layout(height=2800, width=800,
                  title_text="Dependency between SalePrice & Categorical Variables",showlegend=False)
fig.show()

Similar to Continous variables in the train dataset, we've now plotted categorical variables against SalePrice. We have plotted the distribution of SalePrice for each of the distinct values for a categorical variable and analyzing difference in mean value of SalePrice for all the different values of a categorical variable. Below are some of the key findings - 
* Similar to continous variables, there are a lot of categorical variables where there is no difference between mean SalePrice for different distinct values of a categorical variable. In fact an initial look suggests that majority of categorical variables don't have any relationship with SalePrice
* Average value of SalePrice varies drastically with changes in **Neighborhood**. This is intuitive and expected as well, since high income neighborhoods in a city generally command a higher price
* Variables with **Qual and Cond** suffixes seem to have variation in SalePrice for different values. The distinct values for these variables range from Excellent to Poor.These variables can also be classified as Ordinal categorical variables, since there is an intrinsic order between them
* SalePrice seems to vary to an extent with different types of **RoofType and RoofMatl** as well, however certain values of **RoofMatl** have very small violins(small number of records)
* **Built in Garages** command higher price than other types

Similar to the violin plots for continous variables, these violin plots also give us an indication on how to perform feature engineering for categorical variables. Here are some ideas - 
* Ordinal categorical variables for various quality measures can be converted to numeric variables
* For variables with lot of distinct levels, we can combine them to reduce number of levels

Let's now have a deeper look at distribution of prices for Neighborhood variable. 

### Variation of SalePrice by Neighborhood

In [None]:
fig = px.violin(train, y="SalePrice", x="Neighborhood", box=True, color='Neighborhood',title='Distribution of SalePrice by Neighborhood')
fig.update_layout(showlegend=False)
fig.show()

neighborhood_md=train.groupby('Neighborhood').agg({'SalePrice':'median'}).reset_index().sort_values(by='SalePrice',ascending=False)
neighborhood_md

fig1 = px.bar(neighborhood_md, y="SalePrice", x="Neighborhood",color='SalePrice',title='Median SalePrice by Neighborhood')
fig1.update_layout(showlegend=False)
fig1.show()

* There is a huge variance in the median SalePrice across different neighborhoods, it can vary from 300k+ for NridgHt to less than 100k for MeadowV
* Even within a single neighborhood the SalePrice values can sometimes be spread across a big range
* There are total of about 25 distinct values for Neighborhood variable, this can be a problem while training the model, since creating dummy variables from Neigborhood would add 25 additional variables, and since the number of records in training data are only about 1460, this would cause the Curse of Dimensionality. Therefore we can group the Neighborhood values into different groups based on the median SalePrice in the Neighborhood


In [None]:

# zoning=train.groupby('MSZoning').agg({'SalePrice':'median'}).reset_index().sort_values(by='SalePrice',ascending=False)
# neighborhood_md

# fig1 = px.bar(zoning, y="SalePrice", x="MSZoning",color='SalePrice',title='Median SalePrice by MSZoning')
# fig1.update_layout(showlegend=False)
# fig1.show()

In [None]:
# exterior=train.groupby('Exterior1st').agg({'SalePrice':'median'}).reset_index().sort_values(by='SalePrice',ascending=False)

# fig = px.bar(exterior, y="SalePrice", x="Exterior1st",color='SalePrice',title='Median SalePrice by Exterior1st')
# fig.update_layout(showlegend=False)
# fig.show()

# house=train.groupby('HouseStyle').agg({'SalePrice':'median'}).reset_index().sort_values(by='SalePrice',ascending=False)

# fig1 = px.bar(house, y="SalePrice", x="HouseStyle",color='SalePrice',title='Median SalePrice by HouseStyle')
# fig1.update_layout(showlegend=False)
# fig1.show()

### Condition1 and Condition 2 : Impact of Proximity to various conditions on SalePrice

In [None]:
condition=train.groupby(['Condition1','Condition2']).agg({'SalePrice':'median','Utilities':'count'}).reset_index().sort_values(by='SalePrice',ascending=False)
trans=condition.pivot(index='Condition1',columns='Condition2',values='SalePrice')
trans=trans.fillna(0)
cm = sns.light_palette("green", as_cmap=True)
s = trans.style.background_gradient(cmap=cm)
s




In [None]:
# condition.loc[condition['SalePrice']<100000,'condition_flag']=0
# condition.loc[condition['SalePrice']>=100000,'condition_flag']=1
# condition.loc[condition['SalePrice']>=150000,'condition_flag']=2
# condition.loc[condition['SalePrice']>=200000,'condition_flag']=3
# del condition['SalePrice']
#condition

condition=condition.rename(columns={"SalePrice": "avg_sale_price_cond","Utilities": "total_records"})
condition.sort_values(by='total_records',ascending=False,inplace=True)
condition

As seen above in the table, prices of a house varies a great deal based on proximity to different features-
* Houses Adjacent to Feeder Streets have low prices
* Houses nearby or adjacent to positive offsite features like parks etc. have a higher price
* Houses close to East-West Railroad have low selling prices

We also looked at count of unique records across different values of Condition1 - Condition 2 to check if we can look at distinct values to extract information around pricing. As it turns out close to 90% of records in train dataset have both Condition1 and Condition 2 as Norm, which tells us that it does not make sense to create features using distinct values for Conditions

In [None]:
# exterior.loc[exterior['SalePrice']>=110000,'ExteriorFlag']=0
# exterior.loc[exterior['SalePrice']>=125000,'ExteriorFlag']=1
# exterior.loc[exterior['SalePrice']>=130000,'ExteriorFlag']=2
# exterior.loc[exterior['SalePrice']>=150000,'ExteriorFlag']=3
# exterior.loc[exterior['SalePrice']>=200000,'ExteriorFlag']=4

# #exterior['ExteriorFlag']=int(exterior['ExteriorFlag'])
# del exterior['SalePrice']
# #exterior

In [None]:
# house.loc[house['SalePrice']>=110000,'StyleFlag']=1
# house.loc[house['SalePrice']>=150000,'StyleFlag']=2
# house.loc[house['SalePrice']>=180000,'StyleFlag']=3
# #exterior['ExteriorFlag']=int(exterior['ExteriorFlag'])
# del house['SalePrice']
# #exterior

### Month & Year of Sale : Checking for Seasonality in House Sales




In [None]:
gp=train.groupby(['YrSold','MoSold']).agg({'SalePrice':'median','LotArea':'count'}).reset_index()
gp['MoSold'] = gp['MoSold'].apply(str)
gp['YrSold'] = gp['YrSold'].apply(str)

gp['month_year']=gp['YrSold']+"-"+gp['MoSold']
gp.columns=['YrSold','MoSold','SalePrice','SaleCount','month_year']

# fig=px.line(gp,x='month_year',y='SalePrice')
# fig.show()


fig = make_subplots(specs=[[{"secondary_y": True}]])

# Add traces
fig.add_trace(
    go.Bar(x=gp['month_year'], y=gp['SalePrice'], name="SalePrice"),
    secondary_y=False,
)

fig.add_trace(
    go.Scatter(x=gp['month_year'],y=gp['SaleCount'], name="SaleCount"),
    secondary_y=True,
)

fig.update_layout(title_text="Median SalePrice and SaleCount over Time")
fig.show()

We looked at median SalePrices for different months and years in the SalePrice to check if houses sell for a greater price in a specific month of the year or during a certain year. Also 2008 was the year of Global Housing Crisis, we also wanted to check if house prices collapsed during the year. 

Even though the count of Sales transactions does go up and down each year, there is hardly enough variation in SalePrice across different months to justify that the SalePrices go up/down at any specific time of the year. 

In [None]:
# gp.MoSold=pd.to_numeric(gp.MoSold, errors='coerce')
# gp.YrSold=pd.to_numeric(gp.YrSold, errors='coerce')
# gp=gp[['YrSold','MoSold','SaleCount']]

#train=pd.merge(train,gp,left_on=['MoSold','YrSold'],right_on=['MoSold','YrSold'])

# Feature Engineering<a name="feature"></a>

Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data. The features in your data will directly influence the predictive models you use and the results you can achieve.

Having had a look at all the different variables present in the dataset, we will now start to look at creating new features out of existing features to extract more meaningful information from the variables and improve model performance. We will use two basic methods to create new features - 
1. Apply a mathematical operation to combine different continous variables
2. Assign an intrinsic order or combine different levels for categorical variables


In [None]:
#Feature Engineering
#Grouping neighborhoods based on value of median SalePrice for the neighborhood
neighborhood_md.loc[neighborhood_md['SalePrice']>=5000,'neighborhood_flag']=0
neighborhood_md.loc[neighborhood_md['SalePrice']>=150000,'neighborhood_flag']=1
neighborhood_md.loc[neighborhood_md['SalePrice']>=200000,'neighborhood_flag']=2
neighborhood_md.loc[neighborhood_md['SalePrice']>=250000,'neighborhood_flag']=3
del neighborhood_md['SalePrice']

In [None]:
neighborhood_md.columns=['Neighborhood','neighborhood_flag']
train=pd.merge(train,neighborhood_md,left_on='Neighborhood',right_on='Neighborhood')

#train=pd.merge(train,exterior,left_on='Exterior1st',right_on='Exterior1st')
#train=pd.merge(train,house,left_on='HouseStyle',right_on='HouseStyle')
#train=pd.merge(train,condition,left_on=['Condition1','Condition2'],right_on=['Condition1','Condition2'])

test=pd.merge(test,neighborhood_md,left_on='Neighborhood',right_on='Neighborhood')

#test=pd.merge(test,exterior,left_on='Exterior1st',right_on='Exterior1st')
#test=pd.merge(test,house,left_on='HouseStyle',right_on='HouseStyle')
#test=pd.merge(test,condition,left_on=['Condition1','Condition2'],right_on=['Condition1','Condition2'])

In [None]:
#Feature Engineering for Continous Variables
train['tot_bath']=train['BsmtFullBath']+0.5*train['BsmtHalfBath']+train['FullBath']+0.5*train['HalfBath']
train['bsmt_bath']=train['BsmtFullBath']+0.5*train['BsmtHalfBath']

train['bed_bath_kitch']=train['tot_bath']+train['BedroomAbvGr']+train['KitchenAbvGr']
train['area_floors']=train['1stFlrSF']+train['2ndFlrSF']+train['BsmtFinSF1']+train['BsmtFinSF2']
train['bsmt_by_total']=(train['BsmtFinSF1']+train['BsmtFinSF2'])/(train['area_floors'])
train['unf_bsmt']=(train['BsmtUnfSF']/train['TotalBsmtSF']).fillna(0)
train["unf_bsmt"].replace([np.inf, -np.inf], 0)

train['porch_area_tot']=train['OpenPorchSF']+train['EnclosedPorch']+train['3SsnPorch']+train['ScreenPorch']
train['wood_deck_porch']=(train['WoodDeckSF']/train['porch_area_tot']).replace(np.inf, 0)
train['sale_built_yr']=train['YrSold']-train['YearBuilt']
train['remod_built_yr']=train['YearRemodAdd']-train['YearBuilt']
train['new_flag']=0
train.loc[train['YrSold']-train['YearBuilt'],'new_flag']=1
train['remod_flag']=0
train.loc[train['remod_built_yr']>=2,'remod_flag']=1
train['floor_by_lot']=train['area_floors']/train['LotArea']

train['garage_area_per_car']=(train['GarageArea']/train['GarageCars']).fillna(0)
train["garage_area_per_car"]=train["garage_area_per_car"].replace([np.inf, -np.inf], 0)


test['tot_bath']=test['BsmtFullBath']+0.5*test['BsmtHalfBath']+test['FullBath']+0.5*test['HalfBath']
test['bsmt_bath']=test['BsmtFullBath']+0.5*test['BsmtHalfBath']
test['bed_bath_kitch']=test['tot_bath']+test['BedroomAbvGr']+test['KitchenAbvGr']
test['area_floors']=test['1stFlrSF']+test['2ndFlrSF']+test['BsmtFinSF1']+test['BsmtFinSF2']
test['bsmt_by_total']=(test['BsmtFinSF1']+test['BsmtFinSF2'])/(test['area_floors'])
test['unf_bsmt']=(test['BsmtUnfSF']/test['TotalBsmtSF']).fillna(0)
test["unf_bsmt"].replace([np.inf, -np.inf], 0)

test['porch_area_tot']=test['OpenPorchSF']+test['EnclosedPorch']+test['3SsnPorch']+test['ScreenPorch']
test['wood_deck_porch']=(test['WoodDeckSF']/test['porch_area_tot']).replace(np.inf, 0)
test['sale_built_yr']=test['YrSold']-test['YearBuilt']
test['remod_built_yr']=test['YearRemodAdd']-test['YearBuilt']
test['new_flag']=0
test.loc[test['YrSold']-test['YearBuilt']==0,'new_flag']=1
test['remod_flag']=0
test.loc[test['remod_built_yr']>=2,'remod_flag']=1
test['floor_by_lot']=test['area_floors']/test['LotArea']

test['garage_area_per_car']=(test['GarageArea']/test['GarageCars']).fillna(0)
test["garage_area_per_car"]=test["garage_area_per_car"].replace([np.inf, -np.inf], 0)

In [None]:
mapping = { "NA" : 0,"Po" : 1,"Fa":2,"TA" : 3,"Gd":4,"Ex":5 }
mapping_shape = { "Reg" : 0,"IR1" : 1,"IR2":2,"IR3" : 3}
mapping_contour = { "Bnk" : 0,"Lvl" : 1,"Low":2,"HLS" : 3}
mapping_ms_zoning={"FV":4,"RL":3,"RH":2,"RM":1,"C (all)":0,"NA":-1}
mapping_paved_Drive={"Y":2,"P":1,"N":0}
mapping_utilities={"AllPub":1,"NoSeWa":0,"NoSeWr":0,"ELO":0,"NA":0}
mapping_functional={"Typ":4,"Min1":3,"Min2":3,"Mod":2,"Maj1":1,"Maj2":1,"Sev":0,"Sal":0,"NA":0}

#mapping_cond={"RRNn":2,"PosN":1,"PosA":0,"Artery":,""}

l_col=['BsmtCond','BsmtQual','ExterQual','ExterCond','HeatingQC','GarageQual','GarageCond','FireplaceQu','KitchenQual']
l_col_zon=['MSZoning']
l_col_pav=['PavedDrive']
l_col_shape=['LotShape']
l_col_util=['Utilities']
l_col_con=['LandContour']
l_col_fun=['Functional']

#l_col_cond=['Condition1']
def mapping_var(mapping,df,varlist):
    
    for i in range(len(varlist)):
        varname=varlist[i]
        varname_o=varname+'_o'
        df[varname]=df[varname].fillna('NA')
        df[varname_o] =df[varname].apply(lambda x : mapping[x])
    return df

train=mapping_var(mapping,train,l_col)
train=mapping_var(mapping_ms_zoning,train,l_col_zon)
train=mapping_var(mapping_paved_Drive,train,l_col_pav)
train=mapping_var(mapping_shape,train,l_col_shape)
train=mapping_var(mapping_utilities,train,l_col_util)
train=mapping_var(mapping_contour,train,l_col_con)
train=mapping_var(mapping_functional,train,l_col_fun)


test=mapping_var(mapping,test,l_col)
test=mapping_var(mapping_ms_zoning,test,l_col_zon)
test=mapping_var(mapping_paved_Drive,test,l_col_pav)
test=mapping_var(mapping_shape,test,l_col_shape)
test=mapping_var(mapping_utilities,test,l_col_util)
test=mapping_var(mapping_contour,test,l_col_con)
test=mapping_var(mapping_functional,test,l_col_fun)

labelencoder = LabelEncoder()
train.loc[:, 'Street'] = labelencoder.fit_transform(train.loc[:, 'Street'])
train.loc[:, 'CentralAir'] = labelencoder.fit_transform(train.loc[:, 'CentralAir'])

test.loc[:, 'Street'] = labelencoder.fit_transform(test.loc[:, 'Street'])
test.loc[:, 'CentralAir'] = labelencoder.fit_transform(test.loc[:, 'CentralAir'])

In [None]:
train['tot_quality']=train['BsmtQual_o']+train['ExterQual_o']+train['KitchenQual_o']+train['GarageQual_o']+train['FireplaceQu_o']
test['tot_quality']=test['BsmtQual_o']+test['ExterQual_o']+test['KitchenQual_o']+test['GarageQual_o']+test['FireplaceQu_o']

Here is a summary of all the new variables created above - 
* Neighborhood values are grouped into different groups based on value of SalePrice
* Total bathrooms column based on total number of full and half bathrooms in different parts of house
* Total number of bathrooms and kitchen in the house
* Total house area combining total area of different floors and basement
* Portion of basement that is unfinished
* Whether a house is newly built or old
* Age of the house at the time of Sale
* Converting categorical variables to numeric by assigning a numeric value to different lables and assigning an intrinsic order
* Combining different quality variables to create a total quality variable

Let's now look at Correlation matrix between all the variables and SalePrice to check if all the old and new numeric variables have a correlation with dependent variable Sale Price and also to check correlation between different independent variables.

In [None]:
#Correlation matrix for Variables
numcols=train._get_numeric_data().columns
train_num=train.loc[:,numcols]

corr = train_num.corr(method='pearson')
# corr
f, ax = plt.subplots(figsize=(25, 25))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)
mask = np.triu(np.ones_like(corr, dtype=np.bool))

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.8, cbar_kws={"shrink": .5})

#ax = sns.heatmap(corr,linewidths=0.8,cmap=cmap)



* Sale Price seems to be highly correlated to neighborhood flag, tot_quality, OverallQual
* OverallQual is mildly correlated to individual quality variables for Bsmt, Exterior, Garage etc.
* As expected, MoSold and YrSold aren't correlatd with SalePrice
* Lot Area is weakly correlated with SalePrice
* Pool and Porch related variables are not correlated with SalePrice

### Multicollinearity
We will also be checking for multicollinearity in the independent variables in the dataset. According to Wikipedia - Multicollinearity is a phenomenon in which one predictor variable in a multiple regression model can be linearly predicted from the others with a substantial degree of accuracy.
Multicollinearity can be problematic for linear regression because it can give unreliable coefficients for correlated variables




In [None]:
#Check for variable combinations where correlation is greater than 80%
k=corr.unstack().sort_values().drop_duplicates()
k[(k>0.8) | (k<-0.8)]

During feature engineering, some of the new variables created are highly correlated to the base variables. for ex. bsmt_by_total has more than 80% positive correlation with BsmtFinSF1 and GarageCars has a 88% correlation with GarageArea. During model training, we should ensure that we should only take one variable out of the correlated pair for model training, since they are both adding the same effect in model. 

As a next step we will be combining the train and test datasets and identifying the input variables with a skew. We will then be taking a log transformation of variables with a skew. 

In [None]:
#pd.get_dummies[train,['Neighborhood']]
#train.columns
#train=pd.get_dummies(data=train, columns=['Neighborhood'])
#test=pd.get_dummies(data=test, columns=['Neighborhood'])

In [None]:
#train=train.loc[train['SalePrice']<500000]

In [None]:
test.sort_values(by='Id',inplace=True)
test
train.sort_values(by='Id',inplace=True)

salePrice=train['SalePrice']
del train['logSalePrice']
del train['SalePrice']

In [None]:
train['flag']='train'
test['flag']='test'
all_data = pd.concat([train, test], ignore_index=True, sort=False)
#all_data

In [None]:
from scipy.stats import skew

#log transform skewed numeric features:
numeric_feats = all_data.dtypes[train.dtypes != "object"].index
#numeric_feats

skewed_feats = all_data[numeric_feats].apply(lambda x: skew(x.dropna())) #compute skewness
skewed_feats = skewed_feats[skewed_feats > 0.5]
skewed_feats = skewed_feats.index
#skewed_feats=skewed_feats.drop('SalePrice')
all_data[skewed_feats] = np.log1p(all_data[skewed_feats])
#all_data

In [None]:
skewed_feats

In [None]:
#df['feature'] = df['feature'].replace(-np.inf, np.nan)

#type(skewed_feats)
#
numcols=all_data._get_numeric_data().columns
all_data_num=all_data.loc[:,numcols]

k=all_data_num.columns.to_series()[np.isinf(all_data_num).any()]
#test['remod_built_yr'].unique()
all_data[k] = all_data[k].replace(-np.inf, np.nan)

# Model Building <a name="model"></a>
In this section, we will look to build out different models for predicting house prices on the data. We will be training three different models, RandomForest, XGBoost and Lasso and we will be analyzing how the different models perform on the training dataset. 

First let's start with Missing Value Treatment. For sake of simplicity, we will be using the Simple Imputer in scikit-learn to subsitute missing values in features with the respective mean for the column. 

Next we will create a basic Random Forest model and plot feature importance graph to understand which of the variables are important in model prediction. 


In [None]:
print(all_data.columns.values)

In [None]:
all_data_mod=all_data[['LotArea','GrLivArea','TotRmsAbvGrd','OverallQual','OverallCond','YearBuilt','YearRemodAdd','TotalBsmtSF','BsmtQual_o','ExterQual_o','GarageCars','1stFlrSF','2ndFlrSF','FullBath','FireplaceQu_o','KitchenQual_o','MSZoning_o','PavedDrive_o','LotShape_o', 'Utilities_o', 'LandContour_o', 'Functional_o','tot_bath','bed_bath_kitch','area_floors','GarageArea','Fireplaces','BedroomAbvGr','tot_quality','sale_built_yr','remod_built_yr','MSSubClass','neighborhood_flag','MasVnrArea','garage_area_per_car','floor_by_lot','remod_flag','new_flag','unf_bsmt','bsmt_by_total','bsmt_bath','flag','BsmtFinSF1','BsmtFinType2','BsmtFinSF2','BsmtUnfSF']]
all_data_mod

In [None]:
# train_mod=train[['LotArea','GrLivArea','TotRmsAbvGrd','OverallQual','OverallCond','YearBuilt','YearRemodAdd','TotalBsmtSF','BsmtQual_o','ExterQual_o','GarageCars','1stFlrSF','2ndFlrSF','FullBath','FireplaceQu_o','KitchenQual_o','MSZoning_o','tot_bath','bed_bath_kitch','area_floors','GarageArea','Fireplaces','BedroomAbvGr','tot_quality','sale_built_yr','remod_built_yr','MSSubClass','neighborhood_flag','MasVnrArea','garage_area_per_car','floor_by_lot','remod_flag','new_flag','unf_bsmt','bsmt_by_total','bsmt_bath']]
# test_mod=test[['LotArea','GrLivArea','TotRmsAbvGrd','OverallQual','OverallCond','YearBuilt','YearRemodAdd','TotalBsmtSF','BsmtQual_o','ExterQual_o','GarageCars','1stFlrSF','2ndFlrSF','FullBath','FireplaceQu_o','KitchenQual_o','MSZoning_o','tot_bath','bed_bath_kitch','area_floors','GarageArea','Fireplaces','BedroomAbvGr','tot_quality','sale_built_yr','remod_built_yr','MSSubClass','neighborhood_flag','MasVnrArea','garage_area_per_car','floor_by_lot','remod_flag','new_flag','unf_bsmt','bsmt_by_total','bsmt_bath']]
all_data_mod=all_data[['LotArea','GrLivArea','TotRmsAbvGrd','OverallQual','OverallCond','YearBuilt','YearRemodAdd','TotalBsmtSF','BsmtQual_o','ExterQual_o','GarageCars','1stFlrSF','2ndFlrSF','FullBath','FireplaceQu_o','KitchenQual_o','MSZoning_o','PavedDrive_o','LotShape_o', 'Utilities_o', 'LandContour_o', 'Functional_o','tot_bath','bed_bath_kitch','area_floors','GarageArea','Fireplaces','BedroomAbvGr','tot_quality','sale_built_yr','remod_built_yr','MSSubClass','neighborhood_flag','MasVnrArea','floor_by_lot','remod_flag','new_flag','unf_bsmt','bsmt_by_total','bsmt_bath','flag','BsmtFinSF1','BsmtFinSF2','BsmtUnfSF']]
# Missing Value Treatment using Simple Imputer
train_mod=all_data_mod.loc[all_data_mod['flag']=='train',:]
test_mod=all_data_mod.loc[all_data_mod['flag']=='test',:]
del train_mod['flag']
del test_mod['flag']
my_imputer = SimpleImputer(missing_values=np.nan,strategy='mean')
train_imp = my_imputer.fit_transform(train_mod)
test_imp = my_imputer.fit_transform(test_mod)

In [None]:
# train_mod=train[['LotArea','tot_bath','sale_built_yr','TotalBsmtSF','YearBuilt','1stFlrSF','GarageArea','KitchenQual_o','GarageCars','neighborhood_flag','GrLivArea','ExterQual_o','tot_quality','area_floors','OverallQual']]
# test_mod=test[['LotArea','tot_bath','sale_built_yr','TotalBsmtSF','YearBuilt','1stFlrSF','GarageArea','KitchenQual_o','GarageCars','neighborhood_flag','GrLivArea','ExterQual_o','tot_quality','area_floors','OverallQual']]

# from sklearn.impute import SimpleImputer
# my_imputer = SimpleImputer(missing_values=np.nan,strategy='mean')
# train_imp = my_imputer.fit_transform(train_mod)
# test_imp = my_imputer.fit_transform(test_mod)

In [None]:
X=train_imp
y=np.log1p(salePrice)#log of dependent variable
#y=train['SalePrice']
#Split train data into train and test datasets
X_train,X_test,Y_train,Y_Test=train_test_split(X, y,test_size=0.3,random_state=1)

#Defining a basic Random Forest model with 100 trees 
regressor = RandomForestRegressor(n_estimators=100, random_state=1,max_depth=10,max_features=10,min_samples_leaf=5)  
feature_list=train_mod.columns

# # fit the regressor with X and Y data 
model=regressor.fit(X_train,Y_train) 

#Make predictions on test dataset
pred_rf=model.predict(X_test)
#X_test['pred']=pred
#X_test['actual']=Y_Test

#Calculate Cross Validation results
cv_results = cross_validate(regressor, X_train, Y_train, cv=5,scoring='r2')
sorted(cv_results.keys())

scoreOfModel = model.score(X_test, Y_Test)
print("RSquared value for Model",scoreOfModel)

#Calculate Feature Importances
feat_importances = pd.Series(model.feature_importances_, index=feature_list)
feat_importances=feat_importances.sort_values()
feat_importances.plot(kind='barh',figsize=(12,9))

#x.style.background_gradient(cmap = 'Wistia')
#Print Root Mean Squared Log Error
print("RMSLE for Model",np.sqrt(mean_squared_log_error(np.exp(Y_Test),np.exp(pred_rf) )))
#print(np.sqrt(mean_squared_log_error(Y_Test,pred_rf )))

In [None]:
print("Cross Validation Results")
print(cv_results)

Our initial model had a RMSLE of 0.159 and RSquared values of 0.86, which is not too bad. We also applied Cross validation checks to avoid Overfitting and found out that different cross validation scores are quite close to each other, indicating that the model is not overfitting and doing well on records it has not seen before. Most important part of the output above is the Feature Importance Graph - 
* **OverallQual** had the highest feature importance, followed by area_floors
* Our variables created through Feature Engineering **neighborhood_flag** and **tot_quality** also did quite well and had the third highest feature importance
* Some of the variables like **TotRmsAbvGrd, bsmt_bath, Fireplaces** did not have a lot of importance in the model. 

With this initial understanding on the variables, let's now use Grid Search to identify the best possible set of hyperparameters which will give us the best results. 

In [None]:
pred=model.predict(test_imp)
#pred=best_grid_rf.predict(test_imp)
print(len(pred))
#test.shape
#test_mod.shape
Id=pd.Series(range(1461,2920))
#pred
#Id
d = {'Id': Id, 'SalePrice': pred}
df = pd.DataFrame(data=d)
df['SalePrice']=np.exp(df['SalePrice'])
df

df.to_csv('submission.csv',index=False)

In [None]:

# Create the parameter grid based on the results of Grid search 
param_grid = {
    'bootstrap': [True],
    'max_depth': [5, 3,7,10,15],
    'max_features': [5, 12,10,8],
  #  'max_leaf_nodes': [4, 8, 16,32],
    'min_samples_split': [5, 10, 8,3],
    'n_estimators': [100, 200,150,300]
}
rf = RandomForestRegressor()
grid_search_rf = GridSearchCV(estimator = rf, param_grid = param_grid, 
                          cv = 5, n_jobs = -1, verbose = 2)
# Fit the grid search to the data
grid_search_rf.fit(X_train,Y_train)
grid_search_rf.best_params_

In [None]:
def evaluate(model, test_features, test_labels):
    predictions = model.predict(test_features)
    errors = abs(predictions - test_labels)
    mape = 100 * np.mean(errors / test_labels)
    accuracy = 100 - mape
    print('Model Performance')
    print('Average Error: {:0.4f} degrees.'.format(np.mean(errors)))
    print('Accuracy = {:0.2f}%.'.format(accuracy))
    
    return accuracy

In [None]:
best_grid_rf = grid_search_rf.best_estimator_
# grid_accuracy = evaluate(best_grid, X_test, Y_Test)
# grid_accuracy

pred_gs_rf=best_grid_rf.predict(X_test)

scoreOfModel = best_grid_rf.score(X_test, Y_Test)
print("Rqusared for the Model",scoreOfModel)

pred_gs_rf=best_grid_rf.predict(X_test)

print("RMSLE for the Model",np.sqrt(mean_squared_log_error(np.exp(Y_Test),np.exp(pred_gs_rf) )))
#print(np.sqrt(mean_squared_log_error(Y_Test,pred_gs_rf)))

In [None]:
actuals=Y_Test
predictions=pred_gs_rf

actuals=np.exp(Y_Test)
predictions=np.exp(pred_gs_rf)
residuals=predictions-actuals


plt.scatter(predictions, residuals)
plt.xlabel("Predicted SalePrice")
plt.ylabel("Residuals")
plt.show()

In [None]:
plt.scatter(actuals,predictions)
plt.ylabel("Predicted SalePrice")
plt.xlabel("Actual SalePrice")
plt.show()

The Grid Search model does only slightly better than the basic RandomForest model, which is an indication that we need to further tune the hyperparameters to improve model performance. We also plotted the Fitted vs Residual plot and Actual vs Predicted plots to measure model performance across different values of dependent variable. The Variance for Residuals is quite low for fitted values less than 250k but increases substantially for values greater than 250k, indicating that our model isn't performing very well on higher values of SalePrice 

### XGBoost 

In [None]:
import xgboost
xgb = xgboost.XGBRegressor(n_estimators=100, learning_rate=0.08, gamma=0, subsample=0.75,
                           colsample_bytree=1, max_depth=7)
xgb.fit(X_train,Y_train)

In [None]:
param_grid = {
    'max_depth': [8,10,5],
    'learning_rate': [0.05, 0.1, 0.3],
    'colsample_bytree': [0.5, 0.7],
    'lambda':[0.75,1],
  #  'max_leaves':[4,8,16],
    'min_child_weight':[2,4,6],
    'subsample': [0.5, 0.7, 0.8],
    'n_estimators': [150, 200,250,300]
}
xg = xgboost.XGBRegressor()
grid_search_xg = GridSearchCV(estimator = xg, param_grid = param_grid, 
                          cv = 3, n_jobs = -1, verbose = 2)
grid_search_xg.fit(X_train,Y_train)
best_grid_xg=grid_search_xg.best_params_

In [None]:
best_grid_xg=grid_search_xg.best_params_
#xgboost.plot_importance(best_grid_xg,importance_type='gain')

In [None]:
from sklearn.metrics import explained_variance_score

best_grid_xg = grid_search_xg.best_estimator_

pred_gs_xg = best_grid_xg.predict(X_test)
#print(explained_variance_score(predictions,y_test))
print(np.sqrt(mean_squared_log_error(np.exp(Y_Test), np.exp(pred_gs_xg) )))
print(explained_variance_score(pred_gs_xg,Y_Test))

In [None]:
actuals=np.exp(Y_Test)

import matplotlib.pyplot as plt

predictions=np.exp(pred_gs_rf)
residuals=predictions-actuals

plt.scatter(predictions, residuals)
plt.xlabel("SalePrice")
plt.ylabel("Residuals")
plt.show()

predictions=np.exp(pred_gs_xg)
residuals=predictions-actuals

plt.scatter(predictions, residuals)
plt.xlabel("SalePrice")
plt.ylabel("Residuals")
plt.show()

## Linear Regression
After fitting XGBoost & Random Forest, both of which are tree based models, we look to fit Regularized linear models Lasso & Ridge Regression. Both Lasso & Ridge Regression require a learning parameter, alpha as an input. Similar to the tree models above, we would be identifying the best possible value of this parameter through the Grid Search. Next we would be looking at Residual plots and see if these models perform differently than tree based models. 


[Here](https://www.google.com/search?q=analytics+vidhya+lasso+regression&rlz=1C1CHBF_enIN875IN875&oq=analytics+vidhya+lasso+regression&aqs=chrome..69i57j35i39j0l3j69i60l3.6383j0j7&sourceid=chrome&ie=UTF-8) is a good guide on Lasso & Ridge Regression and how they prevent overfitting. 


### Lasso Regression


In [None]:
#Lasso Regression - Select the best learning parameter with Grid Search
from sklearn.linear_model import Lasso,LassoCV
lasso=Lasso(max_iter=10000)
parameters={'alpha': [1e-15, 1e-10, 1e-8, 9e-4, 7e-4, 5e-4, 3e-4, 1e-4, 1e-3, 5e-2, 1e-2, 0.1, 0.3, 1, 3, 5]}
lasso_regressor=GridSearchCV(lasso,parameters,scoring='r2',cv=5)
lasso_regressor.fit(X_train,Y_train)

#model_lasso = LassoCV(alphas = [1, 0.1, 0.001, 0.0005]).fit(X_train, y)
best_grid_lasso=lasso_regressor.best_params_

print(lasso_regressor.best_params_)
print("R Squared value for the model",lasso_regressor.best_score_)

In [None]:
lasso_model=Lasso(alpha=0.0001,max_iter=10000).fit(X_train,Y_train)
pred_gs_lasso=lasso_model.predict(X_test)
print("RMSLE for the model",np.sqrt(mean_squared_log_error(np.exp(Y_Test), np.exp(pred_gs_lasso) )))

In [None]:
coef = pd.Series(lasso_model.coef_, index = train_mod.columns)
print("Lasso picked " + str(sum(coef != 0)) + " variables and eliminated the other " +  str(sum(coef == 0)) + " variables")

In [None]:
import matplotlib
imp_coef = pd.concat([coef.sort_values().head(10),
                     coef.sort_values().tail(10)])

matplotlib.rcParams['figure.figsize'] = (8.0, 10.0)
imp_coef.plot(kind = "barh")
plt.title("Coefficients in the Lasso Model")

In [None]:
actuals=np.exp(Y_Test)

predictions=np.exp(pred_gs_lasso)
residuals=predictions-actuals

plt.scatter(predictions, residuals)
plt.xlabel("SalePrice")
plt.ylabel("Residuals")
plt.show()

plt.scatter(predictions, actuals)
plt.xlabel("SalePrice")
plt.ylabel("Actuals")
plt.show()

## Ridge Regression

In [None]:
####Lasso Regression
from sklearn.linear_model import Ridge,RidgeCV
ridge=Ridge(max_iter=10000)
parameters={'alpha': [1e-15, 1e-10, 1e-8, 9e-4, 7e-4, 5e-4, 3e-4, 1e-4, 1e-3, 5e-2, 1e-2, 0.1, 0.3, 1, 3, 5]}
ridge_regressor=GridSearchCV(ridge,parameters,scoring='r2',cv=5)
ridge_regressor.fit(X_train,Y_train)

#model_lasso = LassoCV(alphas = [1, 0.1, 0.001, 0.0005]).fit(X_train, y)
best_grid_ridge=ridge_regressor.best_params_

print(ridge_regressor.best_params_)
print("R Squared value for the model",ridge_regressor.best_score_)

ridge_model=Ridge(alpha=0.3,max_iter=10000).fit(X_train,Y_train)
pred_gs_ridge=ridge_model.predict(X_test)
#print(explained_variance_score(predictions,y_test))
print("RMSLE for the model",np.sqrt(mean_squared_log_error(np.exp(Y_Test), np.exp(pred_gs_lasso) )))


coef = pd.Series(ridge_model.coef_, index = train_mod.columns)
print(coef)

In [None]:
actuals=np.exp(Y_Test)

predictions=np.exp(pred_gs_ridge)
residuals=predictions-actuals

plt.scatter(predictions, residuals)
plt.xlabel("SalePrice")
plt.ylabel("Residuals")
plt.show()

Here are some of the observations on Lasso & Ridge models above - 
* Both models are quite simple compared to ensemble models like Random Forests & XG Boost where we are fitting hundreds of trees to fit the training dataset. Even though these models are simpler, the **RMSLE metric for Lasso & Ridge Regression are comparable to tree based models**
* Lasso Regression prevents overfitting by setting coefficients of irrelevant variables to zero, in this instance **Lasso picked 40 variables and eliminated the other 3 variables**
* Ridge Regression on the other hand does not explicitly set the coefficients of irrelevant variables to zero, but brings them quite close to zero. Based on grid search results, the resultant value of **alpha was quite high for Ridge Regression compared to Lasso Regression**, hence Ridge model was quite aggressive in preventing overfitting and set the coefficients for higher number of variables close to zero
* Similar to tree based models, both of these models have **high residual values for higher values of SalePrice
**


### Model Blending
Finally we would look to combine results from different models and check if the RMSLE value for combined results is better than the individual models. 


In [None]:
pred_rf=best_grid_rf.predict(X_test)
pred_xg=best_grid_xg.predict(X_test)
pred_ls=lasso_model.predict(X_test)
pred_rg=ridge_model.predict(X_test)

pred_xg_ls_rg=(pred_xg+pred_ls+pred_rg)/3

print('RMSLE for Random Forest')
print(np.sqrt(mean_squared_log_error(np.exp(Y_Test), np.exp(pred_rf) )))


print('RMSLE for XG Boost')
print(np.sqrt(mean_squared_log_error(np.exp(Y_Test), np.exp(pred_xg) )))


print('RMSLE for Lasso')
print(np.sqrt(mean_squared_log_error(np.exp(Y_Test), np.exp(pred_ls) )))

print('RMSLE for Xgboost & Lasso & Ridge Blended')
print(np.sqrt(mean_squared_log_error(np.exp(Y_Test), np.exp(pred_xg_ls_rg) )))

The output above suggests that XGBoost, Lasso & Ridge Regression perform much better than Random Forest models, and the combine results of XGBoost, Lasso & Ridge Regression performs even better than the individual models. For our final submission we would be submitting combined results from  Xgboost & Lasso & Ridge Blended model. 

In [None]:
pred_test_rf=best_grid_rf.predict(test_imp)
pred_test_xg=best_grid_xg.predict(test_imp)
pred_test_ls=lasso_model.predict(test_imp)
pred_test_rg=lasso_model.predict(test_imp)

pred=(pred_test_ls+pred_test_xg+pred_test_rg)/3
#pred=pred_test_xg
print(len(pred))
#test.shape
#test_mod.shape
Id=pd.Series(range(1461,2920))
#pred
#Id
d = {'Id': Id, 'SalePrice': pred}
df = pd.DataFrame(data=d)
df['SalePrice']=np.exp(df['SalePrice'])
df

df.to_csv('submission.csv',index=False)

### Next Steps 

Dive deeper into some aspects of Feature Engineering and model performance and improve score on the public LB.