<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 2: Ames Housing Data and Kaggle Challenge

---
## Problem Statement

Keller Williams Ames wants to gain a competive advantage against other brokerages in the area. They want to predict the sale price of new homes on the market, to help both their seller agents and buyer agents provide unmatched service to their clients.

### Contents:
- [Model 1: Scaled Benchmarked Model](#Model-1:-Scaled-Benchmarked-Model)
- [Model 2: All Positively Correlated Data to Sale Price](#Model-2:-All-Positively-Correlated-Data-to-Sale-Price)
- [Model 3: Add Categorical Features](#Model-3:-Add-Categorical-Features)
- [Model 4: Remove Features](#Model-4:-Remove-Features)

---

In [1]:
# imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.preprocessing import StandardScaler, OneHotEncoder
%matplotlib inline

pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)

In [2]:
ames = pd.read_csv('../datasets/train_clean.csv', dtype={'ms_subclass': str, 'pid': str})
ames.head()

Unnamed: 0,id,lot_frontage,lot_area,overall_qual,overall_cond,year_built,year_remod/add,mas_vnr_area,bsmtfin_sf_1,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,1st_flr_sf,2nd_flr_sf,low_qual_fin_sf,gr_liv_area,bsmt_full_bath,bsmt_half_bath,full_bath,half_bath,bedroom_abvgr,kitchen_abvgr,totrms_abvgrd,fireplaces,garage_yr_blt,garage_cars,garage_area,wood_deck_sf,open_porch_sf,enclosed_porch,3ssn_porch,screen_porch,pool_area,misc_val,mo_sold,yr_sold,saleprice,pid,ms_subclass,ms_zoning,street,alley,lot_shape,land_contour,utilities,lot_config,land_slope,neighborhood,condition_1,condition_2,bldg_type,house_style,roof_style,roof_matl,exterior_1st,exterior_2nd,mas_vnr_type,exter_qual,exter_cond,foundation,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_type_2,heating,heating_qc,central_air,electrical,kitchen_qual,functional,fireplace_qu,garage_type,garage_finish,garage_qual,garage_cond,paved_drive,pool_qc,fence,misc_feature,sale_type
0,109,69.0552,13517.0,6.0,8.0,1976.0,2005.0,289.0,533.0,0.0,192.0,725.0,725.0,754.0,0.0,1479.0,0.0,0.0,2.0,1.0,3.0,1.0,6.0,0.0,1976.0,2.0,475.0,0.0,44.0,0.0,0.0,0.0,0.0,0.0,3.0,2010.0,130500.0,533352170,60,RL,Pave,,IR1,Lvl,AllPub,CulDSac,Gtl,Sawyer,RRAe,Norm,1Fam,2Story,Gable,CompShg,HdBoard,Plywood,BrkFace,Gd,TA,CBlock,TA,TA,No,GLQ,Unf,GasA,Ex,Y,SBrkr,Gd,Typ,,Attchd,RFn,TA,TA,Y,,,,WD
1,544,43.0,11492.0,7.0,5.0,1996.0,1997.0,132.0,637.0,0.0,276.0,913.0,913.0,1209.0,0.0,2122.0,1.0,0.0,2.0,1.0,4.0,1.0,8.0,1.0,1997.0,2.0,559.0,0.0,74.0,0.0,0.0,0.0,0.0,0.0,4.0,2009.0,220000.0,531379050,60,RL,Pave,,IR1,Lvl,AllPub,CulDSac,Gtl,SawyerW,Norm,Norm,1Fam,2Story,Gable,CompShg,VinylSd,VinylSd,BrkFace,Gd,TA,PConc,Gd,TA,No,GLQ,Unf,GasA,Ex,Y,SBrkr,Gd,Typ,TA,Attchd,RFn,TA,TA,Y,,,,WD
2,153,68.0,7922.0,5.0,7.0,1953.0,2007.0,0.0,731.0,0.0,326.0,1057.0,1057.0,0.0,0.0,1057.0,1.0,0.0,1.0,0.0,3.0,1.0,5.0,0.0,1953.0,1.0,246.0,0.0,52.0,0.0,0.0,0.0,0.0,0.0,1.0,2010.0,109000.0,535304180,20,RL,Pave,,Reg,Lvl,AllPub,Inside,Gtl,NAmes,Norm,Norm,1Fam,1Story,Gable,CompShg,VinylSd,VinylSd,,TA,Gd,CBlock,TA,TA,No,GLQ,Unf,GasA,TA,Y,SBrkr,Gd,Typ,,Detchd,Unf,TA,TA,Y,,,,WD
3,318,73.0,9802.0,5.0,5.0,2006.0,2007.0,0.0,0.0,0.0,384.0,384.0,744.0,700.0,0.0,1444.0,0.0,0.0,2.0,1.0,3.0,1.0,7.0,0.0,2007.0,2.0,400.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,2010.0,174000.0,916386060,60,RL,Pave,,Reg,Lvl,AllPub,Inside,Gtl,Timber,Norm,Norm,1Fam,2Story,Gable,CompShg,VinylSd,VinylSd,,TA,TA,PConc,Gd,TA,No,Unf,Unf,GasA,Gd,Y,SBrkr,TA,Typ,,BuiltIn,Fin,TA,TA,Y,,,,WD
4,255,82.0,14235.0,6.0,8.0,1900.0,1993.0,0.0,0.0,0.0,676.0,676.0,831.0,614.0,0.0,1445.0,0.0,0.0,2.0,0.0,3.0,1.0,6.0,0.0,1957.0,2.0,484.0,0.0,59.0,0.0,0.0,0.0,0.0,0.0,3.0,2010.0,138500.0,906425045,50,RL,Pave,,IR1,Lvl,AllPub,Inside,Gtl,SawyerW,Norm,Norm,1Fam,1.5Fin,Gable,CompShg,Wd Sdng,Plywood,,TA,TA,PConc,Fa,Gd,No,Unf,Unf,GasA,TA,Y,SBrkr,TA,Typ,,Detchd,Unf,TA,TA,N,,,,WD


## Model 1: Scaled Benchmarked Model

### Model Preparation

In [3]:
# Features
features = ['overall_qual', 'year_built', 'year_remod/add', 'mas_vnr_area', 'bsmtfin_sf_1',
            'total_bsmt_sf', '1st_flr_sf', 'gr_liv_area', 'full_bath', 'totrms_abvgrd',
            'garage_yr_blt', 'garage_cars', 'garage_area', 'bedroom_abvgr', 'fireplaces']
X = ames[features]

# Target
y = ames['saleprice']

In [4]:
# Train / Test split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.70, random_state=42)

In [5]:
# instantiate Linear Regression model
model = LinearRegression()

### Preprocessing

In [6]:
# initialize standard scaler
ss = StandardScaler()

# Fit and transform train data
X_train_sc = ss.fit_transform(X_train)

# Transform test data
X_test_sc = ss.transform(X_test)

### Model Fitting and Evaluation

In [7]:
# cross validate scaled model (R-squared)
model_scores = cross_val_score(model, X_train_sc, y_train, cv = 5)
model_scores.mean()

0.7561820214222157

In [8]:
# Fit model on scaled train data
model.fit(X_train_sc, y_train)

LinearRegression()

In [9]:
# model coefficients
pd.Series(model.coef_, index = features)

overall_qual      27932.535065
year_built         5830.971023
year_remod/add     7282.586992
mas_vnr_area       5725.586993
bsmtfin_sf_1       7328.775974
total_bsmt_sf      3320.748567
1st_flr_sf         3248.427982
gr_liv_area       15572.708811
full_bath          -864.927537
totrms_abvgrd      6454.119014
garage_yr_blt     -4482.308859
garage_cars        7235.522549
garage_area        6047.326694
bedroom_abvgr     -2975.634085
fireplaces         6410.760084
dtype: float64

Bedrooms, bathrooms, and when the garage was built still appears to have a negative effect on sale price.

In [10]:
# standard deviation of coefficients
pd.Series(ss.scale_, index = features)

overall_qual        1.434843
year_built         30.108185
year_remod/add     20.956561
mas_vnr_area      175.892090
bsmtfin_sf_1      473.070415
total_bsmt_sf     458.680880
1st_flr_sf        405.430268
gr_liv_area       509.667481
full_bath           0.555886
totrms_abvgrd       1.569697
garage_yr_blt     449.288916
garage_cars         0.761413
garage_area       217.581331
bedroom_abvgr       0.819512
fireplaces          0.634872
dtype: float64

In [11]:
# score train data (R-Squared)
model.score(X_train_sc, y_train)

0.791288070190293

In [12]:
# score test data (R-Squared)
model.score(X_test_sc, y_test)

0.8570186287354395

Same results as benchmark model

In [47]:
# MAE, MSE and RSME for Train data
y_pred = model.predict(X_train_sc)

print("Train Evaluation")
print("MAE: ", mean_absolute_error(y_train, y_pred))
print("MSE: ", mean_squared_error(y_train, y_pred))
print("RMSE: ", np.sqrt(mean_squared_error(y_train, y_pred)))

Train Evaluation
MAE:  22759.015179009173
MSE:  1337090347.1448555
RMSE:  36566.24600837302


In [13]:
# MAE, MSE and RSME for Test data
y_pred = model.predict(X_test_sc)

print("Test Evaluation")
print("MAE: ", mean_absolute_error(y_test, y_pred))
print("MSE: ", mean_squared_error(y_test, y_pred))
print("RMSE: ", np.sqrt(mean_squared_error(y_test, y_pred)))

Test Evaluation
MAE:  21484.83223565343
MSE:  854660706.9730763
RMSE:  29234.58067038206


Same results as benchmark model

## Model 2: All Positively Correlated Data to Sale Price

### Model Preparation

In [15]:
# Features from previous model plus all other positively correlated columns to sale price.
features = ['overall_qual', 'year_built', 'year_remod/add', 'mas_vnr_area', 'bsmtfin_sf_1',
            'total_bsmt_sf', '1st_flr_sf', 'gr_liv_area', 'full_bath', 'totrms_abvgrd',
            'garage_yr_blt', 'garage_cars', 'garage_area', 'bedroom_abvgr', 'fireplaces','lot_frontage',
            'lot_area', 'bsmtfin_sf_2', 'bsmt_unf_sf', '2nd_flr_sf', 'bsmt_full_bath', 'half_bath', 'wood_deck_sf',
            'open_porch_sf', '3ssn_porch', 'screen_porch', 'pool_area', 'mo_sold']
X = ames[features]

# Target
y = ames['saleprice']

In [16]:
# Train / Test split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.70, random_state=42)

In [17]:
# instantiate Linear Regression model
model = LinearRegression()

### Preprocessing

In [19]:
# initialize standard scaler
ss = StandardScaler()

# Fit and transform train data
X_train_sc = ss.fit_transform(X_train)

# Transform test data
X_test_sc = ss.transform(X_test)

### Model Fitting and Evaluation

In [21]:
# Cross validate model (R-squared)
model_scores = cross_val_score(model, X_train_sc, y_train, cv=5)
model_scores.mean()

0.7551496148975383

In [22]:
# Fit model on training data
model.fit(X_train_sc, y_train)

LinearRegression()

In [23]:
# model coefficients
pd.Series(model.coef_, index = features)

overall_qual      28449.667717
year_built         7508.896111
year_remod/add     7024.058828
mas_vnr_area       5652.661029
bsmtfin_sf_1       3668.170379
total_bsmt_sf      3784.384854
1st_flr_sf         5587.343069
gr_liv_area        8661.303343
full_bath          -244.709952
totrms_abvgrd      7346.154043
garage_yr_blt     -4040.248603
garage_cars        5781.192385
garage_area        5193.214429
bedroom_abvgr     -3553.080404
fireplaces         5518.495231
lot_frontage        370.254572
lot_area           5775.007450
bsmtfin_sf_2       1153.068485
bsmt_unf_sf        -447.300067
2nd_flr_sf         5748.356030
bsmt_full_bath     4627.127805
half_bath         -1949.935074
wood_deck_sf       2240.823247
open_porch_sf      -317.616282
3ssn_porch         1064.928312
screen_porch       5945.450355
pool_area         -5764.348645
mo_sold             629.470965
dtype: float64

It appears that adding half baths and pool area have a negative effect on sale price as well.

In [25]:
# model intercept
model.intercept_

180717.9693379791

If all features were set to 0, the price of home would be $180,717

In [27]:
# score model on training data (R-squared)
model.score(X_train_sc, y_train)

0.8077915384043867

In [28]:
# Score model on test data (R-squared)
model.score(X_test_sc, y_test)

0.8508710690004339

Model did better on test data than train data, which means the model is not overfitting. The new model with all positively correlated data did better than the benchmark in the train data, but did slightly worse than the benchmark on the test data.

In [30]:
# MAE, MSE and RSME for Train data
y_pred = model.predict(X_train_sc)

print("Train Evaluation")
print("MAE: ", mean_absolute_error(y_train, y_pred))
print("MSE: ", mean_squared_error(y_train, y_pred))
print("RMSE: ", np.sqrt(mean_squared_error(y_train, y_pred)))

Train Evaluation
MAE:  22009.44838292095
MSE:  1231362667.5455346
RMSE:  35090.77752836968


In [32]:
# MAE, MSE and RSME for Test data
y_pred = model.predict(X_test_sc)

print("Test Evaluation")
print("MAE: ", mean_absolute_error(y_test, y_pred))
print("MSE: ", mean_squared_error(y_test, y_pred))
print("RMSE: ", np.sqrt(mean_squared_error(y_test, y_pred)))

Test Evaluation
MAE:  21589.027304627678
MSE:  891407296.4260302
RMSE:  29856.44480553621


Model performs better on train data and gets closer to zero. No overfitting. New model does slightly worse than benchmark model on the train and test data, slight underfitting. 

## Model 3: Add Categorical Features

### Model Preparation & Preprocessing

In [33]:
# Features from previous model plus all categorical features
features = ['overall_qual', 'year_built', 'year_remod/add', 'mas_vnr_area', 'bsmtfin_sf_1',
            'total_bsmt_sf', '1st_flr_sf', 'gr_liv_area', 'full_bath', 'totrms_abvgrd',
            'garage_yr_blt', 'garage_cars', 'garage_area', 'bedroom_abvgr', 'fireplaces','lot_frontage',
            'lot_area', 'bsmtfin_sf_2', 'bsmt_unf_sf', '2nd_flr_sf', 'bsmt_full_bath', 'half_bath', 'wood_deck_sf',
            'open_porch_sf', '3ssn_porch', 'screen_porch', 'pool_area', 'mo_sold', 'ms_subclass', 'ms_zoning', 'utilities', 'lot_config',
            'neighborhood', 'bldg_type', 'house_style', 'roof_style', 'roof_matl', 'exter_qual', 'exter_cond', 'foundation',
            'bsmt_qual', 'bsmt_cond', 'bsmt_exposure', 'bsmtfin_type_1', 'heating', 'heating_qc', 'central_air', 'electrical',
            'kitchen_qual', 'fireplace_qu', 'garage_type', 'garage_finish', 'garage_qual', 'garage_cond', 'pool_qc']
X = ames[features]

# One Hot Encode Categorical Features
enc = OneHotEncoder(drop='first', sparse=False)
X = enc.fit_transform(X)

# Target
y = ames['saleprice']

In [34]:
# Train / Test split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.70, random_state=42)

In [35]:
# instantiate Linear Regression model
model = LinearRegression()

In [36]:
# initialize standard scaler
ss = StandardScaler()

# Fit and transform train data
X_train_sc = ss.fit_transform(X_train)

# Transform test data
X_test_sc = ss.transform(X_test)

### Model Fitting and Evaluation

In [37]:
# Cross validate model (R-squared)
model_scores = cross_val_score(model, X_train_sc, y_train, cv = 5)
model_scores.mean()

0.6314266820607954

Does worse than model 2 and 3.

In [38]:
# fit model on scaled data
model.fit(X_train_sc, y_train)

LinearRegression()

In [39]:
# Score model on train data (R-squared)
model.score(X_train_sc, y_train)

1.0

In [40]:
# Score model on test data (R-squared)
model.score(X_test_sc, y_test)

0.7366648102320235

Overfitting occuring. Model is only slightly worse than model 1 and 2. 

In [41]:
# MAE, MSE and RSME for Train data
y_pred = model.predict(X_train_sc)

print("Train Evaluation")
print("MAE: ", mean_absolute_error(y_train, y_pred))
print("MSE: ", mean_squared_error(y_train, y_pred))
print("RMSE: ", np.sqrt(mean_squared_error(y_train, y_pred)))

Train Evaluation
MAE:  2.9934354399035616e-10
MSE:  2.369937684064157e-19
RMSE:  4.86820057522711e-10


In [42]:
# MAE, MSE and RSME for Test data
y_pred = model.predict(X_test_sc)

print("Test Evaluation")
print("MAE: ", mean_absolute_error(y_test, y_pred))
print("MSE: ", mean_squared_error(y_test, y_pred))
print("RMSE: ", np.sqrt(mean_squared_error(y_test, y_pred)))

Test Evaluation
MAE:  25397.919115760804
MSE:  1574066869.46338
RMSE:  39674.5115844339


Model is clearly overfitting and performs slighlty worse than previous 2 models.

## Model 4: Remove Features

### Model Preparation & Preprocessing

In [43]:
# Features from previous model minus features deemed to be not as prominant to sale price
# Decided based on previous experience as a realtor
features = ['overall_qual', 'year_built', 'year_remod/add', 'mas_vnr_area', 'bsmtfin_sf_1',
            'total_bsmt_sf', '1st_flr_sf', 'gr_liv_area', 'full_bath', 'totrms_abvgrd', 'garage_cars', 'garage_area', 'bedroom_abvgr', 
            'fireplaces','lot_frontage', 'lot_area', 'bsmtfin_sf_2', 'bsmt_unf_sf', '2nd_flr_sf', 'bsmt_full_bath', 'half_bath', 
            'wood_deck_sf', 'open_porch_sf', 'utilities', 'lot_config', 'neighborhood', 'bldg_type', 'exter_qual', 'exter_cond', 'foundation',
            'bsmt_qual', 'bsmt_cond', 'bsmtfin_type_1', 'heating', 'heating_qc', 'central_air', 'electrical',
            'kitchen_qual', 'fireplace_qu', 'garage_type', 'garage_finish', 'garage_qual', 'garage_cond']
X = ames[features]

# One Hot Encode Categorical Features
enc = OneHotEncoder(drop='first', sparse=False)
X = enc.fit_transform(X)

# Target
y = ames['saleprice']

In [44]:
# Train / Test split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.70, random_state=42)

In [45]:
# instantiate Linear Regression model
model = LinearRegression()

In [46]:
# initialize standard scaler
ss = StandardScaler()

# Fit and transform train data
X_train_sc = ss.fit_transform(X_train)

# Transform test data
X_test_sc = ss.transform(X_test)

### Model Fitting and Evaluation

In [47]:
# Cross validate model (R-squared)
model_scores = cross_val_score(model, X_train_sc, y_train, cv = 5)
model_scores.mean()

0.6192485206709067

In [48]:
# fit model on scaled data
model.fit(X_train_sc, y_train)

LinearRegression()

In [49]:
# Score model on train data (R-squared)
model.score(X_train_sc, y_train)

1.0

In [50]:
# Score model on test data (R-squared)
model.score(X_test_sc, y_test)

0.705546720703113

overfitting occuring. Model is significantly worse than benchmark and other models.

In [51]:
# MAE, MSE and RSME for Train data
y_pred = model.predict(X_train_sc)

print("Train Evaluation")
print("MAE: ", mean_absolute_error(y_train, y_pred))
print("MSE: ", mean_squared_error(y_train, y_pred))
print("RMSE: ", np.sqrt(mean_squared_error(y_train, y_pred)))

Train Evaluation
MAE:  3.0753723563113694e-10
MSE:  1.9185960305529876e-19
RMSE:  4.3801781134481134e-10


In [52]:
# MAE, MSE and RSME for Test data
y_pred = model.predict(X_test_sc)

print("Test Evaluation")
print("MAE: ", mean_absolute_error(y_test, y_pred))
print("MSE: ", mean_squared_error(y_test, y_pred))
print("RMSE: ", np.sqrt(mean_squared_error(y_test, y_pred)))

Test Evaluation
MAE:  26654.33227938566
MSE:  1760072977.5403566
RMSE:  41953.22368472245


Model is overfitting, but by adding more bias and does slightly worse than model 3.