<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 2: Ames Housing Data and Kaggle Challenge

---
## Problem Statement

Keller Williams Ames wants to gain a competive advantage against other brokerages in the area. They want to predict the sale price of new homes on the market, to help both their seller agents and buyer agents provide unmatched service to their clients.

### Contents:
- [Model Preparation & Preprocessing](#Model-Preparation-&-Preprocessing)
- [Ridge Regression Model](#Ridge-Regression-Model)
- [Lasso Regression Model](#Lasso-Regression-Model)
- [Elastic Net Model](#Elastic-Net-Model)

---

In [1]:
# imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LassoCV, RidgeCV, ElasticNetCV
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.preprocessing import StandardScaler, OneHotEncoder
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)

In [2]:
ames = pd.read_csv('../datasets/train_clean.csv', dtype={'ms_subclass': str, 'pid': str})
ames.head()

Unnamed: 0,id,lot_frontage,lot_area,overall_qual,overall_cond,year_built,year_remod/add,mas_vnr_area,bsmtfin_sf_1,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,1st_flr_sf,2nd_flr_sf,low_qual_fin_sf,gr_liv_area,bsmt_full_bath,bsmt_half_bath,full_bath,half_bath,bedroom_abvgr,kitchen_abvgr,totrms_abvgrd,fireplaces,garage_yr_blt,garage_cars,garage_area,wood_deck_sf,open_porch_sf,enclosed_porch,3ssn_porch,screen_porch,pool_area,misc_val,mo_sold,yr_sold,saleprice,pid,ms_subclass,ms_zoning,street,alley,lot_shape,land_contour,utilities,lot_config,land_slope,neighborhood,condition_1,condition_2,bldg_type,house_style,roof_style,roof_matl,exterior_1st,exterior_2nd,mas_vnr_type,exter_qual,exter_cond,foundation,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_type_2,heating,heating_qc,central_air,electrical,kitchen_qual,functional,fireplace_qu,garage_type,garage_finish,garage_qual,garage_cond,paved_drive,pool_qc,fence,misc_feature,sale_type
0,109,69.0552,13517.0,6.0,8.0,1976.0,2005.0,289.0,533.0,0.0,192.0,725.0,725.0,754.0,0.0,1479.0,0.0,0.0,2.0,1.0,3.0,1.0,6.0,0.0,1976.0,2.0,475.0,0.0,44.0,0.0,0.0,0.0,0.0,0.0,3.0,2010.0,130500.0,533352170,60,RL,Pave,,IR1,Lvl,AllPub,CulDSac,Gtl,Sawyer,RRAe,Norm,1Fam,2Story,Gable,CompShg,HdBoard,Plywood,BrkFace,Gd,TA,CBlock,TA,TA,No,GLQ,Unf,GasA,Ex,Y,SBrkr,Gd,Typ,,Attchd,RFn,TA,TA,Y,,,,WD
1,544,43.0,11492.0,7.0,5.0,1996.0,1997.0,132.0,637.0,0.0,276.0,913.0,913.0,1209.0,0.0,2122.0,1.0,0.0,2.0,1.0,4.0,1.0,8.0,1.0,1997.0,2.0,559.0,0.0,74.0,0.0,0.0,0.0,0.0,0.0,4.0,2009.0,220000.0,531379050,60,RL,Pave,,IR1,Lvl,AllPub,CulDSac,Gtl,SawyerW,Norm,Norm,1Fam,2Story,Gable,CompShg,VinylSd,VinylSd,BrkFace,Gd,TA,PConc,Gd,TA,No,GLQ,Unf,GasA,Ex,Y,SBrkr,Gd,Typ,TA,Attchd,RFn,TA,TA,Y,,,,WD
2,153,68.0,7922.0,5.0,7.0,1953.0,2007.0,0.0,731.0,0.0,326.0,1057.0,1057.0,0.0,0.0,1057.0,1.0,0.0,1.0,0.0,3.0,1.0,5.0,0.0,1953.0,1.0,246.0,0.0,52.0,0.0,0.0,0.0,0.0,0.0,1.0,2010.0,109000.0,535304180,20,RL,Pave,,Reg,Lvl,AllPub,Inside,Gtl,NAmes,Norm,Norm,1Fam,1Story,Gable,CompShg,VinylSd,VinylSd,,TA,Gd,CBlock,TA,TA,No,GLQ,Unf,GasA,TA,Y,SBrkr,Gd,Typ,,Detchd,Unf,TA,TA,Y,,,,WD
3,318,73.0,9802.0,5.0,5.0,2006.0,2007.0,0.0,0.0,0.0,384.0,384.0,744.0,700.0,0.0,1444.0,0.0,0.0,2.0,1.0,3.0,1.0,7.0,0.0,2007.0,2.0,400.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,2010.0,174000.0,916386060,60,RL,Pave,,Reg,Lvl,AllPub,Inside,Gtl,Timber,Norm,Norm,1Fam,2Story,Gable,CompShg,VinylSd,VinylSd,,TA,TA,PConc,Gd,TA,No,Unf,Unf,GasA,Gd,Y,SBrkr,TA,Typ,,BuiltIn,Fin,TA,TA,Y,,,,WD
4,255,82.0,14235.0,6.0,8.0,1900.0,1993.0,0.0,0.0,0.0,676.0,676.0,831.0,614.0,0.0,1445.0,0.0,0.0,2.0,0.0,3.0,1.0,6.0,0.0,1957.0,2.0,484.0,0.0,59.0,0.0,0.0,0.0,0.0,0.0,3.0,2010.0,138500.0,906425045,50,RL,Pave,,IR1,Lvl,AllPub,Inside,Gtl,SawyerW,Norm,Norm,1Fam,1.5Fin,Gable,CompShg,Wd Sdng,Plywood,,TA,TA,PConc,Fa,Gd,No,Unf,Unf,GasA,TA,Y,SBrkr,TA,Typ,,Detchd,Unf,TA,TA,N,,,,WD


## Model Preparation & Preprocessing

use Model 3 from previous notebook as was better performing model with both numerical and categorical features.

In [3]:
# Features
features = ['overall_qual', 'year_built', 'year_remod/add', 'mas_vnr_area', 'bsmtfin_sf_1',
            'total_bsmt_sf', '1st_flr_sf', 'gr_liv_area', 'full_bath', 'totrms_abvgrd',
            'garage_yr_blt', 'garage_cars', 'garage_area', 'bedroom_abvgr', 'fireplaces','lot_frontage',
            'lot_area', 'bsmtfin_sf_2', 'bsmt_unf_sf', '2nd_flr_sf', 'bsmt_full_bath', 'half_bath', 'wood_deck_sf',
            'open_porch_sf', '3ssn_porch', 'screen_porch', 'pool_area', 'mo_sold', 'ms_subclass', 'ms_zoning', 'utilities', 'lot_config',
            'neighborhood', 'bldg_type', 'house_style', 'roof_style', 'roof_matl', 'exter_qual', 'exter_cond', 'foundation',
            'bsmt_qual', 'bsmt_cond', 'bsmt_exposure', 'bsmtfin_type_1', 'heating', 'heating_qc', 'central_air', 'electrical',
            'kitchen_qual', 'fireplace_qu', 'garage_type', 'garage_finish', 'garage_qual', 'garage_cond', 'pool_qc']
X = ames[features]

# One Hot Encode Categorical Features
enc = OneHotEncoder(drop='first', sparse=False)
X = enc.fit_transform(X)

# Target
y = ames['saleprice']

In [4]:
# Train / Test split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.70, random_state=42)

In [5]:
# initialize standard scaler
ss = StandardScaler()

# Fit and transform train data
Z_train = ss.fit_transform(X_train)

# Transform test data
Z_test = ss.transform(X_test)

## Ridge Regression Model

In [6]:
# alpha values
r_alphas = np.logspace(0, 5, 100)

# cross validate list of ridge alphas
ridge_cv = RidgeCV(r_alphas, scoring='r2', cv = 5)

# fit model using best ridge alpha
ridge_cv.fit(Z_train, y_train);

In [7]:
# optimal value for alpha
ridge_cv.alpha_

23.101297000831593

In [8]:
print(f"Ridge CV Training R-Squared: {ridge_cv.score(Z_train, y_train)}")
print(f"Ridge CV Testing R-Squared: {ridge_cv.score(Z_test, y_test)}")

Ridge CV Training R-Squared: 0.9999926572575689
Ridge CV Testing R-Squared: 0.7381710570446709


In [9]:
# MAE, MSE and RSME for Train data
y_pred = ridge_cv.predict(Z_train)

print("Train Evaluation")
print("MAE: ", mean_absolute_error(y_train, y_pred))
print("MSE: ", mean_squared_error(y_train, y_pred))
print("RMSE: ", np.sqrt(mean_squared_error(y_train, y_pred)))

Train Evaluation
MAE:  96.24196224853495
MSE:  47040.483191756444
RMSE:  216.88818130953203


In [10]:
# MAE, MSE and RSME for Test data
y_pred = ridge_cv.predict(Z_test)

print("Test Evaluation")
print("MAE: ", mean_absolute_error(y_test, y_pred))
print("MSE: ", mean_squared_error(y_test, y_pred))
print("RMSE: ", np.sqrt(mean_squared_error(y_test, y_pred)))

Test Evaluation
MAE:  25163.691793489455
MSE:  1565063389.1191387
RMSE:  39560.88205688971


Ridge model does slightly better compared to original model 3 on test data.

## Lasso Regression Model

In [11]:
# lasso alpha values
l_alphas = np.logspace(-3, 1, 100)

# cross-valide over list of lasso alphas
lasso_cv = LassoCV(alphas=l_alphas, cv = 5, max_iter=50000, n_jobs=-1)

# fit model using best lasso alpha
lasso_cv.fit(Z_train, y_train);

In [12]:
lasso_cv.alpha_

10.0

In [13]:
print(f"LASSO Training R-Squared: {lasso_cv.score(Z_train, y_train)} ")
print(f"LASSO Testing R-Squared: {lasso_cv.score(Z_test, y_test)} ")

LASSO Training R-Squared: 0.9999765836660316 
LASSO Testing R-Squared: 0.8289021934733836 


In [54]:
# MAE, MSE and RSME for Train data
y_pred = lasso_cv.predict(Z_train)

print("Train Evaluation")
print("MAE: ", mean_absolute_error(y_train, y_pred))
print("MSE: ", mean_squared_error(y_train, y_pred))
print("RMSE: ", np.sqrt(mean_squared_error(y_train, y_pred)))

Train Evaluation
MAE:  291.3211571507726
MSE:  150014.2044745065
RMSE:  387.3166720843637


In [14]:
# MAE, MSE and RSME for Test data
y_pred = lasso_cv.predict(Z_test)

print("Test Evaluation")
print("MAE: ", mean_absolute_error(y_test, y_pred))
print("MSE: ", mean_squared_error(y_test, y_pred))
print("RMSE: ", np.sqrt(mean_squared_error(y_test, y_pred)))

Test Evaluation
MAE:  21573.637834455578
MSE:  1022724645.8352126
RMSE:  31980.066382595465


Lasso model performed better than ridge model and original model 3

## Elastic Net Model

In [16]:
# # elastic net alphas
# e_alphas = np.linspace(0.01, 1, 100)

# # l1 ratio
# enet_ratio = 0.05

# instantiate model
enet_model = ElasticNetCV(n_jobs=-1)

# fit model using optimal alpha
enet_model.fit(Z_train, y_train)

# evaluate model
print(f"Elastic Net Training R-squared: {enet_model.score(Z_train, y_train)}")
print(f"Elastic Net Testing R-squared: {enet_model.score(Z_test, y_test)}")

Elastic Net Training R-squared: 0.4184460825444334
Elastic Net Testing R-squared: 0.2861325646541183


In [17]:
enet_model.alpha_

99.11510063117203

In [18]:
# MAE, MSE and RSME for Train data
y_pred = enet_model.predict(Z_train)

print("Train Evaluation")
print("MAE: ", mean_absolute_error(y_train, y_pred))
print("MSE: ", mean_squared_error(y_train, y_pred))
print("RMSE: ", np.sqrt(mean_squared_error(y_train, y_pred)))

Train Evaluation
MAE:  43387.59376871251
MSE:  3725662112.7650995
RMSE:  61038.20207677401


In [19]:
# MAE, MSE and RSME for Test data
y_pred = enet_model.predict(Z_test)

print("Test Evaluation")
print("MAE: ", mean_absolute_error(y_test, y_pred))
print("MSE: ", mean_squared_error(y_test, y_pred))
print("RMSE: ", np.sqrt(mean_squared_error(y_test, y_pred)))

Test Evaluation
MAE:  47137.32470908066
MSE:  4267090471.868987
RMSE:  65322.970476464


Elastic model performed worse than Ridge and Lasso and Benchmark Model