# Modeling

This section will be used to create a model that will predict the housing prices in Ames, IA.  We will start with creating a base score using all the variables.  We will then take steps to improve the model using the R2 score.  If a model's training score is less then the testing score, we are underfit, high bias, and low variance. This means our model is not complex enough.  If the model has a higher testing score, we are overfit, therefore we have a high variance, low bias model.  This mean we are fitting to the data, and the model may not fit well to unseen data.  It is important to get the models training and testing scores as close together as possible.

In [14]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, LogisticRegression, Lasso, LassoCV
from sklearn.model_selection import train_test_split, cross_val_score

pd.options.display.max_columns = 100
pd.options.display.max_rows = 3000

In [2]:
%store -r ames
%store -r features
%store -r cat

In [3]:
ames.head()

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,SalePrice
0,109,533352170,60,RL,68.0,13517,Pave,No Alley,IR1,Lvl,...,0,0,No Pool,No Fence,,0,3,2010,WD,130500
1,544,531379050,60,RL,43.0,11492,Pave,No Alley,IR1,Lvl,...,0,0,No Pool,No Fence,,0,4,2009,WD,220000
2,153,535304180,20,RL,68.0,7922,Pave,No Alley,Reg,Lvl,...,0,0,No Pool,No Fence,,0,1,2010,WD,109000
3,318,916386060,60,RL,73.0,9802,Pave,No Alley,Reg,Lvl,...,0,0,No Pool,No Fence,,0,4,2010,WD,174000
4,255,906425045,50,RL,82.0,14235,Pave,No Alley,IR1,Lvl,...,0,0,No Pool,No Fence,,0,3,2010,WD,138500


In [4]:
ames.shape

(2051, 81)

## One Hot-Encoding

In [5]:
# finds the first element in each category when sorted, because these will be dropped when dummied
comp = []
for x in cat: # cat is a list of categorical columns 
    comp.append(sorted(ames[x].unique())[0])

In [6]:
comp

[20,
 'A (agr)',
 'Grvl',
 'Grvl',
 'IR1',
 'Bnk',
 'AllPub',
 'Corner',
 'Gtl',
 'Blmngtn',
 'Artery',
 'Artery',
 '1Fam',
 '1.5Fin',
 'Flat',
 'ClyTile',
 'AsbShng',
 'AsbShng',
 'BrkCmn',
 'Ex',
 'Ex',
 'BrkTil',
 'Ex',
 'Ex',
 'Av',
 'ALQ',
 'ALQ',
 'GasA',
 'Ex',
 'N',
 'FuseA',
 'Ex',
 'Maj1',
 'Ex',
 '2Types',
 'Fin',
 'Ex',
 'Ex',
 'N',
 'Ex',
 'GdPrv',
 'Elev',
 'COD']

Above are features that will be dropped once one-hot encoded is used.  When making inference from the linear model's coefficient, all these variables are held constant

In [7]:
features

['Lot Frontage',
 'Lot Area',
 'Overall Qual',
 'Overall Cond',
 'Year Built',
 'Year Remod/Add',
 'Mas Vnr Area',
 'BsmtFin SF 1',
 'Bsmt Unf SF',
 'BsmtFin SF 2',
 '1st Flr SF',
 '2nd Flr SF',
 'Low Qual Fin SF',
 'Gr Liv Area',
 'Bsmt Full Bath',
 'Bsmt Half Bath',
 'Full Bath',
 'Half Bath',
 'Bedroom AbvGr',
 'Kitchen AbvGr',
 'TotRms AbvGrd',
 'Fireplaces',
 'Garage Cars',
 'Garage Area',
 'Wood Deck SF',
 'Open Porch SF',
 'Enclosed Porch',
 '3Ssn Porch',
 'Screen Porch',
 'Pool Area',
 'Misc Val',
 'Mo Sold',
 'Yr Sold',
 'Total Bsmt SF']

In [8]:
cat

['MS SubClass',
 'MS Zoning',
 'Street',
 'Alley',
 'Lot Shape',
 'Land Contour',
 'Utilities',
 'Lot Config',
 'Land Slope',
 'Neighborhood',
 'Condition 1',
 'Condition 2',
 'Bldg Type',
 'House Style',
 'Roof Style',
 'Roof Matl',
 'Exterior 1st',
 'Exterior 2nd',
 'Mas Vnr Type',
 'Exter Qual',
 'Exter Cond',
 'Foundation',
 'Bsmt Qual',
 'Bsmt Cond',
 'Bsmt Exposure',
 'BsmtFin Type 1',
 'BsmtFin Type 2',
 'Heating',
 'Heating QC',
 'Central Air',
 'Electrical',
 'Kitchen Qual',
 'Functional',
 'Fireplace Qu',
 'Garage Type',
 'Garage Finish',
 'Garage Qual',
 'Garage Cond',
 'Paved Drive',
 'Pool QC',
 'Fence',
 'Misc Feature',
 'Sale Type']

In [9]:
cate = pd.get_dummies(ames, columns = cat, drop_first = True)

In [10]:
cate

Unnamed: 0,Id,PID,Lot Frontage,Lot Area,Overall Qual,Overall Cond,Year Built,Year Remod/Add,Mas Vnr Area,BsmtFin SF 1,...,Misc Feature_Shed,Misc Feature_TenC,Sale Type_CWD,Sale Type_Con,Sale Type_ConLD,Sale Type_ConLI,Sale Type_ConLw,Sale Type_New,Sale Type_Oth,Sale Type_WD
0,109,533352170,68.0,13517,6,8,1976,2005,289.0,533.0,...,0,0,0,0,0,0,0,0,0,1
1,544,531379050,43.0,11492,7,5,1996,1997,132.0,637.0,...,0,0,0,0,0,0,0,0,0,1
2,153,535304180,68.0,7922,5,7,1953,2007,0.0,731.0,...,0,0,0,0,0,0,0,0,0,1
3,318,916386060,73.0,9802,5,5,2006,2007,0.0,0.0,...,0,0,0,0,0,0,0,0,0,1
4,255,906425045,82.0,14235,6,8,1900,1993,0.0,0.0,...,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2046,1587,921126030,79.0,11449,8,5,2007,2007,0.0,1011.0,...,0,0,0,0,0,0,0,0,0,1
2047,785,905377130,68.0,12342,4,5,1940,1950,0.0,262.0,...,0,0,0,0,0,0,0,0,0,1
2048,916,909253010,57.0,7558,6,6,1928,1950,0.0,0.0,...,0,0,0,0,0,0,0,0,0,1
2049,639,535179160,80.0,10400,4,5,1956,1956,0.0,155.0,...,0,0,0,0,0,0,0,0,0,1


In [11]:
cate.shape

(2051, 279)

In [16]:
cate['SalePrice']

0       130500
1       220000
2       109000
3       174000
4       138500
5       190000
6       140000
7       142000
8       112500
9       135000
10       85400
11      183600
12      131000
13      200000
14      193000
15      173500
16       98000
17      139000
18      143500
19      215200
20      129000
21      278000
22      344133
23      185000
24      145000
25      187500
26      138500
27      198000
28      119600
29      122900
30      278000
31      230000
32      270000
33      125000
34      297000
35      113500
36      127000
37      190000
38      175500
39      146000
40      147500
41      465000
42      165500
43      131500
44      129500
45      257076
46      117000
47      149000
48      128000
49      155000
50      166000
51      135000
52      250000
53       76000
54      155000
55      158000
56      149500
57      121000
58      136000
59      173000
60      290000
61      303477
62      270000
63      122250
64      153000
65      147000
66      14

In [104]:
df = pd.merge(left = ames[features], right = cate, left_index = True, right_index = True )

## Baseline Model Score

In [17]:
X = cate.drop(columns ='SalePrice')
y = cate['SalePrice']

In [18]:
lr = LinearRegression()

In [19]:
cross_val_score(lr,X, y).mean() # this is the baseline Score

0.8037853406030404

We need to improve the model with scores higher then .804

### Model 1

In [28]:
X = cate[['Mas Vnr Area','Total Bsmt SF','1st Flr SF','Gr Liv Area',
         'Full Bath','TotRms AbvGrd','Fireplaces','Garage Yr Blt','Garage Cars','Garage Area','Open Porch SF','Wood Deck SF','Lot Area']]
y = cate['SalePrice']

Train Test Split

In [29]:
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 4, test_size = .1, train_size = .9)

Scaling

In [30]:
ss = StandardScaler()

Xs_train = ss.fit_transform(X_train) # data leakage if you put test data.
Xs_test = ss.transform(X_test) # test, 

Methods

fit(X[, y]) - Compute the mean and std to be used for later scaling.
fit_transform(X[, y]) - Fit to data, then transform it.
get_params([deep]) - parameters for this estimator.
inverse_transform(X[, copy])
Scale back the data to the original representation
partial_fit(X[, y])
Online computation of mean and std on X for later scaling.
set_params(**params)
Set the parameters of this estimator.
transform(X[, copy]) -Perform standardization by centering and scaling

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html 

Instantiate, Fit, Score Model

In [31]:
lr = LinearRegression()

In [32]:
lr.fit(X_train, y_train)

LinearRegression()

In [33]:
lr.score(X_train, y_train), lr.score(X_test, y_test) # this model did worst then baseline score.

(0.744137646447512, 0.4826040517475163)

Low Bias, high variance - ok for first model  r2 score without scaling

In [34]:
cross_val_score(lr, X, y) # checking the score 5 more times

array([0.71976524, 0.76441344, 0.6246985 , 0.77556205, 0.67102882])

In [35]:
lr.coef_

array([ 7.13594752e+01,  4.33727634e+01,  1.79911666e+00,  4.88400107e+01,
        1.16703312e+04, -2.80447427e+03,  1.16949173e+04, -1.10291005e+01,
        2.19457656e+04,  3.31737760e+01,  7.54069216e+01,  4.99668678e+01,
        1.61075317e-02])

### Model 2 - Numeric Features

In [36]:
X = cate[features]
y = cate['SalePrice']

In [37]:
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 4, test_size = .1, train_size = .9)
ss = StandardScaler()

Xs_train = ss.fit_transform(X_train) # data leakage if you put test data.
Xs_test = ss.transform(X_test) # test, 

In [38]:
lr = LinearRegression()
lr.fit(Xs_train, y_train)

LinearRegression()

In [39]:
lr.score(Xs_train, y_train), lr.score(Xs_test, y_test) 

(0.8457134063613226, 0.6009567332946075)

The model is still currently overfitting.  Linear Regression models tend to be high bias and low variance.

### Model 4 - Numeric and Categorical columns

This model will contain both numeric and categorical features for the model.

In [41]:
X= cate.drop(columns = ['SalePrice'])
y = cate['SalePrice']

X_train,X_test,y_train, y_test = train_test_split(X, y, random_state = 4, test_size = .1, train_size = .9 )


In [42]:
X.columns

Index(['Id', 'PID', 'Lot Frontage', 'Lot Area', 'Overall Qual', 'Overall Cond',
       'Year Built', 'Year Remod/Add', 'Mas Vnr Area', 'BsmtFin SF 1',
       ...
       'Misc Feature_Shed', 'Misc Feature_TenC', 'Sale Type_CWD',
       'Sale Type_Con', 'Sale Type_ConLD', 'Sale Type_ConLI',
       'Sale Type_ConLw', 'Sale Type_New', 'Sale Type_Oth', 'Sale Type_WD '],
      dtype='object', length=278)

In [43]:
X_train.shape

(1845, 278)

In [44]:
y_train.shape

(1845,)

In [46]:
lr = LinearRegression()
lr.fit(X_train, y_train)
lr.score(X_train, y_train), lr.score(X_test, y_test) # the model is currently overfit

(0.9483620478434723, 0.5895197575406008)

This model is overfitting, meaning we are fitting more to the training data then the unseen data.

In [47]:
lr.coef_

array([-1.89709052e-01,  4.90866056e-06,  1.04537815e+02,  8.90917086e-01,
        6.06097154e+03,  5.23048791e+03,  3.81390183e+02,  7.24571639e+01,
        2.66594887e+01,  1.63882677e+01,  1.15383230e+01, -2.34451308e+00,
        2.55821583e+01,  1.61340443e+01,  2.47488706e+01, -1.03724403e+01,
        3.05105827e+01,  2.64486683e+03, -1.78859701e+02,  2.84568291e+03,
        2.60797718e+03, -3.44484986e+03, -1.08477601e+04,  1.23883236e+03,
        3.85022376e+03,  1.39501781e+01,  2.49214423e+03,  1.75164603e+01,
        1.25790870e+01,  5.32035808e+00,  1.93157733e+01,  2.76649592e+00,
        5.19885924e+01, -2.07898062e+02,  8.77958382e-01, -1.28490989e+02,
       -6.83562247e+02,  6.03062913e+03,  1.36371592e+04,  2.96818009e+04,
        5.63822571e+03,  5.68160214e+03,  9.40658044e+03,  5.66935399e+03,
       -1.07532862e+04, -1.37627034e+03, -6.89478262e+03, -2.13984585e+04,
       -7.57704703e+04, -1.60016154e+04, -2.10862214e+04, -5.86799987e+03,
        1.02928558e+04,  

### Model 5 - Lasso

For this step, we will use a regularizer Lasso for feature selection.

In [48]:
X= cate.drop(columns = ['SalePrice'])
y = cate['SalePrice']

X_train,X_test,y_train, y_test = train_test_split(X, y, random_state = 4, test_size = .1, train_size = .9 )

ss = StandardScaler()
Xss_train = ss.fit_transform(X_train)
Xss_test = ss.transform(X_test)

In [49]:
# 5000 different alphas between -3, 10
lass_alpha = np.logspace(-3,10,5000)

# the algorithm continually  adjust all our weights and 
# sometimes the default value from acceleration isn't enough for all those weights to converge.
lasso_cv= LassoCV(alphas= lass_alpha, cv = 5, max_iter = 50000)

# Fit model using best ridge alpha! 
lasso_cv.fit(Xss_train, y_train);

# best alpha
lasso_cv.alpha_


718.408189208704

In [51]:
print('Training score:', lasso_cv.score(Xss_train, y_train)) 
print('Test score:', lasso_cv.score(Xss_test, y_test))

lasso_cv.coef_  # sets most features to 0, good for feature selection

Training score: 0.9157780028791798
Test score: 0.7024341163933128


array([ 0.00000000e+00, -0.00000000e+00, -0.00000000e+00,  3.09055051e+03,
        1.48126907e+04,  3.41582754e+03,  8.08337146e+03,  2.84572550e+03,
        4.85777966e+03,  3.94615720e+03,  2.66351375e+02, -0.00000000e+00,
        4.81344142e+03,  2.51997943e+03,  0.00000000e+00, -8.07619149e+02,
        1.98810523e+04,  2.52721971e+03, -0.00000000e+00,  1.12311110e+03,
        6.00192921e+02, -1.27915891e+02, -2.84309841e+03,  1.23948586e+03,
        2.38549023e+03, -0.00000000e+00,  3.37860881e+03,  2.39008257e+03,
        2.35355451e+03,  1.24292784e+02,  0.00000000e+00, -0.00000000e+00,
        2.71130531e+03,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
       -0.00000000e+00,  0.00000000e+00, -0.00000000e+00,  0.00000000e+00,
       -0.00000000e+00,  0.00000000e+00, -0.00000000e+00,  0.00000000e+00,
       -0.00000000e+00,  0.00000000e+00, -6.79936772e+02, -3.55203949e+02,
       -7.97190297e+02, -9.11537292e+02, -0.00000000e+00, -5.83511773e+02,
       -2.72384564e+02,  

In [52]:
lasso_feat= list(zip(X_train.columns, lasso_cv.coef_))
lasso_feat

[('Id', 0.0),
 ('PID', -0.0),
 ('Lot Frontage', -0.0),
 ('Lot Area', 3090.550508830929),
 ('Overall Qual', 14812.690660757604),
 ('Overall Cond', 3415.827535772207),
 ('Year Built', 8083.3714569551685),
 ('Year Remod/Add', 2845.7255008723973),
 ('Mas Vnr Area', 4857.779658251992),
 ('BsmtFin SF 1', 3946.1571968685694),
 ('BsmtFin SF 2', 266.3513746530556),
 ('Bsmt Unf SF', -0.0),
 ('Total Bsmt SF', 4813.4414236269295),
 ('1st Flr SF', 2519.979431934801),
 ('2nd Flr SF', 0.0),
 ('Low Qual Fin SF', -807.6191487212054),
 ('Gr Liv Area', 19881.052296930342),
 ('Bsmt Full Bath', 2527.2197144537276),
 ('Bsmt Half Bath', -0.0),
 ('Full Bath', 1123.1110964268219),
 ('Half Bath', 600.1929207753828),
 ('Bedroom AbvGr', -127.91589149406904),
 ('Kitchen AbvGr', -2843.0984129655667),
 ('TotRms AbvGrd', 1239.4858606653997),
 ('Fireplaces', 2385.490227041253),
 ('Garage Yr Blt', -0.0),
 ('Garage Cars', 3378.6088076001656),
 ('Garage Area', 2390.082568792196),
 ('Wood Deck SF', 2353.5545090567575),
 (

Lasso was used to select features that had an affect on the home price.  The next step will be to select only those feature and apply them into a new linear regression model.

In [53]:
new_feat =[]
for x in lasso_feat:
    if abs(x[1])>0:
        new_feat.append(x[0])

In [54]:
# new features selected 
new_feat

['Lot Area',
 'Overall Qual',
 'Overall Cond',
 'Year Built',
 'Year Remod/Add',
 'Mas Vnr Area',
 'BsmtFin SF 1',
 'BsmtFin SF 2',
 'Total Bsmt SF',
 '1st Flr SF',
 'Low Qual Fin SF',
 'Gr Liv Area',
 'Bsmt Full Bath',
 'Full Bath',
 'Half Bath',
 'Bedroom AbvGr',
 'Kitchen AbvGr',
 'TotRms AbvGrd',
 'Fireplaces',
 'Garage Cars',
 'Garage Area',
 'Wood Deck SF',
 'Open Porch SF',
 'Screen Porch',
 'MS SubClass_90',
 'MS SubClass_120',
 'MS SubClass_150',
 'MS SubClass_160',
 'MS SubClass_190',
 'MS Zoning_C (all)',
 'MS Zoning_RM',
 'Lot Shape_IR3',
 'Land Contour_HLS',
 'Utilities_NoSeWa',
 'Lot Config_CulDSac',
 'Land Slope_Mod',
 'Neighborhood_Crawfor',
 'Neighborhood_Edwards',
 'Neighborhood_GrnHill',
 'Neighborhood_NAmes',
 'Neighborhood_NWAmes',
 'Neighborhood_NoRidge',
 'Neighborhood_NridgHt',
 'Neighborhood_OldTown',
 'Neighborhood_Somerst',
 'Neighborhood_StoneBr',
 'Neighborhood_Veenker',
 'Condition 1_Norm',
 'Condition 1_PosA',
 'Condition 1_PosN',
 'Condition 1_RRAe',
 'C

In [64]:
X = cate[new_feat]
y = cate['SalePrice']

Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state = 42, test_size = .1, train_size = .9)

#Linear Regression with the selected features
lr = LinearRegression()
lr.fit(Xtrain, ytrain)

lr.score(Xtrain,ytrain), lr.score(Xtest,ytest)

(0.9072923660831265, 0.9049822968983856)

In [65]:
cross_val_score(lr, Xtrain, ytrain, cv = 5).mean()

0.8562363146362953

In [66]:
cross_val_score(lr, Xtest, ytest, cv = 5).mean()

0.7638155521327572

In [67]:
list(zip(X.columns, lr.coef_))

[('Lot Area', 0.3351138892031752),
 ('Overall Qual', 7766.604814464586),
 ('Overall Cond', 4738.772496394451),
 ('Year Built', 219.91401449350397),
 ('Year Remod/Add', 37.65786082851196),
 ('Mas Vnr Area', 12.442218107730696),
 ('BsmtFin SF 1', 2.5667143767716993),
 ('BsmtFin SF 2', 3.2847416311845024),
 ('Total Bsmt SF', 11.130447967337858),
 ('1st Flr SF', -6.122886939067939),
 ('Low Qual Fin SF', -15.37712586253052),
 ('Gr Liv Area', 36.350307427151165),
 ('Bsmt Full Bath', 8457.621949510902),
 ('Full Bath', 8168.527282567738),
 ('Half Bath', 3027.963597428024),
 ('Bedroom AbvGr', 76.11472735471173),
 ('Kitchen AbvGr', -17795.069859352418),
 ('TotRms AbvGrd', 1783.3301466325076),
 ('Fireplaces', 6740.87297973777),
 ('Garage Cars', 9104.072049621895),
 ('Garage Area', 2.413289394224421),
 ('Wood Deck SF', 8.229827555529141),
 ('Open Porch SF', -3.9928889974480626),
 ('Screen Porch', 64.53273865050915),
 ('MS SubClass_90', -5878.256386309659),
 ('MS SubClass_120', -12536.850095457357)

In [None]:
The list above shows the affects features have on the sale price of the home.  