# Modeling

This section will be used to create a model that will predict the housing prices in Ames, IA.  We will start with creating a base score using all the variables.  We will then take steps to improve the model using the R2 score.  If a model's training score is less then the testing score, we are underfit, high bias, and low variance. This means our model is not complex enough.  If the model has a higher testing score, we are overfit, therefore we have a high variance, low bias model.  This mean we are fitting to the data, and the model may not fit well to unseen data.  It is important to get the models training and testing scores as close together as possible.

In [14]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, LogisticRegression, Lasso, LassoCV
from sklearn.model_selection import train_test_split, cross_val_score

pd.options.display.max_columns = 100
pd.options.display.max_rows = 3000

In [2]:
%store -r ames
%store -r features
%store -r cat

In [3]:
ames.head()

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,SalePrice
0,109,533352170,60,RL,68.0,13517,Pave,No Alley,IR1,Lvl,...,0,0,No Pool,No Fence,,0,3,2010,WD,130500
1,544,531379050,60,RL,43.0,11492,Pave,No Alley,IR1,Lvl,...,0,0,No Pool,No Fence,,0,4,2009,WD,220000
2,153,535304180,20,RL,68.0,7922,Pave,No Alley,Reg,Lvl,...,0,0,No Pool,No Fence,,0,1,2010,WD,109000
3,318,916386060,60,RL,73.0,9802,Pave,No Alley,Reg,Lvl,...,0,0,No Pool,No Fence,,0,4,2010,WD,174000
4,255,906425045,50,RL,82.0,14235,Pave,No Alley,IR1,Lvl,...,0,0,No Pool,No Fence,,0,3,2010,WD,138500


In [4]:
ames.shape

(2051, 81)

## One Hot-Encoding

In [5]:
# finds the first element in each category when sorted, because these will be dropped when dummied
comp = []
for x in cat: # cat is a list of categorical columns 
    comp.append(sorted(ames[x].unique())[0])

In [6]:
comp

[20,
 'A (agr)',
 'Grvl',
 'Grvl',
 'IR1',
 'Bnk',
 'AllPub',
 'Corner',
 'Gtl',
 'Blmngtn',
 'Artery',
 'Artery',
 '1Fam',
 '1.5Fin',
 'Flat',
 'ClyTile',
 'AsbShng',
 'AsbShng',
 'BrkCmn',
 'Ex',
 'Ex',
 'BrkTil',
 'Ex',
 'Ex',
 'Av',
 'ALQ',
 'ALQ',
 'GasA',
 'Ex',
 'N',
 'FuseA',
 'Ex',
 'Maj1',
 'Ex',
 '2Types',
 'Fin',
 'Ex',
 'Ex',
 'N',
 'Ex',
 'GdPrv',
 'Elev',
 'COD']

Above are features that will be dropped once one-hot encoded is used.  When making inference from the linear model's coefficient, all these variables are held constant

In [7]:
features

['Lot Frontage',
 'Lot Area',
 'Overall Qual',
 'Overall Cond',
 'Year Built',
 'Year Remod/Add',
 'Mas Vnr Area',
 'BsmtFin SF 1',
 'Bsmt Unf SF',
 'BsmtFin SF 2',
 '1st Flr SF',
 '2nd Flr SF',
 'Low Qual Fin SF',
 'Gr Liv Area',
 'Bsmt Full Bath',
 'Bsmt Half Bath',
 'Full Bath',
 'Half Bath',
 'Bedroom AbvGr',
 'Kitchen AbvGr',
 'TotRms AbvGrd',
 'Fireplaces',
 'Garage Cars',
 'Garage Area',
 'Wood Deck SF',
 'Open Porch SF',
 'Enclosed Porch',
 '3Ssn Porch',
 'Screen Porch',
 'Pool Area',
 'Misc Val',
 'Mo Sold',
 'Yr Sold',
 'Total Bsmt SF']

In [8]:
cat

['MS SubClass',
 'MS Zoning',
 'Street',
 'Alley',
 'Lot Shape',
 'Land Contour',
 'Utilities',
 'Lot Config',
 'Land Slope',
 'Neighborhood',
 'Condition 1',
 'Condition 2',
 'Bldg Type',
 'House Style',
 'Roof Style',
 'Roof Matl',
 'Exterior 1st',
 'Exterior 2nd',
 'Mas Vnr Type',
 'Exter Qual',
 'Exter Cond',
 'Foundation',
 'Bsmt Qual',
 'Bsmt Cond',
 'Bsmt Exposure',
 'BsmtFin Type 1',
 'BsmtFin Type 2',
 'Heating',
 'Heating QC',
 'Central Air',
 'Electrical',
 'Kitchen Qual',
 'Functional',
 'Fireplace Qu',
 'Garage Type',
 'Garage Finish',
 'Garage Qual',
 'Garage Cond',
 'Paved Drive',
 'Pool QC',
 'Fence',
 'Misc Feature',
 'Sale Type']

In [9]:
cate = pd.get_dummies(ames, columns = cat, drop_first = True)

In [10]:
cate

Unnamed: 0,Id,PID,Lot Frontage,Lot Area,Overall Qual,Overall Cond,Year Built,Year Remod/Add,Mas Vnr Area,BsmtFin SF 1,...,Misc Feature_Shed,Misc Feature_TenC,Sale Type_CWD,Sale Type_Con,Sale Type_ConLD,Sale Type_ConLI,Sale Type_ConLw,Sale Type_New,Sale Type_Oth,Sale Type_WD
0,109,533352170,68.0,13517,6,8,1976,2005,289.0,533.0,...,0,0,0,0,0,0,0,0,0,1
1,544,531379050,43.0,11492,7,5,1996,1997,132.0,637.0,...,0,0,0,0,0,0,0,0,0,1
2,153,535304180,68.0,7922,5,7,1953,2007,0.0,731.0,...,0,0,0,0,0,0,0,0,0,1
3,318,916386060,73.0,9802,5,5,2006,2007,0.0,0.0,...,0,0,0,0,0,0,0,0,0,1
4,255,906425045,82.0,14235,6,8,1900,1993,0.0,0.0,...,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2046,1587,921126030,79.0,11449,8,5,2007,2007,0.0,1011.0,...,0,0,0,0,0,0,0,0,0,1
2047,785,905377130,68.0,12342,4,5,1940,1950,0.0,262.0,...,0,0,0,0,0,0,0,0,0,1
2048,916,909253010,57.0,7558,6,6,1928,1950,0.0,0.0,...,0,0,0,0,0,0,0,0,0,1
2049,639,535179160,80.0,10400,4,5,1956,1956,0.0,155.0,...,0,0,0,0,0,0,0,0,0,1


In [11]:
cate.shape

(2051, 279)

In [16]:
cate['SalePrice']

0       130500
1       220000
2       109000
3       174000
4       138500
5       190000
6       140000
7       142000
8       112500
9       135000
10       85400
11      183600
12      131000
13      200000
14      193000
15      173500
16       98000
17      139000
18      143500
19      215200
20      129000
21      278000
22      344133
23      185000
24      145000
25      187500
26      138500
27      198000
28      119600
29      122900
30      278000
31      230000
32      270000
33      125000
34      297000
35      113500
36      127000
37      190000
38      175500
39      146000
40      147500
41      465000
42      165500
43      131500
44      129500
45      257076
46      117000
47      149000
48      128000
49      155000
50      166000
51      135000
52      250000
53       76000
54      155000
55      158000
56      149500
57      121000
58      136000
59      173000
60      290000
61      303477
62      270000
63      122250
64      153000
65      147000
66      14

In [104]:
df = pd.merge(left = ames[features], right = cate, left_index = True, right_index = True )

## Baseline Model Score

In [17]:
X = cate.drop(columns ='SalePrice')
y = cate['SalePrice']

In [18]:
lr = LinearRegression()

In [19]:
cross_val_score(lr,X, y).mean() # this is the baseline Score

0.8037853406030404

We need to improve the model with scores higher then .804

### Model 1

In [110]:
X = ames[['Mas Vnr Area','Total Bsmt SF','1st Flr SF','Gr Liv Area',
         'Full Bath','TotRms AbvGrd','Fireplaces','Garage Yr Blt','Garage Cars','Garage Area','Open Porch SF','Wood Deck SF','Lot Area']]
y = ames['SalePrice']

Train Test Split

In [111]:
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 4, test_size = .1, train_size = .9)

Scaling

In [112]:
ss = StandardScaler()

Xs_train = ss.fit_transform(X_train) # data leakage if you put test data.
Xs_test = ss.transform(X_test) # test, 

Methods

fit(X[, y]) - Compute the mean and std to be used for later scaling.
fit_transform(X[, y]) - Fit to data, then transform it.
get_params([deep]) - parameters for this estimator.
inverse_transform(X[, copy])
Scale back the data to the original representation
partial_fit(X[, y])
Online computation of mean and std on X for later scaling.
set_params(**params)
Set the parameters of this estimator.
transform(X[, copy]) -Perform standardization by centering and scaling

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html 

Instantiate, Fit, Score Model

In [113]:
lr = LinearRegression()

In [114]:
lr.fit(X_train, y_train)

LinearRegression()

In [115]:
lr.score(X_train, y_train), lr.score(X_test, y_test) # this model did worst then baseline score.

(0.744137646447512, 0.4826040517475163)

Low Bias, high variance - ok for first model  r2 score without scaling

In [116]:
cross_val_score(lr, X, y) # checking the score 5 more times

array([0.71976524, 0.76441344, 0.6246985 , 0.77556205, 0.67102882])

In [117]:
lr.coef_

array([ 7.13594752e+01,  4.33727634e+01,  1.79911666e+00,  4.88400107e+01,
        1.16703312e+04, -2.80447427e+03,  1.16949173e+04, -1.10291005e+01,
        2.19457656e+04,  3.31737760e+01,  7.54069216e+01,  4.99668678e+01,
        1.61075317e-02])

### Model 2 - includes Scaling

Scaling our features is critical because it put all our features into the same "measured unit".  Sklearn turns each feature into mean and standard deviation.

In [118]:
X = ames[['Mas Vnr Area','Total Bsmt SF','1st Flr SF','Gr Liv Area',
         'Full Bath','TotRms AbvGrd','Fireplaces','Garage Yr Blt','Garage Cars','Garage Area','Open Porch SF','Wood Deck SF','Lot Area']]
y = ames['SalePrice']

In [119]:
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 4, test_size = .1, train_size = .9)
ss = StandardScaler()

Xs_train = ss.fit_transform(X_train) # data leakage if you put test data.
Xs_test = ss.transform(X_test) # test, 

Methods

fit(X[, y]) - Compute the mean and std to be used for later scaling.
fit_transform(X[, y]) - Fit to data, then transform it.
get_params([deep]) - parameters for this estimator.
inverse_transform(X[, copy])
Scale back the data to the original representation
partial_fit(X[, y])
Online computation of mean and std on X for later scaling.
set_params(**params)
Set the parameters of this estimator.
transform(X[, copy]) -Perform standardization by centering and scaling

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html 

Instantiate, Fit, Score Model

In [120]:
lr = LinearRegression()

In [121]:
lr.fit(Xs_train, y_train)

LinearRegression()

In [122]:
lr.score(Xs_train, y_train), lr.score(Xs_test, y_test) # this model did worst then baseline score.

(0.7441376464475121, 0.4826040517475141)

The model is currently overfit, low bias, and high variance.  At this point we want to see which features make a difference in the model.  Lasso is good for feature selection.

### Model 3 - Numeric Features

In [123]:
X = ames[features]
y = ames['SalePrice']

In [124]:
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 4, test_size = .1, train_size = .9)
ss = StandardScaler()

Xs_train = ss.fit_transform(X_train) # data leakage if you put test data.
Xs_test = ss.transform(X_test) # test, 

In [125]:
lr = LinearRegression()
lr.fit(Xs_train, y_train)

LinearRegression()

In [126]:
lr.score(Xs_train, y_train), lr.score(Xs_test, y_test) 

(0.8457134063613226, 0.6009567332946075)

The model is still currently overfitting.  Linear Regression models tend to be high bias and low variance.

### Model 4 - Numeric and Categorical columns

Steps in this process:
    - add numeric and categorical columns
    - create linear regression model

In [127]:
# one-hot-encode the categorical columns
comb_ames = pd.DataFrame()
comb_ames = pd.get_dummies(data = ames, columns = cat, drop_first = True)

In [128]:
comb_ames

Unnamed: 0,Id,PID,Lot Frontage,Lot Area,Overall Qual,Overall Cond,Year Built,Year Remod/Add,Mas Vnr Area,BsmtFin SF 1,...,Misc Feature_Shed,Misc Feature_TenC,Sale Type_CWD,Sale Type_Con,Sale Type_ConLD,Sale Type_ConLI,Sale Type_ConLw,Sale Type_New,Sale Type_Oth,Sale Type_WD
0,109,533352170,68.0,13517,6,8,1976,2005,289.0,533.0,...,0,0,0,0,0,0,0,0,0,1
1,544,531379050,43.0,11492,7,5,1996,1997,132.0,637.0,...,0,0,0,0,0,0,0,0,0,1
2,153,535304180,68.0,7922,5,7,1953,2007,0.0,731.0,...,0,0,0,0,0,0,0,0,0,1
3,318,916386060,73.0,9802,5,5,2006,2007,0.0,0.0,...,0,0,0,0,0,0,0,0,0,1
4,255,906425045,82.0,14235,6,8,1900,1993,0.0,0.0,...,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2046,1587,921126030,79.0,11449,8,5,2007,2007,0.0,1011.0,...,0,0,0,0,0,0,0,0,0,1
2047,785,905377130,68.0,12342,4,5,1940,1950,0.0,262.0,...,0,0,0,0,0,0,0,0,0,1
2048,916,909253010,57.0,7558,6,6,1928,1950,0.0,0.0,...,0,0,0,0,0,0,0,0,0,1
2049,639,535179160,80.0,10400,4,5,1956,1956,0.0,155.0,...,0,0,0,0,0,0,0,0,0,1


In [129]:
comb_ames = pd.merge(left = comb_ames, right = ames[features], left_index = True, right_index = True)

In [130]:
ames[features]

Unnamed: 0,Lot Frontage,Lot Area,Overall Qual,Overall Cond,Year Built,Year Remod/Add,Mas Vnr Area,BsmtFin SF 1,Bsmt Unf SF,BsmtFin SF 2,...,Wood Deck SF,Open Porch SF,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Misc Val,Mo Sold,Yr Sold,Total Bsmt SF
0,68.0,13517,6,8,1976,2005,289.0,533.0,192.0,0.0,...,0,44,0,0,0,0,0,3,2010,725.0
1,43.0,11492,7,5,1996,1997,132.0,637.0,276.0,0.0,...,0,74,0,0,0,0,0,4,2009,913.0
2,68.0,7922,5,7,1953,2007,0.0,731.0,326.0,0.0,...,0,52,0,0,0,0,0,1,2010,1057.0
3,73.0,9802,5,5,2006,2007,0.0,0.0,384.0,0.0,...,100,0,0,0,0,0,0,4,2010,384.0
4,82.0,14235,6,8,1900,1993,0.0,0.0,676.0,0.0,...,0,59,0,0,0,0,0,3,2010,676.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2046,79.0,11449,8,5,2007,2007,0.0,1011.0,873.0,0.0,...,0,276,0,0,0,0,0,1,2008,1884.0
2047,68.0,12342,4,5,1940,1950,0.0,262.0,599.0,0.0,...,158,0,0,0,0,0,0,3,2009,861.0
2048,57.0,7558,6,6,1928,1950,0.0,0.0,896.0,0.0,...,0,0,0,0,0,0,0,3,2009,896.0
2049,80.0,10400,4,5,1956,1956,0.0,155.0,295.0,750.0,...,0,189,140,0,0,0,0,11,2009,1200.0


In [131]:
34 + 279 # features = 34, one

313

In [132]:
comb_ames # need to add 'SalePrice' column

Unnamed: 0,Id,PID,Lot Frontage_x,Lot Area_x,Overall Qual_x,Overall Cond_x,Year Built_x,Year Remod/Add_x,Mas Vnr Area_x,BsmtFin SF 1_x,...,Wood Deck SF_y,Open Porch SF_y,Enclosed Porch_y,3Ssn Porch_y,Screen Porch_y,Pool Area_y,Misc Val_y,Mo Sold_y,Yr Sold_y,Total Bsmt SF_y
0,109,533352170,68.0,13517,6,8,1976,2005,289.0,533.0,...,0,44,0,0,0,0,0,3,2010,725.0
1,544,531379050,43.0,11492,7,5,1996,1997,132.0,637.0,...,0,74,0,0,0,0,0,4,2009,913.0
2,153,535304180,68.0,7922,5,7,1953,2007,0.0,731.0,...,0,52,0,0,0,0,0,1,2010,1057.0
3,318,916386060,73.0,9802,5,5,2006,2007,0.0,0.0,...,100,0,0,0,0,0,0,4,2010,384.0
4,255,906425045,82.0,14235,6,8,1900,1993,0.0,0.0,...,0,59,0,0,0,0,0,3,2010,676.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2046,1587,921126030,79.0,11449,8,5,2007,2007,0.0,1011.0,...,0,276,0,0,0,0,0,1,2008,1884.0
2047,785,905377130,68.0,12342,4,5,1940,1950,0.0,262.0,...,158,0,0,0,0,0,0,3,2009,861.0
2048,916,909253010,57.0,7558,6,6,1928,1950,0.0,0.0,...,0,0,0,0,0,0,0,3,2009,896.0
2049,639,535179160,80.0,10400,4,5,1956,1956,0.0,155.0,...,0,189,140,0,0,0,0,11,2009,1200.0


In [133]:
comb_ames = pd.merge(left = comb_ames, right = ames['SalePrice'], left_index = True, right_index = True)

In [134]:
comb_ames

Unnamed: 0,Id,PID,Lot Frontage_x,Lot Area_x,Overall Qual_x,Overall Cond_x,Year Built_x,Year Remod/Add_x,Mas Vnr Area_x,BsmtFin SF 1_x,...,Open Porch SF_y,Enclosed Porch_y,3Ssn Porch_y,Screen Porch_y,Pool Area_y,Misc Val_y,Mo Sold_y,Yr Sold_y,Total Bsmt SF_y,SalePrice_y
0,109,533352170,68.0,13517,6,8,1976,2005,289.0,533.0,...,44,0,0,0,0,0,3,2010,725.0,130500
1,544,531379050,43.0,11492,7,5,1996,1997,132.0,637.0,...,74,0,0,0,0,0,4,2009,913.0,220000
2,153,535304180,68.0,7922,5,7,1953,2007,0.0,731.0,...,52,0,0,0,0,0,1,2010,1057.0,109000
3,318,916386060,73.0,9802,5,5,2006,2007,0.0,0.0,...,0,0,0,0,0,0,4,2010,384.0,174000
4,255,906425045,82.0,14235,6,8,1900,1993,0.0,0.0,...,59,0,0,0,0,0,3,2010,676.0,138500
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2046,1587,921126030,79.0,11449,8,5,2007,2007,0.0,1011.0,...,276,0,0,0,0,0,1,2008,1884.0,298751
2047,785,905377130,68.0,12342,4,5,1940,1950,0.0,262.0,...,0,0,0,0,0,0,3,2009,861.0,82500
2048,916,909253010,57.0,7558,6,6,1928,1950,0.0,0.0,...,0,0,0,0,0,0,3,2009,896.0,177000
2049,639,535179160,80.0,10400,4,5,1956,1956,0.0,155.0,...,189,140,0,0,0,0,11,2009,1200.0,144000


In [153]:
comb_ames.drop(columns = ['SalePrice_x'], inplace = True)

KeyError: "['SalePrice_x'] not found in axis"

In [154]:
comb_ames.drop(columns =['PID','Id'], inplace = True)

KeyError: "['PID' 'Id'] not found in axis"

In [155]:
X= comb_ames.drop(columns = ['SalePrice_y'])
y = comb_ames['SalePrice_y']

X_train,X_test,y_train, y_test = train_test_split(X, y, random_state = 4, test_size = .1, train_size = .9 )


In [138]:
X.columns

Index(['Lot Frontage_x', 'Lot Area_x', 'Overall Qual_x', 'Overall Cond_x',
       'Year Built_x', 'Year Remod/Add_x', 'Mas Vnr Area_x', 'BsmtFin SF 1_x',
       'BsmtFin SF 2_x', 'Bsmt Unf SF_x',
       ...
       'Wood Deck SF_y', 'Open Porch SF_y', 'Enclosed Porch_y', '3Ssn Porch_y',
       'Screen Porch_y', 'Pool Area_y', 'Misc Val_y', 'Mo Sold_y', 'Yr Sold_y',
       'Total Bsmt SF_y'],
      dtype='object', length=310)

In [139]:
X_train.shape

(1845, 310)

In [140]:
y_train.shape

(1845,)

In [156]:
lr = LinearRegression()
lr.fit(X_train, y_train)
lr.score(X_train, y_train) # the model is currently slightly overfit

0.948361516487874

This model is currently slightly overfit, meaning we are fitting more to the data proved then the unseen data.

In [157]:
lr.coef_

array([-1.50862386e+07,  1.22600394e+13, -8.75734487e+11, -4.46072097e+11,
       -2.18017759e+11,  3.12642581e+11, -1.10594567e+12,  1.96268260e+12,
        2.10214266e+12,  1.80362873e+12, -2.13312945e+12,  5.51654511e+11,
        1.16010700e+12,  5.85595393e+11, -5.78811432e+11,  5.74369075e+10,
       -9.91075412e+10,  7.42297756e+09, -4.00451382e+10, -3.22711687e+10,
        3.47980918e+10,  1.50298712e+09, -2.51196373e+09,  1.37598877e+01,
       -2.58761496e+10, -1.95917044e+11,  1.83124765e+10, -9.39516048e+10,
       -7.71305412e+10,  3.32095513e+10,  4.38196920e+10, -1.78815038e+11,
        4.36841552e+10,  3.72978632e+08, -4.58304715e+09,  6.12650364e+03,
        1.39415597e+04,  2.99470462e+04,  5.75572769e+03,  5.66928889e+03,
        9.53577408e+03,  5.77361807e+03, -1.05383056e+04, -1.15262035e+03,
        3.59954325e+08, -2.16655930e+04, -7.55023095e+04, -1.62501892e+04,
       -2.13514238e+04, -5.76891529e+03,  1.06485453e+04,  2.87963969e+04,
        1.22384185e+08,  

### Model 3 - Lasso

For this step, we will use a regularizer 

In [158]:
ss = StandardScaler()
Xss_train = ss.fit_transform(X_train)
Xss_test = ss.transform(X_test)

In [159]:
# 100 different alphas between -3, 0
lass_alpha = np.logspace(-3,10,5000)

# the algorithm continually  adjust all our weights and 
# sometimes the default value from acceleration isn't enough for all those weights to converge.
lasso_cv= LassoCV(alphas= lass_alpha, cv = 5, max_iter = 50000)

# Fit model using best ridge alpha! 
lasso_cv.fit(Xss_train, y_train);

# best alpha
lasso_cv.alpha_


718.408189208704

In [160]:
print('Training score:', lasso_cv.score(Xss_train, y_train)) 
print('Test score:', lasso_cv.score(Xss_test, y_test))

lasso_cv.coef_  # sets most features to 0, good for feature selection

Training score: 0.9157785355286931
Test score: 0.7024309399468989


array([-0.00000000e+00,  5.32093562e+02,  1.29598790e+04,  3.11793305e+02,
        7.58983392e+03,  2.73351583e+03,  4.78445769e+03,  3.54306737e+03,
        0.00000000e+00, -0.00000000e+00,  4.31949094e+03,  2.52752001e+03,
        0.00000000e+00, -4.87527235e+02,  1.97758658e+04,  2.52723423e+03,
       -0.00000000e+00,  1.12340100e+03,  5.93640940e+02, -1.12097204e+02,
       -2.84228912e+03,  1.23964495e+03,  2.38549387e+03, -0.00000000e+00,
        3.36300254e+03,  2.39122324e+03,  2.35370701e+03,  1.24135583e+02,
        0.00000000e+00, -0.00000000e+00,  2.71143663e+03,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00, -0.00000000e+00,  0.00000000e+00,
       -0.00000000e+00,  0.00000000e+00, -0.00000000e+00,  0.00000000e+00,
       -0.00000000e+00,  0.00000000e+00, -0.00000000e+00,  0.00000000e+00,
       -8.03098750e+02, -3.54709161e+02, -7.97196991e+02, -9.11203022e+02,
       -0.00000000e+00, -5.83524578e+02, -2.72063355e+02,  0.00000000e+00,
        0.00000000e+00,  

In [161]:
lasso_feat= list(zip(X_train.columns, lasso_cv.coef_))
lasso_feat

[('Lot Frontage_x', -0.0),
 ('Lot Area_x', 532.0935624532837),
 ('Overall Qual_x', 12959.879020709745),
 ('Overall Cond_x', 311.7933054209024),
 ('Year Built_x', 7589.833923365611),
 ('Year Remod/Add_x', 2733.5158331156977),
 ('Mas Vnr Area_x', 4784.457694312031),
 ('BsmtFin SF 1_x', 3543.067366544197),
 ('BsmtFin SF 2_x', 0.0),
 ('Bsmt Unf SF_x', -0.0),
 ('Total Bsmt SF_x', 4319.490939471468),
 ('1st Flr SF_x', 2527.520007690452),
 ('2nd Flr SF_x', 0.0),
 ('Low Qual Fin SF_x', -487.52723503629716),
 ('Gr Liv Area_x', 19775.86580753073),
 ('Bsmt Full Bath_x', 2527.2342331715067),
 ('Bsmt Half Bath_x', -0.0),
 ('Full Bath_x', 1123.400998599674),
 ('Half Bath_x', 593.6409401272372),
 ('Bedroom AbvGr_x', -112.09720370204384),
 ('Kitchen AbvGr_x', -2842.2891179363564),
 ('TotRms AbvGrd_x', 1239.6449547734242),
 ('Fireplaces_x', 2385.4938687059266),
 ('Garage Yr Blt', -0.0),
 ('Garage Cars_x', 3363.0025411056354),
 ('Garage Area_x', 2391.2232374313817),
 ('Wood Deck SF_x', 2353.707012879217

In [147]:
new_feat =[]
for x in lasso_feat:
    if abs(x[1])>0:
        new_feat.append(x[0])

In [148]:
# new features selected 
new_feat

['Lot Area_x',
 'Overall Qual_x',
 'Overall Cond_x',
 'Year Built_x',
 'Year Remod/Add_x',
 'Mas Vnr Area_x',
 'BsmtFin SF 1_x',
 'Total Bsmt SF_x',
 '1st Flr SF_x',
 'Low Qual Fin SF_x',
 'Gr Liv Area_x',
 'Bsmt Full Bath_x',
 'Full Bath_x',
 'Half Bath_x',
 'Bedroom AbvGr_x',
 'Kitchen AbvGr_x',
 'TotRms AbvGrd_x',
 'Fireplaces_x',
 'Garage Cars_x',
 'Garage Area_x',
 'Wood Deck SF_x',
 'Open Porch SF_x',
 'Screen Porch_x',
 'MS SubClass_90',
 'MS SubClass_120',
 'MS SubClass_150',
 'MS SubClass_160',
 'MS SubClass_190',
 'MS Zoning_C (all)',
 'MS Zoning_RM',
 'Lot Shape_IR3',
 'Land Contour_HLS',
 'Utilities_NoSeWa',
 'Lot Config_CulDSac',
 'Land Slope_Mod',
 'Neighborhood_Crawfor',
 'Neighborhood_Edwards',
 'Neighborhood_GrnHill',
 'Neighborhood_NAmes',
 'Neighborhood_NWAmes',
 'Neighborhood_NoRidge',
 'Neighborhood_NridgHt',
 'Neighborhood_OldTown',
 'Neighborhood_Somerst',
 'Neighborhood_StoneBr',
 'Neighborhood_Veenker',
 'Condition 1_Norm',
 'Condition 1_PosA',
 'Condition 1_Po

In [149]:
sorted(comb_ames.columns)

['1st Flr SF_x',
 '1st Flr SF_y',
 '2nd Flr SF_x',
 '2nd Flr SF_y',
 '3Ssn Porch_x',
 '3Ssn Porch_y',
 'Alley_No Alley',
 'Alley_Pave',
 'Bedroom AbvGr_x',
 'Bedroom AbvGr_y',
 'Bldg Type_2fmCon',
 'Bldg Type_Duplex',
 'Bldg Type_Twnhs',
 'Bldg Type_TwnhsE',
 'Bsmt Cond_Fa',
 'Bsmt Cond_Gd',
 'Bsmt Cond_No Basement',
 'Bsmt Cond_Po',
 'Bsmt Cond_TA',
 'Bsmt Exposure_Gd',
 'Bsmt Exposure_Mn',
 'Bsmt Exposure_No',
 'Bsmt Exposure_No Basement',
 'Bsmt Full Bath_x',
 'Bsmt Full Bath_y',
 'Bsmt Half Bath_x',
 'Bsmt Half Bath_y',
 'Bsmt Qual_Fa',
 'Bsmt Qual_Gd',
 'Bsmt Qual_No Basement',
 'Bsmt Qual_Po',
 'Bsmt Qual_TA',
 'Bsmt Unf SF_x',
 'Bsmt Unf SF_y',
 'BsmtFin SF 1_x',
 'BsmtFin SF 1_y',
 'BsmtFin SF 2_x',
 'BsmtFin SF 2_y',
 'BsmtFin Type 1_BLQ',
 'BsmtFin Type 1_GLQ',
 'BsmtFin Type 1_LwQ',
 'BsmtFin Type 1_No Basement',
 'BsmtFin Type 1_Rec',
 'BsmtFin Type 1_Unf',
 'BsmtFin Type 2_BLQ',
 'BsmtFin Type 2_GLQ',
 'BsmtFin Type 2_LwQ',
 'BsmtFin Type 2_No Basement',
 'BsmtFin Type 2_R

In [171]:
X = comb_ames[new_feat]
y = comb_ames['SalePrice_y']

Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state = 40, test_size = .1, train_size = .9)
ss = StandardScaler()
Xstrain = ss.fit_transform(Xtrain)
Xstest = ss.transform(Xtest)

#Linear Regression with the selected features
lr = LinearRegression()
lr.fit(Xtrain, ytrain)

lr.score(Xtrain,ytrain), lr.score(Xtest,ytest)

(0.9072465626032411, 0.9053909003952798)

In [167]:
Xstrain = ss.inverse_transform(Xstrain)

In [168]:
Xtrain
cross_val_score(lr, Xstrain, ytrain, cv = 5).mean()

-571705.1698374378

In [152]:
cross_val_score(lr, Xstest, ytest, cv = 5).mean()

-3.8297166974828796e+23

In [172]:
list(zip(X.columns, lr.coef_))

[('Lot Area_x', 159465649358361.8),
 ('Overall Qual_x', 3839.25),
 ('Overall Cond_x', 2568.236328125),
 ('Year Built_x', 110.593017578125),
 ('Year Remod/Add_x', 22.1962890625),
 ('Mas Vnr Area_x', 8.225830078125),
 ('BsmtFin SF 1_x', 2.0686492919921875),
 ('Total Bsmt SF_x', 5.336639404296875),
 ('1st Flr SF_x', -5.470855712890625),
 ('Low Qual Fin SF_x', -8.257915496826172),
 ('Gr Liv Area_x', 19.134552001953125),
 ('Bsmt Full Bath_x', 8705.471888184547),
 ('Full Bath_x', 7630.395411610603),
 ('Half Bath_x', 1319.94278049469),
 ('Bedroom AbvGr_x', 241.35335898399353),
 ('Kitchen AbvGr_x', -14625.64563691616),
 ('TotRms AbvGrd_x', 1130.5154175758362),
 ('Fireplaces_x', 6361.19496178627),
 ('Garage Cars_x', 4664.387651860714),
 ('Garage Area_x', 2.2162399291992188),
 ('Wood Deck SF_x', 10.257209777832031),
 ('Open Porch SF_x', -4.390105068683624),
 ('Screen Porch_x', 71.50680685043335),
 ('MS SubClass_90', -6682.036041796207),
 ('MS SubClass_120', -10725.478141069412),
 ('MS SubClass_1