# Modeling

This section will be used to create a model that will predict the housing prices in Ames, IA.  We will start with creating a base score using all the variables.  We will then take steps to improve the model using the R2 score.  If a model's training score is less then the testing score, we are underfit, high bias, and low variance. This means our model is not complex enough.  If the model has a higher testing score, we are overfit, therefore we have a high variance, low bias model.  This mean we are fitting to the data, and the model may not fit well to unseen data.  It is important to get the models training and testing scores as close together as possible.

In [337]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, LogisticRegression, Lasso, LassoCV
from sklearn.model_selection import train_test_split, cross_val_score

In [338]:
%store -r ames
%store -r features
%store -r cat

In [339]:
ames.head()

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,SalePrice
0,109,533352170,60,RL,68.0,13517,Pave,No Alley,IR1,Lvl,...,0,0,No Pool,No Fence,,0,3,2010,WD,130500
1,544,531379050,60,RL,43.0,11492,Pave,No Alley,IR1,Lvl,...,0,0,No Pool,No Fence,,0,4,2009,WD,220000
2,153,535304180,20,RL,68.0,7922,Pave,No Alley,Reg,Lvl,...,0,0,No Pool,No Fence,,0,1,2010,WD,109000
3,318,916386060,60,RL,73.0,9802,Pave,No Alley,Reg,Lvl,...,0,0,No Pool,No Fence,,0,4,2010,WD,174000
4,255,906425045,50,RL,82.0,14235,Pave,No Alley,IR1,Lvl,...,0,0,No Pool,No Fence,,0,3,2010,WD,138500


In [340]:
ames.shape

(2051, 81)

## One Hot-Encoding

In [341]:
# finds the first element in each category when sorted, because these will be dropped when dummied
comp = []
for x in cat:
    comp.append(sorted(ames[x].unique())[0])

In [342]:
comp

[20,
 'A (agr)',
 'Grvl',
 'Grvl',
 'IR1',
 'Bnk',
 'AllPub',
 'Corner',
 'Gtl',
 'Blmngtn',
 'Artery',
 'Artery',
 '1Fam',
 '1.5Fin',
 'Flat',
 'ClyTile',
 'AsbShng',
 'AsbShng',
 'BrkCmn',
 'Ex',
 'Ex',
 'BrkTil',
 'Ex',
 'Ex',
 'Av',
 'ALQ',
 'ALQ',
 'GasA',
 'Ex',
 'N',
 'FuseA',
 'Ex',
 'Maj1',
 'Ex',
 '2Types',
 'Fin',
 'Ex',
 'Ex',
 'N',
 'Ex',
 'GdPrv',
 'Elev',
 'COD']

Above are features that will be dropped once one-hot encoded is used.  When making inference from the linear model's coefficient, all these variables are held constant

In [343]:
features

['Lot Frontage',
 'Lot Area',
 'Overall Qual',
 'Overall Cond',
 'Year Built',
 'Year Remod/Add',
 'Mas Vnr Area',
 'BsmtFin SF 1',
 'Bsmt Unf SF',
 'BsmtFin SF 2',
 '1st Flr SF',
 '2nd Flr SF',
 'Low Qual Fin SF',
 'Gr Liv Area',
 'Bsmt Full Bath',
 'Bsmt Half Bath',
 'Full Bath',
 'Half Bath',
 'Bedroom AbvGr',
 'Kitchen AbvGr',
 'TotRms AbvGrd',
 'Fireplaces',
 'Garage Cars',
 'Garage Area',
 'Wood Deck SF',
 'Open Porch SF',
 'Enclosed Porch',
 '3Ssn Porch',
 'Screen Porch',
 'Pool Area',
 'Misc Val',
 'Mo Sold',
 'Yr Sold',
 'Total Bsmt SF']

In [344]:
cat

['MS SubClass',
 'MS Zoning',
 'Street',
 'Alley',
 'Lot Shape',
 'Land Contour',
 'Utilities',
 'Lot Config',
 'Land Slope',
 'Neighborhood',
 'Condition 1',
 'Condition 2',
 'Bldg Type',
 'House Style',
 'Roof Style',
 'Roof Matl',
 'Exterior 1st',
 'Exterior 2nd',
 'Mas Vnr Type',
 'Exter Qual',
 'Exter Cond',
 'Foundation',
 'Bsmt Qual',
 'Bsmt Cond',
 'Bsmt Exposure',
 'BsmtFin Type 1',
 'BsmtFin Type 2',
 'Heating',
 'Heating QC',
 'Central Air',
 'Electrical',
 'Kitchen Qual',
 'Functional',
 'Fireplace Qu',
 'Garage Type',
 'Garage Finish',
 'Garage Qual',
 'Garage Cond',
 'Paved Drive',
 'Pool QC',
 'Fence',
 'Misc Feature',
 'Sale Type']

In [345]:
cate = pd.get_dummies(ames, columns = cat, drop_first = True)

In [346]:
cate

Unnamed: 0,Id,PID,Lot Frontage,Lot Area,Overall Qual,Overall Cond,Year Built,Year Remod/Add,Mas Vnr Area,BsmtFin SF 1,...,Misc Feature_Shed,Misc Feature_TenC,Sale Type_CWD,Sale Type_Con,Sale Type_ConLD,Sale Type_ConLI,Sale Type_ConLw,Sale Type_New,Sale Type_Oth,Sale Type_WD
0,109,533352170,68.0,13517,6,8,1976,2005,289.0,533.0,...,0,0,0,0,0,0,0,0,0,1
1,544,531379050,43.0,11492,7,5,1996,1997,132.0,637.0,...,0,0,0,0,0,0,0,0,0,1
2,153,535304180,68.0,7922,5,7,1953,2007,0.0,731.0,...,0,0,0,0,0,0,0,0,0,1
3,318,916386060,73.0,9802,5,5,2006,2007,0.0,0.0,...,0,0,0,0,0,0,0,0,0,1
4,255,906425045,82.0,14235,6,8,1900,1993,0.0,0.0,...,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2046,1587,921126030,79.0,11449,8,5,2007,2007,0.0,1011.0,...,0,0,0,0,0,0,0,0,0,1
2047,785,905377130,68.0,12342,4,5,1940,1950,0.0,262.0,...,0,0,0,0,0,0,0,0,0,1
2048,916,909253010,57.0,7558,6,6,1928,1950,0.0,0.0,...,0,0,0,0,0,0,0,0,0,1
2049,639,535179160,80.0,10400,4,5,1956,1956,0.0,155.0,...,0,0,0,0,0,0,0,0,0,1


In [347]:
cate.shape

(2051, 279)

In [348]:
df = pd.merge(left = ames[features], right = cate, left_index = True, right_index = True )

In [349]:
ames[features].shape

(2051, 34)

In [350]:
df.shape

(2051, 313)

## Baseline Model Score

In [351]:
X = df.drop(columns ='SalePrice')
y = ames['SalePrice']

In [352]:
lr = LinearRegression()

In [353]:
cross_val_score(lr,X, y).mean() # this is the baseline Score

0.8038096976011048

We need to improve the model with scores higher then .804

### Model 1

In [354]:
X = ames[['Mas Vnr Area','Total Bsmt SF','1st Flr SF','Gr Liv Area',
         'Full Bath','TotRms AbvGrd','Fireplaces','Garage Yr Blt','Garage Cars','Garage Area','Open Porch SF','Wood Deck SF','Lot Area']]
y = ames['SalePrice']

Train Test Split

In [355]:
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 4, test_size = .1, train_size = .9)

Scaling

In [356]:
ss = StandardScaler()

Xs_train = ss.fit_transform(X_train) # data leakage if you put test data.
Xs_test = ss.transform(X_test) # test, 

Methods

fit(X[, y]) - Compute the mean and std to be used for later scaling.
fit_transform(X[, y]) - Fit to data, then transform it.
get_params([deep]) - parameters for this estimator.
inverse_transform(X[, copy])
Scale back the data to the original representation
partial_fit(X[, y])
Online computation of mean and std on X for later scaling.
set_params(**params)
Set the parameters of this estimator.
transform(X[, copy]) -Perform standardization by centering and scaling

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html 

Instantiate, Fit, Score Model

In [357]:
lr = LinearRegression()

In [358]:
lr.fit(X_train, y_train)

LinearRegression()

In [359]:
lr.score(X_train, y_train), lr.score(X_test, y_test) # this model did worst then baseline score.

(0.744137646447512, 0.4826040517475163)

Low Bias, high variance - ok for first model  r2 score without scaling

In [360]:
cross_val_score(lr, X, y) # checking the score 5 more times

array([0.71976524, 0.76441344, 0.6246985 , 0.77556205, 0.67102882])

### Model 2 - includes Scaling

Scaling our features is critical because it put all our features into the same "measured unit".  Sklearn turns each feature into mean and standard deviation.

In [361]:
X = ames[['Mas Vnr Area','Total Bsmt SF','1st Flr SF','Gr Liv Area',
         'Full Bath','TotRms AbvGrd','Fireplaces','Garage Yr Blt','Garage Cars','Garage Area','Open Porch SF','Wood Deck SF','Lot Area']]
y = ames['SalePrice']

In [362]:
ss = StandardScaler()

Xs_train = ss.fit_transform(X_train) # data leakage if you put test data.
Xs_test = ss.transform(X_test) # test, 

Methods

fit(X[, y]) - Compute the mean and std to be used for later scaling.
fit_transform(X[, y]) - Fit to data, then transform it.
get_params([deep]) - parameters for this estimator.
inverse_transform(X[, copy])
Scale back the data to the original representation
partial_fit(X[, y])
Online computation of mean and std on X for later scaling.
set_params(**params)
Set the parameters of this estimator.
transform(X[, copy]) -Perform standardization by centering and scaling

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html 

Instantiate, Fit, Score Model

In [363]:
lr = LinearRegression()

In [364]:
lr.fit(Xs_train, y_train)

LinearRegression()

In [365]:
lr.score(Xs_train, y_train), lr.score(Xs_test, y_test) # this model did worst then baseline score.

(0.7441376464475121, 0.4826040517475141)

The model is currently underfit, low bias, and high variance.  At this point we want to see which features make a difference in the model.  Lasso is good for feature selection.

### Model 3 - Numeric Features

In [366]:
X = ames[features]
y = ames['SalePrice']

In [367]:
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 4, test_size = .1, train_size = .9)
ss = StandardScaler()

Xs_train = ss.fit_transform(X_train) # data leakage if you put test data.
Xs_test = ss.transform(X_test) # test, 

In [368]:
lr = LinearRegression()
lr.fit(Xs_train, y_train)

LinearRegression()

In [369]:
lr.score(Xs_train, y_train), lr.score(Xs_test, y_test) 

(0.8457134063613226, 0.6009567332946075)

The model is still currently underfitting.  Linear Regression models tend to be high bias and low variance.

### Model 4 - Numeric and Categorical columns

Steps in this process:
    - add numeric and categorical columns
    - create linear regression model

In [370]:
# one-hot-encode the categorical columns
comb_ames = pd.DataFrame()
comb_ames = pd.get_dummies(data = ames, columns = cat, drop_first = True)

In [371]:
comb_ames

Unnamed: 0,Id,PID,Lot Frontage,Lot Area,Overall Qual,Overall Cond,Year Built,Year Remod/Add,Mas Vnr Area,BsmtFin SF 1,...,Misc Feature_Shed,Misc Feature_TenC,Sale Type_CWD,Sale Type_Con,Sale Type_ConLD,Sale Type_ConLI,Sale Type_ConLw,Sale Type_New,Sale Type_Oth,Sale Type_WD
0,109,533352170,68.0,13517,6,8,1976,2005,289.0,533.0,...,0,0,0,0,0,0,0,0,0,1
1,544,531379050,43.0,11492,7,5,1996,1997,132.0,637.0,...,0,0,0,0,0,0,0,0,0,1
2,153,535304180,68.0,7922,5,7,1953,2007,0.0,731.0,...,0,0,0,0,0,0,0,0,0,1
3,318,916386060,73.0,9802,5,5,2006,2007,0.0,0.0,...,0,0,0,0,0,0,0,0,0,1
4,255,906425045,82.0,14235,6,8,1900,1993,0.0,0.0,...,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2046,1587,921126030,79.0,11449,8,5,2007,2007,0.0,1011.0,...,0,0,0,0,0,0,0,0,0,1
2047,785,905377130,68.0,12342,4,5,1940,1950,0.0,262.0,...,0,0,0,0,0,0,0,0,0,1
2048,916,909253010,57.0,7558,6,6,1928,1950,0.0,0.0,...,0,0,0,0,0,0,0,0,0,1
2049,639,535179160,80.0,10400,4,5,1956,1956,0.0,155.0,...,0,0,0,0,0,0,0,0,0,1


In [372]:
comb_ames = pd.merge(left = comb_ames, right = ames[features], left_index = True, right_index = True)

In [373]:
ames[features]

Unnamed: 0,Lot Frontage,Lot Area,Overall Qual,Overall Cond,Year Built,Year Remod/Add,Mas Vnr Area,BsmtFin SF 1,Bsmt Unf SF,BsmtFin SF 2,...,Wood Deck SF,Open Porch SF,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Misc Val,Mo Sold,Yr Sold,Total Bsmt SF
0,68.0,13517,6,8,1976,2005,289.0,533.0,192.0,0.0,...,0,44,0,0,0,0,0,3,2010,725.0
1,43.0,11492,7,5,1996,1997,132.0,637.0,276.0,0.0,...,0,74,0,0,0,0,0,4,2009,913.0
2,68.0,7922,5,7,1953,2007,0.0,731.0,326.0,0.0,...,0,52,0,0,0,0,0,1,2010,1057.0
3,73.0,9802,5,5,2006,2007,0.0,0.0,384.0,0.0,...,100,0,0,0,0,0,0,4,2010,384.0
4,82.0,14235,6,8,1900,1993,0.0,0.0,676.0,0.0,...,0,59,0,0,0,0,0,3,2010,676.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2046,79.0,11449,8,5,2007,2007,0.0,1011.0,873.0,0.0,...,0,276,0,0,0,0,0,1,2008,1884.0
2047,68.0,12342,4,5,1940,1950,0.0,262.0,599.0,0.0,...,158,0,0,0,0,0,0,3,2009,861.0
2048,57.0,7558,6,6,1928,1950,0.0,0.0,896.0,0.0,...,0,0,0,0,0,0,0,3,2009,896.0
2049,80.0,10400,4,5,1956,1956,0.0,155.0,295.0,750.0,...,0,189,140,0,0,0,0,11,2009,1200.0


In [374]:
34 + 279 # features = 34, one

313

In [375]:
comb_ames # need to add 'SalePrice' column

Unnamed: 0,Id,PID,Lot Frontage_x,Lot Area_x,Overall Qual_x,Overall Cond_x,Year Built_x,Year Remod/Add_x,Mas Vnr Area_x,BsmtFin SF 1_x,...,Wood Deck SF_y,Open Porch SF_y,Enclosed Porch_y,3Ssn Porch_y,Screen Porch_y,Pool Area_y,Misc Val_y,Mo Sold_y,Yr Sold_y,Total Bsmt SF_y
0,109,533352170,68.0,13517,6,8,1976,2005,289.0,533.0,...,0,44,0,0,0,0,0,3,2010,725.0
1,544,531379050,43.0,11492,7,5,1996,1997,132.0,637.0,...,0,74,0,0,0,0,0,4,2009,913.0
2,153,535304180,68.0,7922,5,7,1953,2007,0.0,731.0,...,0,52,0,0,0,0,0,1,2010,1057.0
3,318,916386060,73.0,9802,5,5,2006,2007,0.0,0.0,...,100,0,0,0,0,0,0,4,2010,384.0
4,255,906425045,82.0,14235,6,8,1900,1993,0.0,0.0,...,0,59,0,0,0,0,0,3,2010,676.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2046,1587,921126030,79.0,11449,8,5,2007,2007,0.0,1011.0,...,0,276,0,0,0,0,0,1,2008,1884.0
2047,785,905377130,68.0,12342,4,5,1940,1950,0.0,262.0,...,158,0,0,0,0,0,0,3,2009,861.0
2048,916,909253010,57.0,7558,6,6,1928,1950,0.0,0.0,...,0,0,0,0,0,0,0,3,2009,896.0
2049,639,535179160,80.0,10400,4,5,1956,1956,0.0,155.0,...,0,189,140,0,0,0,0,11,2009,1200.0


In [376]:
comb_ames = pd.merge(left = comb_ames, right = ames['SalePrice'], left_index = True, right_index = True)

In [377]:
comb_ames

Unnamed: 0,Id,PID,Lot Frontage_x,Lot Area_x,Overall Qual_x,Overall Cond_x,Year Built_x,Year Remod/Add_x,Mas Vnr Area_x,BsmtFin SF 1_x,...,Open Porch SF_y,Enclosed Porch_y,3Ssn Porch_y,Screen Porch_y,Pool Area_y,Misc Val_y,Mo Sold_y,Yr Sold_y,Total Bsmt SF_y,SalePrice_y
0,109,533352170,68.0,13517,6,8,1976,2005,289.0,533.0,...,44,0,0,0,0,0,3,2010,725.0,130500
1,544,531379050,43.0,11492,7,5,1996,1997,132.0,637.0,...,74,0,0,0,0,0,4,2009,913.0,220000
2,153,535304180,68.0,7922,5,7,1953,2007,0.0,731.0,...,52,0,0,0,0,0,1,2010,1057.0,109000
3,318,916386060,73.0,9802,5,5,2006,2007,0.0,0.0,...,0,0,0,0,0,0,4,2010,384.0,174000
4,255,906425045,82.0,14235,6,8,1900,1993,0.0,0.0,...,59,0,0,0,0,0,3,2010,676.0,138500
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2046,1587,921126030,79.0,11449,8,5,2007,2007,0.0,1011.0,...,276,0,0,0,0,0,1,2008,1884.0,298751
2047,785,905377130,68.0,12342,4,5,1940,1950,0.0,262.0,...,0,0,0,0,0,0,3,2009,861.0,82500
2048,916,909253010,57.0,7558,6,6,1928,1950,0.0,0.0,...,0,0,0,0,0,0,3,2009,896.0,177000
2049,639,535179160,80.0,10400,4,5,1956,1956,0.0,155.0,...,189,140,0,0,0,0,11,2009,1200.0,144000


In [378]:
comb_ames.drop(columns = ['SalePrice_x'], inplace = True)

In [379]:
X= comb_ames.drop(columns = ['SalePrice_y'])
y = comb_ames['SalePrice_y']

X_train,X_test,y_train, y_test = train_test_split(X, y, random_state = 42, test_size = .1, train_size = .9 )


In [380]:
X_train.shape

(1845, 312)

In [381]:
y_train.shape

(1845,)

In [382]:
lr = LinearRegression()
lr.fit(X_train, y_train)
lr.score(X_train, y_train), lr.score(X_test, y_test) # the model is currently slightly overfit

(0.9453435706473343, 0.9215803187408077)

This model is currently slightly overfit, meaning we are fitting more to the data proved then the unseen data.

### Model 3 - Lasso

For this step, we will use a regularizer 

In [383]:
ss = StandardScaler()
Xss_train = ss.fit_transform(X_train)
Xss_test = ss.transform(X_test)

In [384]:
# 100 different alphas between -3, 0
lass_alpha = np.logspace(-3,10,5000)

# the algorithm continually  adjust all our weights and 
# sometimes the default value from acceleration isn't enough for all those weights to converge.
lasso_cv= LassoCV(alphas= lass_alpha, cv = 5, max_iter = 50000)

# Fit model using best ridge alpha! 
lasso_cv.fit(Xss_train, y_train);

# best alpha
lasso_cv.alpha_


397.11819327091695

In [385]:
print('Training score:', lasso_cv.score(Xss_train, y_train)) 
print('Test score:', lasso_cv.score(Xss_test, y_test))

lasso_cv.coef_  # sets most features to 0, good for feature selection

Training score: 0.9221904659062675
Test score: 0.918030932365283


array([-0.00000000e+00,  0.00000000e+00, -0.00000000e+00,  0.00000000e+00,
        1.13767656e+04,  2.80965210e+02,  7.31164529e+03,  1.68783089e+03,
        3.19377790e+03,  3.66533872e+03,  0.00000000e+00, -0.00000000e+00,
        5.14865063e+03,  3.02533923e+02,  0.00000000e+00, -2.35656353e+02,
        1.97417162e+04,  2.81332753e+03, -0.00000000e+00,  2.30132134e+03,
        1.01447129e+03, -2.11028065e+02, -3.07686225e+03,  2.04815856e+03,
        2.97949167e+03, -0.00000000e+00,  4.47777843e+03,  1.50525822e+03,
        1.28003686e+03,  3.18027799e+02,  0.00000000e+00,  2.51070851e+02,
        3.47571529e+03,  2.34323583e+02, -4.92455043e+03, -0.00000000e+00,
        0.00000000e+00,  3.78021220e+02, -0.00000000e+00,  2.26570354e+02,
        0.00000000e+00,  0.00000000e+00, -0.00000000e+00,  0.00000000e+00,
       -1.72749876e+02,  0.00000000e+00, -1.17849171e+03, -2.24467612e+03,
       -1.19243433e+03, -2.23155160e+03, -0.00000000e+00, -8.34883678e+02,
       -4.46588891e+02,  

In [390]:
lasso_feat= list(zip(X_train.columns, lasso_cv.coef_))
lasso_feat

[('Id', -0.0),
 ('PID', 0.0),
 ('Lot Frontage_x', -0.0),
 ('Lot Area_x', 0.0),
 ('Overall Qual_x', 11376.765579213861),
 ('Overall Cond_x', 280.9652097707689),
 ('Year Built_x', 7311.6452918350615),
 ('Year Remod/Add_x', 1687.8308928160122),
 ('Mas Vnr Area_x', 3193.777901816222),
 ('BsmtFin SF 1_x', 3665.3387214168865),
 ('BsmtFin SF 2_x', 0.0),
 ('Bsmt Unf SF_x', -0.0),
 ('Total Bsmt SF_x', 5148.650627597736),
 ('1st Flr SF_x', 302.5339225095322),
 ('2nd Flr SF_x', 0.0),
 ('Low Qual Fin SF_x', -235.65635250775114),
 ('Gr Liv Area_x', 19741.716159155287),
 ('Bsmt Full Bath_x', 2813.3275330654515),
 ('Bsmt Half Bath_x', -0.0),
 ('Full Bath_x', 2301.321337227393),
 ('Half Bath_x', 1014.4712922390709),
 ('Bedroom AbvGr_x', -211.02806528300812),
 ('Kitchen AbvGr_x', -3076.8622502809626),
 ('TotRms AbvGrd_x', 2048.158563503313),
 ('Fireplaces_x', 2979.491665971431),
 ('Garage Yr Blt', -0.0),
 ('Garage Cars_x', 4477.778429221768),
 ('Garage Area_x', 1505.2582182769675),
 ('Wood Deck SF_x', 

In [391]:
new_feat =[]
for x in lasso_feat:
    if abs(x[1])>0:
        new_feat.append(x[0])

In [392]:
# new features selected 
new_feat

['Overall Qual_x',
 'Overall Cond_x',
 'Year Built_x',
 'Year Remod/Add_x',
 'Mas Vnr Area_x',
 'BsmtFin SF 1_x',
 'Total Bsmt SF_x',
 '1st Flr SF_x',
 'Low Qual Fin SF_x',
 'Gr Liv Area_x',
 'Bsmt Full Bath_x',
 'Full Bath_x',
 'Half Bath_x',
 'Bedroom AbvGr_x',
 'Kitchen AbvGr_x',
 'TotRms AbvGrd_x',
 'Fireplaces_x',
 'Garage Cars_x',
 'Garage Area_x',
 'Wood Deck SF_x',
 'Open Porch SF_x',
 '3Ssn Porch_x',
 'Screen Porch_x',
 'Pool Area_x',
 'Misc Val_x',
 'MS SubClass_30',
 'MS SubClass_45',
 'MS SubClass_80',
 'MS SubClass_90',
 'MS SubClass_120',
 'MS SubClass_150',
 'MS SubClass_160',
 'MS SubClass_190',
 'MS Zoning_C (all)',
 'MS Zoning_RM',
 'Street_Pave',
 'Alley_Pave',
 'Lot Shape_IR2',
 'Lot Shape_IR3',
 'Land Contour_HLS',
 'Land Contour_Lvl',
 'Lot Config_CulDSac',
 'Lot Config_FR3',
 'Land Slope_Mod',
 'Land Slope_Sev',
 'Neighborhood_BrDale',
 'Neighborhood_BrkSide',
 'Neighborhood_Crawfor',
 'Neighborhood_Edwards',
 'Neighborhood_Gilbert',
 'Neighborhood_GrnHill',
 'Ne

In [393]:
X = comb_ames[new_feat]
y = comb_ames['SalePrice_y']

Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state = 42, test_size = .1, train_size = .9)
Xstrain = ss.fit_transform(Xtrain)
Xstest = ss.transform(Xtest)

#Linear Regression with the selected features
lr = LinearRegression()
lr.fit(Xstrain, ytrain)

lr.score(Xstrain,ytrain), lr.score(Xstest,ytest)

(0.9356974214033755, 0.9173149562393891)

In [394]:
Xtrain
cross_val_score(lr, Xtrain, ytrain, cv = 5).mean()

-19865510.912556555

In [395]:
cross_val_score(lr, Xtrain, ytrain, cv = 5)

array([ 9.17287556e-01,  7.23758114e-01,  9.11245941e-01,  8.43714526e-01,
       -9.93275580e+07])

In [396]:
from sklearn.metrics import mean_squared_error

y_pred = lr.predict(Xtest)
mse = mean_squared_error(ytest, y_pred)
rmse = np.sqrt(mse)
rmse

110153481.39877217