# Aimes, Iowa Housing Data: Preprocessing and Model Fitting (Automated Selection, Lasso Regression)

We now look at the results of throwing in all terms and using Lasso regression.

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LassoCV
import patsy
from sklearn.metrics import mean_squared_error
from functools import reduce

import pickle

In [2]:
fileObj = open('./pickles/housingDF.pkl', 'rb')
housing = pickle.load(fileObj)
fileObj.close()

The predictors of this model will simply be every column and dummy applicable.

In [3]:
terms = reduce((lambda x,y: x + ' + ' + y),housing.drop('SalePrice', axis=1).columns)
formula = f'SalePrice ~ {terms} - 1'
y, x = patsy.dmatrices(formula, housing)
x = pd.DataFrame(x, columns=x.design_info.column_names)

In [16]:
model = LassoCV(n_alphas=100, cv=10)
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.3, shuffle=True)

scaler = StandardScaler()
xtrain = scaler.fit_transform(xtrain)
xtest = scaler.transform(xtest)
ytrain = ytrain.ravel()
ytest = ytest.ravel()

In [17]:
model.fit(xtrain, ytrain)
model.score(xtrain, ytrain), model.score(xtest, ytest)

(0.9403174798810787, 0.9172900489195338)

In [18]:
mean_squared_error(ytest, model.predict(xtest))**0.5

22905.593339767503

This model performs much better, having access to more information.  We can confirm some of the more important terms here.

In [19]:
pd.Series(model.coef_, index=x.columns).abs().sort_values(ascending=False)[:20]

GrLivArea                  26353.427553
OverallQual                10292.930145
KitchenQual[T.TA]           9485.737516
TotalBsmtSF                 8929.059912
KitchenQual[T.Gd]           8820.803876
ExterQual[T.TA]             8308.980146
YearBuilt                   8168.921848
BsmtFinSF1                  7814.613651
ExterQual[T.Gd]             6990.235406
Neighborhood[T.StoneBr]     6169.821477
LotArea                     6038.204565
OverallCond                 6012.439953
BsmtQual[T.Ex]              5700.814160
Neighborhood[T.NridgHt]     5464.955813
MasVnrArea                  5297.016205
GarageArea                  5179.195799
BsmtExposure[T.Gd]          4008.593197
Neighborhood[T.NoRidge]     3709.762650
SaleType[T.New]             3586.626501
Functional[T.Typ]           3445.637161
dtype: float64

In [20]:
scores = cross_val_score(model, xtrain, ytrain.ravel(), cv=10)
np.mean(scores), np.std(scores)

(0.9193652215978899, 0.017492518367827052)

We can try and see if adding interactions I think are relevant might improve the score.

In [21]:
interactions = 'GrLivArea:OverallQual'
formula = f'SalePrice ~ {terms} + {interactions} - 1'
y, x = patsy.dmatrices(formula, housing)
x = pd.DataFrame(x, columns=x.design_info.column_names)

In [25]:
model = LassoCV(n_alphas=100, cv=10)
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.3, shuffle=True)

scaler = StandardScaler()
xtrain = scaler.fit_transform(xtrain)
xtest = scaler.transform(xtest)

In [29]:
model.fit(xtrain, ytrain.ravel())
model.score(xtrain, ytrain), model.score(xtest, ytest)

(0.9483806818554631, 0.921052054879546)

In [27]:
scores = cross_val_score(model, xtrain, ytrain.ravel(), cv=10)
np.mean(scores), np.std(scores)

(0.922529021226018, 0.01649240846022595)

No change on a scale we would care about.