# Aimes, Iowa Housing Data: Preprocessing and Model Fitting (Intuitive Interactions, Linear Regression)

In this notebook, we process our data further before fitting it into a model.  The data is split into training and testing sets, scaled and fit.  Predictors are guessed through previous EDA and fitting more area variables, though these are admittedly guessed toward.  We also note that there was skew in the distributions, but a linear regression is used in this manner to interpret large scale trends.

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
import patsy
from sklearn.metrics import mean_squared_error

import pickle

np.random.seed(1)

In [2]:
fileObj = open('./pickles/housingDF.pkl', 'rb')
housing = pickle.load(fileObj)
fileObj.close()

In this notebook, we further explore modelling on a minimal set of interactions.  Lot frontage may have had a more obvious interaction with lot configuration, but these were discarded earlier due to missing information.  Imputation may have had biased results.

In [3]:
interactions = 'TotalBsmtSF:BsmtExposure + LotArea:LotConfig + GarageArea:GarageFinish + OverallQual:GrLivArea'
formula = f'SalePrice ~ {interactions} - 1'
y, x = patsy.dmatrices(formula, housing)
x = pd.DataFrame(x, columns=x.design_info.column_names)

In [4]:
model = LinearRegression()
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.7, shuffle=True)

scaler = StandardScaler()
xtrain = scaler.fit_transform(xtrain)
xtest = scaler.transform(xtest)

In [5]:
model.fit(xtrain, ytrain)
print(f'The model\'s R^2 score is {model.score(xtrain, ytrain)} on the training data, {model.score(xtest, ytest)} on the test data.')

The model's R^2 score is 0.896104532472778 on the training data, 0.8530487255962007 on the test data.


In [6]:
mean_squared_error(ytest, model.predict(xtest))**0.5

30125.558788029295

This model scores somewhat consistently in cross validation, and the largest weight is from quality area(OverallQual:GrLivArea), as promised.  We see that garage area has a large negative coefficient for garages of type 0(those that were listed as having no garage) to conform to the data, but making no sense if we try to infer a relationship from this.

In [7]:
pd.Series(model.coef_[0], index=x.columns).sort_values(ascending=False)

OverallQual:GrLivArea           4.674827e+04
GarageArea:GarageFinish[Fin]    2.352433e+04
TotalBsmtSF:BsmtExposure[Gd]    2.302346e+04
GarageArea:GarageFinish[RFn]    2.038898e+04
TotalBsmtSF:BsmtExposure[Av]    1.326832e+04
TotalBsmtSF:BsmtExposure[No]    1.168230e+04
GarageArea:GarageFinish[Unf]    8.867471e+03
TotalBsmtSF:BsmtExposure[Mn]    7.189679e+03
LotArea:LotConfig[Inside]       4.635108e+03
LotArea:LotConfig[CulDSac]      2.550493e+03
LotArea:LotConfig[Corner]       1.698476e+03
LotArea:LotConfig[FR2]          9.548840e+02
TotalBsmtSF:BsmtExposure[0]     0.000000e+00
GarageArea:GarageFinish[0]     -9.094947e-13
LotArea:LotConfig[FR3]         -1.684410e+03
dtype: float64

In [8]:
scores = cross_val_score(model, xtrain, ytrain.ravel(), cv=10)
print(f'With 10 folds, the R^2 score is {np.mean(scores)} +- {np.std(scores)}')

With 10 folds, the R^2 score is 0.8768389605082882 +- 0.037195135717457986


This may be just by virtue of poorly choosing parameters: in fact, we get similar results by applying a linear combination of overall quality and living area rather than the product.

In [9]:
interactions = 'TotalBsmtSF:BsmtExposure + LotArea:LotConfig + GarageArea:GarageFinish + OverallQual + GrLivArea'
formula = f'SalePrice ~ {interactions} - 1'
y, x = patsy.dmatrices(formula, housing)
x = pd.DataFrame(x, columns=x.design_info.column_names)

In [10]:
model = LinearRegression()
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.7, shuffle=True)

scaler = StandardScaler()
xtrain = scaler.fit_transform(xtrain)
xtest = scaler.transform(xtest)

In [11]:
model.fit(xtrain, ytrain)
print(f'The model\'s R^2 score is {model.score(xtrain, ytrain)} on the training data, {model.score(xtest, ytest)} on the test data.')

The model's R^2 score is 0.8686897681690234 on the training data, 0.8430454758047368 on the test data.


In [12]:
scores = cross_val_score(model, xtrain, ytrain.ravel(), cv=10)
print(f'With 10 folds, the R^2 score is {np.mean(scores)} +- {np.std(scores)}')

With 10 folds, the R^2 score is 0.8518301924838829 +- 0.030771909533508765


Rather than guessing for features, we should automate the process.  We first try this using a Lasso regression model.