## Summary of this notebook

In the [last notebook](./4_modeling.ipynb), we did some modeling of `SalePrice` (or its natural logarithm) as a linear function of various numeric features plus various categorical features that had been `above_below_mid` (abm) encoded.  We did so in an attempt to avoid overfitting: just one-hot encoding all categorical variables would lead to an extremely large number of parameters in our model, so we encoded these categorical variables as -1 ("below average"), 0 ("approximately average"), or 1 ("above average").

In this notebook, we take a different approach: we one-hot encode all categorical variables and then use more advanced methods to reduce the number of features in our linear model.  Specifically, we use regularized regressions (LASSO, Ridge, and ElasticNet) to eliminate features / shrink coefficients from our linear models, so as to avoid overfitting.

Finally, we compare the results of these "advanced" models with our more basic models from the last notebook and come to our conclusions.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#sklearn
from sklearn.linear_model import LinearRegression, Ridge, RidgeCV, Lasso, LassoCV, ElasticNet, ElasticNetCV
from sklearn import metrics
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer, TransformedTargetRegressor
from sklearn.pipeline import Pipeline

import itertools

## Import training data (but not reserved training data)

In [2]:
#Data import
df = pd.read_csv('../datasets/train_processed.csv')
len(df)

1734

In [3]:
#Extract the numeric and categorical features
numerics = []
categoricals = []

for feature in df.dtypes.index:
    if (df.dtypes[feature] == 'int64') or (df.dtypes[feature] == 'float64'):
        numerics.append(feature)
    else:
        categoricals.append(feature)
        
print(f"Numeric variables include {numerics[:5]}")
print(f"Categorical variables incl {categoricals[:5]}")

Numeric variables include ['Id', 'Lot Frontage', 'Lot Area', 'Lot Shape', 'Utilities']
Categorical variables incl ['MS SubClass', 'MS Zoning', 'Street', 'Alley', 'Land Contour']


In [4]:
#Remove the ID number since it's not predictive, and remove our target variable
numerics.remove('Id')
numerics.remove('SalePrice')

## Read in the reserved data set

In [5]:
tests = pd.read_csv('../datasets/train_processed_reserved.csv')

# Advanced Regression Modeling

In [6]:
y = df['SalePrice']

## LASSO Pipeline

Here, we'll create a pipeline to perform the following taks:
1. Dummify the categorical varaibles we want to include.
2. Standardize all variables.
3. Perform a LASSO regularized regression with a particular hyperparameter "alpha".

Then we'll use `GridSearchCV` to search for the optimal hyperparameter level that maximizes the cross-val score of the resulting linear model.

In [7]:
ohe = OneHotEncoder(drop='if_binary', handle_unknown='ignore', sparse=False)

In [8]:
#Write a column transformer to dummify only the categorical variables
dummify_cats = ColumnTransformer(
 transformers=[
     ('ohe', ohe, categoricals)
 ],
    remainder='passthrough'
)

In [9]:
#Write a pipline to dummify categorical variables,
#then standardize all variables,
#then perform a LASSO regularized regression
lasso_pipe = Pipeline([
    ('dcats', dummify_cats),
    ('ss', StandardScaler()),
    ('lasso', Lasso(max_iter=10_000))
])

In [10]:
lasso_params = {
    'lasso__alpha': np.logspace(0, 4, 200)
}

In [11]:
#Perform a gridsearch to find the optimal LASSO hyperparameter
lasso_gridsearch = GridSearchCV(lasso_pipe,     
                              lasso_params,
                              cv=5,        
                              verbose=1,
                              n_jobs=-2)

In [12]:
lasso_gridsearch.fit(df.drop(columns='SalePrice'), y)

Fitting 5 folds for each of 200 candidates, totalling 1000 fits


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('dcats',
                                        ColumnTransformer(remainder='passthrough',
                                                          transformers=[('ohe',
                                                                         OneHotEncoder(drop='if_binary',
                                                                                       handle_unknown='ignore',
                                                                                       sparse=False),
                                                                         ['MS '
                                                                          'SubClass',
                                                                          'MS '
                                                                          'Zoning',
                                                                          'Street',
                                    

In [13]:
#Find the best parameter value
lasso_gridsearch.best_params_

{'lasso__alpha': 283.3096101839324}

Let's record the results of this regression.  To do so, we'll write a function:

In [14]:
def results(df, tests, target, model, model_name, results_dict):
    '''
    Inputs:
    df: A training data set (dataframe)
    tests: A test data set (dataframe)
    target: The variable name to be predicted
    model: A fitted GridSearchCV model with .predict methods
    model_name: What you want this model to be called in the
        outputted pair (a string)
    results_dict: A dictionary of the results that have been recorded
        so far (using this function on other models)
        
    Outputs:
    Returns the inputted results_dict but adds the item
        model_name : [RESULTS]
        where [RESULTS] is a dictionary of the results
        of that model.
    ''' 
    
    X_train = df.drop(columns=target)
    X_test = tests.drop(columns=target)
    y_train = df[target]
    y_test = tests[target]
    
    bp_dict = model.best_params_
    best_hyperparam = " ".join([f"{key.split('__')[-1]}: {str( round(value,5) )}" for key, value in bp_dict.items()])
    
    results_dict[model_name] = {
        'best_hyperparam' : best_hyperparam,
        'r2_train' : model.score(X_train, y_train),
        'r2_test' : model.score(X_test, y_test),
        'mae_train' : metrics.mean_absolute_error(y_train, model.predict(X_train)),
        'mae_test' : metrics.mean_absolute_error(y_test, model.predict(X_test)),
        'rmse_train' : metrics.mean_squared_error(y_train, model.predict(X_train), squared=False),
        'rmse_test' : metrics.mean_squared_error(y_test, model.predict(X_test), squared=False)
                }
    
    return results_dict

In [15]:
results_dict = {}

In [16]:
model = lasso_gridsearch
model_name = 'LASSO gridsearch'

results_dict = results(df, tests, 'SalePrice', model, model_name, results_dict)

In [17]:
results_df = pd.DataFrame.from_dict(results_dict, orient='index')
results_df

Unnamed: 0,best_hyperparam,r2_train,r2_test,mae_train,mae_test,rmse_train,rmse_test
LASSO gridsearch,alpha: 283.30961,0.937949,0.911396,13794.44189,15884.919076,19696.764324,24055.961883


## LASSO log-transform

In [18]:
#Make a log transformer for a LASSO 
lasso_logreg = TransformedTargetRegressor(
    regressor=Lasso(max_iter=10_000),
    func = np.log,
    inverse_func=np.exp
) 

In [19]:
#Write a pipline to dummify categorical variables,
#then standardize all variables,
#then perform a LASSO regularized regression
lasso_log_pipe = Pipeline([
    ('dcats', dummify_cats),
    ('ss', StandardScaler()),
    ('ll', lasso_logreg)
])

In [20]:
lasso_log_params = {
    'll__regressor__alpha': np.logspace(-4, 0, 200)
}

In [21]:
#Perform a gridsearch to find the optimal LASSO hyperparameter
lasso_log_gridsearch = GridSearchCV(lasso_log_pipe,     
                              lasso_log_params,
                              cv=5,        
                              verbose=1,
                              n_jobs=-2)

In [22]:
lasso_log_gridsearch.fit(df.drop(columns='SalePrice'), y)

Fitting 5 folds for each of 200 candidates, totalling 1000 fits


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('dcats',
                                        ColumnTransformer(remainder='passthrough',
                                                          transformers=[('ohe',
                                                                         OneHotEncoder(drop='if_binary',
                                                                                       handle_unknown='ignore',
                                                                                       sparse=False),
                                                                         ['MS '
                                                                          'SubClass',
                                                                          'MS '
                                                                          'Zoning',
                                                                          'Street',
                                    

In [23]:
model = lasso_log_gridsearch
model_name = 'LASSO Log(y) gridsearch'

results_dict = results(df, tests, 'SalePrice', model, model_name, results_dict)

In [24]:
results_df = pd.DataFrame.from_dict(results_dict, orient='index')
results_df.sort_values('r2_test', ascending=False)

Unnamed: 0,best_hyperparam,r2_train,r2_test,mae_train,mae_test,rmse_train,rmse_test
LASSO Log(y) gridsearch,alpha: 0.00233,0.952689,0.928585,12033.71688,13463.558197,17198.997617,21596.822248
LASSO gridsearch,alpha: 283.30961,0.937949,0.911396,13794.44189,15884.919076,19696.764324,24055.961883


## Summary of results so far

The above dataframe contains the results of the optimal (from gridsearch) LASSO regularized regressions.  They are sorted by $R^2$ scores on the reserved (test) data set.

As before, it seems that performing a log transfrom on `SalePrice` before running any regressions yields slightly better results.  Thus, for the remainder of our regularized regressions, we will employ this transformation.

So far, these results are fairly comparable to the best results from the `above_below_mid` encoded regressions in the last notebook.  The $R^2$ and mean absolute error (mae) scores on the test data set are slightly better with the LASSO Log(y) regression than the best regression from the last notebook, but it's not clear that this slight increase in expected performace is worth the total loss of interpretability that accompanies these sorts of regularized regressions.

Below, we will try some Ridge and ElastnicNet regularized regressions.

## Ridge Log-Transform

In [25]:
#Make a log transformer for an ElasticNet regularized regression
ridge_logreg = TransformedTargetRegressor(
    regressor=Ridge(max_iter=10_000),
    func = np.log,
    inverse_func=np.exp
) 

In [26]:
#Write a pipline to dummify categorical variables,
#then standardize all variables,
#then perform an ElasticNet regularized regression
ridge_log_pipe = Pipeline([
    ('dcats', dummify_cats),
    ('ss', StandardScaler()),
    ('rl', ridge_logreg)
])

In [27]:
ridge_log_params = {
    'rl__regressor__alpha': np.logspace(-2, 3, 400)
}

In [28]:
#Perform a gridsearch to find the optimal ElasticNet hyperparameters
ridge_log_gridsearch = GridSearchCV(ridge_log_pipe,     
                              ridge_log_params,
                              cv=5,        
                              verbose=1,
                              n_jobs=-2)

In [29]:
ridge_log_gridsearch.fit(df.drop(columns='SalePrice'), y)

Fitting 5 folds for each of 400 candidates, totalling 2000 fits


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('dcats',
                                        ColumnTransformer(remainder='passthrough',
                                                          transformers=[('ohe',
                                                                         OneHotEncoder(drop='if_binary',
                                                                                       handle_unknown='ignore',
                                                                                       sparse=False),
                                                                         ['MS '
                                                                          'SubClass',
                                                                          'MS '
                                                                          'Zoning',
                                                                          'Street',
                                    

In [30]:
model = ridge_log_gridsearch
model_name = 'Ridge Log(y) gridsearch'

results_dict = results(df, tests, 'SalePrice', model, model_name, results_dict)

In [31]:
results_df = pd.DataFrame.from_dict(results_dict, orient='index')
results_df.sort_values('r2_test', ascending=False)

Unnamed: 0,best_hyperparam,r2_train,r2_test,mae_train,mae_test,rmse_train,rmse_test
LASSO Log(y) gridsearch,alpha: 0.00233,0.952689,0.928585,12033.71688,13463.558197,17198.997617,21596.822248
Ridge Log(y) gridsearch,alpha: 182.24332,0.956421,0.920388,11648.14588,14644.86084,16506.690785,22802.593371
LASSO gridsearch,alpha: 283.30961,0.937949,0.911396,13794.44189,15884.919076,19696.764324,24055.961883


## ElasticNet Log-transform gridsearch

In [32]:
#Make a log transformer for an ElasticNet regularized regression
en_logreg = TransformedTargetRegressor(
    regressor=ElasticNet(max_iter=10_000),
    func = np.log,
    inverse_func=np.exp
) 

In [33]:
#Write a pipline to dummify categorical variables,
#then standardize all variables,
#then perform an ElasticNet regularized regression
en_log_pipe = Pipeline([
    ('dcats', dummify_cats),
    ('ss', StandardScaler()),
    ('enl', en_logreg)
])

In [34]:
en_log_params = {
    'enl__regressor__alpha': np.logspace(-3, 3, 200),
    'enl__regressor__l1_ratio': np.linspace(0,1,21)
}

In [35]:
#Perform a gridsearch to find the optimal ElasticNet hyperparameters
en_log_gridsearch = GridSearchCV(en_log_pipe,     
                              en_log_params,
                              cv=5,        
                              verbose=1,
                              n_jobs=-2)

In [36]:
en_log_gridsearch.fit(df.drop(columns='SalePrice'), y)

Fitting 5 folds for each of 4200 candidates, totalling 21000 fits


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('dcats',
                                        ColumnTransformer(remainder='passthrough',
                                                          transformers=[('ohe',
                                                                         OneHotEncoder(drop='if_binary',
                                                                                       handle_unknown='ignore',
                                                                                       sparse=False),
                                                                         ['MS '
                                                                          'SubClass',
                                                                          'MS '
                                                                          'Zoning',
                                                                          'Street',
                                    

In [37]:
en_log_gridsearch.best_params_

{'enl__regressor__alpha': 0.04552935074866948,
 'enl__regressor__l1_ratio': 0.05}

In [38]:
model = en_log_gridsearch
model_name = 'ElasticNet Log(y) gridsearch'

results_dict = results(df, tests, 'SalePrice', model, model_name, results_dict)

In [39]:
results_df = pd.DataFrame.from_dict(results_dict, orient='index')
results_df.sort_values('r2_test', ascending=False)

Unnamed: 0,best_hyperparam,r2_train,r2_test,mae_train,mae_test,rmse_train,rmse_test
LASSO Log(y) gridsearch,alpha: 0.00233,0.952689,0.928585,12033.71688,13463.558197,17198.997617,21596.822248
ElasticNet Log(y) gridsearch,alpha: 0.04553 l1_ratio: 0.05,0.952876,0.927212,12042.889862,13677.910275,17164.911949,21803.539834
Ridge Log(y) gridsearch,alpha: 182.24332,0.956421,0.920388,11648.14588,14644.86084,16506.690785,22802.593371
LASSO gridsearch,alpha: 283.30961,0.937949,0.911396,13794.44189,15884.919076,19696.764324,24055.961883


As we can see, this ElasticNet regularized regression performs best when its $L^1$ ratio is .05: that is, when it's very close to a Ridge regression rather than a LASSO regression.

Its performace is slightly worse than our original LASSO Log(y) regression, which makes sense since we used a rather sparse space of parameters for our gridsearch (so it didn't take too much time).  Let's try one more grid search with a much more refined space of hyperparameters, based on our last attempt:

In [40]:
refined_enlog_params = {
    'enl__regressor__alpha': np.linspace(.035, .055, 101),
    'enl__regressor__l1_ratio': np.linspace(.04,.06,21)
}

In [41]:
#Perform a gridsearch to find the optimal ElasticNet hyperparameters
refined_enlog_gridsearch = GridSearchCV(en_log_pipe,     
                              refined_enlog_params,
                              cv=5,        
                              verbose=1,
                              n_jobs=-2)

In [42]:
refined_enlog_gridsearch.fit(df.drop(columns='SalePrice'), y)

Fitting 5 folds for each of 2121 candidates, totalling 10605 fits


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('dcats',
                                        ColumnTransformer(remainder='passthrough',
                                                          transformers=[('ohe',
                                                                         OneHotEncoder(drop='if_binary',
                                                                                       handle_unknown='ignore',
                                                                                       sparse=False),
                                                                         ['MS '
                                                                          'SubClass',
                                                                          'MS '
                                                                          'Zoning',
                                                                          'Street',
                                    

In [43]:
model = refined_enlog_gridsearch
model_name = 'Refined ElasticNet Log(y) gridsearch'

results_dict = results(df, tests, 'SalePrice', model, model_name, results_dict)

In [44]:
results_df = pd.DataFrame.from_dict(results_dict, orient='index')
results_df.sort_values('r2_test', ascending=False)

Unnamed: 0,best_hyperparam,r2_train,r2_test,mae_train,mae_test,rmse_train,rmse_test
LASSO Log(y) gridsearch,alpha: 0.00233,0.952689,0.928585,12033.71688,13463.558197,17198.997617,21596.822248
Refined ElasticNet Log(y) gridsearch,alpha: 0.0448 l1_ratio: 0.05,0.952969,0.927259,12033.739171,13678.306841,17147.883627,21796.487489
ElasticNet Log(y) gridsearch,alpha: 0.04553 l1_ratio: 0.05,0.952876,0.927212,12042.889862,13677.910275,17164.911949,21803.539834
Ridge Log(y) gridsearch,alpha: 182.24332,0.956421,0.920388,11648.14588,14644.86084,16506.690785,22802.593371
LASSO gridsearch,alpha: 283.30961,0.937949,0.911396,13794.44189,15884.919076,19696.764324,24055.961883


## A more refined LASSO regression

In [45]:
refined_lassolog_params = {
    'll__regressor__alpha': np.linspace(.0005, .0035, 400)
}

In [46]:
#Perform a gridsearch to find the optimal LASSO hyperparameter
refined_lassolog_gridsearch = GridSearchCV(lasso_log_pipe,     
                              refined_lassolog_params,
                              cv=5,        
                              verbose=1,
                              n_jobs=-2)

In [47]:
refined_lassolog_gridsearch.fit(df.drop(columns='SalePrice'), y)

Fitting 5 folds for each of 400 candidates, totalling 2000 fits


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('dcats',
                                        ColumnTransformer(remainder='passthrough',
                                                          transformers=[('ohe',
                                                                         OneHotEncoder(drop='if_binary',
                                                                                       handle_unknown='ignore',
                                                                                       sparse=False),
                                                                         ['MS '
                                                                          'SubClass',
                                                                          'MS '
                                                                          'Zoning',
                                                                          'Street',
                                    

In [48]:
model = refined_lassolog_gridsearch
model_name = 'Refined LASSO Log(y) gridsearch'

results_dict = results(df, tests, 'SalePrice', model, model_name, results_dict)

In [49]:
results_df = pd.DataFrame.from_dict(results_dict, orient='index')
results_df.sort_values('r2_test', ascending=False)

Unnamed: 0,best_hyperparam,r2_train,r2_test,mae_train,mae_test,rmse_train,rmse_test
LASSO Log(y) gridsearch,alpha: 0.00233,0.952689,0.928585,12033.71688,13463.558197,17198.997617,21596.822248
Refined LASSO Log(y) gridsearch,alpha: 0.00233,0.95267,0.928582,12035.498267,13463.161311,17202.303149,21597.362742
Refined ElasticNet Log(y) gridsearch,alpha: 0.0448 l1_ratio: 0.05,0.952969,0.927259,12033.739171,13678.306841,17147.883627,21796.487489
ElasticNet Log(y) gridsearch,alpha: 0.04553 l1_ratio: 0.05,0.952876,0.927212,12042.889862,13677.910275,17164.911949,21803.539834
Ridge Log(y) gridsearch,alpha: 182.24332,0.956421,0.920388,11648.14588,14644.86084,16506.690785,22802.593371
LASSO gridsearch,alpha: 283.30961,0.937949,0.911396,13794.44189,15884.919076,19696.764324,24055.961883


## Discussion of Results

As we can see, our refined ElasticNet model does ever-so-slightly better than our best LASSO model in terms of $R^2$ on the training data set, but it does ever-so-slightly worse in terms of $R^2$ on the test data set.  Thus, the best LASSO and ElasticNet models seem to have roughly comparable explanatory power, even though the ElasticNet model's optimal `l1_ratio` hyperparameter makes it pretty close to a Ridge regression.

## Aside: Making a Kaggle submission using a pipeline

In [50]:
kaggle = pd.read_csv('../datasets/test_processed.csv')

In [51]:
#Make a new dataset using both our training data and our reserved data
#(The model will be tested on the Kaggle data)

X = pd.concat([df.drop(columns='SalePrice'), tests.drop(columns='SalePrice')])
y = pd.concat([df['SalePrice'], tests['SalePrice']])
len(X)

2040

In [52]:
#Choose the model
model = refined_lassolog_gridsearch

model.fit(X, y)

Fitting 5 folds for each of 400 candidates, totalling 2000 fits


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('dcats',
                                        ColumnTransformer(remainder='passthrough',
                                                          transformers=[('ohe',
                                                                         OneHotEncoder(drop='if_binary',
                                                                                       handle_unknown='ignore',
                                                                                       sparse=False),
                                                                         ['MS '
                                                                          'SubClass',
                                                                          'MS '
                                                                          'Zoning',
                                                                          'Street',
                                    

In [53]:
kaggle_preds = model.predict(kaggle)

In [54]:
kaggle['SalePrice'] = kaggle_preds
output = kaggle[['Id', 'SalePrice']]
output.head()

Unnamed: 0,Id,SalePrice
0,2658,111880.089205
1,2718,160007.543064
2,2414,209796.919544
3,1989,103680.093415
4,625,171396.924955


In [55]:
#Change this cell with the name of the csv to be exported
output_filename = 'lasso_predictions'

output.to_csv(f'../datasets/kaggle_submissions/{output_filename}.csv', index=False)

# Conclusions

We have learned that it is certainly possible to make relatively accurate predictions of the sale prices of homes in Ames, Iowa using the features in our data set.  To get the best predictions, we one-hot encoded the categorical variables in the data set then employed a hyperparameter grid search using LASSO and ElasticNet regularized regressions.  The best-performing models selected in this way achieved $R^2$ scores of about .9286 on the reserved test data, meaning that they can explain about 92.86% if the variability of homes' sale prices using the features they include.  They achieved a mean abolute error of about 13463, meaning their predictions were off by $13,460 or so on average.

However, when compared to the simpler regression models from [the last notebook](./4_modeling.ipynb), the added performance of these regularized models may not be worth their lack of interpretability.

The models in the previous notebook were created using `above_below_mid` encoding on the categorical features.  While such an encoding makes us unable to see *exactly which values* of a categorical variable contribute most strongly to `SalePrice`, it allows us to include a much larger number of categorical variables in our model without overfitting.  Thus, it gives us much more power to see *which categorical variables contribute to `SalePrice`* while maintaining confidence that our predictions will be reasonably accurate outside of the data on which our model was trained.

Altogether, the best models of `SalePrice` predicted the log of `SalePrice`.  This was true both for our simple linear regressions and our regularized regressions.  The best simple model was found using a gridsearch over the minimum-correlation-with-`SalePrice` thresholds for categorical and numeric variables; it included 32 numeric variables and 12 categorical variables and achieved an $R^2$ score of .915 on the test data and a mean absolute error of 15,312 on the test data.  These scores are certainly not as good as the corresponding $R^2 = .9286$ and MAE=13,463 scores achieved by the regularized models, but they are not very far off.  Since its performance scores are quite high and it is interpretable, **our production model is the one explored at then end of the previous notebook**.