# Project 2: Ames Housing Prices

## Model Benchmarks

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import Ridge, Lasso, ElasticNet, LinearRegression, RidgeCV, LassoCV, ElasticNetCV
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error

In [2]:
# Read processed datasets, keeping the 'NA' values filled in previously
X_train = pd.read_csv('../datasets/model_tuning/X_train_processed.csv', keep_default_na=False, index_col='Id')
X_test = pd.read_csv('../datasets/model_tuning/X_test_processed.csv', keep_default_na=False, index_col='Id')
y_train = pd.read_csv('../datasets/model_tuning/y_train.csv', keep_default_na=False, index_col='Id')
y_test = pd.read_csv('../datasets/model_tuning/y_test.csv', keep_default_na=False, index_col='Id')

## Benchmark (Feature Set 1)

We can use the mean sale price of y_train to set as the benchmark to compare our model against.

In [3]:
# Create a df for y_train_mean with the same size as y_test
y_train_mean = y_test.copy()

# Set y_train_mean SalePrice to the mean SalePrice of y_train
y_train_mean['SalePrice'] = y_train['SalePrice'].mean()

In [4]:
# Compute RMSE of y_train_mean to y_test
mean_squared_error(y_test, y_train_mean, squared=False)

71351.54360417312

## Linear Regression Modelling

### Ordinary Linear Regression

In [5]:
# Create and fit Linear Regression model to training data of Feature Set 1
lr = LinearRegression()
lr.fit(X_train, y_train['SalePrice'])

LinearRegression()

In [6]:
# Check cross val score (RMSE)
abs(cross_val_score(lr, X_train, y_train, cv=5, scoring = 'neg_root_mean_squared_error')).mean()

26270070147013.473

In [7]:
# Compare model predictions to test data (RMSE)
mean_squared_error(y_test, lr.predict(X_test), squared=False)

846252266585126.9

The high RMSE values for the cross validation scores and the model predictions of test data show that the ordinary Linear Regression model does not perform well at all.

### Ridge Regression

In [8]:
# Create Ridge Regression CV of training data of Feature Set 1
ridge_alphas = np.logspace(0, 5, 200)

optimal_ridge = RidgeCV(alphas=ridge_alphas, cv=5)
optimal_ridge.fit(X_train, y_train)

print(optimal_ridge.alpha_)

2.1214517849106302


In [9]:
# Create Ridge Regression model with optimal alpha
ridge = Ridge(alpha=optimal_ridge.alpha_)
ridge.fit(X_train, y_train)

Ridge(alpha=2.1214517849106302)

In [10]:
# Check cross val score (RMSE)
abs(cross_val_score(ridge, X_train, y_train, cv=5, scoring='neg_root_mean_squared_error')).mean()

19950.43548232915

In [11]:
# Compare model predictions to test data (RMSE)
mean_squared_error(y_test, ridge.predict(X_test), squared=False)

58395.657586695364

Based on the RMSE scores, the Ridge Regression model performs much better than the ordinary Linear Regression model.

However, there is a high level of overfitting as the test predictions are much worse than the cross validation score. The model does not generalize well to unseen data. This is to be expected as over 200 columns are used.

The model test prediction RMSE performs only slightly better than the benchmark RMSE of 71351.

| Score | RMSE|
|-------|-----|
|Cross validation|19950|
|Test predictions|58395|



### Lasso Regression

In [12]:
# Create Lasso Regression CV of training data of Feature Set 1
optimal_lasso = LassoCV(n_alphas=500, cv=5)
optimal_lasso.fit(X_train, np.ravel(y_train))

print(optimal_lasso.alpha_)

202.37683637506322


In [13]:
# Create Lasso Regression model with optimal alpha
lasso = Lasso(alpha=optimal_lasso.alpha_)
lasso.fit(X_train, y_train)

Lasso(alpha=202.37683637506322)

In [14]:
# Check cross val score (RMSE)
abs(cross_val_score(lasso, X_train, y_train, cv=5, scoring='neg_root_mean_squared_error')).mean()

20837.39073596619

In [15]:
# Compare model predictions to test data (RMSE)
mean_squared_error(y_test, lasso.predict(X_test), squared=False)

50134.245321592105

The Lasso Regression model performs slightly better than the Ridge Regression model in terms of modelling unseen test data.

This is to be expected as Lasso Regression eliminates irrelevant features.

| Score | RMSE|
|-------|-----|
|Cross validation|20837|
|Test predictions|50134|

### Elastic Net Regression

In [16]:
# Create Enet Regression CV of training data of Feature Set 1
l1_ratios = np.linspace(0.01, 1.0, 25)
optimal_enet = ElasticNetCV(l1_ratio=l1_ratios, n_alphas=100, cv=5)
optimal_enet.fit(X_train, np.ravel(y_train))

ElasticNetCV(cv=5,
             l1_ratio=array([0.01   , 0.05125, 0.0925 , 0.13375, 0.175  , 0.21625, 0.2575 ,
       0.29875, 0.34   , 0.38125, 0.4225 , 0.46375, 0.505  , 0.54625,
       0.5875 , 0.62875, 0.67   , 0.71125, 0.7525 , 0.79375, 0.835  ,
       0.87625, 0.9175 , 0.95875, 1.     ]))

In [17]:
print(optimal_enet.alpha_, optimal_enet.l1_ratio_)

202.37683637506322 1.0


In [18]:
# Enet Regression should perform similarly to Lasso, given optimal l1 = 1
enet = ElasticNet(alpha=optimal_enet.alpha_, l1_ratio=optimal_enet.l1_ratio_)
enet.fit(X_train, y_train)

ElasticNet(alpha=202.37683637506322, l1_ratio=1.0)

In [19]:
# Check cross val score (RMSE)
abs(cross_val_score(enet, X_train, y_train, cv=5, scoring='neg_root_mean_squared_error')).mean()

20837.39073596619

In [20]:
# Compare model predictions to test data (RMSE)
mean_squared_error(y_test, enet.predict(X_test), squared=False)

50134.245321592105

The Enet Regression model performs similarly to the Lasso Regression model, as the optimal l1_ratio is 1.0.

| Score |RMSE |
|-------|-----|
|Cross validation|20837|
|Test predictions|50134|

## Summary (Feature Set 1)

The 3 tuned Linear Regression models using Feature Set 1 performed as follows:

| Ridge | RMSE|
|-------|-----|
|Cross validation|19950|
|Test predictions|58395|

| Lasso | RMSE|
|-------|-----|
|Cross validation|20837|
|Test predictions|50134|

| Enet  |RMSE |
|-------|-----|
|Cross validation|20837|
|Test predictions|50134|

We can examine the top features of the Lasso Regression model to tune the feature set further.

In [21]:
# Examine top 50 Lasso Regression coefficients
lasso_coefs = pd.DataFrame({'feature':X_train.columns,
                            'coef':lasso.coef_,
                            'abs_coef':np.abs(lasso.coef_)})
lasso_coefs.sort_values('abs_coef', inplace=True, ascending=False)
lasso_coefs.head(50)

Unnamed: 0,feature,coef,abs_coef
8,Gr Liv Area,24095.70094,24095.70094
185,Kitchen Qual_Ex,23847.440798,23847.440798
150,Bsmt Qual_Ex,18516.320628,18516.320628
47,Neighborhood_NridgHt,17397.100149,17397.100149
162,Bsmt Exposure_Gd,15177.173105,15177.173105
141,Exter Qual_Ex,14577.094016,14577.094016
46,Neighborhood_NoRidge,13415.796923,13415.796923
35,Neighborhood_Crawfor,11487.809499,11487.809499
11,Overall Qual,9065.137909,9065.137909
6,Total Bsmt SF,8556.796445,8556.796445


In [22]:
# Save top 50 Lasso Regression coefficients to csv to develop Feature Set 2 in 03: Model Tuning.
lasso_coefs.head(50).to_csv('../datasets/model_tuning/top_lasso_coef.csv')

In 03: Model Tuning and Production, we will create Linear Regression models based on these top 50 features and further refine the feature selection. The best scored model will be submitted as the production model to Kaggle.

## First Kaggle Submission (Trial)

Trial a submission to Kaggle, fitting the Lasso model on the full training data and making predictions on the Kaggle test data.

In [23]:
# Read processed datasets, keeping the 'NA' values filled in previously
train_X = pd.read_csv('../datasets/kaggle_submission/train_X_processed.csv', keep_default_na=False, index_col='Id')
train_y = pd.read_csv('../datasets/kaggle_submission/train_y.csv', keep_default_na=False, index_col='Id')
test_X = pd.read_csv('../datasets/kaggle_submission/test_X_processed.csv', keep_default_na=False, index_col='Id')

In [24]:
# Re-perform Lasso CV on full train data
optimal_lasso.fit(train_X, np.ravel(train_y))

LassoCV(cv=5, n_alphas=500)

In [25]:
# Create and fit Lasso model on train_X
lasso = Lasso(alpha=optimal_lasso.alpha_)
lasso.fit(train_X, train_y)

Lasso(alpha=241.8116719349783)

In [26]:
submission = pd.DataFrame(test_X.index, columns=['Id'])

In [27]:
submission['SalePrice'] = lasso.predict(test_X)

In [28]:
submission.to_csv('../datasets/kaggle_submission/submission_1.csv', index=False)

Submission 1 using 220 columns achieved Private Score: 39529 and Public Score: 39084.