# Blending Regression on the Kaggle Housing Data

Richard Corrado richcorrado.github.io

The data files are stored in the same github directory as this notebook.  For the purposes of this exercise, we won't go into too many details of the datasets.  They were all generated from the same messy data set, but after EDA and feature engineering in R, choice and defintion of new features evolved. However many tidy data files were generated as saved during the learning process.  They have many features in common, but have important differences.

In this notebook, we will pick one dataset and use 6 models to make predictions on the test set.  We will use a weighted average of the individual predictions to produce the blended prediction.

In [1]:
import numpy as np
import pandas as pd
pd.set_option('display.precision',5)
from scipy import stats
from scipy import optimize

from sklearn import linear_model, svm
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, confusion_matrix
from sklearn.ensemble import RandomForestRegressor, BaggingRegressor
from sklearn.model_selection import train_test_split, cross_val_predict, KFold, cross_val_score, GridSearchCV
from sklearn.feature_selection import SelectKBest, SelectFromModel
from sklearn.pipeline import Pipeline
import xgboost as xgb
from xgboost.training import train
from xgboost.sklearn import XGBClassifier

import matplotlib
# this is needed for interactive plots to be displayed properly
matplotlib.use('Agg')
import matplotlib.pylab as plt
from matplotlib.pylab import rcParams
from matplotlib import pyplot
rcParams['figure.figsize'] = 12, 4
# allow interactive plots
%matplotlib notebook

In [2]:
# def to compare goodness of fit on training set
def rmse(y_true, y_pred):
    return np.sqrt(mean_squared_error(y_true, y_pred))

In [4]:
train_df = pd.read_csv("train-1-9-delout.csv")
test_df = pd.read_csv("test-1-9-delout.csv")

# Set up predictors and response
y_train = train_df['LogSalePrice'].values
x_train = train_df.drop(['Id','LogSalePrice'],axis=1).values
x_test = test_df.drop(['Id'],axis=1).values

print("Training set size:", x_train.shape)
print("Test set size:", x_test.shape)

('Training set size:', (1451, 390))
('Test set size:', (1459, 390))


We will fit the models on the training data and obtain the training errors.  The training errors will then be used to define the weights for the averaging. 

### Ridge regression

In [15]:
ridge_regr = linear_model.Ridge(alpha = 55)
ridge_regr.fit(x_train, y_train)

y_ridge_pred = ridge_regr.predict(x_train)
rmse_ridge = rmse(y_train, y_ridge_pred)
print("Ridge score on training set: %f" % rmse_ridge)

Ridge score on training set: 0.090990


### Lasso regression

In [16]:
lasso_regr = linear_model.Lasso(alpha=0.0006, max_iter=50000)
lasso_regr.fit(x_train, y_train)

y_lasso_pred = lasso_regr.predict(x_train)
rmse_lasso = rmse(y_train, y_lasso_pred)
print("Lasso score on training set: %f" % rmse_lasso)

Lasso score on training set: 0.092526


### Elastic Net

In [17]:
elnet_regr = linear_model.ElasticNet(alpha = 0.0011, l1_ratio=0.5, max_iter=15000, random_state=7)
elnet_regr.fit(x_train, y_train)

y_elnet_pred = elnet_regr.predict(x_train)
rmse_elnet = rmse(y_train, y_elnet_pred)
print("Elastic Net score on training set: %f" % rmse_elnet)

Elastic Net score on training set: 0.092101


### Random Forest

In [18]:
rf_regr = RandomForestRegressor(n_estimators = 700, max_depth = 25, random_state = 7)
rf_regr.fit(x_train, y_train)

y_rf_pred = rf_regr.predict(x_train)
rmse_rf = rmse(y_train, y_rf_pred)
print("Random Forest score on training set: %f" % rmse_rf)

Random Forest score on training set: 0.046494


### Support Vector Regressor

In [19]:
svm_regr = svm.SVR(C=5, cache_size=200, coef0=0.0, degree=3, epsilon=0.034, gamma=0.0004,
                        kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)
svm_regr.fit(x_train, y_train)

y_svm_pred = svm_regr.predict(x_train)
rmse_svm = rmse(y_train, y_svm_pred)
print("Support Vector Regressor score on training set: %f" % rmse_svm)

Support Vector Regressor score on training set: 0.080296


### XGBoost Trees

In [20]:
xgb_regr = xgb.XGBRegressor(
    max_depth = 1,
    min_child_weight = 0.5,
    gamma = 0,
    subsample = 1,
    colsample_bytree = 0.6,
    reg_alpha = 0.45,
    reg_lambda = 0.2,
    learning_rate = 0.05,
    n_estimators = 6100,
    seed = 42,
    nthread = -1,
    silent = 1)

xgb_regr.fit(x_train, y_train)

y_xgb_pred = xgb_regr.predict(x_train)
rmse_xgb = rmse(y_train, y_xgb_pred)
print("XGBoost Regressor score on training set: %f" % rmse_xgb)

XGBoost Regressor score on training set: 0.084155


## Test predictions and blending

First, the individual model predictions:

In [21]:
y_ridge_pred = ridge_regr.predict(x_test)
y_lasso_pred = lasso_regr.predict(x_test)
y_elnet_pred = elnet_regr.predict(x_test)
y_rf_pred = rf_regr.predict(x_test)
y_svm_pred = svm_regr.predict(x_test)
y_xgb_pred = xgb_regr.predict(x_test)

Next, compute the weighted average, using the inverse of the relative training error as the weights:

In [22]:
norm = 1 / rmse_ridge + 1 / rmse_lasso + 1 / rmse_elnet + 1 / rmse_rf + 1 / rmse_svm + 1 / rmse_xgb 

y_pred_blend = (y_ridge_pred / rmse_ridge + y_lasso_pred / rmse_lasso + y_elnet_pred / rmse_elnet + 
                y_rf_pred / rmse_rf + y_svm_pred / rmse_svm + y_xgb_pred / rmse_xgb ) / norm

y_pred_blend = np.exp(y_pred_blend) # response was log of SalePrice

Write prediction to a file, for, e.g., kaggle submission.

In [23]:
pred_blend_df = pd.DataFrame(y_pred_blend, index=test_df["Id"], columns=["SalePrice"])
pred_blend_df.to_csv('blending_output.csv', header=True, index_label='Id')