## Overview of Notebook

In this notebook, I will build **eight regression models** such as a simple linear regression, ensembling models, a complex neural network, and more to predict consumer Spending. I will find the best model using nested cross validation in grid search.

The data used for these regression models contains information about whether or not various consumers made a purchase in response to a test mailing of a certain catalog and, in case of a purchase, how much money each consumer spent. This dataset has two possible outcome variables: Purchase (0/1 value: whether or not the purchase was made) and Spending (numeric value: amount spent). I will only be building regression models to predict the Spending outcome.

## Read in Data

In [2]:
import pandas as pd
import numpy as np

# plotting libraries
import matplotlib.pyplot as plt
import scikitplot as skplt

from sklearn.preprocessing import MinMaxScaler #normalizer

from sklearn.metrics import mean_squared_error

#modeling libraries
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation
from sklearn.pipeline import Pipeline

from sklearn.neighbors import KNeighborsRegressor # KNN
from sklearn import linear_model #linear regression
from sklearn import svm # SVM
from sklearn import tree
from sklearn.tree import DecisionTreeRegressor # Import Decision Tree regressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
import xgboost as xgb
from sklearn.ensemble import StackingRegressor

from sklearn.neural_network import MLPRegressor

import keras
import tensorflow
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasRegressor
from tensorflow.keras.optimizers import Adam

In [3]:
data = pd.read_csv('spending.csv', index_col = 'sequence_number')

In [4]:
data

Unnamed: 0_level_0,US,source_a,source_c,source_b,source_d,source_e,source_m,source_o,source_h,source_r,...,source_x,source_w,Freq,last_update_days_ago,1st_update_days_ago,Web order,Gender=male,Address_is_res,Purchase,Spending
sequence_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1,0,0,1,0,0,0,0,0,0,...,0,0,2,3662,3662,1,0,1,1,127.87
2,1,0,0,0,0,1,0,0,0,0,...,0,0,0,2900,2900,1,1,0,0,0.00
3,1,0,0,0,0,0,0,0,0,0,...,0,0,2,3883,3914,0,0,0,1,127.48
4,1,0,1,0,0,0,0,0,0,0,...,0,0,1,829,829,0,1,0,0,0.00
5,1,0,1,0,0,0,0,0,0,0,...,0,0,1,869,869,0,0,0,0,0.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1996,1,0,0,0,0,0,0,0,0,0,...,0,0,1,1701,1701,1,0,0,1,29.50
1997,1,0,0,0,0,0,0,0,0,0,...,0,0,1,2633,2633,1,1,0,1,10.00
1998,1,0,0,0,0,0,0,0,0,0,...,0,0,0,3394,3394,0,0,0,0,0.00
1999,1,0,0,0,0,0,0,0,0,0,...,0,1,1,253,253,0,1,1,0,0.06


In [6]:
# split the data into predictors and the response variable
X = data[data.columns.difference(['Purchase','Spending'])]
y = data['Spending']

sequence_number
1       127.87
2         0.00
3       127.48
4         0.00
5         0.00
         ...  
1996     29.50
1997     10.00
1998      0.00
1999      0.06
2000      0.14
Name: Spending, Length: 2000, dtype: float64

## Defining the Success Metric

I will use the **root mean squared error** as my success metric because I want to avoid large errors in my models. RMSE will penalize a lot for large errors because it is squaring the error. Therefore, RMSE might favor a model that makes many small errors over a model that makes occasional large errors.


## Determine Best Model Using Nested Cross Validation

I will perform nested cross validation to determine the best model, using negative root mean squared error for a scoring metric. The nested cross validation will take the average of the negative root mean squared error to determine the score. I will build a decision tree, linear regression, KNN, support vector machine, random forest, gradient boost, XGboost, and neural network model, tuning the hyper-parameters as outlined below.

* For the regression tree, I will tune max_depth, max_leaf nodes, min_samples split, and min_samples_leaf  because these parameters will reduce overfitting. 
* For linear regression, I will not tune any parameters. 
* For KNN, I will look at up to 30 neighbors, weighting the neighbors evenly and by distance. 
* For SVM, I will tune C (the regularization parameter), the kernels, and gamma. I looked at linear and rbf kernels because these ar the most popular. Gamma will only be used for rbf, and a large gamma is more likely to overfit the model.
* For Random Forest, I will tune max_depth, max_leaf nodes, min_samples split, and min_samples_leaf 
* For Gradient Boosting, I will tune max_depth, max_leaf nodes, min_samples split, and min_samples_leaf
* For XGBoost I will tune max_depth, max_leaf_nodes, and gamma because these parameters can control overfitting.
* For neural network, I will tune number of hidden layers, max number of iterations, alpha, and the learning rate.


After performing nested cross validation, I found the mean of the scores for each model, and found that **the best performing model was the gradient boosting model.**

In [17]:
dt_grid = {'max_depth': range(2,20, 5), 
           'max_leaf_nodes': range(2,20, 5),
           'min_samples_split': range(2,20, 5),
           'min_samples_leaf': range(2,20, 5)}
dt = tree.DecisionTreeRegressor(min_impurity_decrease=0.01)

inner_cv = KFold(n_splits=4, shuffle=True, random_state=2)
outer_cv = KFold(n_splits=4, shuffle=True, random_state=2)

dt_clf = GridSearchCV(estimator=dt, param_grid=dt_grid, scoring='neg_root_mean_squared_error', cv=inner_cv)

dt_score = cross_val_score(dt_clf, X=X, y=y, cv=outer_cv)

print(f"Regression Tree: {dt_score.mean():.3f} +/- {dt_score.std():.3f}")

Regression Tree: -134.204 +/- 16.279


In [60]:
scaler = MinMaxScaler()
lr = linear_model.LinearRegression()
lr_pipe = Pipeline(steps=[('mms', scaler), ('linear', lr)])
lr_grid = {}

lr_clf = GridSearchCV(estimator=lr_pipe, param_grid=lr_grid, scoring='neg_root_mean_squared_error', cv=inner_cv)

lr_score = cross_val_score(lr_clf, X=X, y=y, cv=outer_cv)

print(f"Linear Regression: {lr_score.mean():.3f} +/- {lr_score.std():.3f}")

Linear Regression: -126.593 +/- 15.656


In [21]:
knn = KNeighborsRegressor()
knn_pipe = Pipeline(steps=[('mms', scaler), ('knn', knn)])
knn_grid = {'knn__n_neighbors': range(1, 30), 
            'knn__weights': ['distance', 'uniform']}

knn_clf = GridSearchCV(estimator=knn_pipe, param_grid=knn_grid, scoring='neg_root_mean_squared_error', cv=inner_cv)

knn_score = cross_val_score(knn_clf, X=X, y=y, cv=outer_cv)

print(f"KNN: {knn_score.mean():.3f} +/- {knn_score.std():.3f}")

KNN: -165.288 +/- 20.142


In [22]:
svr = svm.SVR()
svr_pipe = Pipeline(steps=[('mms', scaler), ('svr', svr)])
svm_grid = {'svr__C': [0.000001, 0.00001, 0.0001, .001, .01, 0.1, 1, 10, 100],
            'svr__kernel': ['linear', 'rbf'],
            'svr__gamma': [0.1, 0.5, 1, 3, 5]}

svr_clf = GridSearchCV(estimator=svr_pipe, param_grid=svm_grid, scoring='neg_root_mean_squared_error', cv=inner_cv)

svr_score = cross_val_score(svr_clf, X=X, y=y, cv=outer_cv)

print(f"SVR: {svr_score.mean():.3f} +/- {svr_score.std():.3f}")

SVR: -137.784 +/- 16.889


In [23]:
rf_grid = {'max_depth': range(2,20, 5), 
           'max_leaf_nodes': range(2,20, 5),
           'min_samples_split': range(2,20, 5),
           'min_samples_leaf': range(2,20, 5)}
rf = RandomForestRegressor(min_impurity_decrease=0.01)

rf_clf = GridSearchCV(estimator=rf, param_grid=rf_grid, scoring='neg_root_mean_squared_error', cv=inner_cv)

rf_score = cross_val_score(rf_clf, X=X, y=y, cv=outer_cv)

print(f"Random Forest: {rf_score.mean():.3f} +/- {rf_score.std():.3f}")

Random Forest: -128.294 +/- 15.301


In [24]:
gb_grid = {'max_depth': range(2,20, 5), 
           'max_leaf_nodes': range(2,20, 5),
           'min_samples_split': range(2,20, 5),
           'min_samples_leaf': range(2,20, 5)}
gb = GradientBoostingRegressor(min_impurity_decrease=0.01)

gb_clf = GridSearchCV(estimator=gb, param_grid=gb_grid, scoring='neg_root_mean_squared_error', cv=inner_cv)

gb_score = cross_val_score(gb_clf, X=X, y=y, cv=outer_cv)

print(f"Gradient Boosting: {gb_score.mean():.3f} +/- {gb_score.std():.3f}")

Gradient Boosting: -123.515 +/- 13.762


In [66]:
# https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/

xgb_grid = {'max_depth': range(2,20, 5),
            #'max_leaf_nodes': range(2,20, 5), # might not need this, will ignore max_depth if this is set
            'gamma': [i/10.0 for i in range(0,5)]}
xg = xgb.XGBRegressor()

xg_clf = GridSearchCV(estimator=xg, param_grid=xgb_grid, scoring='neg_root_mean_squared_error', cv=inner_cv)

xg_score = cross_val_score(xg_clf, X=X, y=y, cv=outer_cv)

print(f"XGBoost: {xg_score.mean():.3f} +/- {xg_score.std():.3f}")

XGBoost: -124.351 +/- 17.099


In [51]:
# Neural Network with sklearn

# Used this site to understand how to tune hyper-parameters
# https://michael-fuchs-python.netlify.app/2021/02/10/nn-multi-layer-perceptron-regressor-mlpregressor/

scaler = MinMaxScaler()
inner_cv = KFold(n_splits=4, shuffle=True, random_state=2)
outer_cv = KFold(n_splits=4, shuffle=True, random_state=2)

# Create the model
nn_model = MLPRegressor(activation='relu', solver='adam')

#pipe
nn_pipe = Pipeline(steps=[('mms', scaler), ('nn', nn_model)])

# grid
nn_grid = {'nn__hidden_layer_sizes': [(150,100,50), (120,80,40), (100,50,30)], 
          'nn__max_iter': [50, 100],
          'nn__alpha': [0.0001, 0.05],
          'nn__learning_rate': ['constant','adaptive']}


# Build the GridSearchCV
nn_clf = GridSearchCV(estimator=nn_pipe, param_grid=nn_grid, scoring='neg_root_mean_squared_error', cv =inner_cv)

nn_score = cross_val_score(nn_clf, X=X, y=y, cv=outer_cv)

print(f"Neural Network: {nn_score.mean():.3f} +/- {nn_score.std():.3f}")























Neural Network: -126.038 +/- 15.541




## Tune the Best Model: Gradient Boost

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
xg_grid = {'max_depth': range(2,20, 5),
            #'max_leaf_nodes': range(2,20, 5), # might not need this, will ignore max_depth if this is set
            'gamma': [i/10.0 for i in range(0,5)]}
xg = xgb.XGBRegressor()

xg_clf = GridSearchCV(estimator=xg, param_grid=xg_grid, scoring='neg_root_mean_squared_error', cv=5)
xg_clf.fit(X_train, y_train)

xg_clf.best_params_

{'gamma': 0.0, 'max_depth': 2}

In [11]:
xg_predictions = xg_clf.predict(X_test) 
   
# print classification report 
print(mean_squared_error(y_test, xg_predictions, squared=False))

142.84372944976545
