# Exercises

The autompg dataset concerns city-cycle fuel consumption in miles per gallon, to be predicted in terms of 3 multivalued discrete and 5 continuous attributes. There are 398 instances.

Attribute information:
1. mpg:           continuous
2. cylinders:     multi-valued discrete
3. displacement:  continuous
4. horsepower:    continuous
5. weight:        continuous
6. acceleration:  continuous
7. model year:    multi-valued discrete
8. origin:        multi-valued discrete
9. car name:      string (unique for each instance)


There are 6 missing entries for horsepower, indicated in the file as  ?. 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Neural network regression model
from sklearn.neural_network import MLPRegressor

# for linear regression
import statsmodels.api as sm

# Model validation 
from sklearn.model_selection import KFold,RepeatedKFold,GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.base import clone

# for generating combinations
from itertools import product

%matplotlib inline

In [2]:
plt.rcParams['figure.dpi'] = 150

In [3]:
auto = pd.read_csv(
    '../data/auto-mpg.csv',
    na_values = '?'
)

# drop missing values
# axis = 0 - drop 
auto = auto.dropna(axis=0)

# car name is unique for each instance
# deleting the column
del auto['car name']

# print a few rows of data frame
auto

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin
0,18.0,8,307.0,130.0,3504,12.0,70,1
1,15.0,8,350.0,165.0,3693,11.5,70,1
2,18.0,8,318.0,150.0,3436,11.0,70,1
3,16.0,8,304.0,150.0,3433,12.0,70,1
4,17.0,8,302.0,140.0,3449,10.5,70,1
...,...,...,...,...,...,...,...,...
393,27.0,4,140.0,86.0,2790,15.6,82,1
394,44.0,4,97.0,52.0,2130,24.6,82,2
395,32.0,4,135.0,84.0,2295,11.6,82,1
396,28.0,4,120.0,79.0,2625,18.6,82,1


In [4]:
# standardize predictors
X = auto.drop('mpg',axis=1).values # extract as numpy array
X_mean,X_std = X.mean(axis=0),X.std(axis=0)
X = (X-X_mean)/X_std

# standardize response
y = auto['mpg'].values
y_mean,y_std = y.mean(),y.std()
y = (y-y_mean)/y_std

## Exercise 1

Train and tune a neural network model to predict `mpg` as a function of the other variables. Use 5-fold CV.

In [5]:
# Hyper-parameter tuning via cross-validation for scikit-learn estimators
# create grid of hyperparameters
alphas = [0.01,0.1,1]
sizes = [5,10,20,30]

# initial model constructor
nn_init = MLPRegressor(
    hidden_layer_sizes=10, # scalar -> single hidden layer with 10 units
    activation='logistic', # activation function for the hidden layer
    solver='lbfgs', # deterministic optimizer (not the default)
    alpha=0.01,# regularization parameter
    max_iter = 1000 # number of LBFGS iterations
)

np.random.seed(456)

kf = KFold(n_splits=5,shuffle=True)
# grid search CV
nn_tuned = GridSearchCV(
    estimator=nn_init,
    param_grid={ # dictionary containing possible values of each hyperparameter
        'alpha':alphas,
        'hidden_layer_sizes':sizes,
    },
    scoring='neg_mean_squared_error', # score (higher is better) - negative loss 
    refit=True, # whether to fit the model with the best configuration on the entire training set.
    cv=kf, # cross-validation split generator
    n_jobs=1, # argument for utilizing parallel cores
).fit(X,y)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("

In [6]:
nn_tuned.best_params_

{'alpha': 0.1, 'hidden_layer_sizes': 10}

In [7]:
# best 5-fold CV R2
print('Best CV R^2: %6.3f'%(1-(-nn_tuned.best_score_)/y.var()))

Best CV R^2:  0.889


In [8]:
# computing predictions on training set
y_pred = nn_tuned.predict(X)
# residuals
e = y-y_pred

# training metrics
training_r2 = 1 - np.sum(e**2)/np.sum(y**2)
print('Training CV R^2: %6.3f'%training_r2)

Training CV R^2:  0.926


## Exercise 2

Train and tune a support vector machine model (with radial basis kernel) to predict `mpg` as a function of the other variables. Use the same 5-fold partition as earlier.


`SVR` from the module `sklearn.svm` implements supports support vector machine regression. For the default radial basis kernel, it has two hyperparameters: `C` and `gamma`. Test the following values:

1. `C`: `[1e-2,0.1,1,10,100]`
2. `gamma`: `[1e-3,1e-2,0.1,1,10]`

There are 25 combinations of C and gamma values.

In [9]:
from sklearn.svm import SVR

In [10]:
np.random.seed(123)
svr_tuned = GridSearchCV(
    estimator=SVR(),
    param_grid={ # dictionary containing possible values of each hyperparameter
        'C':[1e-2,0.1,1,10,100],
        'gamma':[1e-3,1e-2,0.1,1,10],
    },
    scoring='neg_mean_squared_error', # score - negative loss
    refit=True, # whether to fit the model with the best configuration on the entire training set.
    cv=kf,
    return_train_score=True,
    n_jobs=1, # argument for utilizing parallel cores 
).fit(X,y)

# best 5-fold CV R2
print('Best CV R^2: %6.3f'%(1-(-svr_tuned.best_score_)/y.var()))

Best CV R^2:  0.883


In [11]:
svr_tuned.best_params_

{'C': 10, 'gamma': 0.1}

## Exercise 3

Compare both models using (a) a single K-fold partition (b) multiple K-fold partitions

In [12]:
# cross-validation measure for comparison with other models
# 5 replicates of 5 fold cross-validation
np.random.seed(234)

n_folds = 5
n_repeats=5
rkf = RepeatedKFold(n_splits=n_folds,n_repeats=5)
mses_nnet_rkf = np.empty(n_folds*n_repeats)
mses_svm_rkf = np.empty(n_folds*n_repeats)

for i,(train_index,test_index) in enumerate(rkf.split(X)):
    ############ neural network model ############
    # create new model with the specified hyperparameters
    nn_fold = clone(nn_init).set_params(**nn_tuned.best_params_)

    # fit model on training split
    _ = nn_fold.fit(X[train_index,:],y[train_index])

    # compute test predictions
    y_pred_nnet = nn_fold.predict(X[test_index,:])
    # compute metric-mean-squared error
    mses_nnet_rkf[i] = mean_squared_error(y[test_index],y_pred_nnet)
    
    ############ SVM model ############
    svm_fold = SVR().set_params(**svr_tuned.best_params_).fit(X[train_index,:],y[train_index])
    y_pred_svm = svm_fold.predict(X[test_index,:])
    mses_svm_rkf[i] = mean_squared_error(y[test_index],y_pred_svm)    

In [13]:
mse_nnet_cv = mses_nnet_rkf.mean()
mse_svm_cv = mses_svm_rkf.mean()

r2_nnet_cv = 1-mse_nnet_cv/y.var()
r2_svm_cv = 1-mse_svm_cv/y.var()

print('Replicated 5-fold CV R2 for neural network.....: %5.3f'%r2_nnet_cv)
print('Replicated 5-fold CV R2 for SVM................: %5.3f'%r2_svm_cv)

Replicated 5-fold CV R2 for neural network.....: 0.879
Replicated 5-fold CV R2 for SVM................: 0.883
