# Hold-out set for final evaluation
- How well can the model perform on never before seen data?
- Using ALL data for cross-validation *is not* ideal
- Split data into training and hold-out set at the beginning
- Perform grid search cross-validation on training set
- Choose best hyperparameters and **evaluate on hold-out set**

In [5]:
import numpy as np
import pandas as pd

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

In [2]:
df = pd.read_csv('datasets/diabetes.csv')

X = df.drop('diabetes', axis=1)
y = df['diabetes']

## Hold-out set in practice: Classification

In [11]:
# Create the hyperparameter grid
c_space = np.logspace(-5, 8, 15)
param_grid = {'C': c_space, 'penalty' : ['l1','l2']}

logreg = LogisticRegression()

# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y,
        test_size=0.4, random_state=42)

# Instantiate the GridSearchCV
logreg_cv = GridSearchCV(logreg, param_grid, cv=5)

# Fit it to the training data
logreg_cv.fit(X_train, y_train)

print('Tuned Logistic Regression Parameter: {}'.format(logreg_cv.best_params_))
print('Tuned Logistic Regression Accuracy: {}'.format(logreg_cv.best_score_))

Tuned Logistic Regression Parameter: {'C': 31.622776601683793, 'penalty': 'l2'}
Tuned Logistic Regression Accuracy: 0.7673913043478261


## Hold-out set in practice: Regression
- Lasso uses the L1 penalty to regularize
- Ridge uses the L2 penalty to regularize
- Elastic net regularization: the penalty term is a linear combination of the L1 and L2 penalties:
$$\alpha*L1 + b*L2$$

In scikit-learn, this term is represented by the 'l1_ratio' parameter: An 'l1_ratio' of 1 corresponds to an L1 penalty, and anything lower is a combination of L1 and L2.

In this exercise, GridSearchCV is used to tune the *l1_ratio* of an elastic net model trained on the Gapminder.

In [15]:
from sklearn.linear_model import ElasticNet
from sklearn.metrics import mean_squared_error

In [12]:
df = pd.read_csv('datasets/gm_2008_region.csv')

X = df.drop(['Region','life'], axis=1).values
y = df['life'].values

In [23]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

# Create the hyperparameter grid
l1_space = np.linspace(0, 1, 30)
param_grid = {'l1_ratio': l1_space}

# Instantiate the ElasticNet regressor
elastic_net = ElasticNet()

# Setup the GridSearchCV object
gm_cv = GridSearchCV(elastic_net, param_grid, cv=5)

# Fit to the training data
gm_cv.fit(X_train, y_train)



GridSearchCV(cv=5, error_score='raise',
       estimator=ElasticNet(alpha=1.0, copy_X=True, fit_intercept=True, l1_ratio=0.5,
      max_iter=1000, normalize=False, positive=False, precompute=False,
      random_state=None, selection='cyclic', tol=0.0001, warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'l1_ratio': array([0.     , 0.03448, 0.06897, 0.10345, 0.13793, 0.17241, 0.2069 ,
       0.24138, 0.27586, 0.31034, 0.34483, 0.37931, 0.41379, 0.44828,
       0.48276, 0.51724, 0.55172, 0.58621, 0.62069, 0.65517, 0.68966,
       0.72414, 0.75862, 0.7931 , 0.82759, 0.86207, 0.89655, 0.93103,
       0.96552, 1.     ])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [25]:
# Predict on the test set and compute metrics
y_pred = gm_cv.predict(X_test)
r2 = gm_cv.score(X_test, y_test)
mse = mean_squared_error(y_test, y_pred)
print('Tuned ElasticNet l1 ratio: {}'.format(gm_cv.best_params_))
print('Tuned ElasticNet R squared: {}'.format(r2))
print('Tuned ElasticNet MSE: {}'.format(mse))

Tuned ElasticNet l1 ratio: {'l1_ratio': 0.20689655172413793}
Tuned ElasticNet R squared: 0.8668305372460283
Tuned ElasticNet MSE: 10.05791413339844
