# Hyperparameter optimization using Keras and the scikit-learn API

When using Keras, we can implement a grid search over hyperparameters using the scikit-learn API.

We will demonstrate this using the original regression example.

#### Important hyperparameters for training

- optimization algorithm
- learning rate
- dropout
- regularization
- batch size
- number of training epochs

As examples of grid search, we will explore varying optimizers, number of epochs, learning rate, and regularization.

In [1]:
# default settings
num_epochs = 50
learning_rate = 0.01

In [2]:
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import load_boston
from sklearn.preprocessing import StandardScaler
from sklearn import model_selection
from sklearn.linear_model import LinearRegression

from keras.optimizers import Adam, SGD, RMSprop
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasRegressor
from keras import regularizers

Using TensorFlow backend.


In [3]:
boston = load_boston()
X = boston.data
y = boston.target
y = y.reshape(-1,1)

In [4]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.2, random_state=0)

In [5]:
scaler = StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [6]:
# some housekeeping
input_dim = X_train.shape[1]
output_dim = 1 # for regression

#### First, let's do a grid search over optimization algorithms.

For some optimizers, dependent parameters can/should be tuned. We'll explore that later (by example).
For most optimizers, it is in fact recommended to NOT change the defaults (e.g., RMSprop, Adagrad...)

For this run, the defaults will be used, e.g.
- SGD(lr=0.01, momentum=0.0, decay=0.0, nesterov=False)
- Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0)
- RMSprop(lr=0.001, rho=0.9, epsilon=1e-08, decay=0.0)

In [7]:
# build_fn for keras.wrappers.scikit_learn.KerasRegressor(build_fn=None, **sk_params)
def create_model(optimizer = "Adam"):    
    model = Sequential()
    model.add(Dense(output_dim , input_dim = input_dim, kernel_initializer='normal')) # activation = None for regression
    model.compile(loss='mean_squared_error', optimizer=optimizer)
    return model

In [8]:
# epochs will be passed through to scikit-learn
model = KerasRegressor(build_fn=create_model, epochs = num_epochs, verbose=0)

In [9]:
# define the grid search parameters
optimizers = ['RMSprop', 'Adam', 'SGD']
grid = GridSearchCV(estimator=model, cv=10, param_grid=dict(optimizer = optimizers))

# do the grid search
fit = grid.fit(X_train, y_train)

In [10]:
def report_cv_results(fit):
    means = fit.cv_results_['mean_test_score']
    sdevs = fit.cv_results_['std_test_score']
    params = fit.cv_results_['params']
    for mean, sd, param in zip(means, sdevs, params):
        print("Mean score: {:.2f}    Std. dev.: {:.2f}    Param: {}".format(mean, sd, param))

In [11]:
report_cv_results(fit)

Mean score: 540.07    Std. dev.: 60.65    Param: {'optimizer': 'RMSprop'}
Mean score: 540.54    Std. dev.: 60.64    Param: {'optimizer': 'Adam'}
Mean score: 21.34    Std. dev.: 7.55    Param: {'optimizer': 'SGD'}


#### Interestingly, with 50 epochs of training, SGD is _much_ better than the other algorithms! Let's check if Adam and RMSprop catch up with more epochs.

In [12]:
def create_model(optimizer = "SGD", epochs = num_epochs):
    
    model = Sequential()
    model.add(Dense(output_dim , input_dim = input_dim, kernel_initializer='normal')) # activation = None for regression
    model.compile(loss='mean_squared_error', optimizer=optimizer)
    return model

In [13]:
model = KerasRegressor(build_fn=create_model, epochs = num_epochs, verbose=0)

In [14]:
# define the grid search parameters
optimizers = ['SGD', 'RMSprop', 'Adam']
epochs = [50,100,150]
# Let's choose 3 cv splits to speed this up for the demo
# grid = GridSearchCV(estimator=model, cv=10, param_grid=dict(optimizer = optimizers, epochs = epochs))
grid = GridSearchCV(estimator=model, cv=3, param_grid=dict(optimizer = optimizers, epochs = epochs))

# do the grid search
fit = grid.fit(X_train, y_train)
report_cv_results(fit)

Mean score: 23.45    Std. dev.: 4.75    Param: {'epochs': 50, 'optimizer': 'SGD'}
Mean score: 550.94    Std. dev.: 40.98    Param: {'epochs': 50, 'optimizer': 'RMSprop'}
Mean score: 551.35    Std. dev.: 41.25    Param: {'epochs': 50, 'optimizer': 'Adam'}
Mean score: 23.17    Std. dev.: 4.25    Param: {'epochs': 100, 'optimizer': 'SGD'}
Mean score: 518.86    Std. dev.: 38.12    Param: {'epochs': 100, 'optimizer': 'RMSprop'}
Mean score: 521.32    Std. dev.: 38.83    Param: {'epochs': 100, 'optimizer': 'Adam'}
Mean score: 23.70    Std. dev.: 4.42    Param: {'epochs': 150, 'optimizer': 'SGD'}
Mean score: 494.37    Std. dev.: 35.18    Param: {'epochs': 150, 'optimizer': 'RMSprop'}
Mean score: 496.13    Std. dev.: 35.96    Param: {'epochs': 150, 'optimizer': 'Adam'}


#### Doesn't help. How about varying the learning rate for Adam (default is 0.001)?
(We will just pick Adam for this run.)

In [15]:
def create_model(learn_rate = learning_rate):
    
    model = Sequential()
    model.add(Dense(output_dim , input_dim = input_dim, kernel_initializer='normal')) # activation = None for regression
    optimizer = Adam(lr=learn_rate)
    model.compile(loss='mean_squared_error', optimizer=optimizer, metrics=['accuracy'])
    return model

In [16]:
model = KerasRegressor(build_fn=create_model, epochs = num_epochs, verbose=0)

In [17]:
learning_rates = [0.001, 0.01, 0.1,0.5, 0.8]

# grid = GridSearchCV(estimator=model, cv=10, param_grid=dict(learn_rate = learning_rates))
grid = GridSearchCV(estimator=model, cv=3, param_grid=dict(learn_rate = learning_rates))

# do the grid search
fit = grid.fit(X_train, y_train)
report_cv_results(fit)

Mean score: 550.64    Std. dev.: 40.73    Param: {'learn_rate': 0.001}
Mean score: 369.74    Std. dev.: 25.60    Param: {'learn_rate': 0.01}
Mean score: 24.08    Std. dev.: 4.94    Param: {'learn_rate': 0.1}
Mean score: 26.12    Std. dev.: 4.44    Param: {'learn_rate': 0.5}
Mean score: 28.57    Std. dev.: 7.17    Param: {'learn_rate': 0.8}


#### Finally, let's see an example of grid search for different types and degrees of regularization.

In [18]:
def create_model(regularizer = regularizers.l2(0.)):
    
    model = Sequential()
    model.add(Dense(output_dim , input_dim = input_dim, kernel_initializer='normal',
                   kernel_regularizer = regularizer))
    model.compile(loss='mean_squared_error', optimizer="SGD")
    return model

In [19]:
model = KerasRegressor(build_fn=create_model, epochs = num_epochs, verbose=0)

In [20]:
regularizer_list = [regularizers.l1(0.001), regularizers.l1(0.01), regularizers.l1(0.1), regularizers.l2(0.001), regularizers.l2(0.01), regularizers.l2(0.1)]
#grid = GridSearchCV(estimator=model, cv=10, param_grid=dict(regularizer = regularizer_list))
grid = GridSearchCV(estimator=model, cv=3, param_grid=dict(regularizer = regularizer_list))

# do the grid search
fit = grid.fit(X_train, y_train)
#report_cv_results(fit)

In [21]:
report_cv_results(fit)

Mean score: 23.59    Std. dev.: 4.68    Param: {'regularizer': <keras.regularizers.L1L2 object at 0x7f5ede572668>}
Mean score: 23.83    Std. dev.: 4.69    Param: {'regularizer': <keras.regularizers.L1L2 object at 0x7f5ede572208>}
Mean score: 25.21    Std. dev.: 4.86    Param: {'regularizer': <keras.regularizers.L1L2 object at 0x7f5ede5722e8>}
Mean score: 23.71    Std. dev.: 4.78    Param: {'regularizer': <keras.regularizers.L1L2 object at 0x7f5ede572940>}
Mean score: 23.99    Std. dev.: 4.98    Param: {'regularizer': <keras.regularizers.L1L2 object at 0x7f5ede4d4048>}
Mean score: 26.26    Std. dev.: 5.16    Param: {'regularizer': <keras.regularizers.L1L2 object at 0x7f5ede4d4080>}
