<a href="https://colab.research.google.com/github/rahiakela/deep-learning--from-basics-to-practice/blob/24-keras-part-2/cross_validate_and_grid_search_using_scikit_learn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Cross-validate and Grid-search using Scikit-Learn

So far we’ve been searching our hyperparameters by hand. It’s been
illuminating, but it also required a lot of manual effort.

The scikit-learn library offers us routines to
cross-validate our model (to estimate how good it is), and grid-search
its hyperparameters (to find the best-performing combination).

Keras doesn’t offer either of these tools directly, because it offers a way
to use the ones already in scikit-learn.

A popular approach is to extract a tiny piece of the data set, carefully
selected to be representative of the whole, and search on that. Then
each training run will be much faster.

By cross-validating and grid-searching one or more of these little proxy
databases, we can get some guidance for what models and hyperparameters
are worth exploring on a larger scale. Then we can take that
knowledge and work with larger and larger pieces of the dataset, tuning
the hyperparameters at each step.

## Setup

In [1]:
from __future__ import absolute_import, division, print_function, unicode_literals

try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass
import tensorflow as tf
from tensorflow import keras
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

from tensorflow.keras.datasets import mnist
from tensorflow.keras import backend as keras_backend
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.constraints import max_norm
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier

keras_backend.set_image_data_format('channels_last')

TensorFlow 2.x selected.


## Load and process the MNIST data

In [2]:
# load MNIST data and save sizes
(X_train, y_train), (X_test, y_test) = mnist.load_data()

image_height = X_train.shape[1]
image_width = X_train.shape[2]
number_of_pixels = image_height * image_width


# convert to floating-point
X_train = keras_backend.cast_to_floatx(X_train)
X_test = keras_backend.cast_to_floatx(X_test)


# scale data to range [0, 1]
X_train /= 255.0
X_test /= 255.0


# save the original y_train and y_test
original_y_train = y_train
original_y_test = y_test

# replace label data with one-hot encoded versions
number_of_classes = 1 + max(np.append(y_train, y_test)).astype(np.int32)

# encode each list into one-hot arrays of the size we just found
y_train = to_categorical(y_train, num_classes=number_of_classes)
y_test = to_categorical(y_test, num_classes=number_of_classes)

# reshape samples to 2D grid, one line per image
X_train = X_train.reshape([X_train.shape[0], number_of_pixels])
X_test = X_test.reshape([X_test.shape[0], number_of_pixels])

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz


## Keras Wrappers

It would be nice to use scikit-learn’s cross-validation and grid-search
tools directly on our Keras models. But Keras is a library that sits “on
top” of scikit-learn. This means that scikit-learn doesn’t know anything
about Keras and its models. But it also means that Keras knows
everything about scikit-learn.

Let us place a Keras model into scikit-learn,
and then do cross-validation, grid search, or any other operation we
like. From scikit-learn’s perspective, this object is just some custom
estimator that we wrote and gave to it. It doesn’t know that there’s a
deep network hiding inside.

We pull off this trick by embedding our Keras model in an object of
type KerasClassifier or KerasRegressor, depending on the job it
does. These objects are called wrappers, since they “wrap” our Keras
model in a disguise that makes it look and act like a scikit-learn estimator.

Since both wrappers work identically, we’ll choose KerasClassifier
as an example so we can stick with the MNIST classifiers we’ve been
discussing so far.

Let’s dig in, starting with the model-making function.

This argument is named build_fn, short for “build function.” Its value
is a function that we’ve written which will construct, compile, and
return a Keras model.

This model making-function will be called automatically
by scikit-learn when the model is required. When we’re grid
searching, the model will usually be built over and over again at the
start of each new step of the search.

So we have our model-making function that takes arguments, and a
wrapper, and scikit-learn which is going to call our function. How do
we get scikit-learn to include the arguments we want when it calls the
model-making function?

Happily, the mechanism is easy. The trick is in the naming of our
arguments. when we create a search using
scikit-learn, we provide it with a dictionary that names each parameter
we want it to search on as a key, with values to be tried as the value.

To see this in action, let’s start with a model-making function that
takes parameters.

In [0]:
# Build a network of any number of (dense+ dropout) layers of the given size
def make_model(number_of_layers=2, neurons_per_layer=32, dropout_ratio=0.2, optimizer='adam'):
  model = Sequential()

  # first layer is special, because it sets input_shape
  model.add(Dense(neurons_per_layer, activation='relu', input_shape=[number_of_pixels], kernel_constraint=max_norm(3)))
  model.add(Dropout(dropout_ratio))

  # now add in all the rest of the dense-dropout layers
  for layer in range(number_of_layers - 1):
    model.add(Dense(neurons_per_layer, activation='relu', kernel_constraint=max_norm(3)))
    model.add(Dropout(dropout_ratio))

  # finish up with a softmax layer with 10 outputs
  model.add(Dense(number_of_classes, activation='softmax'))

  # compile the model and return it
  model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])

  return model

## Wrap up make_model() function in KerasClassifier.

Wrap up our model-making function into a KerasClassifier, which
will make it behave like a standard scikit-learn estimator.  We'll
give all the arguments defaults which we can override later when
we build the model as part of cross-validation or grid search.

For instance, we will provide a value to number_of_layers, so this
value will be passed to number_of_layers when make_model() is called.

In [0]:
kc_model = KerasClassifier(build_fn=make_model, 
                           number_of_layers=2, neurons_per_layer=32, optimizer='adam',  # parameters for the model-making function 
                           epochs=100, batch_size=256, verbose=0  # parameters for scikit-learn
                           )

In effect, the wrapper just takes the values we provide to it and passes
them to the model-making function arguments of the same name. The
syntax is a little confusing because it looks like KerasClassifier is
taking these arguments for itself, but this is a funky bit of Python that
doesn’t follow the common rules.

If we hand kc_model to scikit-learn for cross-validation, make_model()
will be called and passed the value 2 for number_of_layers, the value
32 for the argument neurons_per_layer, and the string ′adam′ for
optimizer.

The last set of three arguments to KerasClassifier() (epochs,
batch_size, and verbose) are not for our model, but are intended for
scikit-learn. They get passed to the cross-validator’s fit() routine to
control the training process.

The key thing to remember is that the wrapper is basically remembering
what values should be used for the arguments in the model-making
function, and it will use those by default. It also remembers a few values
that get passed on to scikit-learn. As long as the names we’re assigning
to in the wrapper match the names in the model-making routine,
everything will be automatically matched up.

## Cross-Validation

Applying cross-validation may seem pointless. After all, we already
have an excellent, large testing set. What more are we going to learn
from cross-validation that we haven’t already seen by using our validation
data?

In this case, not much. We should expect the results of cross-validation
to be very close to what we saw above.

Another challenge of validation sets comes when we’re working with
a small version of an original dataset. If this dataset is small,cross-validation is a great way to evaluate it without making the training set even smaller by making a dedicated validation set.

Cross-validation requires training and then validating our entire
model over and over again with slightly different data. We’ll be using
10 folds, so each session of the cross-validator will take 10 times longer.

One issue is that scikit-learn’s cross-validation function
cross_val_score() doesn’t want the one-hot encoded version of our
label data. It wants the original versions that contain lists of integers.

The other issue has to do with what data we pass to the cross-validation
system. As we’ve done before, we’ll simply pretend that we don’t
have a validation set, and treat the training data as if it was our entire
dataset. We’ll let the cross-validator manage the train-validation split
for us.

Now let’s get this cross-validation going. There are two tasks to perform.
* First, we’ll make the object that drives the cross-validation
process. 
* Then use StratifiedKFold() with 10 splits. 

We’ll shuffle the data, and we’ll set the optional
random_state variable to the value of random_seed that we already
have around. That’s useful for debugging.

In [0]:
from sklearn.model_selection import StratifiedKFold

random_seed = 42
np.random.seed(random_seed)

kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=random_seed)

We just tell scikit-learn to run the cross-validator and track
the scores by calling cross_val_score() with our model, our training
data and original labels, and our folding object.

In [0]:
from sklearn.model_selection import cross_val_score

results = cross_val_score(kc_model, X_train, original_y_train, cv=kfold, verbose=0)

In [10]:
print(f'results = {str(results)} \n results.mean = {str(results.mean())}')

results = [0.95820844 0.95968682 0.96250623 0.96083331 0.96633333 0.9666611
 0.9599933  0.9613269  0.96815073 0.96114075] 
 results.mean = 0.9624840915203094


Putting it all together.

In [0]:
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score

random_seed = 42
np.random.seed(random_seed)

# Build a network of any number of (dense+ dropout) layers of the given size
def make_model(number_of_layers=2, neurons_per_layer=32, dropout_ratio=0.2, optimizer='adam'):
  model = Sequential()

  # first layer is special, because it sets input_shape
  model.add(Dense(neurons_per_layer, activation='relu', input_shape=[number_of_pixels], kernel_constraint=max_norm(3)))
  model.add(Dropout(dropout_ratio))

  # now add in all the rest of the dense-dropout layers
  for layer in range(number_of_layers - 1):
    model.add(Dense(neurons_per_layer, activation='relu', kernel_constraint=max_norm(3)))
    model.add(Dropout(dropout_ratio))

  # finish up with a softmax layer with 10 outputs
  model.add(Dense(number_of_classes, activation='softmax'))

  # compile the model and return it
  model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])

  return model

# make the model and wrap it up for scikit-learn
kc_model = KerasClassifier(build_fn=make_model, 
                           number_of_layers=2, neurons_per_layer=32, optimizer='adam',  # parameters for the model-making function 
                           epochs=100, batch_size=256, verbose=0  # parameters for scikit-learn
                           )

# create cross-validator
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=random_seed)
results = cross_val_score(kc_model, X_train, original_y_train, cv=kfold, verbose=0)
print(f'results = {str(results)} \n results.mean = {str(results.mean())}')

So the cross-validation run is telling us that on the original dataset of
60,000 images, we got a performance of a bit more than 95% accuracy.
That’s a just about the same as what we saw graphically for this
model way back in Figure 24.13, where the validation accuracy was
just a smidge better than 96%.

That’s reassuring. It says that this whole wrapping and cross-validating
scheme is producing the same results that we got when we trained
and tested the model ourselves.

## Cross-Validation with Normalization

Because we already normalized the training data to the range [0,1] when we divided it by 255. So when cross-validation grabs a random 90% of these samples and trains on them, it’s likely to get samples that run from 0 to 1.

In general, the data that’s going to get chosen from our database and used for cross-validation won’t be normalized to the range [0,1]. It’s up to us to get that normalization in there, and then apply that same transform to the part of the data that was set aside for testing in that run.

We can normalize the particular piece of training
data that’s built for each pass through cross-validation by building
a Pipeline object composed of two steps: a normalizer followed by
our model.

Let’s do this by first making our objects, and then assembling them
into a Pipeline object. For demonstration purposes our pipeline will
contain a MinMaxScaler from scikit-learn, followed by our model.

In [0]:
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import MinMaxScaler

estimators = []
estimators.append(('normalize_step', MinMaxScaler()))
estimators.append(('model_step', kc_model))
pipeline = Pipeline(estimators)

Constructing a pipeline this way is useful when we want to later refer
to the individual steps. We’ll need to do that soon when we use grid
searching.

But for this cross-validation step, we don’t need that kind of access.
We’ll often see code that builds the pipeline in one line, using the shortcut
make_pipeline() function.

In [0]:
pipeline = make_pipeline(MinMaxScaler(), kc_model)

These two pipeline objects are the same. The only difference is that
we’ve given our own names to the steps in the first version.

To use our pipeline object, we just give it to cross_val_score() in
place of a model (or wrapped model). Scikit-learn will recognize that
it’s a pipeline and take care of all the rest.

In [0]:
results = cross_val_score(pipeline, X_train, original_y_train, cv=kfold, verbose=0)

Putting these new lines together.

In [16]:
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import MinMaxScaler

# make the model and wrap it up for scikit-learn
kc_model = KerasClassifier(build_fn=make_model, 
                           number_of_layers=2, neurons_per_layer=32, optimizer='adam',  # parameters for the model-making function 
                           epochs=100, batch_size=256, verbose=0  # parameters for scikit-learn
                           )

# create pipeline
pipeline = make_pipeline(MinMaxScaler(), kc_model)

# create cross-validator
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=random_seed)

# execute cross validation using pipeline and k-fold cross-validator
results2 = cross_val_score(pipeline, X_train, original_y_train, cv=kfold, verbose=0)
print(f'results = {str(results2)} \n results.mean = {str(results2.mean())}')

results = [0.96120548 0.96002001 0.96400601 0.96216667 0.96216667 0.96682781
 0.96266043 0.96449411 0.96631652 0.95980656] 
 results.mean = 0.9629670262336731


Cross-validation is a great way to get a handle on the quality of our
model. It’s not so great when training times start to push our patience,
since every fold is essentially a brand-new full-length training and
testing process. Using 10 folds requires training and then testing our
model 10 times in a row.