# Cross Validation and Choosing Hyper-Parameters

We've started along a bad path and it's time to address it and fix it.  We've got a lot of hyper-parameters that we choose, and if we just keep playing with them until we get test accuracy, we're actually using the test set for training purposes, which means our test accuracy isn't really "fair" anymore -- it's almost more like training set accuracy.

If you do this, *performance on the test set does not necessarily indicate good performance on **new data**,* which is the actual point!

On the other hand, we can't just guess the hyper-parameters.  They interact in funny ways that are hard to predict, and what was a good parameter on one dataset can be wildly offbase with another.  We have to choose them somehow!

## The Solution

The solution is **cross-validation**.  It works like this:

1. Split off some portion of the (original) training set and call it your "validation set."  Call the remaining portion your (new) training set.  This should be done randomly to avoid bias.
2. For each choice of hyper-parameters...
    1. Train a network on the training set with those parameters.
    2. Evaluate the performance of the network on the validation set.
3. Pick the hyper-parameters that give the best results on the validation set
4. Train a network on the whole training set (training+validation)
5. Evaluate its performance on the test set, and this is your actual performance.

So in a sense we're still just training the network on the training set, we're just doing cross-validation within the training set to choose our hyper-parameters.

## Hyper-Parameters So Far

Just to review, here's the hyper-parameters (read, choices that the algorithm doesn't make for you) we already have:

1. Depth of the network
2. Number of neurons at each level
3. Activation function at each level
4. Cost function
5. Initialization method
6. Learning Rate
7. Momentum
8. L1 Penalties
9. L2 Penalties
10. Number of epochs
11. Size of mini-batches

And we're going to get a lot more.  To be fair, some of these have a "right" answer and we don't have to train them.  For example:

1. We'll always use the appropriate Xavier initialization, rather than a uniform method.
2. We'll always use the appropriate cost function; no need to compare CE and MSE for classification problems.
3. We'll typically use as many epochs as we can stand with our hardware and time constraints.
4. Batch size doesn't seem to matter very much, so we can just stick with a moderately sized number.

We might even decide we always like hyperbolic tangent neurons (or rectifier neurons, whatever) and so on, but even if you have a favorite architecture that you just adore, you still need to pick lots of finicky little numbers, and what works on one dataset won't work on another.  Hence, cross validation.

## Cross-Validation: Code

Cross validation is super easy to implement.  There's actually already code in `sklearn` for this, but it's so easy to do by hand that it doesn't seem worth it. We scramble up the sets (like in the code for batching), then split them into two pieces.  This has a side effect of scrabling both sets, which doesn't really affect anything since we're scrambling anyway.

In [1]:
def validation_split(X, Y, train_set_proportion):
    scrambled_indices = np.random.permutation(len(X))
    train_size = int(np.floor(len(X) * train_set_proportion))
    
    train_X = X[scrambled_indices[0:train_size]]
    train_Y = Y[scrambled_indices[0:train_size]]
    
    valid_X = X[scrambled_indices[train_size:]]
    valid_Y = Y[scrambled_indices[train_size:]]
    
    return train_X, train_Y, valid_X, valid_Y

# Choosing Hyper-Parameters

There are a lot of ways to choose hyper-parameters.  If the number of parameters is not too large, and the number of options is not too vast, you can just try every combination.  This is called **grid search**.  This is fine if you're just trying to choose (say) learning rate, momentum, and regularization (L1 and L2).  With three choices each, that means $3^4=81$ simulations, which is a lot, but not *that* many.  The word grid comes from imagining only two parameters, then drawing out an "integer" grid across the search space, and visiting each point on the grid.

On the other hand, if you've got more parameters, or more options per parameter, it's simply not possible to search through all of them.  Worse, with so many options, there's a good chance that picking the best one among them would just be overfitting the validation set and get bad test error anyway.

Bengio talks a lot about this topic (see <a href="http://arxiv.org/pdf/1206.5533v2.pdf">his paper</a>, starting around page 16) and it's important enough that we will revisit it several times throughout these notes.  Choosing hyper-parameters is nearly as complicated, and just as important, as training the networks appropriately!

But for now let's just implement a basic grid search and see where that gets us.

One helpful note from that paper that we will use is that you should examine parameters uniformly on the "log-domain;" that is, don't look at 1, 2, 3, 4, but rather look at 1, 2, 4, 8.  This is because (for example) you're not going to see much change between 7 and 8, or at least not as much as you would between 1 and 2.  So it makes sense to fix a uniform ratio (a "hyper-hyper-parameter") and then spread out evenly along that.

Also, it makes sense to look only at values of the parameters that could plausibly do well.  We know a learning rate of 38 is not appropriate; better to stick to plausible values.

## Grid Search: Code

First, let's write the code for the "inner" experiment; this is essentially the same as the old experiment code, except that it takes the training set as a parameter (instead of just using the global one) and doesn't have so many `print` statements, since that's going to get exhausting.

In [2]:
from basic_nn import *
import time
import numpy as np

def optimize(act_fn, cost_fn, init_fn, learning_rate,
             train_X, train_Y,
             neuron_sizes, num_epochs, batch_size,
             l1_cost=0, l2_cost=0,
             momentum=0
            ):
    np.random.seed(313) # for determinism
    
    # Step 2: initialize
    weights, biases = init_fn(n, k, neuron_sizes)
    acts = [act_fn for _ in range(0, len(weights))]
    acts[-1] = act_sigmoid # last one is always sigmoid
    
    # Step 3: train
    t1 = time.time()
    
    weight_velocities = [0 for _ in range(0, len(weights))]
    biases_velocities = [0 for _ in range(0, len(biases))]

    for epoch in range(0, num_epochs):
        # we'll keep track of the cost as we go
        total_cost = 0
        num_batches = 0

        for X_mb, Y_mb in get_mini_batches(batch_size, train_X, train_Y):
            x, z, y = forward_prop(weights, biases, acts, X_mb)

            bp_grad_w, bp_grad_b = back_prop(weights, biases, acts, cost_fn, X_mb, Y_mb, x, y, z)
            l1_grad_w = lasso_cost(l1_cost, weights, biases, diff=True)
            l2_grad_w = ridge_cost(l2_cost, weights, biases, diff=True)

            for i in range(0, len(weights)):
                weight_grad = bp_grad_w[i] / len(X_mb)
                weight_grad += l1_grad_w[i]
                weight_grad += l2_grad_w[i]
                
                weight_velocities[i] = weight_velocities[i] * momentum + weight_grad
                weights[i] -= weight_velocities[i] * learning_rate
                
                biases_grad = bp_grad_b[i] / len(X_mb)
                
                biases_velocities[i] = biases_velocities[i] * momentum + biases_grad
                biases[i] -= biases_velocities[i] * learning_rate

            total_cost += cost_fn(y[-1], Y_mb, aggregate=True)
            num_batches += 1

        cost = total_cost / num_batches # average cost
    
    return weights, biases, acts

Now for the "outer experiment."  For now we'll just talk about optimizing learning rate, L1 and L2 costs, and momentum, assuming the other values can be picked by hand.

In [3]:
import itertools as it

def run_exp(act_fn, cost_fn, init_fn, learning_rate_range,
            train_X, train_Y,
            neuron_sizes, num_epochs, batch_size,
            l1_cost_range, l2_cost_range,
            momentum_range
           ):
    
    t1 = time.time()
    
    # Combine all the search into one loop...
    options = it.product(learning_rate_range, l1_cost_range, l2_cost_range, momentum_range)
    
    # Split the data into train/validation
    train_X, train_Y, valid_X, valid_Y = validation_split(train_X, train_Y, 0.7)
    best_validation_success = -1
    
    # Loop through all options ...
    sims_accomplished = 0
    for option in options:
        learning_rate_unscaled = option[0]
        l1_cost_unscaled = option[1]
        l2_cost_unscaled = option[2]
        momentum_unscaled = option[3]
        
        # Standard fixes
        learning_rate = learning_rate_unscaled * (1-momentum_unscaled)
        l1_cost = l1_cost_unscaled / len(train_X)
        l2_cost = l2_cost_unscaled / len(train_X)
        momentum = momentum_unscaled
        
        # Train them up
        weights, biases, acts = optimize(act_fn, cost_fn, init_fn,
                                         learning_rate, train_X, train_Y,
                                         neuron_sizes, num_epochs, batch_size,
                                         l1_cost=l1_cost, l2_cost=l2_cost,
                                         momentum=momentum)
        
        # Evaluate the results
        _, _, train_Y_hat = forward_prop(weights, biases, acts, train_X)
        train_success = classification_success_rate(train_Y_hat[-1], train_Y)
        
        _, _, valid_Y_hat = forward_prop(weights, biases, acts, valid_X)
        valid_success = classification_success_rate(valid_Y_hat[-1], valid_Y)
        
        args = (100*valid_success, 100*train_success, learning_rate_unscaled, l1_cost_unscaled, l2_cost_unscaled, momentum)
        
        # Output for each simulation ...
        print("Got {0:0.3f}% validation, {1:0.3f}% train with LR={2:0.3f}, L1={3:0.3f}, L2={4:0.3f}, Mm={5:0.3f}".format(*args))
        
        # Keep track of the best results
        if valid_success > best_validation_success:
            best_validation_success = valid_success
            best_weights, best_biases, best_acts = weights, biases, acts
            print("New record!")
        
        sims_accomplished += 1
        print("Finished {0} simulations after {1:0.3f} seconds.".format(sims_accomplished, time.time()-t1))
        print()
    
    return best_weights, best_biases, best_acts

## Experiments

We'll still use the MNIST dataset:

In [4]:
from mnist_import import get_mnist_nice

train_X, train_Y, test_X, test_Y = get_mnist_nice()

n = train_X.shape[1]
k = train_Y.shape[1]

In [5]:
act_fn = act_tanh
cost_fn = cost_CE
init_fn = initialize_xavier_tanh

learning_rate_range = [0.5, 1, 2]

neuron_sizes = [100, 60, 40]
num_epochs = 20
batch_size = 50

l1_cost_range = [0.25, 0.5, 1]
l2_cost_range = [0.25, 0.5, 1]

momentum_range = [0.8, 0.9, 0.95]

print("There are {0} experiments to run.".format(3*3*3*3))
weights, biases, acts = run_exp(act_fn, cost_fn, init_fn, learning_rate_range,
                                train_X, train_Y,
                                neuron_sizes, num_epochs, batch_size,
                                l1_cost_range, l2_cost_range,
                                momentum_range
                               )

There are 81 experiments to run.
Got 97.389% validation, 99.307% train with LR=0.500, L1=0.250, L2=0.250, Mm=0.800
New record!
Finished 1 simulations after 120.553 seconds.

Got 97.450% validation, 99.374% train with LR=0.500, L1=0.250, L2=0.250, Mm=0.900
New record!
Finished 2 simulations after 237.014 seconds.

Got 97.239% validation, 99.236% train with LR=0.500, L1=0.250, L2=0.250, Mm=0.950
Finished 3 simulations after 345.457 seconds.

Got 97.539% validation, 99.310% train with LR=0.500, L1=0.250, L2=0.500, Mm=0.800
New record!
Finished 4 simulations after 453.345 seconds.

Got 97.333% validation, 99.298% train with LR=0.500, L1=0.250, L2=0.500, Mm=0.900
Finished 5 simulations after 563.659 seconds.

Got 97.078% validation, 99.036% train with LR=0.500, L1=0.250, L2=0.500, Mm=0.950
Finished 6 simulations after 685.234 seconds.

Got 97.356% validation, 99.164% train with LR=0.500, L1=0.250, L2=1.000, Mm=0.800
Finished 7 simulations after 795.295 seconds.

Got 97.489% validation, 99.2

So, that took an incredibly long time (in comparison to the other experiments).  But the ideal parameters (from the options given) were a learning rate of `0.5`, L1 cost of `0.25`, L2 cost of `0.5`, and momentum of `0.8`.  Let's try those parameters with more epochs and see how it gets us on the test set:

In [8]:
act_fn = act_tanh
cost_fn = cost_CE
init_fn = initialize_xavier_tanh

learning_rate = 0.5

neuron_sizes = [100, 60, 40]
num_epochs = 40
batch_size = 50

l1_cost = 0.25
l2_cost = 0.5

momentum = 0.8

# Fix up the parameters, as this agrees with what happened in the experiment
l1_cost /= len(train_X)
l2_cost /= len(train_X)
learning_rate *= 1 - momentum

# train it up; note this is the whole dataset now
weights, biases, acts = optimize(act_fn, cost_fn, init_fn, learning_rate,
                                 train_X, train_Y,
                                 neuron_sizes, num_epochs, batch_size,
                                 l1_cost=l1_cost, l2_cost=l2_cost,
                                 momentum=momentum
                                )

_, _, y = forward_prop(weights, biases, acts, train_X)
train_Y_hat = y[-1]

train_success = classification_success_rate(train_Y_hat, train_Y)
print("Got {0:0.3f}% success rate on the training data.".format(100 * train_success))

_, _, y = forward_prop(weights, biases, acts, test_X)
test_Y_hat = y[-1]

test_success = classification_success_rate(test_Y_hat, test_Y)
print("Got {0:0.3f}% success rate on the test data.".format(100 * test_success))

Got 99.620% success rate on the training data.
Got 98.000% success rate on the test data.


Excellent!  Our first 98% success rate on the test data. It makes you wonder if this was really the ideal set of parameters, though.  Maybe more regularization or more momentum (or both) would have done better on the validation phase if they had been given more epochs, but there just wasn't the computing time to do so.

Also, we could have experimented with different network architectures (depth, number of neurons, activation functions) and gotten even a better search, but again, time limitations bit us. In a later notebook, we may talk about how to use AWS (or similar services) to parallelize this search and get this done in a tiny fraction of the time.  However, for right now, we stop here.

## Wrapup and Summary

We introduced **cross validation** as a way to avoid training on the test data.  We also introduced the **grid search** as a primitive way of exploring possible settings of the hyperparameters.  We did a large scale experiment which took hours to run, but found a decent set of parameters that did well on the test set.

## Homework

Try this yourself!  Find other ranges to search, or consider different variables to optimize on.  If your computer runs better than mine, try a longer simulation and see how it goes.  We'll talk about more advanced parameter searches later on, too.