# Systematically Tuning Hyperparameters

* Many hyperparameters to optimize
    * learning rate / decay rate
    * momentum
    * regularization
    * hidden layer size
    * number of hidden layers
* Need ways to automate this
    * Cross Validation, [sklearn cross validation source code](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/cross_validation.py)
    * Grid/Random Search, [sklearn grid search source code](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/grid_search.py)
* In deep learning, usually, we have very large data. Thus, we just split the data once, int one training set and on validation set. This is occasionally referred to cross-validation as well. 

## Cross Validation

* General way to choose hyperparameters
    * e.g., we want to know if 4 vs. 5 hidden units better
        * Do cross-validation on both, choose the one with the best accuracy
        * "Best" could be defined as "statistically significantly better" if you are into statistics
* We do not want to get "perfect" accuracy on training data
    * Data = signal + noise/outliers
    * We want to fit to signal, not noise


** Split up your data **
* Train - train on this data
* Validation - validate on this data
* Test - try not to touch this data until the very end

** K-Fold Cross Validation **
* Split data into K parts, do K iteration - validate on one part, train on the other K - 1 parts
* Average teh K scores, choose parameters with highest average score
* Keep the test set out of here, "test" on validation set
* You can use <b style='color:red'>sklearn.cross_validation.KFold</b>
* In deep learning, usually data is so big, we just split the data once into one training set and on validation set.
* In code

```python
def crossValidation(model, X, Y, K=5):
    X, Y = shuffle(X, Y)
    size = len(Y) / K
    scores = []
    for k in range(K):
        xtrain = np.concatenate([ X[:k*size, :], X[(k*size + size):, :] ])
        ytrain = np.concatenate([ Y[:k*size], Y[(k*size + size):] ])
        xtest = X[k*size:(k*size + size), :]
        ytest = Y[k*size:(k*size + size)]
        
        model.fit(xtrain, ytrain)
        score = model.score(xtest, ytest)
        scores.append(score)
    return np.mean(scores), np.std(scores)

```
* Can do statistical test to compare (i.e., T-test), there are Scipy functions for these


## Grid Search

* <b style='color:red'>Exhaustive / try every combination</b>
* May every slow
    * You can each independent
    * Good opportunity for parallelization
    * Hadoop / Spark


```python
params = {"learning_rate" : [0.1, 0.001, 0.0001, 0.00001],
              "momentums" : [1, 0.1, 0.01, 0.001],
              "regularizations" : [1, 0.1, 0.01]}
  
GridSearch(model, X, Y, params):
    for lr in params["learning_rate"]:
        for mu in params["momentums"]:
            for reg in params["regularization"]:
                score = cross_validation(lr, mu, reg, data)
    
```


<img src="images/model_complexity.png" alt="Drawing" style="width:60%;height:60%"/>

## Random Search
* Instead of looking at every possiblility, move in random directions until score is improved
    * Fine to coarse strategy
    * Coarse to fine strategy
* In pseudo code

```python
theta = random position in hyperparameter space
score_1 = cross_validation(theta, data)
for i in range(max_iterations):
    next_theta = sample from hypersphere around theta
    score_2 = cross_valication(next_theta, data)
    if score_2 is better then score_1:
        theta = next_theta
```



## Sample Logarithmically
* We know that the difference between learning rates 0.001 and 0.0011 is not that significant
* We would rather try numbers on a log scale
    * e.g., $10^{-2}, 10^{-3}, 10^{-4}$, etc for learning rate
        * Sample uniformly from (-2, -4) (or whatever limits you want)
        * R ~ U(-2, -4)
        * learning rate = $10^{R}$
    * e.g., 0.9, 0.99, 0.999, etc for decay 
        * R ~ U(-1, -3)
        * decay rate = $1 - 10^{R}$

## Grid/Random Search in Code

### Get Data 

In [1]:
def get_spiral():
    # Idea: radius -> low...high
    #           (don't start at 0, otherwise points will be "mushed" at origin)
    #       angle = low...high proportional to radius
    #               [0, 2pi/6, 4pi/6, ..., 10pi/6] --> [pi/2, pi/3 + pi/2, ..., ]
    # x = rcos(theta), y = rsin(theta) as usual

    radius = np.linspace(1, 10, 100)
    thetas = np.empty((6, 100))
    for i in range(6):
        start_angle = np.pi*i / 3.0
        end_angle = start_angle + np.pi / 2
        points = np.linspace(start_angle, end_angle, 100)
        thetas[i] = points

    # convert into cartesian coordinates
    x1 = np.empty((6, 100))
    x2 = np.empty((6, 100))
    for i in range(6):
        x1[i] = radius * np.cos(thetas[i])
        x2[i] = radius * np.sin(thetas[i])

    # inputs
    X = np.empty((600, 2))
    X[:,0] = x1.flatten()
    X[:,1] = x2.flatten()

    # add noise
    X += np.random.randn(600, 2)*0.5

    # targets
    Y = np.array([0]*100 + [1]*100 + [0]*100 + [1]*100 + [0]*100 + [1]*100)
    return X, Y

In [2]:
from theano_ann import ANN
import numpy as np
from sklearn.utils import shuffle
import matplotlib.pyplot as plt


ImportError: No module named 'theano_ann'

### Grid Search

In [None]:
X, Y = get_spiral()
X, Y = shuffle(X, Y)
Ntrain = int(0.7 * len(X))

Xtrain, Ytrain = X[:Ntrain], Y[:Ntrain]
Xtest, Ytest = X[Ntrain:], Y[Ntrain:]

hidden_layer_sizes = []
learning_rate = []
l2_penalties = []

best_validation_rate = 0;
best_hls = None;
best_lr = None;
best_l2 = None
for hls in hidden_layer_sizes:
    for lr in learning_rate:
        for l2 in l2_penalties:
            model = ANN(hls)
            model.fit(Xtrain, Ytrain, learning_rate=lr, reg=l2, mu=0.99, epochs=3000, show_fig=False)
            validation_accuracy = model.score(Xtest, Ytest)
            if validation_accuracy > best_validation_rate:
                best_validation_rate = validation_accuracy
                best_hls = hls
                best_lr = lr
                best_l2 = l2
                
print("Best validation_accuracy:", best_validation_rate)
print("Best settings:")
print("hidden_layer_sizes:", best_hls)
print("learning_rate:", best_lr)
print("l2:", best_l2)


### Random Search

In [None]:
X, Y = get_spiral()
X, Y = shuffle(X, Y)
Ntrain = int(0.7 * len(X))

Xtrain, Ytrain = X[:Ntrain], Y[:Ntrain]
Xtest, Ytest = X[Ntrain:], Y[Ntrain:]

M = 20
# numberof hidden layers
nHidden = 2
log_lr = -4
log_l2 = -2
max_tries = 30

best_validation_rate = 0
best_M = None
best_nHidden = None
best_log_lr = None
best_log_l2 = None

for itr in range(max_tries):
    model = ANN( [[M]]*nHidden)
    model.fit(Xtrain, Ytrain, learning_rate=10**log_lr, reg=10**log_l2, mu=0.99, epochs=3000, show_fig=False)
    validation_accuracy = model.score(Xtest, Ytest)
    if validation_accuracy > best_validation_rate:
        best_validation_rate = validation_accuracy
        best_M = M
        best_nHidden = nHidden
        best_log_lr = lr
        best_log_l2 = l2
        
    # get random number -1, 0 or 1   
    # Note that each hyperparameter should draw the random number independently
    # from the other since the search for each hyperparameter should go independently
    nHidden = best_nHidden + np.random.randint(-1,2)
    nHidden = max(1, nHidden)
    M = best_M + np.random.randint(-1,2)*10
    M = max(10, M)
    log_lr = best_lr + np.random.randint(-1,2)
    log_l2 = best_l2 + np.random.randint(-1,2)
                
print("Best validation_accuracy:", best_validation_rate)
print("Best settings:")
print("hidden_M:", best_M)
print("hidden_layer_sizes:", best_hls)
print("learning_rate:", best_lr)
print("l2:", best_l2)
