<a href="https://colab.research.google.com/github/skhabiri/DS-Unit-4-Sprint-2-Neural-Networks/blob/main/module3-Tune/LS_DS17_423_Tune_Lecture.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

*Unit 4, Sprint 2, Module 3*

---

# Neural Networks & GPUs (Prepare)
*aka Hyperparameter Tuning*

*aka Big Servers for Big Problems*

## Learning Objectives
* <a href="#p1">Part 1</a>: Describe the major hyperparemeters to tune
* <a href="#p2">Part 2</a>: Implement an experiment tracking framework
* <a href="#p3">Part 3</a>: Search the hyperparameter space using RandomSearch (Optional)

# Hyperparameter Options (Learn)
<a id="p1"></a>

## Overview

Hyperparameter tuning is much more important with neural networks than it has been with any other models that we have considered up to this point. Other supervised learning models might have a couple of parameters, but neural networks can have dozens. These can substantially affect the accuracy of our models and although it can be a time consuming process is a necessary step when working with neural networks.
â€‹
Hyperparameter tuning comes with a challenge. How can we compare models specified with different hyperparameters if our model's final error metric can vary somewhat erratically? How do we avoid just getting unlucky and selecting the wrong hyperparameter? This is a problem that to a certain degree we just have to live with as we test and test again. However, we can minimize it somewhat by pairing our experiments with Cross Validation to reduce the variance of our final accuracy values.

### Load MNIST Dataset

In [2]:
from tensorflow.keras.datasets import mnist

(X_train, y_train), (X_test, y_test) = mnist.load_data()
X_train.shape, X_test.shape

((60000, 28, 28), (10000, 28, 28))

### Normalizing Input Data

It's not 100% necessary to normalize/scale your input data before feeding it to a neural network, the network can learn the appropriate weights to deal with data of as long as it is numerically represented,  but it is recommended as it can help **make training faster** and **reduces the chances that gradient descent might get stuck in a local optimum**.

<https://stackoverflow.com/questions/4674623/why-do-we-have-to-normalize-the-input-for-an-artificial-neural-network>

In [3]:
import numpy as np
maximum = np.concatenate([X_train, X_test]).max()
maximum

255

In [4]:
X_train = X_train / maximum
X_test = X_test / maximum

X_train = X_train.reshape(60000, 784)
X_test = X_test.reshape(10000, 784)

### Hyperparameter Tuning Approaches:

#### 1) Babysitting AKA "Grad Student Descent".

If you fiddled with any hyperparameters previously, this is basically what you did. This approach is 100% manual and is pretty common among researchers where finding that 1 exact specification that jumps your model to a level of accuracy never seen before is the difference between publishing and not publishing a paper. Of course the professors don't do this themselves, that's grunt work. This is also known as the fiddle with hyperparameters until you run out of time method.

#### 2) Grid Search

Grid Search is the Grad Student galaxy brain realization of: why don't I just specify all the experiments I want to run and let the computer try every possible combination of them while I go and grab lunch. This has a specific downside in that if I specify 5 hyperparameters with 5 options each then I've just created 5^5 combinations of hyperparameters to check. Which means that I have to train 3125 different versions of my model Then if I use 5-fold Cross Validation on that then my model has to run 15,525 times. This is the brute-force method of hyperparameter tuning, but it can be very profitable if done wisely. 

> *When using Grid Search here's what I suggest: don't use it to test combinations of different hyperparameters, only use it to test different specifications of **a single** hyperparameter. It's rare that combinations between different hyperparameters lead to big performance gains. You'll get 90-95% of the way there if you just Grid Search one parameter and take the best result, then retain that best result while you test another, and then retain the best specification from that while you train another. This at least makes the situation much more manageable and leads to pretty good results.*

#### 3) Random Search

Do Grid Search for a couple of hours and you'll say to yourself - "There's got to be a better way." Enter Random Search. For Random search you specify a hyperparameter space and it picks specifications from that randomly, tries them out, gives you the best results and says - That's going to have to be good enough, go home and spend time with your family. 

> *Grid Search treats every parameter as if it was equally important, but this just isn't the case, some are known to move the needle a lot more than others (we'll talk about that in a minute). Random Search allows searching to be specified along the most important parameter and experiments less along the dimensions of less important hyperparameters. The downside of Random search is that it won't find the absolute best hyperparameters, but it is much less costly to perform than Grid Search.*

#### 4) Bayesian Methods

One thing that can make more manual methods like babysitting and gridsearch effective is that as the experimenter sees results he can then make updates to his future searches taking into account the results of past specifications. If only we could hyperparameter tune our hyperparameter tuning. Well, we kind of can. Enter Bayesian Optimization. Neural Networks are like an optimization problem within an optimization problem, and **Bayesian Optimization is a search strategy that tries to take into account the results of past searches in order to improve future ones.** Check out the new library `keras-tuner` for easy implementations of Bayesian methods. 


## What Hyperparameters are there to test?

- batch_size
- training epochs
- optimization algorithms
- learning rate
- momentum
- activation functions
- dropout regularization
- number of neurons in the hidden layer
- number of the layers

There are more, but these are the most important.

## Follow Along

## Batch Size

Batch size determines how many observations the model is shown before it calculates loss/error and updates the model weights via gradient descent. You're looking for a sweet spot here where you're showing it enough observations that you have enough information to updates the weights, but not such a large batch size that you don't get a lot of weight update iterations performed in a given epoch. **Feed-forward Neural Networks aren't as sensitive to bach_size as other networks**, but it is still an important hyperparameter to tune. Smaller batch sizes will also take longer to train. 

Traditionally, batch size is set in powers of 2 starting at 32 up to 512. Keras defaults to a batch size of 32 if you do not specify it. Yann LeCun famously Twitted: 

> Training with large minibatches is bad for your health.
More importantly, it's bad for your test error.
Friends dont let friends use minibatches larger than 32.

Check out this paper for more reference on his tweet. https://arxiv.org/abs/1804.07612. Increasing the minibatch size could lower the effective learning rate that provides stable convergence.

Check out this SO question on why batch size is typically set in powers of two: https://datascience.stackexchange.com/questions/20179/what-is-the-advantage-of-keeping-batch-size-a-power-of-2



### HP Tuning with GridSearchCV through Keras sklearn wrapper

In order to utilize the gridsearch cross validation we use sklearn wrapper for keras. GridSearchCV will handle the parameter grid and cross validation folding and the KerasClassifier will train the NN for each paramter set and runs for the specified number of epochs.

Assume $P_{j}$ is a set of parameters in the gridsearch space, and $X_{i}$ is one of the cross validation folds. KerasClassifier will train the NN for each ($P_{j}$,$X_{i}$) and the **last epoch** score from keras is recorded as $S(P_{j},X_{i})$. The average score for all the folds for a given parameter set of j is: 

$$S(P_{j}) = \sum_{X_{i}=1}^{5} S(P_{j},X_{i})$$
where, 
$X_{i}$ is each fold, $P_{j}$ is a given parameter set, $S(P_{j},X_{i})$ is trained score for each fold of a given parameter set from the last keras trained epoch, $S(P_{j})$: Mean score over all the folds for a given parameter set j. The parameter set $P_{j_{max}}$ where $S(P_{j_{max}}) = Max(S(P_{j}))$ will be retrained with KerasClassifier for the entire dataset and the log is printed out.

In [18]:
# WARNING - may take a few minutes before any output is visible

import numpy
import pandas as pd
from sklearn.model_selection import GridSearchCV
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier

# Function to create model, required for KerasClassifier
# relu is fast for computation. 0 or y=x, compared to sigmoid
def create_model(units=32):
    # create model
    model = Sequential()
    # units are the number of hidden neurons
    model.add(Dense(units, input_dim=784, activation='relu'))
    # 10 output labels
    model.add(Dense(10, activation='softmax'))
    # Compile model
    #categorical_crossentropy requires 1hotencoding at the output
    model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

# create model that can accept the fit method
model = KerasClassifier(build_fn=create_model, verbose=1)

# define the grid search parameters. These are model parameter names not variable names
param_grid = {'batch_size': [32,64,512],
              'epochs': [20],
              'units': [32],
              # TODO add more params
              }

# Create Grid Search
# Having default cv=None is the same as cv=5 fold
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=None, refit=True)
"""
In scikit-learn, estimator.fit(X, y) returns the object (self) aka fitted estimator. 
This pattern is useful to be able to implement quick one liners in an IPython session such as:
y_predicted = SVC(C=100).fit(X_train, y_train).predict(X_test)
"""

# grid and grid_fitted refer to the same object in memory
grid_fitted = grid.fit(X_train, y_train)
print("type:\n", type(grid_fitted))
print("ids:\n", id(grid), id(grid_fitted))

# Report Best Estimator
print(f"Best: {grid_fitted.best_score_} using {grid_fitted.best_params_}")

# Cross validation parameters
means = grid_fitted.cv_results_['mean_test_score']
stds = grid_fitted.cv_results_['std_test_score']
params = grid_fitted.cv_results_['params']
print("type(means):\n", type(means))

for mean, stdev, param in zip(means, stds, params):
    print(f"Means: {mean}, Stdev: {stdev} with: {param}")

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
type:
 <class 'sklearn.model_selection._search.GridSearchCV'>
ids:
 140442006738760 140442006738760
Best: 0.9623166680335998 using {'batch_size': 64, 'epochs': 20, 'units': 32}
type(means):
 <class 'numpy.ndarray'>
Means: 0.9616999983787536, Stdev: 0.002754600635031296 with: {'batch_size': 32, 'epochs': 20, 'units': 32}
Means: 0.9623166680335998, Stdev: 0.002300602075149159 with: {'batch_size': 64, 'epochs': 20, 'units': 32}
Means: 0.9540500044822693, Stdev: 0.003614638496342309 with: {'batch_size': 512, 'epochs': 20, 'units': 32}


In [25]:
[m for m in dir(grid) if not m.startswith("_")]

['best_estimator_',
 'best_index_',
 'best_params_',
 'best_score_',
 'classes_',
 'cv',
 'cv_results_',
 'decision_function',
 'error_score',
 'estimator',
 'fit',
 'get_params',
 'iid',
 'inverse_transform',
 'multimetric_',
 'n_jobs',
 'n_splits_',
 'param_grid',
 'pre_dispatch',
 'predict',
 'predict_log_proba',
 'predict_proba',
 'refit',
 'refit_time_',
 'return_train_score',
 'score',
 'scorer_',
 'scoring',
 'set_params',
 'transform',
 'verbose']

In [21]:
# get_params() returns the constructor params not the fitted ones
grid.get_params()

{'cv': None,
 'error_score': nan,
 'estimator__verbose': 1,
 'estimator__build_fn': <function __main__.create_model(units=32)>,
 'estimator': <tensorflow.python.keras.wrappers.scikit_learn.KerasClassifier at 0x7fbb33ec56a0>,
 'iid': 'deprecated',
 'n_jobs': -1,
 'param_grid': {'batch_size': [32, 64, 512], 'epochs': [20], 'units': [32]},
 'pre_dispatch': '2*n_jobs',
 'refit': True,
 'return_train_score': False,
 'scoring': None,
 'verbose': 0}

In [23]:
# Returns the best params after the gridsearch
grid.best_params_

{'batch_size': 64, 'epochs': 20, 'units': 32}

In [29]:
grid.cv_results_

{'mean_fit_time': array([83.80310297, 44.38833485,  8.75606074]),
 'std_fit_time': array([0.40393744, 3.92215416, 1.470197  ]),
 'mean_score_time': array([2.21824265, 1.38392358, 0.46798816]),
 'std_score_time': array([0.17727028, 0.42345645, 0.20383882]),
 'param_batch_size': masked_array(data=[32, 64, 512],
              mask=[False, False, False],
        fill_value='?',
             dtype=object),
 'param_epochs': masked_array(data=[20, 20, 20],
              mask=[False, False, False],
        fill_value='?',
             dtype=object),
 'param_units': masked_array(data=[32, 32, 32],
              mask=[False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'batch_size': 32, 'epochs': 20, 'units': 32},
  {'batch_size': 64, 'epochs': 20, 'units': 32},
  {'batch_size': 512, 'epochs': 20, 'units': 32}],
 'split0_test_score': array([0.96541667, 0.96366668, 0.95816666]),
 'split1_test_score': array([0.96041667, 0.95991665, 0.95058334]),
 'split2_test_sco

In [36]:
[str(d) for d in grid.cv_results_["params"]]

["{'batch_size': 32, 'epochs': 20, 'units': 32}",
 "{'batch_size': 64, 'epochs': 20, 'units': 32}",
 "{'batch_size': 512, 'epochs': 20, 'units': 32}"]

In [38]:
# for each of 5 cv folds and each paramter set there is a score
df_score = pd.DataFrame(data=[grid.cv_results_[f"split{cv}_test_score"] for cv in range(5)], 
                        columns=[str(d) for d in grid.cv_results_["params"]])
df_score

Unnamed: 0,"{'batch_size': 32, 'epochs': 20, 'units': 32}","{'batch_size': 64, 'epochs': 20, 'units': 32}","{'batch_size': 512, 'epochs': 20, 'units': 32}"
0,0.965417,0.963667,0.958167
1,0.960417,0.959917,0.950583
2,0.960083,0.962583,0.950167
3,0.958167,0.959667,0.952833
4,0.964417,0.96575,0.9585


In [41]:
# Mean score for all cv's for each param set
df_score.mean(axis=0)

{'batch_size': 32, 'epochs': 20, 'units': 32}     0.961700
{'batch_size': 64, 'epochs': 20, 'units': 32}     0.962317
{'batch_size': 512, 'epochs': 20, 'units': 32}    0.954050
dtype: float64

In [24]:
# This is the best score based on average of all folds for the best param set
grid.best_score_

0.9623166680335998

**best_estimator_.model** is the final trained NN with the entire training dataset and contains the history and weight info.

In [63]:
best_NN = grid.best_estimator_.model

In [64]:
# The best score considering entire dataset is the one from the last epoch, 0.9847999811172485
best_NN.history.history

{'loss': [0.41822561621665955,
  0.22465011477470398,
  0.1852334439754486,
  0.1596328169107437,
  0.14032332599163055,
  0.12511606514453888,
  0.11301782727241516,
  0.10305821150541306,
  0.09549396485090256,
  0.08755308389663696,
  0.08226276189088821,
  0.07650288194417953,
  0.0717812180519104,
  0.0676991194486618,
  0.06402521580457687,
  0.060607343912124634,
  0.05703025683760643,
  0.05422088876366615,
  0.05153549090027809,
  0.04924921691417694],
 'accuracy': [0.883650004863739,
  0.935616672039032,
  0.9474833607673645,
  0.9543499946594238,
  0.9591500163078308,
  0.9631166458129883,
  0.9669666886329651,
  0.9690499901771545,
  0.9715333580970764,
  0.9738166928291321,
  0.975683331489563,
  0.9772999882698059,
  0.9788833260536194,
  0.9797999858856201,
  0.9807833433151245,
  0.9820833206176758,
  0.9831166863441467,
  0.9840166568756104,
  0.9848166704177856,
  0.9847999811172485]}

## Optimizer

Remember that there's a different optimizers [optimizers](https://keras.io/optimizers/). At some point, take some time to read up on them a little bit. "adam" usually gives the best results. The thing to know about choosing an optimizer is that different optimizers have different hyperparameters like learning rate, momentum, etc. So based on the optimizer you choose you might also have to tune the learning rate and momentum of those optimizers after that. 

## Learning Rate

Remember that the Learning Rate is a hyperparameter that is specific to your gradient-descent based optimizer selection. A learning rate that is too high will cause divergent behavior, but a Learning Rate that is too low will fail to converge, again, you're looking for the sweet spot. I would start out tuning learning rates by orders of magnitude: [.001, .01, .1, .2, .3, .5] etc. I wouldn't go above .5, but you can try it and see what the behavior is like. 

Once you have narrowed it down, make the window even smaller and try it again. If after running the above specification your model reports that .1 is the best optimizer, then you should probably try things like [.05, .08, .1, .12, .15] to try and narrow it down. 

It can also be good to tune the number of epochs in combination with the learning rate since the number of iterations that you allow the learning rate to reach the minimum can determine if you have let it run long enough to converge to the minimum. 

In [70]:
learning_rates = [.001, .01, .1, .2, .3, .5]

## Momentum

Momentum is a hyperparameter that is more commonly associated with Stochastic Gradient Descent. SGD is a common optimizer because it's what people understand and know, but I doubt it will get you the best results, you can try hyperparameter tuning its attributes and see if you can beat the performance from adam. Momentum is a property that decides the willingness of an optimizer to overshoot the minimum. Imagine a ball rolling down one side of a bowl and then up the opposite side a little bit before settling back to the bottom. The purpose of momentum is to try and escape local minima.

## Activation Functions

We've talked about this a little bit, typically you'l want to use ReLU for hidden layers and either Sigmoid, or Softmax for output layers of binary and multi-class classification implementations respectively, but try other activation functions and see if you can get any better results with sigmoid or tanh or something. There are a lot of activation functions that we haven't really talked about. Maybe you'll get good results with them. Maybe you won't. :) <https://keras.io/activations/>

## Network Weight Initialization

You saw how big of an effect the way that we initialize our network's weights can have on our results. There are **a lot** of what are called initialization modes. I don't understand all of them, but they can have a big affect on your model's initial accuracy. Your model will get further with less epochs if you initialize it with weights that are well suited to the problem you're trying to solve.

`init_mode = ['uniform', 'lecun_uniform', 'normal', 'zero', 'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform']`

## Dropout Regularization and the Weight Constraint

The Dropout Regularization value is a percentage of neurons that you want to be randomly deactivated during training. The weight constraint is a second regularization parameter that works in tandem with dropout regularization. You should tune these two values at the same time. 

Using dropout on visible vs hidden layers might have a different effect. Using dropout on hidden layers might not have any effect while using dropout on hidden layers might have a substantial effect. You don't necessarily need to turn use dropout unless you see that your model has overfitting and generalizability problems.

## Neurons in Hidden Layer 

Remember that when we only had a single perceptron our model was only able to fit to linearly separable data, but as we have added layers and nodes to those layers our network has become a powerhouse of fitting nonlinearity in data. The larger the network and the more nodes generally the stronger the network's capacity to fit nonlinear patterns in data. The more nodes and layers the longer it will take to train a network, and higher the probability of overfitting. The larger your network gets the more you'll need dropout regularization or other regularization techniques to keep it in check. 

**Typically depth (more layers) is more important than width (more nodes) for neural networks.** This is part of why Deep Learning is so highly touted. Certain deep learning architectures have truly been huge breakthroughs for certain machine learning tasks. 

You might borrow ideas from other network architectures. For example if I was doing image recognition and I wasn't taking cues from state of the art architectures like resnet, alexnet, googlenet, etc. Then I'm probably going to have to do a lot more experimentation on my own before I find something that works.

There are some heuristics, but I am highly skeptical of them. I think you're better off experimenting on your own and forming your own intuition for these kinds of problems. 

- https://machinelearningmastery.com/how-to-configure-the-number-of-layers-and-nodes-in-a-neural-network/

## Challenge
Other parameter searchers (including RandomSearchCV) - https://scikit-learn.org/stable/modules/classes.html#hyper-parameter-optimizers

# Experiment Tracking Framework (Learn)
<a id="p2"></a>

## Overview

You will notice quickly that managing the results of all the experiments you are running becomes challenging. Which set of parameters did the best? Are my results today different than my results yesterday? Although we use Ipython Notebooks to work, the format is not well suited to logging experimental results. Enter **experiment tracking frameworks like [Comet.ml](https://comet.ml) and [Weights and Biases](https://wandb.ai/), and TensorBoard's Hyperparameter Dashboard.* 

Those tools will help you track your experiments, store the results, and the code associated with those experiments. Experimental results can also be readily visualized to see changes in performance across any metric you care about. Data is sent to the tool as each epoch is completed, so you can also see if your model is converging. Let's check out TensorBoard today. 

### Hyperparameter Tuning with the HParams Dashboard in TensorBoard

In [4]:
# Load an ipython extension
%load_ext tensorboard

In [5]:
import tensorflow as tf
from tensorboard.plugins.hparams import api as hp

import os
import datetime

### 1. Create Experiment Configuration
We are going to experiment with: 
* Number of units in the first dense layer
* Learning Rate
* Optimizer

In [77]:
# Define hyper parametres and score metrics
HP_NUM_UNITS = hp.HParam('num_units', hp.Discrete([16,32]))
HP_LEARNING_RATE = hp.HParam('learning_rate', hp.RealInterval(0.001,.01))
HP_OPTIMIZER = hp.HParam('optimizer', hp.Discrete(['adam', 'sgd']))
HP_ADDLAYER = hp.HParam('hplayer', hp.Discrete([False, True]))
HP_DROPOUT = hp.HParam('hpdropout', hp.RealInterval(0.1, 0.2))
HP_EPOCH = hp.HParam('hpepoch', hp.Discrete([5]))
HP_BATCH = hp.HParam('hpbatch', hp.Discrete([128]))

METRIC_ACCURACY = 'hpaccuracy'
METRIC_LOSS = 'hploss'
METRIC_MSE = 'hpmse'

with tf.summary.create_file_writer('logs/hparam_tuning').as_default():
  hp.hparams_config(
      hparams=[HP_NUM_UNITS, HP_LEARNING_RATE, HP_OPTIMIZER, HP_EPOCH, HP_BATCH, HP_ADDLAYER, HP_DROPOUT],
      metrics=[hp.Metric(METRIC_ACCURACY, display_name='HPaccuracy'), 
               hp.Metric(METRIC_LOSS, display_name='HPloss'),
               hp.Metric(METRIC_MSE, display_name='HPmse')
                ]
  )

In [78]:
dir(hp)

['Discrete',
 'Domain',
 'HParam',
 'IntInterval',
 'KerasCallback',
 'Metric',
 'RealInterval',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 'hparams',
 'hparams_config',
 'hparams_config_pb',
 'hparams_pb']

### 2. Adapt Model Function with HParams

In [80]:
def train_test_model(hparams):
    """
    hparams: a dictionary with keys being of HParams type and 
    values being list of possible values
    """
  
    model = tf.keras.Sequential([
        tf.keras.layers.Dense(hparams[HP_NUM_UNITS], activation='relu'),
        tf.keras.layers.Dropout(hparams[HP_DROPOUT])
    ])
    
    if hparams[HP_ADDLAYER] == True:
        model.add(tf.keras.layers.Dense(hparams[HP_NUM_UNITS], activation='relu'))

    model.add(tf.keras.layers.Dense(10, activation='softmax'))      
  

    # Optimizer need the learning rate
    opt_name = hparams[HP_OPTIMIZER]
    lr = hparams[HP_LEARNING_RATE]

    if opt_name == 'adam':
        opt = tf.keras.optimizers.Adam(learning_rate=lr)
    elif opt_name == 'sgd':
        opt = tf.keras.optimizers.SGD(learning_rate=lr)
    else:
        raise ValueError(f'Unexpected optimizer: {opt_name}')

    # Compile defines optimizer, loss function and metric
    model.compile(
        optimizer=opt,
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy', 'mse']
      )

    model.fit(X_train, y_train, epochs=hparams[HP_EPOCH], batch_size=hparams[HP_BATCH])
    
    print("Metrics:", model.metrics_names)
    [loss, accuracy, mse] = model.evaluate(X_test, y_test)

    # Python convention: if a variable doesn't need a name, give it _
    # ten_ones = [1 for _ in range(10)]

    return loss, accuracy, mse

HParam can work with tensor.summary file which is accessible by tensorboard. We use HParam to record parameter sets in the file.

* For each run, log an hparams summary with the hyperparameters and final accuracy. 

In [81]:
def run(run_dir, hparams):
    """trains and evaluate the NN for the hparams and store the HParams into a summary file 
    for the given log directory as well as accuracy number in the form of tensor.
    The summary file is accessible by tensorboard for visualization and record keeping
    """
    with tf.summary.create_file_writer(run_dir).as_default():
        # .hparams() is a method of hp
        # This seems the only part that HParam brings value compare to a primitive variable
        hp.hparams(hparams)  # record the parameter values used in this trial
        
        [loss, accuracy, mse] = train_test_model(hparams)
        # The summary tag used for TensorBoard will be METRIC_ACCURACY='accuracy' prefixed by any active name scopes.
        tf.summary.scalar(name=METRIC_ACCURACY, data=accuracy, step=1)        
        tf.summary.scalar(name=METRIC_LOSS, data=loss, step=1)
        tf.summary.scalar(name=METRIC_MSE, data=mse, step=1)


In [82]:
# For debugging
for num_units in HP_NUM_UNITS.domain.values:
    for learning_rate in (HP_LEARNING_RATE.domain.min_value, HP_LEARNING_RATE.domain.max_value):
        for optimizer in HP_OPTIMIZER.domain.values:
            for batch in HP_BATCH.domain.values:
                for epoch in HP_EPOCH.domain.values:
                    for layer in HP_ADDLAYER.domain.values:
                        for dropout_rate in (HP_DROPOUT.domain.min_value, HP_DROPOUT.domain.max_value):
                            hparams = {
                                        HP_NUM_UNITS: num_units,
                                        HP_LEARNING_RATE: learning_rate,
                                        HP_OPTIMIZER: optimizer,
                                        HP_BATCH: batch,
                                        HP_EPOCH: epoch,
                                        HP_ADDLAYER: layer,
                                        HP_DROPOUT: dropout_rate
                                        }
                            print(hparams)
                            print("*"*50)
                            break
                        break
                    break   
                break
            break
        break
    break

{HParam(name='num_units', domain=Discrete([16, 32]), display_name=None, description=None): 16, HParam(name='learning_rate', domain=RealInterval(0.001, 0.01), display_name=None, description=None): 0.001, HParam(name='optimizer', domain=Discrete(['adam', 'sgd']), display_name=None, description=None): 'adam', HParam(name='hpbatch', domain=Discrete([128]), display_name=None, description=None): 128, HParam(name='hpepoch', domain=Discrete([5]), display_name=None, description=None): 5, HParam(name='hplayer', domain=Discrete([False, True]), display_name=None, description=None): False, HParam(name='hpdropout', domain=RealInterval(0.1, 0.2), display_name=None, description=None): 0.1}
**************************************************


In [83]:
HP_NUM_UNITS.domain.values

[16, 32]

Let's create a series of HParam values and save the trained results in separate log file directories

In [84]:
session_num = 0

# Basically a grid search
for num_units in HP_NUM_UNITS.domain.values:
  for learning_rate in (HP_LEARNING_RATE.domain.min_value,
                        HP_LEARNING_RATE.domain.max_value):
    for optimizer in HP_OPTIMIZER.domain.values:
        for batch in HP_BATCH.domain.values:
            for epoch in HP_EPOCH.domain.values:
                for layer in HP_ADDLAYER.domain.values:
                    for dropout_rate in (HP_DROPOUT.domain.min_value, HP_DROPOUT.domain.max_value):
                        # of parameter set with dict key of type hp.HParam
                          hparams = {
                              HP_NUM_UNITS: num_units,
                              HP_LEARNING_RATE: learning_rate,
                              HP_OPTIMIZER: optimizer,
                              HP_BATCH: batch,
                              HP_EPOCH: epoch,
                              HP_ADDLAYER: layer,
                              HP_DROPOUT: dropout_rate
                          }

                          run_name = f'run-{session_num}'
                          print(f'--- Starting trial: {run_name}')
                          # type(param): <class 'tensorboard.plugins.hparams.summary_v2.HParam'>
                          # param: <HParam 'num_units': {16, 32}>
                          #  param.name: num_units
                          print({param.name: hparams[param] for param in hparams})
                          run('logs/hparam_tuning/' + run_name, hparams)
                          session_num += 1

--- Starting trial: run-0
{'num_units': 16, 'learning_rate': 0.001, 'optimizer': 'adam', 'hpbatch': 128, 'hpepoch': 5, 'hplayer': False, 'hpdropout': 0.1}
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Metrics: ['loss', 'accuracy', 'mse']
--- Starting trial: run-1
{'num_units': 16, 'learning_rate': 0.001, 'optimizer': 'adam', 'hpbatch': 128, 'hpepoch': 5, 'hplayer': False, 'hpdropout': 0.2}
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Metrics: ['loss', 'accuracy', 'mse']
--- Starting trial: run-2
{'num_units': 16, 'learning_rate': 0.001, 'optimizer': 'adam', 'hpbatch': 128, 'hpepoch': 5, 'hplayer': True, 'hpdropout': 0.1}
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Metrics: ['loss', 'accuracy', 'mse']
--- Starting trial: run-3
{'num_units': 16, 'learning_rate': 0.001, 'optimizer': 'adam', 'hpbatch': 128, 'hpepoch': 5, 'hplayer': True, 'hpdropout': 0.2}
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Metrics: ['loss', 'accuracy', 'mse']
--- Starting trial: run-4
{'num_un

### 4. Visualize the Results with tensorboard

In [86]:
!ls "logs/hparam_tuning/run-10"

events.out.tfevents.1614029207.MacBook-Pro.local.30662.2989901.v2


In [87]:
%tensorboard  --logdir "logs/hparam_tuning"

Reusing TensorBoard on port 6008 (pid 30029), started 15:55:14 ago. (Use '!kill 30029' to kill it.)

# Hyperparameter Tuning with RandomSearch in keras-tuner (Learn)

## Overview

Basically `GridSearchCV` takes forever. You'll want to adopt a slightly more sophiscated strategy.

Let's also take a look at an alternative with Keras-Tuner.

In [88]:
# !pip install keras-tuner

In RandomSearch() an instance of HyperParameters() class is passed to the hypermodel parameter as the argument. An instance of `HyperParameters` class contains information about both the search space and the current values of each hyperparameter. Hyperparameters can be defined inline with the model-building code that uses them.
```
import kerastuner as kt
import tensorflow as tf

def build_model(hp):
    model = tf.keras.Sequential()
    for i in range(hp.Int('layers', 3, 10)):
        model.add(tf.keras.layers.Dense(
            units=hp.Int('units_' + str(i), 50, 100, step=10),
            activation=hp.Choice('act_' + str(i), ['relu', 'tanh'])))
    model.add(tf.keras.layers.Dense(1, activation='sigmoid'))
    model.compile('adam', 'binary_crossentropy', metrics=['accuracy'])
    return model

hp = kt.HyperParameters()
model = build_model(hp)
assert 'layers' in hp
assert 'units_0' in hp

```

In [24]:
from tensorflow import keras
from tensorflow.keras import layers
from kerastuner.tuners import RandomSearch
import kerastuner.tuners as kt


"""
This model Tunes:
- Number of Neurons in the Hidden Layer
- Learning Rate in Adam
"""

# hp1 is an arbitrary name that is passed by an instance of kerastuner.HyperParameters()
def build_model(hp1):
    
    hp_act = hp1.Choice('dense_activation', values=['relu', 'tanh', 'sigmoid'], default='relu')
    
    model = keras.Sequential()
    model.add(layers.Dense(units=hp1.Int('units',min_value=32,max_value=512,step=32, default=32), 
                            activation=hp_act, input_dim=784))
    
    for i in range(1,4,1):
        hp_units = hp1.Int('units_' + str(i), min_value=8, max_value=64, step=8)
        # The variable hp_units gets overwritten in each pass, but the layer does not.
        model.add(layers.Dense(units=hp_units, activation=hp_act))

    
    model.add(layers.Dropout(hp1.Float('dropout',min_value=0.0,max_value=0.1,default=0.005,step=0.01)))   
    model.add(layers.Dense(10, activation='softmax'))
    
    model.compile(optimizer=keras.optimizers.Adam(hp1.Float(
        'learning_rate', min_value=1e-4, max_value=1e-2, sampling='LOG', default=1e-3)),
        loss='sparse_categorical_crossentropy',metrics=['accuracy'])
    
    return model


In [19]:
tuner = RandomSearch(
    hypermodel=build_model,
    objective='val_accuracy',
    max_trials=5,
    hyperparameters=None,
    executions_per_trial=3,
    directory='./keras-tuner-trial',
    project_name='randomsearch')

max_trials: represents the number of hyperparameter combinations that will be tested by the tuner.
execution_per_trial: is the number of models that should be built and fit for each trial. each execution may use the previousely trained weights, hence different model. Those are different from epoch.


In [20]:
tuner.search_space_summary()

Search space summary
Default search space size: 7
dense_activation (Choice)
{'default': 'relu', 'conditions': [], 'values': ['relu', 'tanh', 'sigmoid'], 'ordered': False}
units (Int)
{'default': 32, 'conditions': [], 'min_value': 32, 'max_value': 512, 'step': 32, 'sampling': None}
units_1 (Int)
{'default': None, 'conditions': [], 'min_value': 8, 'max_value': 64, 'step': 8, 'sampling': None}
units_2 (Int)
{'default': None, 'conditions': [], 'min_value': 8, 'max_value': 64, 'step': 8, 'sampling': None}
units_3 (Int)
{'default': None, 'conditions': [], 'min_value': 8, 'max_value': 64, 'step': 8, 'sampling': None}
dropout (Float)
{'default': 0.005, 'conditions': [], 'min_value': 0.0, 'max_value': 0.1, 'step': 0.01, 'sampling': None}
learning_rate (Float)
{'default': 0.001, 'conditions': [], 'min_value': 0.0001, 'max_value': 0.01, 'step': None, 'sampling': 'log'}


In [21]:
tuner.search(X_train, y_train,
             epochs=5,
             validation_data=(X_test, y_test))

Trial 5 Complete [00h 00m 42s]
val_accuracy: 0.9708333412806193

Best val_accuracy So Far: 0.9711333314577738
Total elapsed time: 00h 03m 39s
INFO:tensorflow:Oracle triggered exit


In [22]:
tuner.results_summary()

Results summary
Results in ./keras-tuner-trial/randomsearch
Showing 10 best trials
Objective(name='val_accuracy', direction='max')
Trial summary
Hyperparameters:
dense_activation: sigmoid
units: 352
units_1: 24
units_2: 16
units_3: 40
dropout: 0.05
learning_rate: 0.00286700149775965
Score: 0.9711333314577738
Trial summary
Hyperparameters:
dense_activation: tanh
units: 320
units_1: 32
units_2: 40
units_3: 16
dropout: 0.01
learning_rate: 0.001192978696747447
Score: 0.9708333412806193
Trial summary
Hyperparameters:
dense_activation: sigmoid
units: 448
units_1: 40
units_2: 24
units_3: 64
dropout: 0.0
learning_rate: 0.0006233510651221175
Score: 0.9690999984741211
Trial summary
Hyperparameters:
dense_activation: sigmoid
units: 96
units_1: 48
units_2: 48
units_3: 48
dropout: 0.04
learning_rate: 0.0005034672825968825
Score: 0.9571999907493591
Trial summary
Hyperparameters:
dense_activation: sigmoid
units: 352
units_1: 24
units_2: 32
units_3: 48
dropout: 0.02
learning_rate: 0.000340169182687353

In [23]:
best_model = tuner.get_best_models()[0]
# Evaluate the best model.
loss0, accuracy0 = best_model.evaluate(X_test, y_test)
print(f"""best accuracy: {accuracy0}""")
print("best parameters", tuner.get_best_hyperparameters(num_trials=1)[0].values)

best accuracy: 0.9753000140190125
best parameters {'dense_activation': 'sigmoid', 'units': 352, 'units_1': 24, 'units_2': 16, 'units_3': 40, 'dropout': 0.05, 'learning_rate': 0.00286700149775965}


### Hyperband:
Hyperband is an optimized version of random search which uses early-stopping to speed up the hyperparameter tuning process. The main idea is to fit a large number of models for a small number of epochs and to only continue training for the models achieving the highest accuracy on the validation set. The max_epochs variable is the max number of epochs that a model can be trained for.

In [41]:
tuner_hb = kt.Hyperband(build_model,
                     objective = 'val_accuracy', 
                     max_epochs = 8,
                     #factor: Int. Reduction factor for the number of epochs.
                     factor = 3,
                     directory = './kt-hyperband',
                     project_name = 'kt-HB')  

INFO:tensorflow:Reloading Oracle from existing project ./kt-hyperband/kt-HB/oracle.json
INFO:tensorflow:Reloading Tuner from ./kt-hyperband/kt-HB/tuner0.json


In [42]:
tuner_hb.search_space_summary()

Search space summary
Default search space size: 7
dense_activation (Choice)
{'default': 'relu', 'conditions': [], 'values': ['relu', 'tanh', 'sigmoid'], 'ordered': False}
units (Int)
{'default': 32, 'conditions': [], 'min_value': 32, 'max_value': 512, 'step': 32, 'sampling': None}
units_1 (Int)
{'default': None, 'conditions': [], 'min_value': 8, 'max_value': 64, 'step': 8, 'sampling': None}
units_2 (Int)
{'default': None, 'conditions': [], 'min_value': 8, 'max_value': 64, 'step': 8, 'sampling': None}
units_3 (Int)
{'default': None, 'conditions': [], 'min_value': 8, 'max_value': 64, 'step': 8, 'sampling': None}
dropout (Float)
{'default': 0.005, 'conditions': [], 'min_value': 0.0, 'max_value': 0.1, 'step': 0.01, 'sampling': None}
learning_rate (Float)
{'default': 0.001, 'conditions': [], 'min_value': 0.0001, 'max_value': 0.01, 'step': None, 'sampling': 'log'}


In [28]:
tuner_hb.search(X_train, y_train, epoch=5, validation_data=(X_test, y_test))

Trial 11 Complete [00h 00m 10s]
val_accuracy: 0.9541000127792358

Best val_accuracy So Far: 0.972599983215332
Total elapsed time: 00h 01m 35s
INFO:tensorflow:Oracle triggered exit


In [39]:
[item.values for item in tuner_hb.get_best_hyperparameters(num_trials=2)]

[{'dense_activation': 'sigmoid',
  'units': 352,
  'units_1': 24,
  'units_2': 64,
  'units_3': 48,
  'dropout': 0.01,
  'learning_rate': 0.0027887024252856224,
  'tuner/epochs': 5,
  'tuner/initial_epoch': 2,
  'tuner/bracket': 1,
  'tuner/round': 1,
  'tuner/trial_id': '3b6ae24399bfb6e6f7b9c46abb86ac61'},
 {'dense_activation': 'relu',
  'units': 352,
  'units_1': 40,
  'units_2': 16,
  'units_3': 40,
  'dropout': 0.0,
  'learning_rate': 0.0038515020929220166,
  'tuner/epochs': 5,
  'tuner/initial_epoch': 2,
  'tuner/bracket': 1,
  'tuner/round': 1,
  'tuner/trial_id': 'dd6e25ee0d080bf90a30e20797cac982'}]

In [40]:
# Evaluate the best model.
print("best accuracy: ", tuner_hb.get_best_models()[0].evaluate(X_test, y_test)[1])
print("best parameters", tuner_hb.get_best_hyperparameters(num_trials=1)[0].values)

best accuracy:  0.972599983215332
best parameters {'dense_activation': 'sigmoid', 'units': 352, 'units_1': 24, 'units_2': 64, 'units_3': 48, 'dropout': 0.01, 'learning_rate': 0.0027887024252856224, 'tuner/epochs': 5, 'tuner/initial_epoch': 2, 'tuner/bracket': 1, 'tuner/round': 1, 'tuner/trial_id': '3b6ae24399bfb6e6f7b9c46abb86ac61'}


While Hyperbrand runs faster, RandomSearch tuner does a better job in finding the optimum hyper parameters.

# Review
* <a href="#p1">Part 1</a>: Describe the major hyperparemeters to tune
    - Activation Functions
    - Optimizer
    - Number of Layers
    - Number of Neurons
    - Batch Size
    - Dropout Regulaization
    - Learning Rate
    - Number of Epochs
    - and many more
* <a href="#p2">Part 2</a>: Implement an experiment tracking framework
    - Weights & Biases
    - Comet.ml
    - By Hand / GridSearch
    - TensorBoard
* <a href="#p3">Part 3</a>: Search the hyperparameter space using RandomSearch
    - Keras-Tuner
    - Advanced Techniques

# Sources

## Additional Reading
- https://machinelearningmastery.com/grid-search-hyperparameters-deep-learning-models-python-keras/
- https://blog.floydhub.com/guide-to-hyperparameters-search-for-deep-learning-models/
- https://machinelearningmastery.com/dropout-regularization-deep-learning-models-keras/
- https://machinelearningmastery.com/introduction-to-weight-constraints-to-reduce-generalization-error-in-deep-learning/
- https://machinelearningmastery.com/how-to-configure-the-number-of-layers-and-nodes-in-a-neural-network/