In [None]:
"""
You need to run this cell for the code in following cells to work.
"""

# Enable module reloading
%load_ext autoreload
%autoreload 2

%load_ext tensorboard

import sys
sys.path.append('..')

from week_4.backstage.load_data import load_data
# from week_5.mlp import train

%tensorboard --logdir logs --bind_all

# Week 5

__Goals for this week__

We will learn about _hyperparameter tuning,_ an important part of each machine learning project.

__Feedback__

This lab is a work in progress. If you notice a mistake, notify us or you can even make a pull request. Also please fill the [questionnaire](https://forms.gle/r27nBAvnMC7jbjJ58) after you finish this lab to give us feedback.

## Hyperparameter tuning

In previous labs we introduced various quantities that control how does the model train or how does the model look like, e.g. _learning rate,_ _batch size,_ _hidden layer size_ and others. These quantities are called _hyperparameters_. In contrast to regular parameters, they are set before the training takes place and they are not modified during the gradient descent training. The values we choose for the hyperparameters can significantly change the performance of the model.

E 5.1 Try manually setting various values and see how the results change
Code similar to previous lab is encapsulated in `mlp.py`. Try running it with different hyperparameters to see how the results change:

Add num layers, activation as hparams

You can check TensorBoard in the first cell or more conveniently at http://localhost:6006.


TensorBoard hparams integration

In [None]:
# data = load_data('iris.csv', num_classes=3)

# # MNIST instead of Iris?
    
# train(data.x, data.y,
#     dim_output=3,  # This is actually not a hyperparameter, it just describes our data.
#     dim_hidden=20,
#     learning_rate=0.1,
#     batch_size=4,
#     loss_function='mean_squared_error',
#     epoch=50)

_Hyperparameter tuning,_ a process of searching for better hyperparameter values, should be done during each machine learning project. We expect you to tune your hyperparameters during your course projects as well. The goal of hyperparameter tuning is to find a set of hyperparameters that optimizes some evaluation metric, e.g. for classification we want to find hyperparametes that has the best accuracy of the model. We do this by searching through space of all possibel values either via manual tuning or via automatic tuning.

### Manual tuning

Manual tuning is a name for trial-and-error tuning that is done by hand. You essentialy look at how did the model train and try to guess which hyperparameter needs to change. Then you train a model with new hyperparameters and see how did it affect the results. You repeat this until you are content with the performance of your model.

This process is of course quite unsystematic and ineffective. Even for professionals it is hard to guess which hyperparameters are problematic. With this technique you also often explore only a small subspace of all possible values and perhaps you might miss global optimum entirely.

### Random search tuning

Random search is a more systematic approach. You set a interval of possible values for each hyperparameter. Then you randomly sample from these intervals and train the model with what you sampled. You repeat this process until you are content with the performance, or until you have computational resources.

This approach demands more compute than the manual tuning, but it can be fully automatized. Getting better performance is a function of time in this case, you are bound to find better and better models as the time goes on. How you set 
 
### Hyperparameter properties

Hyperparameters can be divided by their type:

- __Integer hyperparameters__, e.g. hidden layer size, number of hidden layers
- __Real number hyperparameters__, e.g. learning rate
- __Categorical hyperparameter__, e.g. activation function

For number hyperparameters (both integer and real) we sample from a pre-defined range during random search. E.g. for hidden layer size we might define the minimum value as 10 and the maximum value as 1000. We then pick the value from within this range. There are two basic ways of sampling number from within this range:

- __Linear__, when we simply pick random value from within this range using uniform distribution.
- __Exponential__, when we define the range via exponents as $\langle 10^1, 10^3\rangle$. Instead of sampling from 10 to 1000, we sample from 1 to 3 interval. This skews the distribution towards the smaller numbers, e.g. in this case half of the values will fall into $\langle 10, 100 \rangle$ interval, while the other half will fall into $\langle 100, 1000 \rangle$ interval, even though the second interval is in fact 10 times bigger.

Below we list some hyperparameters you already encountered with recommended starting ranges:

__Learning rate__ - real - exponential - $\langle 10^{-2}, 10^{-4} \rangle$.  
Learning rate is the most important hyperparameter that should always be tuned. Setting it too low will halt the training as it can get stuck in plateaus or it can pointlessly make the training longer. Setting it too high might cause divergence (see Week 2 lab) that can lead to numeric overflow exception.

__Batch size__ - integer - exponential - $\langle 2^3, 2^6 \rangle$.  
For bigger models memory of HW accelerators limits how high can we go with our batch size. Often it is set to powers of two, as powers of two should work well with HW accelerators.

__Hidden layer size__ - integer - exponential - $\langle 2^5, 2^8 \rangle$.  
Setting the hidden layer too small or too big will both negatively affect the training. Similarly to batch size, this hyperparameter is often set to powers of two.

__Number of layers__ - integer - linear - $\langle 1, 5 \rangle$.  
Compared to previous hyperparameters, number of layers if often architecture specific. For MLP model we learned about so far we usually work with relatively small number of layers. More layers are often used in computer vision convolutional neural networks.

__Activation function__ - categorical - { relu, sigmoid, ... }.  
Activation function can be experimented with, but is usually not so important. ReLU is usually a good starting point. For really small models you can use sigmoid instead.

__Loss function__ - categorical - { cross-entropy + softmax activation, MSE + linear activation, ... }
There are some loss functions that are usually used for some tasks, e.g. you should use cross-entropy with softmax for classification or MSE for regression. 

E 5.2 - Setup a random search
- see the results and how you get improvements with longer runs
- send me your best result and hparams, also how many possibilities you managed to explore
- 

### Tuning in practice

1. You usually start with manual tuning during the development, when you just want to quickly see whether the model works and is able to learn. As soon as you have your model ready and you want to properly train and evaluate it, you should switch to random tuning.

2. Check the hyperparameter values people use in recent (2014 or later), related (same dataset or task) projects. They should server as a fine starting point.

3. You can gradually change the search intervals. E.g. if you find out that a certain subspace has good results, you can focus on this subspace. Similarly, you can expand the range of some parameter, if the best results are achieved with its marginal values. E.g. with batch sizes 4, 8 and 16 you always have the best results with 16. Then it makes sense to expand the batch size range to 32 and perhaps 64 as well.

## Further reading

- Alessio Gozzoli has a nice [blog](https://blog.floydhub.com/guide-to-hyperparameters-search-for-deep-learning-models/) about various hyperparameter tuning techniques and their practical aspects.
- Andrew Ng discusses hyperparameter tuning in his Coursera's Deep Learning course in three  consecutive videos (21 minutes together): [1](https://youtu.be/AXDByU3D1hA), [2](https://youtu.be/cSoK_6Rkbfg), [3](https://youtu.be/wKkcBPp3F1Y).


In [3]:
import tensorflow as tf
from tensorboard.plugins.hparams import api as hp

fashion_mnist = tf.keras.datasets.fashion_mnist

(x_train, y_train),(x_test, y_test) = fashion_mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

HP_NUM_UNITS = hp.HParam('num_units', hp.Discrete([16, 32]))
HP_DROPOUT = hp.HParam('dropout', hp.RealInterval(0.1, 0.2))
HP_OPTIMIZER = hp.HParam('optimizer', hp.Discrete(['adam', 'sgd']))

METRIC_ACCURACY = 'accuracy'

def run(run_dir, hparams):
  with tf.summary.create_file_writer(run_dir).as_default():
    hp.hparams(hparams)  # record the values used in this trial
    accuracy = train_test_model(hparams)
    tf.summary.scalar(METRIC_ACCURACY, accuracy, step=1)


def train_test_model(hparams):
  model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(hparams[HP_NUM_UNITS], activation=tf.nn.relu),
    tf.keras.layers.Dropout(hparams[HP_DROPOUT]),
    tf.keras.layers.Dense(10, activation=tf.nn.softmax),
  ])
  model.compile(
      optimizer=hparams[HP_OPTIMIZER],
      loss='sparse_categorical_crossentropy',
      metrics=['accuracy'],
  )

  model.fit(x_train, y_train, epochs=1, callbacks=[hp.KerasCallback('logs/hparam_tuning', hparams)]) # Run with 1 epoch to speed things up for demo purposes
  _, accuracy = model.evaluate(x_test, y_test)
  return accuracy

session_num = 0

for num_units in HP_NUM_UNITS.domain.values:
  for dropout_rate in (HP_DROPOUT.domain.min_value, HP_DROPOUT.domain.max_value):
    for optimizer in HP_OPTIMIZER.domain.values:
      hparams = {
          HP_NUM_UNITS: num_units,
          HP_DROPOUT: dropout_rate,
          HP_OPTIMIZER: optimizer,
      }
      run_name = "run-%d" % session_num
      print('--- Starting trial: %s' % run_name)
      print({h.name: hparams[h] for h in hparams})
      run('logs/hparam_tuning/' + run_name, hparams)
      session_num += 1


--- Starting trial: run-0
{'num_units': 16, 'dropout': 0.1, 'optimizer': 'adam'}
Train on 60000 samples


--- Starting trial: run-1
{'num_units': 16, 'dropout': 0.1, 'optimizer': 'sgd'}
Train on 60000 samples


--- Starting trial: run-2
{'num_units': 16, 'dropout': 0.2, 'optimizer': 'adam'}
Train on 60000 samples


--- Starting trial: run-3
{'num_units': 16, 'dropout': 0.2, 'optimizer': 'sgd'}
Train on 60000 samples


--- Starting trial: run-4
{'num_units': 32, 'dropout': 0.1, 'optimizer': 'adam'}
Train on 60000 samples


--- Starting trial: run-5
{'num_units': 32, 'dropout': 0.1, 'optimizer': 'sgd'}
Train on 60000 samples


--- Starting trial: run-6
{'num_units': 32, 'dropout': 0.2, 'optimizer': 'adam'}
Train on 60000 samples


--- Starting trial: run-7
{'num_units': 32, 'dropout': 0.2, 'optimizer': 'sgd'}
Train on 60000 samples


In [None]:
print(dir(hp))