# FINE TUNING NEURAL NETWORKS HYPERPARAMETERS

* Flexibility = also one the main drawbacks: many params to tweak
* Looking for the best params combinations:
  * try many combinations manually (K-fold cross-validation)
  * or use `GridSearchCV` and `RandomizedSearchCV` to explore the hyperparameter space   
    wrap model in objects that mimic regular Scikit-Learn regressors
  
  
# Simple method

#### STEP 1
Create a function that will build and compile a Keras model, given a set of hyperparameters:

In [1]:
def build_model(n_hidden=1, n_neurons=30, learning_rate=3e-3, input_shape=[8]):
    
    model = keras.models.Sequential()
    model.add(keras.layers.InputLayer(input_shape=input_shape))
    for layer in range(n_hidden):
        model.add(keras.layers.Dense(n_neurons, activation='relu'))
    model.add(keras.layers.Dense(1))
    
    optimizer = keras.optimizers.SGD(lr=learning_rate)
    model.compile(loss='mse', optimizer=optimizer)
    
    return model

#### STEP 2
Create a `KerasRegressor` base on this `build_model()` function:

In [None]:
keras_reg = keras.wrappers.scikit_learn.KerasRegressor(build_model)
# no hyperparams defined here at creation, so will use default values defined in 'build_model()'

#### STEP 3
Train this Scikit-Learn Regressor using its `fit()` method,  
then evaluate it using its `score()` method,  
then predict using its `predict()` method  
  
* Notes:  
  * Any extra param passed to `fit()` is passed to the underlying Keras model
  * The score will be the opposite of the MSE because Scikit-Learn wants scores, not losses (i.e. highers should be better)

In [None]:
keras_reg.fit(X_train, y_train,
              epochs=100,
              validation_date=(X_valid, y_valid),
              callbacks=[keras.callbacks.EarlyStopping(patience=10)])

mse_test = keras_reg.score(X_test, y_test)
y_pred = keras_reg.predict(X_new)

#### STEP 4
Train and evaluated hundreds of variants and see which one performs better on validation set   
Lots of hyperparams -> randomized search preferable to grid search (see chap. 2)

* `RandomizedSearchCV()` use K-fold cross validation:
  * does not use `X_valid` and `y_valid`
  * they are only used for ealy stopping
 

In [None]:
from scipy.stats import reciprocal
from sklearn.model_selection import RandomizedSearchCV

param_distribs = {
    'n_hidden'     : [0, 1, 2, 3],
    'n_neurons'    : np.arange(1, 100),
    'learning_rage': reciprocal(3e-4, 3e-2)
}

random_search_cv = RandomizedSearchCV(keras_reg, param_distribs, n_iter=10, cv=3)
random_search_cv.fit(X_train, y_train,
                     epochs=100,
                     validation_data=(X_valid, y_valid),
                     callbacks=[keras.callbacks.EarlyStopping(patience=10)])

* When exploration is over, you get access to the best params found, the best score, and the trained Keras model
* And you can save the model, evaluate it on test set, and deploy to production

In [None]:
random_search_cv.best_params_
random_search_cv.best_score_
model = random_search_cv.best_estimator_.model

# Resources for and about hyperparameters optimization

### Libraries

* Hyperopt
* Hyperas, kopt, Talos
* Keras Tuner
* Scikit-Optimize (skopt)
* Spearmint
* Hyperband
* Sklearn-Deap


### Articles

* [DeepMind 2017](https://arxiv.org/abs/1711.09846): jointly optimize a population of models and their hyperparameters
* [Google's AutoML](https://cloud.google.com/automl): **evolurionary approach** to search for hyperparameters + look for the best network architecture for the problem (cloud service)
* [Google's post about evolutionary AutoML](https://ai.googleblog.com/2018/03/using-evolutionary-automl-to-discover.html):  search for best architecture
* [Uber's post about Deep Neuroevolution technique](https://eng.uber.com/deep-neuroevolution/): replacing the Gradient Descent!

# Hyperparameters tuning

SEE: [Leslie Smith 2018 paper](https://arxiv.org/abs/1803.09820)

## Number of hidden layers

* NN with **one hidden layer** can model the most complex functions if it has **enough neurons**
* But **Deep networks** have much more *parameter efficiency* than shallow ones:  
  * can model complex functions using exponentially fewer neurons than shallow nets  
  * -> can reach much better performance with same amount of training data
  
  
* Real world data often structured in a hierarchical way, deep neural nets take advantage of it:
  * **lower** hidden layers model **low-level structures** (line segments of various shapes and orientation)
  * **intermediate** hidden layers combine these low-level structures to model **intermediate-level structures** (squares, circles, etc.)
  * **highest** hidden layers and **output** layer combine intermediate structures to model **high-level structures** (e.g. faces)  
  
  
* => Hierarchical architecture:
    * allows DNNs to **converge faster** and improves their **ability to generalize** to new datasets
    * allows **transfer learning**:
      * reusing lower layers of a network in a new network with similar task
      * weights and biases of new network can be initialized with weights and biases of first network (instead on random values)
      * => new network doesn't have to learn low-level structures that occur in most pictures   
        only learns the higher-level structures
  
  
* Start with 2 hidden layers and ramp up the number of hidden layers until you start overfitting the training set
* We most commonly use parts of a pretrained state-of-the-art network that performs similar tasks:   
  => faster training, requires much less data

## Number of neurons per hidden layer

* **Input** and **output** layers: number neurons determined by type of input and output of the task   
  Ex: MNIST data, 28*28=784 input neurons, 10 output neurons  
  
  
* **Hidden** layers:
  * sometimes same number of neurons in all (only one hyperparameter to tune)
  * depending on dataset, it can help to make the **first hidden layer bigger** than others
  
  
* You can increase the number of neurons until network starts overfitting
* But more efficient: **"stretch pants" approach**
  * pick a model with more layers and neurons that you actually need
  * and use early stopping and other regularization techniques to prevent overfitting
  * that way you avoid bottleneck layers that could ruin the model
  * advantage: a layer with too few neurons would not have enough representational power to preserve all the useful information from inputs   
    lots of neurons -> no information lost   
    no matter how big the network, lost information can never be recovered
  
  
=> In general, better to increase the number of layers instead of the number of neurons per layer

## Learning rate

* **Most important hyperparameter**  
<br/>   
    
* **Optimal** learning rate = about half of the maximum learning rate,  
  i.e. learning rate above which the training algorithm diverges (ch. 4)  
<br/>   
    
* **Method:**
  * train model for few hundred iterations
  * starting with very low lr (e.g. $10^-5$)
  * gradually increase it to a very large value (e.g. 10)
  * by multiplying the lr by a constant factor at each iteration (e.g. by $exp(log(10^6)/500)$)  
    from $10^-5$ to 10 in 500 iterations
  * plot the loss as a function of the learning rate (log scale for the lr):
    * you should see it dropping first
    * but after a while, lr will be too large, so loss will shoot back up
    * **optimal** lr will be a bit lower that the turning point
  * reinitialize your model and train it normally using the found lr

## Optimizers

Choosing a better optimizer that Mini-batch Gradient Descent (and tune its hyperparameters)  
More in ch. 11

## Batch size

* Can have **significant impact** on model's **performance** and **training time**  
<br/>   
   
* **Large batch sizes:** 
  * GPUs can process them efficiently (ch. 19), so training with see more instances per second
  * can lead to instabilities at the beginning of training, so model may not generalize as well as with small batch sizes
  * **strategy:** use large batch size, using learning rate warmup,  
    if training unstable or final perf disappointing, reduce batch size
  * [training large batches](https://arxiv.org/abs/1705.08741) - [large mini-batch SGD](https://arxiv.org/abs/1706.02677)   
<br/>   
  
* **Small batch sizes:** (2 to 32)
  * Yann LeCun tweet: "Friends don't let friends use mini-batches larger than 32", citing [paper from 2018](https://arxiv.org/abs/1804.07612)
  * = says that small batches lead to better models in less training time

## Activation function

* **Hidden layers**: `ReLU` good default
* **Output layers**: depends on the task
  * **For regression**:
    * None if output value can be any range (house price)
    * ReLU or softplus if positive outputs
    * logistic or tanh if output within a range
  * **For classification**:
    * binary classification: single output, between 0 and 1 (estimated probability of positive class), logistic activation
    * multilabel binary classification: multiple output neurons, don't necessarily add up to 1, logistic activation
    * multiclass classification: one output neuron per class, softmax activation ensures that all estimated probas are between 0 andd 1, and add up to 1 (classes are exclusive)

## Number of iterations

* Doesn't need to be tweaked, just use **early stopping** instead
* **Optimal** learning rate depends of the other hyperparams (especially batch size),   
  so if you modify any one of them, update learning rate as well