# Deep Learning & Artificial Intelligence
## Advanced Tricks and Latest Developments with Deep Learning
### Dr. Jie Tao, Fairfield University

## Hyperparameter Optimization

- Similar to any ML models, we need to tune the hyperparameters of the models in order to search for the __optimal results__
- This process, aka., hyperparameter tuning
  - Review that *hyperparameters* refer to the the specifications of the models to be trained, while _parameters_ refer to the values learned during the training process
- Specifically for deep learning models, hyperparameters usually refer to (but not limited to):
  - number of layers
  - number of neurons or filters in each layer
  - activation function in each layer
  - other design decisions like `Dropout` or `BatchNormalization`

### More Arts than Sciences

- Experienced data scientists build intuition over time about what works best in certain situations
- Also, it's usually a __trial-and-error__ process
  - recall what you did in the machine learning class?
- There are no formal rules or _silver bullet_ for hyperparameter tuning or model selection
- It is the norm you should __NOT__ rely on your _arbitrary_ decisions
  - So you will have to train the model repeatedly to find the __optimal__ hyperparameters
- You have two strategies:
  - Either you will **manually** search all possible combinations of the hyperparameters
  - Or you can search in an __automatic__ and __systematic__ way

### Review the Process of Hyperparameter Optimization

1. Choose a set of hyperparameters (__randomly__ or __heuristically__)
2. Build the corresponding model
3. Train (fit) the model to the _training data_, and evaluate the performance (with __selected metrics__) of the model using the _validation data_
4. Choose the next set of hyperparameters (automatically)
5. Repeat the process until you reach the (psuedo) optimal performance
6. Eventually, using the tuned model to __predict__ the _test data_


### How to Tune Hyperparameters

- Among these steps, step `3` is very important. There are different techniques available:
  - Bayesian optimization
  - genetic algorithms
  - simple random search
  - ...
- We already know that we update the weights (__parameters__) with the _backpropagation_ algorithm
- On the contrary, hyperparameter tuning is extremely hard:
  - Computing the __feedback signal__: this is __expensive__ since you have to train the models from scratch repeatedly
  - Unlike the parameters, the hyperparamesters are usually __discrete__ and non-differentiable - so you have to use gradient-free option, which is usually far less effcient

### Something Beyond Random Search

- Usually, we have very limited tools to search for the optimal set of hyperparameters:
  - The tecniques mentioned above are too expensive for today's computing power
- Thus, usually it is just __random search__
  - The method allows us to choose hyperparameters randomly and evaluate the performances repeatedly
  - It is the most __naive__ method
- One recent package called [hyperas](https://github.com/maxpumperla/hyperas), which assists us in this difficult task

### A `hyperas` Tutorial

[OP](https://towardsdatascience.com/a-guide-to-an-efficient-way-to-build-neural-network-architectures-part-i-hyper-parameter-8129009f131b)
- Like any ML/DL process, one of the most important task is to choose the most appropriate (evaluation) metric and loss function
  - In this tutorial, we are doing the fashion-MNIST image classification, so `acc` as the evaluation metric (since the data is *balanced* across classes) and `categorical crossentropy` as the loss function (since it is a multi-class classification problem) seem appropriate.
- We also need to normalize the data
  - Let's first try the normalization we know, then we will try something new called `batchNormalization`

In [None]:
#### load data
from tensorflow.keras.datasets import fashion_mnist
(X_train, y_train), (X_test, y_test) = fashion_mnist.load_data()
print('train shapes', X_train.shape, y_train.shape)
print('test shapes', X_test.shape, y_test.shape)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-labels-idx1-ubyte.gz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-images-idx3-ubyte.gz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-labels-idx1-ubyte.gz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-images-idx3-ubyte.gz
train shapes (60000, 28, 28) (60000,)
test shapes (10000, 28, 28) (10000,)


In [None]:
#### preprocessing
#### combine 2D images to 1D tensors
X_train = X_train.reshape(60000, 784)
X_test = X_test.reshape(10000, 784)
#### set the data type as float
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
#### normalize so it's between [0,1]
X_train /= 255
X_test/= 255

In [None]:
#### one hot encoding of the classes
from tensorflow.keras.utils import to_categorical
nb_classes = 10
y_train = to_categorical(y_train, nb_classes)
y_test= to_categorical(y_test, nb_classes)

In any hyperparameter tuning process, it is very important to build a base model. That is the model you will compare against.

In [None]:
#### base model
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import SGD
model = Sequential([
 Dense(10,input_shape=(784,),activation='softmax')
])
model.compile(optimizer=SGD(lr=0.1),
 loss='categorical_crossentropy',
 metrics=['accuracy'])
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 10)                7850      
Total params: 7,850
Trainable params: 7,850
Non-trainable params: 0
_________________________________________________________________


In [None]:
#### fit/training
history = model.fit(X_train, y_train,
                    epochs=10,
                    batch_size=128,
                    validation_split=0.2)


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


So we can see the `val_loss` is `0.4427` and `val_accuracy` is `0.8464` for the base model. Not bad but we can improve.

### Need for Hyperparameter Tuning

- Let's review why we need hyperparameter tuning again:
  - To find the right balance between **bias** and **variance**. In other words, you don't want a model which is very accurate in training but not as much in validation/testing.
  - To prevent ourselves from falling prey to the **vanishing/exploding gradient** problem: tweaking the *learning rate*, *activation function* and *number of layers* can help us with that problem.
  - Encountering saddle points and **local optima**. Changing learning rates and activation can help us with that problem.
  - Model reaches **no convergence**: using _adaptive learning rates_ may help with that issue.
  - **Extremely low gradients** with `sigmoid` and `tanh` functions. These two functions may not be very good for very deep networks.
  - **Slow** training time. More complexity does not equal to higher performance. Sometimes we can find the minimal architecture for a model to reach the _"best"_ performance.

### What can we tune? & Pro Tips

We can normally tune the following hyperparamters (maybe we won't tune all of them at the same time but this is a complete list):
- Number of Layers: higher --> overfitting/vanishing gradients; low --> low performance; depending on size of training data
- Number of neurons per layer: low --> high bias, high variance; high --> low bias, low variance; depending on size of training data
- Activation function: ReLU is a good choice for starters
- Optimizers: `Adam` is generally good, `RMSProp` is good for getting over local optima; `Adadelta` is good for sparse data
- Learning rate: dependent on the optimizers (`SGD`: 0.1, `Adam`: 0.1/0.01). You should also consider the _learning rate decay_.
- Initialization: not so important, HE-normal for `ReLU` and Glorot-normal for `Sigmoid` are good choices.
- Batch size: low --> hard to converge; high --> slow training. Try power of 2; depending on size of training data
- Number of epochs: high --> overfitting, low --> underfitting. Try higher but use `earlystopping` or `dropout`.
- Dropout: try different drop out ratio between `[0,1]`.
- L1/L2 regularizations: used to control the bias-variance tradeoff. Normally used when `dropout` is not working well

To use `hyperas`, we need to install it (Colab does not have it pre-installed). Note that `hyperas` is specifically designed for `keras`.

In [None]:
!pip install hyperas

Collecting hyperas
  Downloading https://files.pythonhosted.org/packages/04/34/87ad6ffb42df9c1fa9c4c906f65813d42ad70d68c66af4ffff048c228cd4/hyperas-0.4.1-py3-none-any.whl
Installing collected packages: hyperas
Successfully installed hyperas-0.4.1


You need to import the following things to use `hyperas`.

In [None]:
from hyperopt import Trials, STATUS_OK, tpe
from hyperas import optim
from hyperas.distributions import choice, uniform

### `hyperas` Helper Functions

When using `hyperas`, since you are training the mode repeatedly, you need three helper functions.
- Rather than doing these parts in loose codes like we did above.
- `data_loader` function: load train and validation data. If you need to pre-process data, do that and then load the pre-processed data;
- `hyperas_model` function: we define the archirecution of models in this function, also specify the hyperparameters we would like to tune here.
- `hyperas_opt` function; this function fits the defined model from the model function to the training data, and evaluate the trained model against the validation data in each epoch.

You can see that we did the three thing above before, just in loose codes.

See example below for the functions on the fashion-MNIST dataset.

In [None]:
from sklearn.model_selection import train_test_split
#### the data_loader function
def data_loader():
    '''
    Load data, scaling, and one-hot encoding.

    This function is separated from create_model() so that hyperopt
    won't reload data for each evaluation run.

    Output:
    ---
    Processed training data (X_train, y_train) and test data (X_test, y_test).
    '''
    ##### load data
    (X_train, y_train), (X_test, y_test) = fashion_mnist.load_data()
    #### if you want a fixed val set do below
    # X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=12345)
    #### reshaping
    X_train = X_train.reshape(60000, 784)
    X_test = X_test.reshape(10000, 784)
    #### change data type
    X_train = X_train.astype('float32')
    X_test = X_test.astype('float32')
    #### scaling
    X_train /= 255
    X_test /= 255
    #### One-hot encoding of class labels
    nb_classes = 10
    y_train = to_categorical(y_train, nb_classes)
    # y_val = to_categorical(y_val, nb_classes)
    y_test = to_categorical(y_test, nb_classes)
    return X_train, y_train, X_test, y_test

In [None]:
import numpy as np
from tensorflow.keras.layers import Activation, Dropout
from tensorflow.keras.optimizers import Adam, RMSprop, SGD

#### hyperas_model function
def hyperas_model(X_train, y_train, X_val, y_val):

    '''
    Use thr training data to fit, and validation data to evaluate the model.
    Test the model when evaluation is done.

    Create Keras model with double curly brackets dropped-in as needed.
    Return value has to be a valid python dictionary with two customary keys:
        - loss: Specify a numeric evaluation metric to be minimized
        - status: Just use STATUS_OK and see hyperopt documentation if not feasible
    The last one is optional, though recommended, namely:
        - model: specify the model just created so that we can later use it again.
    '''
    #### define the model - very similar to how we define our model before
    model = Sequential()
    #### Add first dense layer
    #### specify different values of number of neurons to test
    model.add(Dense({{choice([128, 256, 512, 1024])}}, input_shape=(784,)))
    #### activation function for the first dense layer
    #### the reason we list it out is we want to test different activation functions
    model.add(Activation({{choice(['relu', 'sigmoid'])}}))
    #### Dropout layer
    #### test different dropout values
    model.add(Dropout({{uniform(0, 1)}}))
    #### second dense layer
    model.add(Dense({{choice([128, 256, 512, 1024])}}))
    model.add(Activation({{choice(['relu', 'sigmoid'])}}))
    #### second dropout layer
    model.add(Dropout({{uniform(0, 1)}}))
    #### a condition to test whether a third layer pair is to be added
    if {{choice(['two', 'three'])}} == 'three':
        model.add(Dense({{choice([128, 256, 512, 1024])}}))
        model.add(Activation({{choice(['relu', 'sigmoid'])}}))
        model.add(Dropout({{uniform(0, 1)}}))
    #### output layer
    #### determined by number of classes and classification problem
    #### no need to test
    model.add(Dense(10))
    model.add(Activation('softmax'))

    #### define optimizers, test different learning rates
    adam = Adam(lr={{choice([10**-3, 10**-2, 10**-1])}})
    rmsprop = RMSprop(lr={{choice([10**-3, 10**-2, 10**-1])}})
    sgd = SGD(lr={{choice([10**-3, 10**-2, 10**-1])}})
    #### actual test
    choiceval = {{choice(['adam', 'sgd', 'rmsprop'])}}
    if choiceval == 'adam':
        optim = adam
    elif choiceval == 'rmsprop':
        optim = rmsprop
    else:
        optim = sgd
    #### complie model
    model.compile(loss='categorical_crossentropy', metrics=['accuracy'],optimizer=optim)
    #### fit to training set and evaluate using validation data
    history = model.fit(X_train, y_train,
              batch_size={{choice([64, 128])}},
              epochs=10,
              verbose=2,
              validation_split=0.2)
    #### record the best model perofrmance on the validation set
    validation_acc = np.amax(history.history['val_accuracy'])
    print('Best validation acc of epoch:', validation_acc)
    return {'loss': -validation_acc, 'status': STATUS_OK, 'model': model}


__NOTE__: Below we need to use the path of the notebook, make sure you get the correct path (_right click on the notebook and "copy path"_) and then update the `nb_path` variable.

In [None]:
#### MAKE SURE YOU CHANGE THIS TO YOUR OWN PATH
nb_path = 'drive/MyDrive/Colab Notebooks/L8-AdvancedTopics'

best_run, best_model = optim.minimize(model=hyperas_model,
                                          data=data_loader,
                                          algo=tpe.suggest,
                                          max_evals=5,
                                          trials=Trials(),
                                          notebook_name= nb_path)
X_train, Y_train, X_test, y_test = data_loader()
print("Evalutation of best performing model:")
print(best_model.evaluate(X_test, y_test))
print("Best performing model chosen hyper-parameters:")
print(best_run)


>>> Imports:
#coding=utf-8

try:
    from tensorflow.keras.datasets import fashion_mnist
except:
    pass

try:
    from tensorflow.keras.utils import to_categorical
except:
    pass

try:
    from tensorflow.keras.models import Sequential
except:
    pass

try:
    from tensorflow.keras.layers import Dense
except:
    pass

try:
    from tensorflow.keras.optimizers import SGD
except:
    pass

try:
    from hyperopt import Trials, STATUS_OK, tpe
except:
    pass

try:
    from hyperas import optim
except:
    pass

try:
    from hyperas.distributions import choice, uniform
except:
    pass

try:
    from sklearn.model_selection import train_test_split
except:
    pass

try:
    import numpy as np
except:
    pass

try:
    from tensorflow.keras.layers import Activation, Dropout
except:
    pass

try:
    from tensorflow.keras.optimizers import Adam, RMSprop, SGD
except:
    pass

>>> Hyperas search space:

def get_space():
    return {
        'Dense': hp.choice('Dense', [128, 256, 51

ValueError: ignored

## Batch Normalization

- __Normalization__ means to make different samples _more similar_ to ML models
  - This usually makes the model more generalizable (lower variance), and consequently lower loss on test data
  - One type we used to is __scaling__
- The other one in ML is making most of the data following __Gaussian__ distribution (we do not do that in DL since we do not assume Guassian)
  - z-score transformation is one of the most popular way of normalization
- In ML, we normalize our data __before__ feeding it to the models
  - This does not work well with DL since we can only guarantee that the input to the initial layer is normalized, but the output from that layer onwards is not.
- Thus we use `BatchNormalization`, which is a `Layer` provided by `keras`
  - If you want to use a __deeper__ network, you should consider using `BatchNormalization` since it helps with _gradient propagation_
- You can use `BatchNormalization` like below:
```python
conv_model.add(layers.Conv2D(32, 3, activation='relu'))
conv_model.add(layers.BatchNormalization())
dense_model.add(layers.Dense(32, activation='relu'))
dense_model.add(layers.BatchNormalization())
```

__Note__: refer to the textbook for the arguments of `BatchNormalization`.

### Callbacks Other than `EarlyStopping`

- We already know that we can use the `EarlyStopping` callback provided by `keras` to avoid __overfitting__ our models.
- But besides it, there are other types of callbacks that are useful in different scenarios
- You can refer to the textbook, or the [keras docs](https://keras.io/api/callbacks/) for more details.

# Deep Learning & Artificial Intelligence
## Advanced Tricks and Latest Developments with Deep Learning
### Dr. Jie Tao, Fairfield University