<a href="https://colab.research.google.com/github/vinayakShenoy/DL4CV/blob/master/optimization_methods.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Optimization Methods
- So far we have used only SGD.
- More advanced optimization techniques help to:
  - Reduce the amount of time to obtain reasonable classification accuracy
  - Make the network more well-behaved for a larger range of hyperparameters other than the learnng rate.
  - Ideally, obtain higher classification accuracy than what is possible with SGD
- SGD modifies all parameters in a network equally in proportion to a given learning rate
- However, given that the learning rate of a network is the most important hyperparameter to tune and a hard, tedious hyperparameter to set correctly, it has been that it is possible to adaptively tune the learning rate as the network trains.

## SGD

$$
W += -lr * dW
$$
- $W$: Weight matrix
- $lr$: learning rate
- $dW$: the gradient of W

## Adagrad
- Adagrad adapts the learning rate to the network parameters
- Larger updates are performed on parameters that change infrequently while smaller updates are done on parameters that change frequently

$$
cache += (dW**2)
$$
$$
W += -lr * dW/(np.sqrt(cache)+eps)
$$

- The cache variable maintains the per-parameter sum of squared gradients and is updated at every mini-batch in the training process.
- By examining the cache, we can see which parameters are updated frequently and which ones are updated infrequently.
- Scaling the update by the cache allows us to adaptively update the parameters in our network.
- Weights that have frequently updated/large gradients in the cache will scale the size of the update down, effectively lowering the learning rate for the parameter. On the other hand, weights that have infrequent updates/smaller gradients in the cache will scale up the size of the update, effectively raising the learning rate for the specific parameter.
- The primary benefit of Adagrad is that we no longer have to manually tune the learning rate – most implementations of the Adagrad algorithm leave the initial learning rate at 0.01 and allow the adaptive nature of the algorithm to tune the learning rate on a per-parameter basis.
- The weakness of Adagrad is as follows. Since the gradients are squared, this accumulation keeps growing during the training process. dividing a small number (the gradient) by a very large number (the cache) will result in an update that is infinitesimally small, too small for the network to actually learn anything in later epochs.

## Adadelta
- In the Adagrad algorithm, we update our cache with all of the previously squared gradients.
- However, Adadelta restricts this cache update by only accumulating a small number of past gradients
- When actually implemented, this operation amounts to computing a decaying average of all past squared gradients.

## RMSprop
- Similar to Adadelta, RMSprop attempts to rectify the negative effects of a globally accumulated cache by converting the cache into an exponentially weighted moving average.

$$
cache = decay_raate*cache + (1-decay_rate)*(dW**2)
$$
$$
W += -lr  * dW/(np.sqrt(cache)+eps)
$$
- The $decay_rate$, often defined as $\phi$ is a hyperparameter typically set to 0.9.  

## Adam
- The Adam (Adaptive Moment Estimation) optimization algorithm, proposed by Kingma and Ba in their 2014 paper.
- In practice, Adam tends to work better than RMSprop in many situations
$$
m = \beta_1*m + (1-\beta_1)*dW
$$
$$
v = \beta_2*v + (1-\beta_2)*(dW**2)
$$
$$
W += -lr*m/(np.sqrt(v)+\epsilon)
$$
- The values of both m and v are similar to SGD momentum, relying on their respective previous values from time t − 1. The value m represents the first moment (mean) of the gradients while v is the second moment (variance).

## Choosing an Optimization Method
- Three methods to learn - **SGD, Adam, RMSprop**
- Choosing an optimization algorithm to train a deep neural network is highly dependent on your familiarity with:
  - The dataset
  - The model architecture
  - The optimization algorithm((and associated hyperparameters))

# Optimal Pathway to Apply Deep Learning
- The four ingredients to the recipe included:
  - Your dataset
  - A loss function
  - A neural network architecture
  - An optimization method
- Take excruciating care to make sure your training data is representative of your validation and testing sets
- There is no shortcut to building your own image dataset. If you expect a deep learning system to obtain high accuracy in a given real-world situation, then make sure this deep learning system was trained on images representative of where it will be deployed.
<img src="https://drive.google.com/uc?id=17yhl2uflADQxUPQmjob92p8BG3rbmakV" width=500px>
- Based on the figure above we can see that Ng is proposing four sets of data splits when training a deep learning model:
  - Training
  - Training-validation (which Ng refers to as “development”)
  - Validation
  - Testing
- If our training error is too high
  - Then we should consider deepening our current architecture by adding in more layers and neurons. 
  - We should also consider training for longer (i.e., more epochs) while simultaneously tweaking our learning rate – using a smaller learning rate may enable you to train for longer while helping prevent overfitting.
  - if after many experiments using our current architecture and varying learning rates does not prove useful, then we likely need to try an entirely different model architectu
- If our training-validation error is high:
  - we should examine the regularization parameters in our network. Are we
applying dropout layers inside the network architecture? Is data augmentation being used to help generate new training samples? What about the actual loss/update function itself – is a regularization penalty being included? Examine these questions in the context of your own deep learning experiments and start adding in regularization.
  - It is likely that your model does not have enough training data
to learn the underlying patterns in your example images. 
  - After exhausting these options, you’ll once again want to consider using a different network architecture.
- If our training-validation error is low, but our validation set error is high, 
  - We need to examine our training data with a closer eye. Are we absolutely, positively sure that our training images are similar to our validation images?
  - Without data representative of where your deep learning model
will be deployed, you will not obtain high accuracy results.
- If our testing error is too high
 - we’ve overfit our model to the training and validation data

## Tranfer Learning or Train from scratch
- [Andrej Karpathy. Transfer Learning](http://cs231n.github.io/transfer-learning/)
- To make this decision, you need to consider two important factors:
  - The size of your dataset.
  - The similarity of your dataset to the dataset the pre-trained CNN was trained on
<img src="https://drive.google.com/uc?id=1McOX6XdFnQ3NDigHRF7N6IJwc7V4YTum">

### Dataset is small and similar to original dataset:
- You'll likely don't have enough training examples to train a CNN from scratch(keep in mind we should ideally have 1000-5000 examples per class you want classify).
- Furthermore given the lack of training data, its likely not a good idea to attempt fine-tuning as we'll likely end up overfitting.
- since your image dataset is similar to what the pre-trained network was trained on, you should treat the network as a feature extractor and train a simple machine learning classifier on top of these features. 
- You should extract features from layers deeper in the architecture as these
features are more rich and representative of the patterns learned from the original dataset

### Dataset is Large and similar to original dataset
- With a large dataset, we should have enough examples to apply fine-tuning without overfitting.
- You may be tempted to train your own model from scratch here as well – this is an experiment worth running. 
- However, since your dataset is similar to the original dataset the network was
already trained on, the filters inside the network are likely already discriminative enough to obtain a reasonable classifier.
- Apply fine-tuning in this case.

### Dataset is small and different than original dataset
- Given a small dataset, we likely won’t obtain a high accuracy deep learning model by training from scratch.
- We should again apply feature extraction and train a standard
machine learning model on top of them – but since our data is different from the original dataset, we should use lower level layers in the network as our feature extractors.
- Keep in mind that the deeper we go into the network architecture, the more rich and discriminative the features are specific to the dataset it was trained on. By extracting features from lower layers in the network, we can still leverage these filters, but without the abstraction caused by the deeper layers.

### Dataset is large and different than original dataset
- We have two options. 
  - Given that we have sufficient training data, we can likely train our own custom network from scratch. 
  - However, the pre-trained weights from models trained on dataset such as ImageNet make for excellent initializations, even if the datasets are unrelated.
- We should therefore perform two sets of experiments:
  - In the first set of experiments, attempt to fine-tune a pre-trained network to your dataset and evaluate the performance.
  - Then in the second set of experiments, train a brand new model from scratch and evaluate.
- Try to fine-tune first as this method will allow you to establish a baseline to beat when you move on to your second set of experiments and train your network from scratch.

# References
- [Geoffrey Hinton. Neural Networks for Machine Learning.](http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf).
- [Andrej Karpathy. Neural Networks (Part III).](https://cs231n.github.io/neural-networks-3/)
- [Andrew Ng. Nuts and Bolts of Building Applications using Deep Learning](https://nips.cc/Conferences/2016/Schedule?showEvent=6203)