In [1]:
import tensorflow as tf

# Implementing a bug free model

- Key idea: **Be suspicious and start simple** 

## Resources:

- [Debug a deep learning network](https://medium.com/@jonathan_hui/debug-a-deep-learning-network-part-5-1123c20f960d)
- [Troubleshooting deep learning models](https://www.youtube.com/watch?v=GwGTwPcG0YM&feature=youtu.be)
- [Recipe for training neural networks](http://karpathy.github.io/2019/04/25/recipe/)
- [Machine learning yearning](https://www.deeplearning.ai/machine-learning-yearning/)
- [Bayesian optimization](http://krasserm.github.io/2018/03/21/bayesian-optimization/)
- [Dive into deep learning](https://d2l.ai/index.html) 
- [Deep learning](https://www.deeplearningbook.org/)

# Strategy for debugging in pseudocode:

1. Start simple.
2. Implement and debug.
3. Evaluate:
    - if meets requirements --> Done.
    - else:
        - Improve model/data and go to step 2.
        - OR tune hyperparameters and go to step 3.


## Overview

- Choose the simplest model and data possible.
- Once it runs, overfit a single batch and reproduce a known result. If a result is not known, then try to come up with some realistic baseline, using either common sense or human level performance.
- Apply bias-variance trade off.
- Tune hyperparameters.
- Bigger model if you underfit, add data or regularize if you overfit.

## 1 Start simple 

### 1.1 Choose a simple architecture

|  | Start |  Later |
| ----------- | ----------- | ---------- |
| Images | LeNet-like architecture  | ResNet, Inception |
| Sequences | a one layer LSTM  or temporal convolutions | Attention model, Transformers |
| Other | Fully connected with one hidden layer | Depends on the problem |

Different modalities?

### 1.2 Defaults to start with

- Optimizer: Adam.
- Activations: ReLU for FC and Conv, tanh for LSTMs.
- Initialization: 
    - Glorot Normal, Glorot Uniform (simple defaults used for the layers in TF2)
    - Initialize in a **smart** way: if you have, for example, a regression problem with the mean of outputs equal to 50, set the bias of the last vector to 50.
- No regularization.
- No batch, layer normalization.

In [2]:
# TF2 defaults
optimizer = tf.keras.optimizers.Adam()
initializer = tf.keras.initializers.GlorotUniform()

### 1.3 Scale/Standardize input data

#### Min-max

Scales to [0, 1]
- Doesn't shift/center the data.
- Retains sparsity.
- Retains zero values.
    

$$ \hat{x} = \frac{x - x_{min}}{x_{max} - x_{min}} $$

- Use when your features are in a limited range.
- When you do not know the distribution of your data.
- When you know the distribution is not Gaussian.
- Sensitive to outliers.

#### Standardization

Scales the distribution to have zero mean and unit standard deviation:

$$\hat{x} = \frac{x - \mu }{\sigma}$$

- Assumes that your data has a Gaussian distribution.

#### Images

- TF2: divide by 255
- PyTorch: divide by 255 and then:

$$ \hat{x} = \frac{x - \mu }{\sigma} $$
$$ \hat{x} = \frac{\hat{x} - 0.5}{0.5}$$


### 1.4 Simplify the problem

It often makes sense to do as a starting point:

- For example, reduce the training set size.
- Use a smaller number of classes, image size, etc.
- Create a synthetic training set that is easier to work with.


## 2 Implement and debug

- Get your model to run.
- Overfit a single batch.


### 2.1 General advice for implementing your model

- Minimum number of lines of code for your version 1 (rule of thumb < 200 lines, not counting already tested components).
- Use off the shelf components:
    - Keras for simpler tasks where no to little changes for default behaviour of functions is needed.
    - Low level TF2, but with tf.keras.layers, tf.losses, etc. when more flexibility is needed.
- Start with a dataset that loads into memory.


### 2.2 Overfit a single batch

Assuming your model runs, try to overfit a single batch:

- Error goes up:
    - Flipped the sign of the loss/gradient.
    - LR too high.
    - Softmax taken over wrong dimension. 
- Error explodes:
    - Numerical issue. Check all exp, log, div operations, clip gradients.
    - LR too high. 
- Error oscillates:
    - Data labels corrupted.
    - LR too high.
- Error plateaus:
    - LR too low.
    - Gradients not flowing through the whole model.
    - Too much regularization.
    - Incorrect input to loss function (e.g. softmax instead of logits, ReLU on output).
    - Data labels corrupted.

## 3 Evaluate

### 3.1 Compare to a known result

| Usefulness in decreasing order | Source |
| ----------- | ----------- |
| 1 | Oficial model implementation evaluated on similar dataset |
| 2 | Oficial model implementation evaluated on a benchmark (e.g. MNIST) |
| 3 | Unoficial model implementation |
| 4 | Results from a paper (with no code |
| 5 | Results from your model on a benchmark dataset (e.g. MNIST) |
| 6 | Compare similar model on a similar dataset |
| 7 | Super simple baseline (avg. of all outputs, linear regression, common sense) |

### 3.2 Perform error analysis

- Carry out error analysis by manually examining ~100 val set examples the model stuggles to predict and counting the major categories of errors. Use this information to prioritize what types of errors to work on.
- Consider splitting the val set into an Eyeball val set, which you will manually examine, and a Blackbox val set, which you will not manually examine. If performance on the Eyeball val set is much better than the Blackbox val set, you have overfit the Eyeball set.

### 3.3 Learnining curves and bias-variance trade-off:

- **Unavoidable bias:** an error rate achieved for a given problem by an optimal algorithm.
    - For example, we have a speech recognition system to train and 14% of all the audio clips are so noisy that even a human can't distinguish the words. So the best possible algorithm would have also 14% of an error rate.
- **Avoidable bias:** the difference between the training error and the unavoidable bias. 
- **Variance:** the difference between the validation set error and the train set error.

$$\text{Bias} = \text{Avoidable bias} + \text{Unavoidable bias}$$

There is **no unavoidable variance**, since in theory we can always add more data.

**Steps to follow**:

1. Set a desired level of performance and if possible, find out the unavoidable bias.
2. Plot learning curves.
3. Identify what causes underperformance and proceed to bias/variance reduction.

**Bias reduction techniques:**

- Increase the model size: If variance increases, use regularization.
- Modify input features based on insights from error analysis.
- Reduce or eliminate regularization (L2 regularization, L1 regularization, dropout, batch normalization, etc): It will increase variance too.
- Modify model architecture so that it is more suitable for your problem.

*One method that is not helpful!*

- Add more training data: This technique helps with variance problems, but it usually has no significant effect on bias.

**Variance reduction techniques:**

- Add more training data (already :)).
- Add regularization (L2 regularization, L1 regularization, dropout): This technique reduces variance but increases bias.
- Add early stopping: This technique reduces variance but increases bias.
- Feature selection to decrease number/type of input features: This technique might help with variance problems, but it might also increase bias. When your training set is small, feature selection can be very useful. In deep learning usually not needed.
- Modify input features based on insights from error analysis: Helps with variance and bias usually.
- Modify model architecture so that it is more suitable for your problem: This technique can affect both bias and variance.


### 3.4 Tuning hyperparameters

- Random search over grid search:

<img src = "../../assets/random_search_vs_grid.png" width="600" height="400" align="center"/>

When the number of parameters increases grid search is too long to perform. Time grows exponentially with the number of parameters.

- Bayesian Optimization algorithm:
    - An alternative to random search.
    - If too many parameters and a network is big, better use random search.