In [3]:
import tensorflow as tf

# Implementing a bug free model

- Key idea: **Be suspicious and start simple** 

### Resources:

- [Debug a deep learning network](https://medium.com/@jonathan_hui/debug-a-deep-learning-network-part-5-1123c20f960d)
- [Troubleshooting deep learning models](https://www.youtube.com/watch?v=GwGTwPcG0YM&feature=youtu.be)

# Strategy for debugging in pseudocode:

1. Start simple
2. Implement and debug
3. Evaluate:
    - if meets requirements --> Done
    - else:
        - Improve model/data and go to step 2.
        - OR tune hyperparameters and go to step 3.


## Overview

- Choose the simplest model and data possible
- Once it runs, overfit a single batch and reproduce a known result. If a result is not known, then try to come up with some realistic baseline, using either common sense or human level performance.
- Apply bias-variance trade off
- Tune hyperparameters
- Bigger model if you underfit, add data or regularize if you overfitexampleexample

# Start simple 

### 1.1 Choose a simple architecture

|  | Start |  Later |
| ----------- | ----------- | ---------- |
| Images | LeNet-like architecture  | ResNet, Inception |
| Sequences | LSTM with one hidden layer or temporal convolutions | Attention model |
| Other | Fully connected with one hidden layer | Depends on the problem |

If there are different input modalities, use an appropriate architecture for each and then mix with a fully connected layer.example

### 1.2 Defaults to start with

- Optimizer: Adam with the magical LR 3e-4
- Activations: ReLU for FC and Conv, tanh for LSTMs
- Initialization: Glorot Normal, Glorot Uniform (simple defaults used for the layers in TF2)
- No regularization
- No batch, layer normalization

In [11]:
# TF2 defaults
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-4)
initializer = tf.keras.initializers.GlorotUniform()

### 1.3 Normalize input data

#### Min-max

Scales to [0, 1]
- doesn't shift/center the data
- retains sparsity
- retains zero values
    

$$ \hat{x} = \frac{x - x_{min}}{x_{max} - x_{min}} $$

- use when you do not know the distribution of your data
- when you know the distribution is not Gaussian
- algorithm you are using does not make assumptions about the distribution of your data, such as k-nearest neighbors and artificial neural networks
- sensitive to outliers

#### Max-abs 

Scales to `[-1, 1]`
- divide by largest maximum value

#### Images

- TF2: divide by 255
- PyTorch: divide by 255 and then:

$$ \hat{x} = \frac{x - \mu }{\sigma} $$
$$ \hat{x} = \frac{\hat{x} - 0.5}{0.5}$$


### 1.4 Simplify the problem

It often makes sense to do as a starting point

- For example, reduce the training set size
- Use a smaller number of classes, image size, etc.
- Create a synthetic training set that is easier to work with


## 2 Implement and debug

- Get your model to run
- Overfit a single batch
- Compare to a known result


### 2.1 General advice for implementing your model

- Minimum number of line of codes for your version 1 (rule of thumb < 200 lines, not counting already tested components)
- Use of the shell components:
    - Keras for simpler tasks where no to little changes for default behaviour of functions is needed
    - Pure TF2, but with tf.keras.layers, tf.losses, etc. when more flexibility is needed
- Start with a dataset that loads into memory


## 2.2 Overfit a single batch

Assuming your model runs, try to overfit a single batch:

- Error goes up:
    - Flipped the sign of the loss/gradient
    - LR too high
    - Softmax taken over wrong dimension 
- Error explodes:
    - Numerical issue. Check all exp, log, div operations, clip gradients
    - LR too high 
- Error oscillates:
    - Data labels corrupted
    - LR too high
- Error plateaus:
    - LR too low
    - Gradients not flowing through the whole model
    - Too much regularization
    - Incorrect input to loss function (e.g. softmax instead of logits, ReLU on output)
    - Data labels corrupted

# 3 Compare to a known result

| Usefulness in decreasing order | Source |
| ----------- | ----------- |
| 1 | Oficial model implementation evaluated on similar dataset |
| 2 | Oficial model implementation evaluated on a benchmark (e.g. MNIST) |
| 3 | Unoficial model implementation |
| 4 | Results from a paper (with no code |
| 5 | Results from your model on a benchmark dataset (e.g. MNIST) |
| 6 | Compare similar model on a similar dataset |
| 7 | Super simple baseline (avg. of all outputs, linear regression, common sense) |