## General Workflow
- Preprocess Data (zero-centered,i.e.,Substractmean)
- Identify architecture
- Ensure that we can overfit a small traing set to acc = 100%
- Loss not decreasing: too low leaning rate
- Loss goes to NaN: too high learning rate  

## Batch Normalization

- Improve gradient flow
- Allow higher learning rates
- Reduce dependence on initialization
- Some regularization
- *Note*: At test time, the mean from training should be used instead of calculated from testing batch

## Batch Normalization
- Fully connected layer
    * X: (num_example, dimension)
    * mu, sigma: 1 \* D


- CNN -> Spatial Batchnorm
    * X: (num_example, channels, height, width)
    * mu, sigma: 1 \* C \* 1 \* 1



## Layer Normalization:
- Fully connected layer
    * X: (num_example, dimension)
    * mu, sigma: N \* 1 
    * Note: same behavior during testing

## Instance Normalization
- CNN
    * X: (num_example, channels, height, width)
    * mu, sigma: N \* C \* 1 \* 1

## Second Order Optimization
- No Hyperparameter and learning rates
- N^2 elements, O(N^3) for taking inverting
- Methods:
    * Quasi-Newton methods(BGFS): O(N^3) -> O(N^2)
    * L-BFGS: Does not form/store the full inverse Hessian.

## Hardware
- CPU: less cores, faster per core, better at sequential tasks
- GPU: more cores, slower per core, better at parallel tasks
- TPU: just for DL (Tensor Processing Unit)
    * Split *One* graph over *Multiple* machines

## Software
- Caffe (FB)
- PyTorch (FB)
- TF (Google)
- CNTK (MS)
- Dynamic (e.g., Eager Execution) vs. Static (e.g., TF Lower-level API)

## Different CNN architectures

- LeNet-5 (CONV-POOL-CONV-POOL-FC-FC)

<img src="https://i.stack.imgur.com/tLKYz.png" width="800">

- AlexNet 8 layers (CONV1-MAXPOOL1-NORM1-CONV2-MAXPOOL2-NORM2-CONV3-CONV4-CONV5-MAXPOOL3-FC6-FC7-FC8)

<img src="https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/content/image_folder_7/AlexNet_1.jpg" width="400">

- ZFNet (similar with AlexNet)
    * Smaller kernel, more filters

- VGGNet(*smaller* filter and *deeper* networks)
    * 16-19 layers in VGG16Net
    * Three 3 \* 3 kernel with stride == One 7 \* 7 kernel; Same *effective receptive field
    * But: 1) deeper network and more non-linearity; 2) less parameters ( 3 \* 3 \* 3 vs. 7 \* 7)
<img src="http://josephpcohen.com/w/wp-content/uploads/Screen-Shot-2016-01-14-at-11.25.15-AM.png" width="800">

- GoogLeNet
    * Introduced *inception* Module (Parallel filter operations with multiple kernel size)
    * Problem: Output size too big after filter concatenation
    * The purpose of 1 \* 1 convolutonal layer: 
        - Pooling layer keeps the same depth as input
        - 1 \* 1 layer keeps the same dimension of input, and reduces depth (for example: 64 \* 56 \* 56 after 32 1 \* 1 con --> 32 \* 56 \* 56)
        - Reduce total number of operations
    
<img src="https://www.researchgate.net/profile/Bo_Zhao48/publication/312515254/figure/fig3/AS:489373281067012@1493687090916/nception-module-of-GoogLeNet-This-figure-is-from-the-original-paper-10.jpg" width="400">

- ResNet
    * Use network layers to fit a *Residual mapping* instead of directly fitting a desired underlying mapping
    * Residual blocks are stacked
    * Similar to GoogLeNet, can use *bottelneck* layer (1 \* 1 conv layer) for downsampling and efficiency ++
    
<img src="https://www.researchgate.net/profile/Antonio_Theophilo/publication/321347448/figure/fig2/AS:565869411815424@1511925189281/Bottleneck-Blocks-for-ResNet-50-left-identity-shortcut-right-projection-shortcut.png" width="500">



## Example of RNN: Image Captioning
1. From image: [CONV-POOL] \* n --> FC Layer --> (num_example, 4096) written as **v**
2. Hiddern layer: $ h = tanh(W_{xh} * X + W_{hh} * h + W_{ih} * \bf{v} )$
3. Output layer: $ y = W_{hy} * h\ $
4. Get input $ X_{t+1}\ by\ sampling\ \bf{y} $

<img src="https://raw.githubusercontent.com/yunjey/pytorch-tutorial/master/tutorials/03-advanced/image_captioning/png/model.png" width="500">



## Image captioning with Attention

To be filled after LSTM

## LSTM

- Gradient Vanishing/Exploding with Vanilla RNN
- Computing gradient of $ h_0 $ involved many multiplication of **W** and **tanh** activation

<img src="https://qph.fs.quoracdn.net/main-qimg-d63725db196675d327f3e4578c48701b" width="500">

- Main Idea of LSTM
    * **Forget Gate** (\*): how much old memory we want to keep; element-wise multiplication with old memory $ C_{t-1} $. The Parameters are learned as $ W_f $. I.e., $ \sigma(W_f([h_{t-1}, X_t]) + b_f = f_t $. If you want all old memory, then $ f_t $ equals 1. After getting $ f_t $, multiply it with $ C_{t-1} $
    
    * **New Memory Gate**(\+): 
        * How to merge new memory with old memory; piece-wise summation, decides how to combine *new* memory with *old* memory. The weighing parameters are learned as $ W_i $. I.e., $ \sigma(W_i([h_{t-1}, X_t]) + b_i = i_t $. 
    
        * What is the new memory itself: $ tanh(W_C([h_{t-1}, X_t]) + b_C = \tilde{C_t} $
    
        * What is the combined memory: $ C_{t-1} * f_t + \tilde{C_t} * i_t = o_t$
    
    * **Output gate**: how much of the new memory we want to output? learned solely through combined memory. $ \sigma(W_o([h_{t-1}, X_t]) + b_o = o_t $. Then the final output $ h_t $ would be $ o_t * tanh(C_t) = h_t $
    
    
- Why LSTM prevents gradient vanishing?
    - Forget gate controls 
    - See proof: https://weberna.github.io/blog/2017/11/15/LSTM-Vanishing-Gradients.html

<img src="https://cdn-images-1.medium.com/max/1600/1*laH0_xXEkFE0lKJu54gkFQ.png" width="500">