# Some key concepts

**Gradient Descent**
- Batch (Fast Convergence)
- SGD

**Back-propagation**
- Preprocessing (e.g., zero-centered)

**Activation Function**
- Sigmoid
    * Saturated at 0/1 and kills gradients (derivative -> 0)
    * Output not zero-centred; for next layer: f = wx + b, x>0, df/dw same sign for all w; zig-zag update trajectory
- Tanh
    * Still kills gradients
    * But: zero-cented
- ReLU
    * Non-saturated, linearity --> Accelerate convergence
    * Cheap computation
    * But: Can die; never activate
    * Extension: Leaky ReLU, maxout
    
**NN as universal approximators**
- More neurons --> more complicated functions
- Regularization to prevent overfitting

**Regularization**
- L1/L2/ElasticNet
- Max Norm constraint
- Drop Out layer

**Hyperparameter Optimization**
- Single validation set > cross validation in practice
- Random search instead of grid search within a range


**ConvNets**
- Difference with regular NN:
    * Main difference: each neuron is layer 2 is only connected to a few neurons in layer 1
    * Data arranged in 3 dimensions: height, width, and depth
- Convolutional Layer:
    * Filter (with full depth, but local connectivity across 2d), 3\*3 --> 5\*5
    * The depth of layer 2 == The number of filters in Layer 1
    * `Stride`: usually 1, leaving downsampling to pooling layer. Can use 2 to compromise 1st layer because of computational constraints
    * `Padding`: use same to avoid missing information along the border
    * *Parameter Sharing*: Same weight for filter/kernel at same depth slice in layer 2 (Alternative: local)
- Pooling
    * Most commonly: 2-2 Max Pooling
- Fully-Connected Layer
- Common architecture:
    * INPUT -> [CONV -> RELU -> CONV -> RELU -> POOL]\* -> [FC -> RELU]\*2 -> FC
    * Prefer a stack of small filter CONV to one large receptive field CONV layer
- Challenge: Computational resources


**Transfer Learning**
- Apply trained model without last FC layer and use it as feature extracter
- Continue to fine tune the model using smaller learning rate
- Can use different image size

**Weights Initialization**
- **All zero**: wrong: neuron outputs and gradients would be same; same update
- **Number to small**: small gradients for inputs; gradient diminishing when flowing backwafrd
- **Preferred: All neuron with same output distribution**: 
    - w = np.random.randn(n) / sqrt(n), where n is number of inputs. 
    - It can be proved that Var(S) = Var(WX) = Var(X)

**Param Update and Learning Rate**
- Step decay for learning rate: 
    * Reduce the learning rate by some factor every few epochs. 
    * Other approaches also avalable, like exponential decay, 1/t decay, etc.
- Second-order update method:
    * i.e., Newton's method, not common
- Per-parameter adaptive learning rate methods: 
    * For example: Adagrad, Adam

<img src="http://cs231n.github.io/assets/nn3/nesterov.jpeg" width="600">

**General Workflow**
- Preprocess Data (zero-centered,i.e.,Substractmean)
- Identify architecture
- Ensure that we can overfit a small traing set to acc = 100%
- Loss not decreasing: too low leaning rate
- Loss goes to NaN: too high learning rate  

**Batch Normalization**

https://kratzert.github.io/2016/02/12/understanding-the-gradient-flow-through-the-batch-normalization-layer.html

- Improve gradient flow
- Allow higher learning rates
- Reduce dependence on initialization
- Some regularization
- *Note*: At test time, the mean from training should be used instead of calculated from testing batch

- Fully connected layer
    * X: (num_example, dimension)
    * mu, sigma: 1 \* D


- CNN -> Spatial Batchnorm
    * X: (num_example, channels, height, width)
    * mu, sigma: 1 \* C \* 1 \* 1

**Layer Normalization**
- Fully connected layer
    * X: (num_example, dimension)
    * mu, sigma: N \* 1 
    * Note: same behavior during testing

**Instance Normalization**
- CNN
    * X: (num_example, channels, height, width)
    * mu, sigma: N \* C \* 1 \* 1
    
**Second Order Optimization**
- No Hyperparameter and learning rates
- N^2 elements, O(N^3) for taking inverting
- Methods:
    * Quasi-Newton methods(BGFS): O(N^3) -> O(N^2)
    * L-BFGS: Does not form/store the full inverse Hessian.
    
**Hardware**
- CPU: less cores, faster per core, better at sequential tasks
- GPU: more cores, slower per core, better at parallel tasks
- TPU: just for DL (Tensor Processing Unit)
    * Split *One* graph over *Multiple* machines

**Software**
- Caffe (FB)
- PyTorch (FB)
- TF (Google)
- CNTK (MS)
- Dynamic (e.g., Eager Execution) vs. Static (e.g., TF Lower-level API)

# CNN

- LeNet-5 (CONV-POOL-CONV-POOL-FC-FC)

<img src="https://i.stack.imgur.com/tLKYz.png" width="800">

- AlexNet 8 layers (CONV1-MAXPOOL1-NORM1-CONV2-MAXPOOL2-NORM2-CONV3-CONV4-CONV5-MAXPOOL3-FC6-FC7-FC8)

<img src="https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/content/image_folder_7/AlexNet_1.jpg" width="400">

- ZFNet (similar with AlexNet)
    * Smaller kernel, more filters


- VGGNet(*smaller* filter and *deeper* networks)
    * 16-19 layers in VGG16Net
    * Three 3 \* 3 kernel with stride == One 7 \* 7 kernel; Same *effective receptive field
    * But: 1) deeper network and more non-linearity; 2) less parameters ( 3 \* 3 \* 3 vs. 7 \* 7)
<img src="http://josephpcohen.com/w/wp-content/uploads/Screen-Shot-2016-01-14-at-11.25.15-AM.png" width="800">


- GoogLeNet
    * Introduced *inception* Module (Parallel filter operations with multiple kernel size)
    * Problem: Output size too big after filter concatenation
    * The purpose of 1 \* 1 convolutonal layer: 
        - Pooling layer keeps the same depth as input
        - 1 \* 1 layer keeps the same dimension of input, and reduces depth (for example: 64 \* 56 \* 56 after 32 1 \* 1 con --> 32 \* 56 \* 56)
        - Reduce total number of operations
    
<img src="https://www.researchgate.net/profile/Bo_Zhao48/publication/312515254/figure/fig3/AS:489373281067012@1493687090916/nception-module-of-GoogLeNet-This-figure-is-from-the-original-paper-10.jpg" width="400">


- ResNet
    * Use network layers to fit a *Residual mapping* instead of directly fitting a desired underlying mapping
    * Residual blocks are stacked
    * Similar to GoogLeNet, can use *bottelneck* layer (1 \* 1 conv layer) for downsampling and efficiency ++
    
<img src="https://www.researchgate.net/profile/Antonio_Theophilo/publication/321347448/figure/fig2/AS:565869411815424@1511925189281/Bottleneck-Blocks-for-ResNet-50-left-identity-shortcut-right-projection-shortcut.png" width="500">



# RNN

- What is the problem with RNN
    * Gradient Vanishing/Exploding with Vanilla RNN
    * Computing gradient of $ h_0 $ involved many multiplication of **W** and **tanh** activation
    * Brief proof:
    
    $ \frac{ \partial E_t} { \partial w}
    = \sum_{k=1}^{t}  \frac{ \partial E_t} { \partial y_t}
                      \frac{ \partial y_t} { \partial h_t}
                      \frac{ \partial h_t} { \partial h_k}
                      \frac{ \partial h_k} { \partial w}$
    Here:$ h_t = W_{hh} f(h_{t-1}) + W_{hx} X_t$              
    $ \frac{ \partial h_t} { \partial h_k}
    = \prod_{j= k + 1}^{t} \frac{ \partial h_j} { \partial h_{j-1}} $
    
    $ \| \frac{ \partial h_t} { \partial h_k} \|
    \leq (\beta_W \beta_h)^{t-k} $ $\beta$ is upper bound for matrix norms
    
    $ \| W^T_{hh} \| \leq \beta_W $ and $ \| diag(f'(h_{j-1}) \| \leq \beta_h $
    * We cannot figure out the dependency between long time interval's data

<img src="https://qph.fs.quoracdn.net/main-qimg-d63725db196675d327f3e4578c48701b" width="500">

- How to fix vanishing gradients?
    * Partial fix for gradient exploding: if ||g|| > threshold, shrink value of g
    * Initialize W to be identity
    * Use ReLU as activation function f
- Main Idea of LSTM
    * **Forget Gate** (\*): how much old memory we want to keep; element-wise multiplication with old memory $ C_{t-1} $. The Parameters are learned as $ W_f $. I.e., $ \sigma(W_f([h_{t-1}, X_t]) + b_f = f_t $. If you want all old memory, then $ f_t $ equals 1. After getting $ f_t $, multiply it with $ C_{t-1} $<br/><br/>
   
    * **New Memory Gate**(\+)
        * How to merge new memory with old memory; piece-wise summation, decides how to combine *new* memory with *old* memory. The weighing parameters are learned as $ W_i $. I.e., $ \sigma(W_i([h_{t-1}, X_t]) + b_i = i_t $. 
    
        * What is the new memory itself: $ tanh(W_C([h_{t-1}, X_t]) + b_C = \tilde{C_t} $
    
        * What is the combined memory: $ C_{t-1} * f_t + \tilde{C_t} * i_t = C_t$
    
    * 
    * **Output gate**: how much of the new memory we want to output or store? learned solely through combined memory. $ \sigma(W_o([h_{t-1}, X_t]) + b_o = o_t $. Then the final output $ h_t $ would be $ o_t * tanh(C_t) = h_t $
    
    
    
- Why LSTM prevents gradient vanishing?
    - *Linear* Connection between $C_t$ and $C_{t-1}$ rather than multiplying
    - Forget gate controls and keeps long-distance dependency
    - Allows error to flow at different strength based on inputs
    - During initialization: Initialize forget gate bias to one: default to remembering
    - See proof: https://weberna.github.io/blog/2017/11/15/LSTM-Vanishing-Gradients.html

<img src="https://cdn-images-1.medium.com/max/1600/1*laH0_xXEkFE0lKJu54gkFQ.png" width="500">
<img src="https://cdn-images-1.medium.com/max/1600/0*LyfY3Mow9eCYlj7o." width="500">


- Other variation: Gated Recurrent Unit (GRU)
    * **Update Gate**: How to combine old and new state: $ \sigma(W_z([h_{t-1}, X_t])  = z_t $
    * **Reset Gate**: How much to keep old state: $ \sigma(W_r([h_{t-1}, X_t])= r_t $
    * **New State**: $ tanh(WX_t + r_t * U h_{t-1}) =\tilde{h_t}$ 
    * **Combine States**: $z_t* h_{t-1} + (1-z_t) * \tilde{h_t} $
    * If r=0, ignore/drop previos state for generating new state
    * if z=1, carry information from past through many steps (long-term dependency)

- Bidirectional LSTM
<img src="https://guillaumegenthial.github.io/assets/char_representation.png" width="400">


## Image Captioning
1. From image: [CONV-POOL] \* n --> FC Layer --> (num_example, 4096) written as **v**
2. Hiddern layer: $ h = tanh(W_{xh} * X + W_{hh} * h + W_{ih} * \bf{v} )$
3. Output layer: $ y = W_{hy} * h\ $
4. Get input $ X_{t+1}\ by\ sampling\ \bf{y} $

<img src="https://raw.githubusercontent.com/yunjey/pytorch-tutorial/master/tutorials/03-advanced/image_captioning/png/model.png" width="500">



## Image captioning with Attention

To be filled after LSTM, TBA