## General Workflow
- Preprocess Data (zero-centered,i.e.,Substractmean)
- Identify architecture
- Ensure that we can overfit a small traing set to acc = 100%
- Loss not decreasing: too low leaning rate
- Loss goes to NaN: too high learning rate  

## Batch Normalization

- Improve gradient flow
- Allow higher learning rates
- Reduce dependence on initialization
- Some regularization
- *Note*: At test time, the mean from training should be used instead of calculated from testing batch

## Batch Normalization
- Fully connected layer
    * X: (num_example, dimension)
    * mu, sigma: 1 \* D


- CNN -> Spatial Batchnorm
    * X: (num_example, channels, height, width)
    * mu, sigma: 1 \* C \* 1 \* 1



## Layer Normalization:
- Fully connected layer
    * X: (num_example, dimension)
    * mu, sigma: N \* 1 
    * Note: same behavior during testing

## Instance Normalization
- CNN
    * X: (num_example, channels, height, width)
    * mu, sigma: N \* C \* 1 \* 1

## Second Order Optimization
- No Hyperparameter and learning rates
- N^2 elements, O(N^3) for taking inverting
- Methods:
    * Quasi-Newton methods(BGFS): O(N^3) -> O(N^2)
    * L-BFGS: Does not form/store the full inverse Hessian.

## Hardware
- CPU: less cores, faster per core, better at sequential tasks
- GPU: more cores, slower per core, better at parallel tasks
- TPU: just for DL (Tensor Processing Unit)
    * Split *One* graph over *Multiple* machines

## Software
- Caffe (FB)
- PyTorch (FB)
- TF (Google)
- CNTK (MS)
- Dynamic (e.g., Eager Execution) vs. Static (e.g., TF Lower-level API)

## Different CNN architectures

- LeNet-5 (CONV-POOL-CONV-POOL-FC-FC)

<img src="https://i.stack.imgur.com/tLKYz.png" width="800">

- AlexNet 8 layers (CONV1-MAXPOOL1-NORM1-CONV2-MAXPOOL2-NORM2-CONV3-CONV4-CONV5-MAXPOOL3-FC6-FC7-FC8)

<img src="https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/content/image_folder_7/AlexNet_1.jpg" width="400">

- ZFNet (similar with AlexNet)
    * Smaller kernel, more filters

- VGGNet(*smaller* filter and *deeper* networks)
    * 16-19 layers in VGG16Net
    * Three 3 \* 3 kernel with stride == One 7 \* 7 kernel; Same *effective receptive field
    * But: 1) deeper network and more non-linearity; 2) less parameters ( 3 \* 3 \* 3 vs. 7 \* 7)
<img src="http://josephpcohen.com/w/wp-content/uploads/Screen-Shot-2016-01-14-at-11.25.15-AM.png" width="800">

- GoogLeNet
    * Introduced *inception* Module (Parallel filter operations with multiple kernel size)
    * Problem: Output size too big after filter concatenation
    * The purpose of 1 \* 1 convolutonal layer: 
        - Pooling layer keeps the same depth as input
        - 1 \* 1 layer keeps the same dimension of input, and reduces depth (for example: 64 \* 56 \* 56 after 32 1 \* 1 con --> 32 \* 56 \* 56)
        - Reduce total number of operations
    
<img src="https://www.researchgate.net/profile/Bo_Zhao48/publication/312515254/figure/fig3/AS:489373281067012@1493687090916/nception-module-of-GoogLeNet-This-figure-is-from-the-original-paper-10.jpg" width="400">

- ResNet
    * Use network layers to fit a *Residual mapping* instead of directly fitting a desired underlying mapping
    * Residual blocks are stacked
    * Similar to GoogLeNet, can use *bottelneck* layer (1 \* 1 conv layer) for downsampling and efficiency ++
    
<img src="https://www.researchgate.net/profile/Antonio_Theophilo/publication/321347448/figure/fig2/AS:565869411815424@1511925189281/Bottleneck-Blocks-for-ResNet-50-left-identity-shortcut-right-projection-shortcut.png" width="500">



## Example of RNN: Image Captioning
1. From image: [CONV-POOL] \* n --> FC Layer --> (num_example, 4096) written as **v**
2. Hiddern layer: $ h = tanh(W_{xh} * X + W_{hh} * h + W_{ih} * \bf{v} )$
3. Output layer: $ y = W_{hy} * h\ $
4. Get input $ X_{t+1}\ by\ sampling\ \bf{y} $

<img src="https://raw.githubusercontent.com/yunjey/pytorch-tutorial/master/tutorials/03-advanced/image_captioning/png/model.png" width="500">



## Image captioning with Attention

To be filled after LSTM, TBA

## LSTM

- What is the problem with RNN
    * Gradient Vanishing/Exploding with Vanilla RNN
    * Computing gradient of $ h_0 $ involved many multiplication of **W** and **tanh** activation
    * Brief proof:
    
    $ \frac{ \partial E_t} { \partial w}
    = \sum_{k=1}^{t}  \frac{ \partial E_t} { \partial y_t}
                      \frac{ \partial y_t} { \partial h_t}
                      \frac{ \partial h_t} { \partial h_k}
                      \frac{ \partial h_k} { \partial w}$
    Here:$ h_t = W_{hh} f(h_{t-1}) + W_{hx} X_t$              
    $ \frac{ \partial h_t} { \partial h_k}
    = \prod_{j= k + 1}^{t} \frac{ \partial h_j} { \partial h_{j-1}} $
    
    $ \| \frac{ \partial h_t} { \partial h_k} \|
    \leq (\beta_W \beta_h)^{t-k} $ $\beta$ is upper bound for matrix norms
    
    $ \| W^T_{hh} \| \leq \beta_W $ and $ \| diag(f'(h_{j-1}) \| \leq \beta_h $
    * We cannot figure out the dependency between long time interval's data

<img src="https://qph.fs.quoracdn.net/main-qimg-d63725db196675d327f3e4578c48701b" width="500">

- How to fix vanishing gradients?
    * Partial fix for gradient exploding: if ||g|| > threshold, shrink value of g
    * Initialize W to be identity
    * Use ReLU as activation function f
- Main Idea of LSTM
    * **Forget Gate** (\*): how much old memory we want to keep; element-wise multiplication with old memory $ C_{t-1} $. The Parameters are learned as $ W_f $. I.e., $ \sigma(W_f([h_{t-1}, X_t]) + b_f = f_t $. If you want all old memory, then $ f_t $ equals 1. After getting $ f_t $, multiply it with $ C_{t-1} $<br/><br/>
   
    * **New Memory Gate**(\+)
        * How to merge new memory with old memory; piece-wise summation, decides how to combine *new* memory with *old* memory. The weighing parameters are learned as $ W_i $. I.e., $ \sigma(W_i([h_{t-1}, X_t]) + b_i = i_t $. 
    
        * What is the new memory itself: $ tanh(W_C([h_{t-1}, X_t]) + b_C = \tilde{C_t} $
    
        * What is the combined memory: $ C_{t-1} * f_t + \tilde{C_t} * i_t = C_t$
    
    * 
    * **Output gate**: how much of the new memory we want to output or store? learned solely through combined memory. $ \sigma(W_o([h_{t-1}, X_t]) + b_o = o_t $. Then the final output $ h_t $ would be $ o_t * tanh(C_t) = h_t $
    
    
    
- Why LSTM prevents gradient vanishing?
    - *Linear* Connection between $C_t$ and $C_{t-1}$ rather than multiplying
    - Forget gate controls and keeps long-distance dependency
    - Allows error to flow at different strength based on inputs
    - During initialization: Initialize forget gate bias to one: default to remembering
    - See proof: https://weberna.github.io/blog/2017/11/15/LSTM-Vanishing-Gradients.html

<img src="https://cdn-images-1.medium.com/max/1600/1*laH0_xXEkFE0lKJu54gkFQ.png" width="500">

- Other variation: Gated Recurrent Unit (GRU)
    * **Update Gate**: How to combine old and new state: $ \sigma(W_z([h_{t-1}, X_t])  = z_t $
    * **Reset Gate**: How much to keep old state: $ \sigma(W_r([h_{t-1}, X_t])= r_t $
    * **New State**: $ tanh(WX_t + r_t * U h_{t-1}) =\tilde{h_t}$ 
    * **Combine States**: $z_t* h_{t-1} + (1-z_t) * \tilde{h_t} $
    * If r=0, ignore/drop previos state for generating new state
    * if z=1, carry information from past through many steps (long-term dependency)

# Machine Translation as RNN

## Problem definition
- Neural Machine Translation (NMT)
- Sequence-to-Sequence(seq2seq) architecture
- Difference from SMT (Statistical MT): calculate P(y|x) directly instead of using Bayes
- Advantage: Single NN, less human engineering
- Disadvantage: less interpretable, less control
- Figure (TBA)

## Main Components
- Encoder RNN: encode source sentence, generate hidden state
- Decoder RNN: **Language Model**, generate target sentence using outputs from encoder RNN; predict next word in *y* conditional on input *x*

## Beam Search
- Greedy decoding problem
    * Instead of generating argmax each step, use beam search.
    * Keep *k* most probable translations
    
## Attention model
1. Get hidden states: $ h_1, ..., h_N $
1. Get decoder state: $ s_t $
1. Get attention scores by dot product: 
$ \mathbf e^t = [s^T_t h_1, ..., s^T_t h_N] $
1. Take softmax of $ \mathbf e^t $ and get $ \alpha_t $ which sum up to one
1. Take weighted sum of hidder states basded on **h** and $\mathbf\alpha$ and get **a**
1. Concatenate **a** and **s** in the decoder RNN

Advantages:
- Focus on certain parts of source
- Provides shortcut / Bypass bottleneck
- Get some interpretable results and learn alignment

# Word Embedding s/ Word2vec

- Why ther Options not working
    * One-hot vectors (vocabulary list too big; No similarity measurement; how about new words)
    * Co-currence vector (matrix given a certain window size, # of times two words are together)->Sparsity
    * Singular Vector Decomposition (SVD) for cocurrence matrix (too expensive)
    * Use a word's context to represent --> Word embedding
    
    
    
- Key Components
    * Center word *c*, context word *o*
    * Two vectors for each word *w*: $ v_w $ and $ u_w $. $\theta$ contains all *u* and *v* (Input and Output Vector)
    * For example: $ P( w1|w2 ) = P(w_2|w_1;  u_{w2}, v_{w1}, \theta )$
    * Calculate u*v for each word, and use softmax to derive probability
    * After optimization for loss, get two vectors for each word. Combine or Use *u* or Use *v*
    
    
    
- Variation
    * Skip-grams (SG):given center, predict context
    * Countinous Bag of Words (CBOW):given bag of context, predict center
    * Negative sampling (maximize p of actual context + minimize p of random context i.e. noise)
    * GloVe: combine count-based and direct-prediction

# Word Window Classification
- Difference with typical ML: learn both **W** and word vectors **x**
- Task definition: classify a word in its *Context Window*
    * Do not train single word: ambiguity
    * Do not just average over window: lose position information
    * Get a vector X with length of 5d where 5 is window size and d is embedding size
    * Predict y based on softmax of WX and minimize cross-entropy error
    
    
    
- What happens for x:
    * Updated just as weigh W
    * Pushed into an area helpful for classification task
    * Example: $X_{in}$ may be a sign for location (One of the *Named Entity Recognition* tasks)

# Language Modeling
- Task Definition: Predict next word
- Approach 1: N-gram model
    * Using count of different length of grams as they shown in corpus
    * Main problem: *Sparsity*
  

- Approach 2: NN Model with Fixed Window Size
    * No Sparsity problem
    * Model size reduced
    * BUT: X do not share weight, and how to decide window size


- Approach 3: RNN Model
    * Any  sequence length will work
    * Weights shared, model size doesn't increase
    * BUT: computation is slow (why) and cannot access information from many steps back


- Application
    * One-to-one: tagging each word
    * many-to-one: sentiment analysis
    * Encoder module: example: element-wise max of all hidden states -> as input for further NN model
    
    
    
- Alternative tasks in NLP - *Conditioned* Language Models
    * Speech recognition
    * Machine translation
    * Generate summary