# Recurrent neural network

## Why not standard network

- inputs, outputs can be different lengths in different examples
- doesn't share features learned across different positions of text

## Recurrent neural network

- $a^{<0>} = \overrightarrow{0}$
- $a^{<1>} = g_{1}(W_{aa}a^{<0>} + W_{ax}X^{<1>} + b_{a})$ (tanh/relu)
- $\hat{y}^{<1>} = g_{2}(W_{ya}a^{<1>} + b_{y})$ (sigmoid)
- $a^{<t>} = g(W_{aa}a^{<t-1>} + W_{ax}X^{<t>} + b_{a}) = g(W_{a}[a^{<t-1>}, X^{<t>}] + b_{a})$ where $[W_{aa} \vdots W_{ax}] = W_{a}$ and $[a^{<t-1>}, X^{<t>}]$ = |
$\begin{bmatrix}
   a^{<t-1>} \\
   X^{<t>} \\
 \end{bmatrix}$
- $\hat{y}^{<t>} = g(W_{ya}a^{<t>} + b_{y}) = g(W_{y}a^{<t>} + b_{y})$ 

## Backpropagation through time

- $L^{<t>}(\hat{y}^{<t>}, y^{<t>}) = -y^{<t>}log\hat{y}^{<t>} - (1-y^{<t>})log(1-\hat{y}^{<t>})$
- $L(\hat{y},y) = \displaystyle\sum_{t=1}^{T_{y}}L^{<t>}(\hat{y}^{<t>}, y^{<t>})$

## RNN types

- one to many (ex. music generation)
- many to one (ex. sentiment classification)
- many to many (ex. name entity recognition)
- many to many (ex. machine translation)

## Language modelling

- ex. P(The apple and pair salad) = $3.2x10^{-13}$, P(The apple and peer salad) = $5.7x10^{-13}$
- ex. "cats average 15 hours of sleep a day. (EOS)"
    - $L^{<t>}(\hat{y}^{<t>}, y^{<t>}) = -\displaystyle\sum_{i}y_{i}^{<t>}log\hat{y}_{i}^{<t>}$
    - $L = \displaystyle\sum_{t}L^{<t>}(\hat{y}^{<t>}, y^{<t>})$
    - $P(y^{<1>}, y^{<2>}, y^{<3>}) = P(y^{<1>})P(y^{<2>}|y^{<1>})P(y^{<3>}|y^{<1>},y^{<2>})$
    
## Vanishing gradients with RNNs

- ex. "the cats, which, ..., were full" vs "the cat, which, ..., was full"
- capturing long-term dependencies is hard

## Gated recurrent unit

- RNN unit
    - $a^{<t>} = g(W_{a}[a^{<t-1>}, x^{<t>}] + b_{a})$ (where $g$ is tanh)
- GRU
    - let $c$ = memory cell
    - $c^{<t>} = a^{<t>}$
    - $\tilde{c}^{<t>} = tanh(W_{c}[c^{<t-1>},x^{<t>}] + b_{c})$
    - $\Gamma_{u} = \sigma(W_{u}[c^{<t-1>},x^{<t>}] + b_{u})$
    - $c^{<t>} = \Gamma_{u}\tilde{c}^{<t>} + (1-\Gamma_{u}){c}^{<t-1>}$ (if vectors, multiplications are element-wise)
- Full GRU
    - $\tilde{c}^{<t>} = tanh(W_{c}[\Gamma_{r}c^{<t-1>},x^{<t>}] + b_{c})$
    - $\Gamma_{u} = \sigma(W_{u}[c^{<t-1>},x^{<t>}] + b_{u})$
    - $\Gamma_{r} = \sigma(W_{r}[c^{<t-1>},x^{<t>}] + b_{r})$
    - $c^{<t>} = \Gamma_{u}\tilde{c}^{<t>} + (1-\Gamma_{u}){c}^{<t-1>}$
    - $a^{<t>} = c^{<t>}$
    
## Long short term memory (LSTM)

- $\tilde{c}^{<t>} = tanh(W_{c}[a^{<t-1>},x^{<t>}] + b_{c})$
- $\Gamma_{u} = \sigma(W_{u}[c^{<t-1>},x^{<t>}] + b_{u})$ (update)
- $\Gamma_{f} = \sigma(W_{f}[c^{<t-1>},x^{<t>}] + b_{f})$ (forget)
- $\Gamma_{o} = \sigma(W_{o}[c^{<t-1>},x^{<t>}] + b_{o})$ (output)
- $c^{<t>} = \Gamma_{u}\tilde{c}^{<t>} + \Gamma_{u}{c}^{<t-1>}$
- $a^{<t>} = \Gamma_{o}tanhc^{<t>}$

## Bidirectional RNN

- getting information from the future
    - ex. He said, "Teddy bears are on sale!"
    - ex. He said, "Teddy Roosevelt was a great President"
- $\hat{y}^{<t>} = g(W_{y}[\overrightarrow{a}^{<t>}, \overleftarrow{a}^{<t>}] + b_{y})$

## Word representation

- ex.

<table>
<tr>
    <th></th>
    <th>Man (5391)</th>
    <th>Woman (9853)</th>
    <th>King (4914)</th>
    <th>Queen (7157)</th>
    <th>Apple (456)</th>
    <th>Orange (6257)</th>
</tr>
<tr>
    <td>Gender</td>
    <td>-1</td>
    <td>-1</td>
    <td>-0.95</td>
    <td>0.97</td>
    <td>0.00</td>
    <td>0.01</td>
</tr>
<tr>
    <td>Royal</td>
    <td>0.01</td>
    <td>0.02</td>
    <td>0.93</td>
    <td>0.95</td>
    <td>-0.01</td>
    <td>0.00</td>
</tr>
<tr>
    <td>Age</td>
    <td>0.03</td>
    <td>0.02</td>
    <td>0.7</td>
    <td>0.69</td>
    <td>0.03</td>
    <td>-0.02</td>
</tr>
<tr>
    <td>Food</td>
    <td>0.04</td>
    <td>0.01</td>
    <td>0.02</td>
    <td>0.01</td>
    <td>0.95</td>
    <td>0.97</td>
</tr>
</table>

## Transfer learning and word embeddings

- learn word embeddings from large text corpus (1-100B words) (or download pre-trained embedding online)
- transfer embedding to new task with smaller training set (say, 100k words)
- (optional) continue to fine tune the word embeddings with new data

## Word2Vec ("skip gram")

- vocab size = 10,000k
- content c ("orange") $\rightarrow$ target t ("juice")
- $o_{c} \rightarrow E \rightarrow e_{c} \rightarrow o$ (softmax) $\rightarrow \hat{y}$
- softmax: $p(t|c) = \dfrac{e^{\theta_{t}^{T}}e_{c}}{\displaystyle\sum_{i=1}^{10000}e^{\theta_{j}^{T}}e_{c}}$
    - $\theta_{t}$ = parameter associated with output $t$
    - $L(\hat{y}, t) = -\displaystyle\sum_{i=1}^{10000}y_{i}log\hat{y}_{i}$ 
    
## Negative sampling

- ex. 

<table>
<tr>
    <th>context</th>
    <th>word</th>
    <th>target</th>
</tr>
<tr>
    <td>orange</td>
    <td>juice</td>
    <td>1</td>
</tr>
<tr>
    <td>orange</td>
    <td>king</td>
    <td>0</td>
</tr>
<tr>
    <td>orange</td>
    <td>book</td>
    <td>0</td>
</tr>
<tr>
    <td>orange</td>
    <td>the</td>
    <td>0</td>
</tr>
<tr>
    <td>orange</td>
    <td>of</td>
    <td>0</td>
</tr>
</table>

- $P(y=1|c,t) = \sigma(\theta_{t}^{T}e_{c})$
- selecting negative examples
    - $p(w_{i}) = \dfrac{f(w_{i})^{3/4}}{\displaystyle\sum_{j=1}^{10000}f(w_{j})^{3/4}}$
    
## GloVe (global vectors for word representation)

- $X_{ij}$ = number of times $j$ appears in context of $i$ ($i$ = context, $j$ = target)
- $X_{ij} = X_{ji}$
- minimize $\displaystyle\sum_{i=1}^{10000}\displaystyle\sum_{j=1}^{10000}f(X_{ij})(\theta_{i}^{T}e_{j} + b_{i} + b_{j}^{'} -logX_{ij})^{2}$
    - $f(X_{ij})$ = weighting term, which is equal to zero if $X_{ij} = 0$
    - $\theta_{i}, e_{j}$ are symmetric
    - $e_{w}^{find} = \dfrac{e_{w} + \theta_{w}}{2}$
    
## Sentiment classification

- use RNN

## Sequence to sequence model

- ex. French to English translation
    - Jane visite l'Afrique en septembre $(x^{<1>}, x^{<2>}, x^{<3>}, x^{<4>}, x^{<5>})$
    - Jane is visiting Africa in September $(y^{<1>}, y^{<2>}, y^{<3>}, y^{<4>}, y^{<5>}, y^{<6>})$
- conditional language model
    - $P(y^{<1>} \dots y^{<T_{y}>}|x^{<1>} \dots x^{<T_{x}>})$
- objective
    - $\underset{y^{<1>} \dots y^{<T_{y}>}}{\arg\max}P(y^{<1>} \dots y^{<T_{y}>}|x)$
    
## Beam search

- define beam width (say 3)
- then
    - $P(y^{<1>}|x)$; pick top 3
    - $P(y^{<1>},y^{<2>}|x) = P(y^{<1>}|x)P(y^{<2>}|x,y^{<1>})$; pick top 3
    - and so on
- in general
    - $\underset{y}{\arg\max}\displaystyle\prod_{t=1}^{T_{y}}P(y^{<t>}|x, y^{<1>} \dots y^{<t-1>})$
- length normalization
    - $\underset{y}{\arg\max}\dfrac{1}{T_{y}}\displaystyle\sum_{y=1}^{T_{y}}P(y^{<t>}|x, y^{<1>} \dots y^{<t-1>})$
- large B: better result but slower
- small B: worse result but faster
- error analysis
    - ex. Jane visite l'Afrique en septembre
    - human: Jane visits Africa in September $(y^{*})$
    - algorithm: Jane visited Africa last September $(\hat{y})$
    - RNN computes $P(y^{*}|x), P(\hat{y}|x)$
- case 1: $P(y^{*}|x) \gt P(\hat{y}|x)$
    - beam search chose $\hat{y}$, but $y^{*}$ attains higher $P(y|x)$
    - conclusion: beam search is at fault
- case 2: $P(y^{*}|x) \le P(\hat{y}|x)$
    - $y^{*}$ is a better translation than $\hat{y}$, but RNN predicted $P(y^{*}|x) \le P(\hat{y}|x)$
    - RNN model is at fault
    
## Bleu score

- evaluate machine translation
    - ex. Le chat est sur le tapis
    - reference 1: The cat is on the mat
    - reference 2: There is a cat on the mat
    - MT output: the the the the the the the $(\hat{y})$
- unigram
    - $P_{1} = \dfrac{\displaystyle\sum_{unigram \in \hat{y}}Count_{clip}(unigram)}{\displaystyle\sum_{unigram \in \hat{y}}Count(unigram)}$
- n-gram
    - $P_{1} = \dfrac{\displaystyle\sum_{n-gram \in \hat{y}}Count_{clip}(n-gram)}{\displaystyle\sum_{n-gram \in \hat{y}}Count(n-gram)}$
- BP (brevity penalty)
    - 1 if MT_output_length $\gt$ reference_output_length
    - exp(1 - reference_output_length / MT_output_length) otherwise
    
## Attention model

- $\alpha^{<t,t'>}$ = amount of "attention" $y^{<t>}$ should pay to $a^{<t'>} = \dfrac{exp(e^{<t,t'>})}{\displaystyle\sum_{t=1}^{T_{x}}exp(e^{<t,t'>})}$ 
- $a^{<t'>} = (\overrightarrow{a}^{<t'>}, \overleftarrow{a}^{<t'>})$
- $\displaystyle\sum_{t'}\alpha^{<1,t'>} = 1$
- $c^{<1>} = \displaystyle\sum_{t'}\alpha^{<1,t'>}a^{<t'>}$
- $c^{<2>} = \displaystyle\sum_{t'}\alpha^{<2,t'>}a^{<t'>}$