> #### Main Sources of Reference: 
>  - Evolution of TL in NLP: https://arxiv.org/pdf/1910.07370v1.pdf
>  - ULMFiT paper: https://arxiv.org/pdf/1801.06146.pdf
>  - Coursera-Deeplearning.ai course on **Sequence Models** explains the formula used here for different RNN units
>  - Yashu Seth on AWD LSTM - https://yashuseth.blog/2018/09/12/awd-lstm-explanation-understanding-language-model/

## Evolution of RNN architectures for Transfer Learning in NLP (Part 1)

- The crux of Transfer Learning: <br> 
 - (1) Build a Language Model^ that understands the underlying features of the text 
 - (2) Fine-tune the Language Model with additional layers for downstream tasks
> Why Language Model <br>
> *Language modeling can be seen as the ideal source task as it captures many facets of language relevant for downstream tasks, such as long-term dependencies, hierarchical relations and sentiment* <br>
> 
>Ruder et al in the ULMFiT paper
_______________________________________________________________________________________________________________

#### Agenda of Part 1
- Introduction to Language Modeling
- How Transfer Learning Evolved
- Evolution of RNN units - RNN, LSTM, GRU, AWD-LSTM

#### Covered in Part 2
- ULMFiT

#### Covered in Part 3
- ELMo
_______________________________________________________________________________________________________________

### Introduction to Language Modeling

Language Model: **A model of the probability of a sequence of words**

- A language model can assign probability to each possible next word. And also, help in assigning a probability to an entire sentence. 

#### Applications of Language Model

* Speech Recognition: E.g.: P('recognize speech') >> P('wreck a nice beach')
* Spelling Correction: E.g.: P('I have a gub') << P('I have a gun')
* Machine Translation: E.g.: P('strong winds') > P('large winds')
* Optical Character Recognition/ Handwriting Recognition
* Autoreply Suggestions
* Text Classification (discussed with python implementation of a simple N-gram model)
* Text Generation (discussed this with Char-level and Word-level language models)
_______________________________________________________________________________________________________________

#### Evaluation Metrics for LM

##### (1) Perplexity
- Perplexity is a measurement of how well a probability model predicts a sample
- It is used to compare probability models
- A low perplexity indicates the probability distribution is good at predicting the sample  

Definition: 
- Perplexity is the inverse probability of  the test set, normalized by the number  of words. <br>
Perplexity of test data = PP(test data) =

$$ P(w_1,\ w_2,\ ....,\ w_N)^{1 \over N}  $$

$$ {\sqrt[N]{1 \over P(w_1,\ w_2,\ ....,\ w_N)}} $$

$$ {\sqrt[N]{\prod\limits_{i=1}^{N} {1 \over P(w_i,\ |\ {w_{i-1},\ ....,\ w_1})}}}$$ 

**the lower the perplexity, the lower is the entropy in the generated text, the better is the model**

_______________________________________________________________________________________________________________

##### (2) Log-likelihood

###### Before that, what is Markov Assumption?  

So the probability of occurrence of a 4-word sentence: 

$$ Prob \ ({ A,\ B,\ C,\ D}) = Prob\ ({ A_{before}\ B_{before}\ C_{before}\ D}) = Prob\ ( {A \ |\ {B,\ C,\ D}} ) \  \times \  Prob\ ( {B \ |\ {C,\ D}})\ \times \ Prob\ ({C \ |\ D})\ \times \ Prob\ (D) $$



> “What I see NOW depends only on what I saw in the PREVIOUS step”

$$ P(w_i,\ |\ {w_{i-1},\ ....,\ w_1}) = P(w_i,\ |\ {w_{i-1}}) $$

Hence the probability of occurrence of a 4-word sentence with Markov Assumption is:

$$ Prob \ ({ A,\ B,\ C,\ D}) = Prob\ ({ A_{before}\ B_{before}\ C_{before}\ D}) = Prob\ ( {A \ |\ {B}} ) \  \times \  Prob\ ( {B \ |\ {C}})\ \times \ Prob\ ({C \ |\ D})\ \times \ Prob\ (D) $$

Based on Markov Assumption, the log-likelihood of a N-word sentence:
_________________________________________________________________________________________________________________

$$ \log P(w_1,\ w_2,\ ....,\ w_N) = \sum\limits_{\ i\ =\ 2}^{N} {\log P(w_i\ |\ w_{i-1})}\ +\ \log P(w_1) $$ 

- where N  is the number of words in the sentence. 
- Since log (probabilities) are always negative (see graph), shorter sentences will have higher probability of occurrence. 

**To normalize it**,
$$ {1 \over N} \log P(w_1,\ w_2,\ ....,\ w_N) = {1 \over N} \left [\ \sum\limits_{\ i\ =\ 2}^{N} {\log P(w_i\ |\ w_{i-1})}\ +\ \log P(w_1) \right] $$ 

**Higher the log probability value, better is the model **


#### Comparison between Perplexity and Log-Probability metrics for a Bigram Markov Language Model

$$ Perplexity\ for\ a\ bigram\ model = {\sqrt[N]{\prod\limits_{i=1}^{N} {1 \over P(w_i,\ |\ {w_{i-1}})}}} $$

Comparing with normalized log probability for a bigram model:

$$ LogProbability\ for\ a\ bigram\ model = {1 \over N} \left [\ \sum\limits_{\ i\ =\ 2}^{N} {\log P(w_i\ |\ w_{i-1})}\ +\ \log P(w_1) \right] $$ 

___________________________________________________________________________________________________________________________

### How Transfer Learning Evolved

- Stage1: NLP started with rule-based and statistical methodologies
- Stage2: ML algos such as Naive Bayes, SVM, Trees coupled with bag-of-words word representations
- Stage3: Recurrent Neural Networks such as LSTM
- Stage4: Seq2Seq Architectures and Transformers --> 'ImageNet' moment in NLP

![Transformer and Language Models](../images/TL and LM Flow_chart.png)

### Evolution of RNN units

Why RNNs came into existence?
- Models such as the **Multi-layer Perceptron Network, vector machines and logistic regression** did not perform well on sequence modelling tasks (e.g.: text_sequence2sentiment_classification)
- Why? **Lack of memory element** ; **No information retention**

#### Cometh the RNNs: 

- RNNs attempted to redress this shortcoming by introducing loops within the network, thus allowing the retention of information.

**An unrolled RNN**
![An un-rolled RNN Cell](../images/an_unrolled_RNN.png)

**Digging Deeper - tanh activations**
![](../images/RNN-units.png)

**Mathematically, a single RNN cell**
![](../images/maths_RNN_fwd_prop.png)

**Mathematically, a simple RNN network**
![](../images/maths_RNN_units.png)

##### How are weights such as $W_{aa}, W_{ax}, b_a$ computed
- BPTT: **Back Propagation Through Time**
![](../images/rnn_cell_backprop.png)

#### Advantage of a vanilla RNN:
- Better than traditional ML algos in retaining information

#### Limitations of a vanilla RNN:
- RNNs fail to model long term dependencies.
 - the information was often **"forgotten"** after the unit activations were multiplied several times by small numbers
- Vanishing gradient and exploding gradient problems

#### Long Short Term Memory (LSTM): 
- a special type of RNN architecture
- designed to keep information retained for extended number of timesteps
- each LSTM cell consists of 4 layers (3 sigmoid and 1 tanh)

![](../images/LSTM.png)
<caption><center>LSTM-cell. This tracks and updates a "cell state" or memory variable $c^{\langle t \rangle}$ at every time-step, which can be different from $a^{\langle t \rangle}$. </center></caption>

### About the gates

#### - Forget gate

For the sake of this illustration, lets assume we are reading words in a piece of text, and want use an LSTM to keep track of grammatical structures, such as whether the subject is singular or plural. If the subject changes from a singular word to a plural word, we need to find a way to get rid of our previously stored memory value of the singular/plural state. In an LSTM, the forget gate lets us do this: 

$$\Gamma_f^{\langle t \rangle} = \sigma(W_f[a^{\langle t-1 \rangle}, x^{\langle t \rangle}] + b_f)\tag{1} $$

Here, $W_f$ are weights that govern the forget gate's behavior. We concatenate $[a^{\langle t-1 \rangle}, x^{\langle t \rangle}]$ and multiply by $W_f$. The equation above results in a vector $\Gamma_f^{\langle t \rangle}$ with values between 0 and 1. This forget gate vector will be multiplied element-wise by the previous cell state $c^{\langle t-1 \rangle}$. So if one of the values of $\Gamma_f^{\langle t \rangle}$ is 0 (or close to 0) then it means that the LSTM should remove that piece of information (e.g. the singular subject) in the corresponding component of $c^{\langle t-1 \rangle}$. If one of the values is 1, then it will keep the information. 

#### - Update gate

Once we forget that the subject being discussed is singular, we need to find a way to update it to reflect that the new subject is now plural. Here is the formulat for the update gate: 

$$\Gamma_u^{\langle t \rangle} = \sigma(W_u[a^{\langle t-1 \rangle}, x^{\{t\}}] + b_u)\tag{2} $$ 

Similar to the forget gate, here $\Gamma_u^{\langle t \rangle}$ is again a vector of values between 0 and 1. This will be multiplied element-wise with $\tilde{c}^{\langle t \rangle}$, in order to compute $c^{\langle t \rangle}$.

#### - Updating the cell 

To update the new subject we need to create a new vector of numbers that we can add to our previous cell state. The equation we use is: 

$$ \tilde{c}^{\langle t \rangle} = \tanh(W_c[a^{\langle t-1 \rangle}, x^{\langle t \rangle}] + b_c)\tag{3} $$

Finally, the new cell state is: 

$$ c^{\langle t \rangle} = \Gamma_f^{\langle t \rangle}* c^{\langle t-1 \rangle} + \Gamma_u^{\langle t \rangle} *\tilde{c}^{\langle t \rangle} \tag{4} $$


#### - Output gate

To decide which outputs we will use, we will use the following two formulas: 

$$ \Gamma_o^{\langle t \rangle}=  \sigma(W_o[a^{\langle t-1 \rangle}, x^{\langle t \rangle}] + b_o)\tag{5}$$ 
$$ a^{\langle t \rangle} = \Gamma_o^{\langle t \rangle}* \tanh(c^{\langle t \rangle})\tag{6} $$

#### Advantage of a LSTM:
- Better equipped for long range dependencies

#### Limitations of LSTM:
- resists better than LSTM for vanishing gradient problem
- Added gates lead to more computation requirement and LSTMs tend to be slower

### Gated Recurrent Unit
- a curtailed version of LSTM
- retains the resisting vanishing gradient properties of LSTM but GRUs are internally simpler and faster than LSTMs.
> `forget` and `update` gates are merged into a single `update` gate
> The update gate decides how much of previous memory to keep around.
>
> The `reset` gate defines how to combine new input with previous value.

![](https://cdn-images-1.medium.com/freeze/max/1000/1*dhq14CzJijlqjf7IlDB0uw.png?q=20)

Comparison of performance between GRU and LSTM:
- GRUs are almost on par with LSTM but with efficient computation. 
- However, with large data LSTMs with higher expressiveness may lead to better results