> #### Main Sources of Reference: 
>  - Evolution of TL in NLP: https://arxiv.org/pdf/1910.07370v1.pdf
>  - ELMo Paper: https://arxiv.org/pdf/1802.05365.pdf <br>
>    - ELMo in Articles:
>     - https://ahmedhanibrahim.wordpress.com/2019/07/01/a-study-on-cove-context2vec-elmo-ulmfit-and-bert/
>     - https://medium.com/@plusepsilon/the-bidirectional-language-model-1f3961d1fb27
>  - ELMo in practice: https://colab.research.google.com/drive/13f6dKakC-0yO6_DxqSqo0Kl41KMHT8A1

## Evolution of RNN architectures for Transfer Learning in NLP (Part 3)

#### Already covered in Part 1
- Introduction to Language Modeling
- How Transfer Learning Evolved
- Evolution of RNN units - RNN, LSTM, GRU, AWD-LSTM

#### Agenda covered in Part 2
- ULMFiT

#### Agenda covered here in Part 3
- ELMo
_______________________________________________________________________________________________________________

- ELMo comes up with better `word representations/embeddings` using Language Models that learn the `context` of the word in focus
![](https://ahmedhanibrahim.files.wordpress.com/2019/07/52861-1pb5hxsxogjrnda_si4nj9q.png?w=775)
*Ignore the hidden vectors predicting the padding tokens and only focus on the vectors that predict on the words*
source: https://medium.com/@plusepsilon/the-bidirectional-language-model-1f3961d1fb27

ELMo uses the Bi-directional Language Model to get a new embedding that will be concatenated with the initialized word embedding. Concretely, the word “are” in the above figure will have a representation formed with the following embedding vectors

- Original embedding, GloVe, Word2Vec or FastText for example
- Forward pass hidden layer representation vector
- Backward pass hidden layer representation vector

### About ELMo Word Vectors: 
ELMo models both 
- (1) complex characteristics of word use (e.g., syntax and semantics)
- (2) how these uses vary across linguistic contexts (i.e., to model polysemy)
<br>
<br>
- ELMo `word vectors` are **learned functions of the internal states of a deep bidirectional language model (biLM)**, which is pretraind on a large text corpus
<br>
<br>
- ELMo assigns each token/word **a representation that is function of the entire input sentence**
<br>
<br>
- ELMo representations are **deep**, in the sense that they are **a function of all of the internal layers of the biLM**
- In other words, ELMo doesn't just use the top LSTM layer, but all the internal layers
<br>
<br>
 - **higher-level LSTM states** capture **context-dependent aspects of word meaning**
 - **lower-level states** model aspects of **syntax**

ELMo does well in 6 diverse NLP tasks

| Task | Description | Comments about Dataset | Evaluation Parameter | Previous SOTA | ELMo SOTA |
| ------ |:------: |:------ |------ |------ | ------ | 
| SQuAD | Stanford Question Answering Dataset | a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable | F1 score (harmonic mean of precision and recall) | 84.4 |  85.8 |
| SNLI | Stanford Natural Language Inference | SNLI corpus (version 1.0) is a collection of 570k human-written English sentence pairs manually labeled for balanced classification with the labels entailment, contradiction, and neutral, supporting the task of natural language inference (NLI), also known as recognizing textual entailment (RTE) | Accuracy | 88.6 | 88.7 |
| SRL | Semantic Role Labeling | Semantic Role Labeling (SRL) recovers the latent predicate argument structure of a sentence, providing representations that answer basic questions about sentence meaning, including “who” did “what” to “whom,” etc | F1 Score | 81.7 | 84.6 |
| Coref | Coreference resolution | Coreference resolution is the task of finding all expressions that refer to the same entity in a text. | Average F1 | 67.2 | 70.4 |
| NER | Named Entity Recognition | The named entity recognition model identifies named entities (people, locations, organizations, and miscellaneous) in the input text | F1 | 91.93 | 92.22 |
| SST-5 | 5-class Stanford Sentiment Treebank Dataset | fine-grained sentiment classification task uses 5 discrete classes: Strongly positive, Weakly positive, Neutral, Weakly negative, Strongly negative | Accuracy | 53.7 | 54.7 |
<br>
**sources for the Task Description:**
- https://rajpurkar.github.io/SQuAD-explorer/
- https://nlp.stanford.edu/projects/snli/
- https://demo.allennlp.org/semantic-role-labeling/MTIzODQzNg==
- https://demo.allennlp.org/coreference-resolution/MTIzODQzNA==
- https://demo.allennlp.org/named-entity-recognition/MTIzODQzOA==
- https://towardsdatascience.com/fine-grained-sentiment-analysis-in-python-part-1-2697bb111ed4

#### 1. Bidirectional Language Models
> Gist: ELMo word representations are functions of the entire input sentence **computed on top of two-layer biLMs**

Given a sequence of N tokens $ (t_1,t_2,...,t_N) $, **a forward language model** computes the probability of the sequence by modeling the probability of
$$ {\prod\limits_{k=1}^{N} {P(t_k,\ |\ {t_1, ...,t_{t-2},t_{t-1} })}}$$ 

A **backward LM** is similar to a forward LM, except it runs over the sequence in reverse, predicting the previous token given the future context:

$$ {\prod\limits_{k=1}^{N} {P(t_k,\ |\ {t_{k+1},t_{k+2} , ...,t_{N-1},t_{N} })}} $$

A biLM combines both a forward and backwardLM. Our formulation jointly maximizes the log likelihood of the forward and backward directions
![](../images/ELMo_logliklihood.png)
![](../images/ELMo_parameters.png)

** 2. ELMo Embedding **
> Gist: ELMo is a liner combination of the internal/hidden states (equation consisting of $\gamma^{task}, s^{task} and \ h_{k,j}^{LM}$)

ELMo embedding for each token is a combination of the intermediate layer representations in the biLM
$$ R_k = \{x_k^{LM}, 
\overrightarrow{
h_{k,j}^{LM}
}, 
\overleftarrow{
h_{k,j}^{LM}
} | \ 
j = 1, 2, ...L\} $$

$$ R_k = \{ h_{k,j}^{LM} \ | \ j = 0, 1, 2, ...L\} $$

where 
- $ x_{k,j}^{LM} $ - *context-independent* original embedding. For e.g.: GloVe, Word2Vec or FastText
- $\overrightarrow{
h_{k,j}^{LM}
} $ - context-dependent representation of the forward language model of the jth layer
- `L` is the last LSTM layer
- $ h_{k,0}^{LM} $ is the context independent token layer  representation
- $ h_{k,j}^{LM} = [
\overrightarrow{
h_{k,j}^{LM}
},
\overleftarrow{
h_{k,j}^{LM}
}
]
$ 

ELMo combines all the layers of the biLM representation into a single vector $ELMo_k = E(R_k)$ where ELMo collapses all layers in `R` into a single vector  

While finetuning with respect to a task, the ELMo embedding for token `k` is
![](../images/ELMo_finetuned_embeddings.png)
where
- $\gamma^{task} $ = a scalar quantity = useful in optimization process
- $ s^{task} $ = softmax-normalized weights

** 3. How to use ELMo in most supervised tasks in NLP**
> Gist: Most supervised NLP models share a common architecture at the lowest layers, allowing us to add ELMo in a easy manner

- Given a sequence of tokens $ (t_1,t_2,...,t_N) $, it is standard to form a context-independent token representation 
$$ x_k = pre-trained word embeddings +(optionally)character-based representations $$
<br>
<br>
- concatenate the ELMo vector $ELMo_k^{task}$ with $ x_k $ to create
$$ ELMo\ enhanced\ representation = [x_k;ELMo_k^{task}] $$

- The `ELMo enhanced representation` is then used for downstream task like SQuAD, SNLI, SST-5, etc.,

**4. Pre-trained Bidirectional LM Architecture**: 
![](../images/ELMo_pretrained_bidirectionalLM_architecture.png)
- final model uses L = 2 biLSTM layers with 4096 units and 512 dimension projections and a residual connection from the first to second layer.
- context-independent representation $ x_k $ uses 2048 character n-gram convolutional filters followed by two highway layers and a linear projection down to a 512 representation
<br>
<br>
> source: 
> - What are highway layers: https://papers.nips.cc/paper/5850-training-very-deep-networks.pdf

**Conclusion:** 
- high-quality deep context-dependent representations are learned from biLMs
- the biLM layers efficiently encode different types of syntactic and semantic information about words-in-context
- using all layers improves overall task performance rather than just the top LSTM layer 