> #### Main Sources of Reference: 
>  - Evolution of TL in NLP: https://arxiv.org/pdf/1910.07370v1.pdf
>  - ULMFiT paper: https://arxiv.org/pdf/1801.06146.pdf
>  - Articles on ULMFiT
>   - https://humboldt-wi.github.io/blog/research/information_systems_1819/group4_ulmfit/#vlbs
>   - https://medium.com/@zhangguanguan1/an-application-of-universal-language-model-fine-tuning-to-the-classification-of-multiclass-company-e77527e2bcae

## Evolution of RNN architectures for Transfer Learning in NLP (Part 2)

#### Already covered in Part 1
- Introduction to Language Modeling
- How Transfer Learning Evolved
- Evolution of RNN units - RNN, LSTM, GRU, AWD-LSTM

#### Agenda covered here in Part 2
- ULMFiT

#### Agenda to be covered here in Part 3
- ELMo
_______________________________________________________________________________________________________________

**Why ULMFiT became successful:**

Historically: 
    - Fine-tuning a LM required millions of in-domain corpus (in other words, transfer learning was not possible). Hence limited applicability <br>
    - LMs overfit to small datasets and suffered catastrophic forgetting when fine-tuned with a classifier <br>
    - ULMFiT used a very common 3 layer LSTM architecture (ignoring encoder, decoder and fine-tuning layers) but used a variety of novel training techniques to make the the concept of `inductive transfer learning` (pre-train in a huge generic corpus and fine-tune for a target data) <br>

### ULMFiT
- Universal Language Model Fine-tuning (ULMFiT) for Text Classification
 - This paper introduces techniques that are essential to fine-tune an LSTM-based Language Model
 - This paper specifically the superior performance of ULMFiT approach  in 6 text classification datasets
 - The performance was measured in terms of error rates
   ![](../images/ULMFiT_table.png)
 - The 6 text classification datasets used are: 
  ![](../images/ULMFiT_6_data_table.png)

#### What does ULMFiT propose?
- Pretrain a LM on a large general-domain corpus and fine-tune it on the target task using novel* techniques
- Why called **Universal** (the following have become synonymous with what a TL model is):
 - 1) It works across tasks varying in document size, number, and label type
 - 2) it uses a single architecture and training process; 
 - 3) it requires no custom feature engineering or preprocessing; and 
 - 4) it does not require additional in-domain documents or labels
- What are the **novel** techniques: 
 - discriminative fine-tuning,
 - slanted triangular learning rates, and 
 - gradual unfreezing

#### Comparison Notes with CV and other NLP Works
- Compared to CV models (which are several layers deep), **NLP models are typically more shallow** and thus require different fine-tuning methods
- Features in deep neural networks in CV have been observed to transition **from general to task-specific** from the **first to the last layer**. 
- For this reason, most work in CV focuses on transferring the first layers of the model and fine-tuning the last or several of the last layers and leaving the remaining layers frozen
- **Hypercolumns**:
 - CV: A hypercolumn at a pixel in CV is the vector of all activations of CNN units above that pixel
 - NLP: Concatenation of embeddings at different layers in a pretrained model
 - In CV, hypercolumns have been nearly entirely superseded by end-to-end fine-tuning
- **Multi-task learning**
 - One of the papers cited mentions training a LM objective jointly with the main task objective
 

>> #### ULMFiT does not use any of the custom engineered architectures (needed for hyper columns), no residual network, no multi-task learning objectives. But performs better than the above methodologies

### ULMFiT uses AWD-LSTM cell based Language Model

### About AWD LSTM
- Average SGD Weight Dropped (AWD) LSTM
- It uses `DropConnect` and a variant of Average-SGD (`NT-ASGD`) along with several other well-known regularization strategies

**Why `dropout` won't work?**
 - Dropout, an algorithm that randomly(with a probability p) ignore units’ activations during the training phase allows for the regularization of a neural network.
 - By diminishing the probability of neurons developing inter-dependencies, it increases the individual power of a neuron and thus reduces overfitting. 
 - However, dropout inhibits the RNNs capability of developing long term dependencies as there is loss of information caused due to randomly ignoring units activations.

**Hence `drop connect`**
- the drop connect algorithm randomly drops weights instead of neuron activations. It does so by randomly(with probability 1-p) setting weights of the neural network to zero during the training phase. 
- Thus **redressing the issue of information loss** in the Recurrent Neural Network **while still performing regularization.**

![](https://yashuseth.files.wordpress.com/2018/09/nn_do1.jpg?w=685)

**What is NT-ASGD**
- Non-monotonically Triggered Average Stochastic Gradient
- For Language Modeling Tasks, **traditional SGD without momentum outperforms other algorithms** such as momentum SGD, Adam, Adagrad, and RMSProp
- ASGD -- a variant of the traditional SGD algorithm

*Batch GD:*
```python
for i in range(nb_epochs):
  params_grad = evaluate_gradient(loss_function, data, params)
  params = params - learning_rate * params_grad
```

*Traditional SGD:*
```python
for i in range(nb_epochs):
  np.random.shuffle(data)
  for example in data:
    params_grad = evaluate_gradient(loss_function, example, params)
    params = params - learning_rate * params_grad
```

*Average Traditional SGD:*
```python
for i in range(nb_epochs):
  np.random.shuffle(data)
  params_list = []  
  for index, example in enumerate(data):
    avg_fact = 1 / max(index - K, 1) #when index > K, max(index-K, 1) will be index-K
    if avg_fact == 1: #when index < K
        params_grad = evaluate_gradient(loss_function, example, params)
        params = params - learning_rate * params_grad
    else: # when index > K
        params_list.append(params)
        params = avg_fact * (sum(params_list) + (params - learning_rate * params_grad))
```

- `K` is the minimum number of iterations run before weight averaging starts. 
- So before K iterations, the ASGD will behave similarly to a traditional SGD.
- `index` is the current number of iterations done, 
- `sum(params_list)` is the sum of weights from iteration K to index and 
- `learning_rate` is the learning rate at iteration t decided by a learning rate scheduler.


#### What is NT ASGD then?
- A commonly used strategy while using the SGD optimizer is to reduce the learning rate by a fixed quantity when the validation error worsens. 
- The Nonmonotonically triggered ASGD employs a similar technique.
- It differs in the fact that, instead of performing averaging when the validation error worsens NT-ASGD performs the averaging operation if the validation error fails to improve.

**ULMFiT language Model Architecture with an example (not considering the two blocks of classification layer)**
![](../images/ULMFiT_Language_Model.png "source: https://medium.com/@zhangguanguan1/an-application-of-universal-language-model-fine-tuning-to-the-classification-of-multiclass-company-e77527e2bcae")

**Transforming the language model into a Classifier**
![](https://miro.medium.com/max/1201/1*TxWWBKy4ot7jq-lnJfMJUA.jpeg)

### 3 Stages of ULMFiT

![](../images/ULMFiT_pretraining.png)

1.LM Pre-training: LM is trained on **a general-domain corpus** to capture general features of the language in different layers

- Capture the general properties of the given language through pretraining: Pre-trained their language model on **WikiText-103** – a large general-purpose dataset that consists of 28,595 preprocessed articles and 103 million words.
- Computationally expensive | need to be performed only once.

![](../images/ULMFiT_finetuning.png)

2.LM fine-tuning: full LM is fine-tuned on **target task data** using discriminative fine-tuning (‘Discr’) and slanted triangular learning rates (STLR) to learn task-specific features

- Finetune the LM to capture the inherent nuances of the **target task**
- Performed on relatively smaller dataset | requires less computation power
- Finetuning using two novel techniques: 
 - **Discriminative Finetuning**
 - **Slanted Triangular Learning Rates**

![](../images/ULMFiT_classifier-fine-tuning.png)

3.The classifier is fine-tuned on the target task using gradual unfreezing, ‘Discr’, and STLR to preserve low-level representations and adapt high-level ones (shaded: unfreezing stages; black: frozen)

- To perform task-specific classification, **two linear blocks** initialized from scratch are added to the language model. 
- (similar to the practice in CV classifiers) each block uses
 - **batch normalization**
 - **dropout**
 - **ReLU Activations**
In the last layer after these two linear blocks <br>
 - **a softmax layer**
- Parameters in these **target task-specific classifier layers** are the only ones that are learned from scratch
- The first linear layer takes as the input the pooled last layer hidden states
- Techniques used in Classifier Finetuning:
 - **Concat pooling**
 - **Gradual Unfreezing**
 - **BPTT for Text Classification (BP3TC)**
 - **Bidirectional language model**

### Novel Techniques - Part 1: 

#### Used in Language Model FineTuning

##### <center>**Discriminative fine-tuning**:</center>
- Core Idea: different layers capture different types of information, hence should be fine-tuned to different extents

Stochastic Gradient Descent of a model’s parameters $\theta$ at time step `t`:
$$ \theta_t = \theta_{t-1} - \eta \cdot \nabla_\theta J( \theta) $$

where
- $\eta$ is the learning rate
- $\nabla_\theta J( \theta)$ gradient with regard to the model’s objective function

For discriminative fine-tuning, we 
- split the parameter $\theta$ into {$\theta^1$,...,$\theta^L$} where
 - $\theta^l$ contains the parameters of the model at the $l^{th}$ layer and L is the number of layers of the model (here L =3)

- split the parameter $\eta$ into {$\eta^1$,...,$\eta^L$} where
 - $\eta^1$ is the learning rate for $l^{th}$  layer
 
**SGD with discriminative finetuning**:
$$ \theta_t^l = \theta_{t-1}^l - \eta^l \cdot \nabla_\theta^l J( \theta) $$

Choose the learning rate $\eta^L$ for the last layer and then compute learning rates of the previous layers using
$$ \eta^{l-1} = { \eta^{l}\over 2.6 } $$ 

**How to choose $\eta^L$ **
**Slanted triangular learning rates**:

**Objective**: 
- Want the model to quickly converge to a suitable region of the parameter space in the beginning of training and then refine its parameters <br>
*How it won't be achieived?*
- Using the same learning rate (LR) or an annealed learning rate (gradually reducing) throughout training

*How it can be achieved?*
- **STLR**
 - first linearly increase the learning rate and then linearly decays it
 
 ![](../images/ULMFiT_STLR.png)
 ![](../images/ULMFiT_STLR_equations.png)

where 
- `T` is the number of iterations
- `cut_frac` is the fraction of iterations we increase the LR
- `cut` is the iteration at which we switch from increasing to decreasing the LR 
- `p` is the fraction of the number of iterations we have increased or will decrease the LR respectively 
- `ratio` = $\eta_{min}\over\eta_{max}$ specifies how much smaller the lowest LR is from the maximum LR $\eta_{max}$
- $\eta_t$ is the learning rate at iteration t
- `cut_frac = 0.1` , `ratio = 32` and $\eta_{max} = 0.01$

### Novel Techniques - Part 2: 

#### Classifier FineTuning

**Concat Pooling**:
- As input documents can consist of hundreds of words, information may get lost if we only consider the last hidden state of the model
$$h_c = [h_T, maxpool(H), meanpool(H)]$$
where 
$$ H = {h_1,...,h_T} $$
$ [] $$ is the concatenation $
$$ T is the last time step $$

**Gradual Freezing**:
- Overly aggressive fine-tuning will cause catastrophic forgetting, eliminating the benefit of the information captured through language modeling
- too cautious fine-tuning will lead to slow convergence (and resultant overfitting)
- Besides discriminative finetuning and triangular learning rates, we propose gradual unfreezing for fine-tuning the classifier

Idea: 
- Rather than fine-tuning all layers at once, which risks catastrophic forgetting, we propose to gradually unfreeze the model starting from the last layer as this contains the least general knowledge
- We first unfreeze the last layer and fine-tune all unfrozen layers for one epoch
- We then unfreeze the next lower frozen layer and repeat, until we finetune all layers until convergence at the last iteration
![](../images/ULMFiT_gradual_unfreeze.png)


**Back Propagation through Time for Text Classification (BP#TC)**:

**Simple BPTT (for RNNs)**
![](https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcTF6q2J0fxaYPaBKBK1AvL791-ztNyR-2KwSZmDFAsWp-TxUIJv)
- The red line indicates gradient propagation through time in the reverse direction
- Language models are trained with backpropagation through time (BPTT) to enable gradient propagation for large input sequences
- **Why BP3TC** - In order to make fine-tuning a classifier for large documents feasible, authors of ULMFiT proposed BPTT for Text Classification (BPT3C)

**Bidirectional LM**:
- Train both a forward LM as well as a backward LM. Consequently averaging the predictions given by both the Language Models