# Transfer Learning

## Outline:
1. Transfer Learning
1. ELMO  
1. BERT  
1. GPT-2 
1. Natural Language Understanding
1. transformers lib


## Readings
1. https://jalammar.github.io/illustrated-gpt2/  
1. http://jalammar.github.io/illustrated-bert/
1. https://gluebenchmark.com/
1. https://medium.com/modern-nlp/transfer-learning-in-nlp-f5035cc3f62f  
1. https://lena-voita.github.io/posts/emnlp19_evolution.html

## 1 Transfer Learning

Why transfer learning? https://arxiv.org/pdf/2001.08361.pdf

<img src="images/trans.png" style="height:300px">

Types of transfer learning (subset)
1. Domain adaptation: Representational approach. Try changing the underlying distribution of data by either finding features that are common in both domain or represent both data in a shared low-dimensional space
1. Multitask Learning
1. ...

Differences between source and target domains:
1. The feature space of source and target is different.
1. The marginal probability distribution of words is different for source and target.
1. Labels differ for source and target.
1. The marginal probability distribution of labels is different for source and target.
1. The condition probability distribution of labels is different.

### MultiTask Learning

<img src="images/multi1.png" style="height:300px">

Why?  
1. Implicit data augmentation: Effectively, MTL increases the training data for our model.
1. Implicit feature selection. Training on multiple task can teach the model to focus on the most relevant features and can lead to a better model.
1. Representation bias: MTL forces the model to learn representations which are useful for all tasks. This helps the model to generalize faster for all tasks in the future as the representation which works for many tasks will also work for a new one.
1. Regularization: MTL acts as a regularizer by introducing inductive bias and reduces Rademacher complexity of the model.


### Optimization
! Notion of catastrophic forgetting = direct finetuning of pretrained model on your downstream task can result in loss of information from pretrained data.

Downstream task = task to finetune on

Optimization Schemes:
1. feature extraction = do not change the pretrained weights 
1. fine-tuning = change the pretrained weights
    1. Progressively in time (freezing) 
    1. Progressively in intensity (lower learning rates)
    1. Progressively vs. a pretrained model (regularization)
    
    
Evolution of BERT layer representations from https://arxiv.org/abs/1909.01380
1. with the language model objective, as you go from bottom to top layers, information about the past gets lost and predictions about the future get formed;
1. for masked language model, representations initially acquire information about the context around the token, partially forgetting the token identity and producing a more generalized token representation; the token identity then gets recreated at the top layer;
1. for machine translation, though representations get refined with context, less processing is happening and most information about the word type does not get lost.

## 2 ELMO


https://arxiv.org/pdf/1802.05365

**Training objective**: Language Modeling  (forward and backward)

**Downstream tasks**:

Embedding of k-th token for specific task
$$ ELMO^{task}_k = \gamma^{task} \sum_{j=0}^L s^{task} h_{k,j}^{LM}$$
where  
$k$ - k-th token  
$s^{task}$ - learnable softmax-normalized weights  
$\gamma^{task}$ - learnable scaling parameter  

<img src="images/elmo.png" style="height:400px">

In [None]:
from allennlp.modules.elmo import Elmo, batch_to_ids

options_file = "https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway/elmo_2x4096_512_2048cnn_2xhighway_options.json"
weight_file = "https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway/elmo_2x4096_512_2048cnn_2xhighway_weights.hdf5"

# Compute two different representation for each token.
# Each representation is a linear weighted combination for the
# 3 layers in ELMo (i.e., charcnn, the outputs of the two BiLSTM))
elmo = Elmo(options_file, weight_file, 2, dropout=0)

# use batch_to_ids to convert sentences to character ids
sentences = [['First', 'sentence', '.', 'custom', 'service'], ['Another', '.']]
character_ids = batch_to_ids(sentences)

embeddings = elmo(character_ids)

## 3 BERT

Bert consists only from encoder layers.

<img src="images/bert.png" style="height:300px">


Training objectives:
1. Masked Language Model
2. Sentence Entailment

<img src="images/bert_training.png" style="height:300px">


Classification with BERT
<img src="images/bert_cls.png" style="height:300px">



In [7]:
from transformers import BertForSequenceClassification, AutoTokenizer
import torch

model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

text = 'example_text'

input_ids = torch.tensor([tokenizer.encode(text, add_special_tokens=True)]) 
print(input_ids)

with torch.no_grad():
    last_hidden_states = model(input_ids)[0]

tensor([[ 101, 2742, 1035, 3793,  102]])


## 4 GPT-2

GPT-2 consists only from decoder blocks.  
Training objective: Language Model  

<img src="images/gpt.png" style="height:300px">

Downstream Tasks:
Essentially, you convert all downstream tasks to satisfy language model objective.

<img src="images/gpt2.png" style="height:300px">

<img src="images/gpt3.png" style="height:300px">

e.g. Machine Translation 

<img src="images/gpt4.png" style="height:300px">

## 5 Natural Language Understanding

Most modern models are evaluated on downstream tasks (aka transfer learning), despite their training objective.

<img src="images/nlu.jpeg" style="height:300px">

"The General Language Understanding Evaluation (GLUE) benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems"