# Transfer Learning

## Outline:
1. Transfer Learning  
1. BERT  
1. GPT-2 
1. T5


## Readings
1. https://jalammar.github.io/illustrated-gpt2/  
1. http://jalammar.github.io/illustrated-bert/
1. https://gluebenchmark.com/
1. https://medium.com/modern-nlp/transfer-learning-in-nlp-f5035cc3f62f  

# 1 Transfer Learning

## 1.1 Problem Statement

Why transfer learning? https://arxiv.org/pdf/2001.08361.pdf

<div><img src="images/trans.png" width="600"></div>

Types of transfer learning (subset)
1. Domain adaptation: Representational approach. Try changing the underlying distribution of data by either finding features that are common in both domain or represent both data in a shared low-dimensional space
1. Multitask Learning
1. ...

Differences between source and target domains:
1. $X_1 \neq X_2$ The feature space of source and target is different.
1. $p(x_1) \neq p(x_2)$ The marginal probability distribution of words is different for source and target.
1. $Y_1 \neq Y_2$ Labels differ for source and target.
1. $p(y_1) \neq p(y_2)$ The marginal probability distribution of labels is different for source and target.
1. $p(y_1|x) \neq p(y_2|x)$ The condition probability distribution of labels is different.

## 1.2 GLUE Benchmark

https://gluebenchmark.com/

<div><img src="images/glue.png" width="600"></div>

In [13]:
import datasets

df = datasets.load_dataset('glue', 'qnli');
df['train']

Reusing dataset glue (/Users/denaas/.cache/huggingface/datasets/glue/qnli/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


  0%|          | 0/3 [00:00<?, ?it/s]

Dataset({
    features: ['question', 'sentence', 'label', 'idx'],
    num_rows: 104743
})

## 1.3 CodeXGLUE Benchmark

https://microsoft.github.io/CodeXGLUE/

<div><img src="images/codeglue.png" width="1000"></div>

In [12]:
import datasets

df = datasets.load_dataset('code_x_glue_tc_text_to_code');
df['train']

Reusing dataset code_x_glue_tc_text_to_code (/Users/denaas/.cache/huggingface/datasets/code_x_glue_tc_text_to_code/default/0.0.0/059898ce5bb35e72c699c69af37020002b38b251734ddaeedef30ae7e6292717)


  0%|          | 0/3 [00:00<?, ?it/s]

Dataset({
    features: ['id', 'nl', 'code'],
    num_rows: 100000
})

# 2 BERT
## 2.1 Architecture

Bert consists only from encoder layers.

<div><img src="images/bert.png" width="1000"></div>

In [18]:
from transformers import BertConfig, BertModel


model = BertModel(BertConfig())
model

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          

## 2.2 Training objectives
1. Masked Language Modeling 

<div><img src="images/bert_mlm.png" width="700"></div>

2. Next Sentence Prediction
<div><img src="images/bert_next.jpeg" width="700"></div>

## 2.3 Bert Tuning 

<div><img src="images/bert_tune.png" width="800"></div>

Bert for Sentence Classification

<div><img src="images/bert_cls.png" width="700"></div>

In [25]:
from transformers import BertForSequenceClassification, BertForQuestionAnswering

model = BertForQuestionAnswering(BertConfig())
model

BertForQuestionAnswering(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_

In [None]:
## 2.4 Special Tokens and Attention

In [39]:
# https://github.com/jessevig/bertviz
from bertviz import head_view
from transformers import AutoTokenizer, AutoModel


tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModel.from_pretrained('bert-base-uncased', output_attentions=True)


Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [44]:
sentence_a = 'there was an animal playing on the floor'
sentence_b = 'it was a big black cat'

inputs = tokenizer.encode_plus(sentence_a, sentence_b, return_tensors='pt')
input_ids = inputs['input_ids']
token_type_ids = inputs['token_type_ids']
attention = model(input_ids, token_type_ids=token_type_ids)[-1]
sentence_b_start = token_type_ids[0].tolist().index(1)
input_id_list = input_ids[0].tolist() # Batch index 0
tokens = tokenizer.convert_ids_to_tokens(input_id_list) 

In [45]:
head_view(attention, tokens, sentence_b_start)

<IPython.core.display.Javascript object>

## 2.4 [RoBERTa](https://arxiv.org/abs/1907.11692)

* Static vs Dynamic Masking
* NSP Loss Importance
* Different Batch Sizes

# 3 GPT
## 3.1 Architecture

GPT-2 consists only from decoder blocks.  
**Training objective**: Language Modeling  

<div><img src="images/gpt.png" width="900"></div>

## 3.2 [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165)

<div><img src="images/gpt_few_shot.png" width="900"></div>

e.g. Machine Translation 

<div><img src="images/gpt_nmt.png" width="800"></div>

In [29]:
from transformers import AutoModelWithLMHead, AutoTokenizer
import torch


model = AutoModelWithLMHead.from_pretrained('gpt2')
tok = AutoTokenizer.from_pretrained('gpt2')

input_token = torch.tensor([tok.encode('This time of year', add_special_tokens=True)])

output_tokens = model.generate(input_token, 
                               max_length=10,
#                                num_beams=5, 
                               do_sample=True,
                               temperature=1.3,
                               top_k=20,
                               top_p=0.8,
                               repetition_penalty=1.3,
                               num_return_sequences=5,
                              )

for j in range(5):
    print(tok.decode(output_tokens[j]))

This time of year, we need to have a
This time of year when the water has been very
This time of year, a woman who wears the
This time of year you can't have too much
This time of year they are on their own,


## 3.3 Dataset Memorization

[Extracting Training Data from Large Language Models](https://arxiv.org/abs/2012.07805)

<div><img src="images/gpt_mem.png" width="600"></div>

# 4 [T5](https://arxiv.org/abs/1910.10683)
## 4.1 Architecture

Encoder-decoder transformer architecture with shared vocabulary

<div><img src="images/t5_arch.png" width="700"></div>

## 4.2 Training Objectives

Unsupervised Objectives:
1. Language Modeling
1. Masked Language Modeling
1. Denoising Autoencoder (Deshuffling) 

<div><img src="images/t5_loss.png" width="800"></div>

In [32]:
from transformers import T5Config, T5Model

model = T5Model(T5Config())
model

T5Model(
  (shared): Embedding(32128, 512)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 512)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=512, out_features=512, bias=False)
              (k): Linear(in_features=512, out_features=512, bias=False)
              (v): Linear(in_features=512, out_features=512, bias=False)
              (o): Linear(in_features=512, out_features=512, bias=False)
              (relative_attention_bias): Embedding(32, 8)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseReluDense(
              (wi): Linear(in_features=512, out_features=2048, bias=False)
              (wo): Linear(in_features=2048, out_features=512, bias=False)
              (dropout): Dropout(p=0.1, inplace

## 4.3 MultiTask Learning

<div><img src="images/multi1.png" width="500"></div>

Why it works?  
1. **Implicit data augmentation**: Effectively, MTL increases the training data for our model.
1. **Implicit feature selection. Training on multiple task can teach the model to focus on the most relevant features and can lead to a better model.
1. **Representation bias**: MTL forces the model to learn representations which are useful for all tasks. This helps the model to generalize faster for all tasks in the future as the representation which works for many tasks will also work for a new one.
1. **Regularization**: MTL acts as a regularizer by introducing inductive bias and reduces Rademacher complexity of the model.