<a href="https://colab.research.google.com/github/victorviro/Deep_learning_python/blob/master/Transfer_learning_in_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Table of contents

1. [Introduction](#1)
2. [Introduction to Transfer Learning in NLP](#2)
3. [ELMo](#3)
4. [ULMFiT](#4)
5. [GPT](#5)
    1. [Unsupervised pre-training](#5.1)
    2. [Supervised fine-tuning](#5.2)
    3. [Input transformations](#5.3)
    4. [Byte pair encoding](#5.4)
6. [BERT](#6)
    1. [Pre-training](#6.1)
        1. [Input representation](#6.1.1)
    2. [Fine-tuning](#6.2)
    3. [Feature-based approach](#6.3)
7. [ALBERT](#7)
    1. [Factorized embedding parametrization](#7.1)
    2. [Cross-layer parameter sharing](#7.2)
    3. [Sentence-order prediction](#7.3)
8. [GPT-2](#8)
    1. [Zero-shot transfer](#8.1)
    2. [BPE on byte sequences](#8.2)
    3. [Model modifications](#8.3)
    4. [One difference from BERT](#8.4)
9. [ROBERTa](#9)
10. [References](#10)

# Introduction <a name="1"></a>

Two important breakthroughs that have provided significant impetus to the natural language processing (NLP) domain are the arrival of *Transfer Learning* and rapid improvements in the performance of *Language models*. 

In this notebook, we seek to discuss the recent strides made in Transfer learning. An introduction to Language models is available in this [notebook](https://nbviewer.jupyter.org/github/victorviro/Deep_learning_python/blob/master/Char_RNN_with_Keras.ipynb).

#Introduction to transfer learning in NLP <a name="2"></a>

*Transfer Learning*, is a technique where a neural network is fine-tuned on a specific task after being pre-trained on a general task allowing deep learning models to converge faster and with relatively lower requirements of fine-tuning data. Transfer learning has had a large impact on computer vision (CV). CV models (including object detection, classification, and segmentation) are rarely trained from scratch but instead are fine-tuned from models that have been pre-trained on ImageNet, MS-COCO, and other datasets. With recent advances in natural language processing (NLP), it has become possible to perform transfer learning in this domain as well.

While Deep Learning models have achieved state-of-the-art on many NLP tasks, these models are trained from scratch, requiring large datasets, and days to converge. Fine-tuning pre-trained **word embeddings**, a simple transfer technique that only targets a model’s first layer, has had a large impact in practice and is used in most state-of-the-art models (a review of different text vectorization techniques, including word embeddings, is available in this [notebook](https://nbviewer.jupyter.org/github/victorviro/Deep_learning_python/blob/master/Text_Vectorization_NLP.ipynb)). Recent approaches that concatenate embeddings derived from other tasks with the input at different layers ([ELMo](https://arxiv.org/abs/1802.05365)) still train the main task model from scratch and treat pre-trained embeddings as fixed parameters, limiting their usefulness.

In light of the benefits of pretraining, we should be able to do better than randomly initializing the remaining parameters of our models. Transfer learning can be used for applications where there is a lack of a large training set. The target dataset should ideally be related to the pre-training dataset for effective transfer learning. This type of training is generally referred to as *semi-supervised training* where the neural network is first trained as a language model on a general dataset followed by supervised training on a labeled training dataset thus establishing a dependence of supervised fine-tuning on unsupervised language modeling.

Mou et al. in 2016 ([paper](https://arxiv.org/abs/1603.06111)) or Dai and Le in 2015 ([paper](https://arxiv.org/abs/1511.01432)) proposed fine-tuning a language model (LM) but require millions of in-domain documents to achieve good performance, which severely limits its applicability. The lack of knowledge of how to train LM fine-tuning effectively has been hindering wider adoption. LMs overfit to small datasets and suffered catastrophic forgetting when fine-tuned with a classifier. Compared to CV, NLP models are typically more shallow and thus require different fine-tuning methods.

# ELMo <a name="3"></a>

*Embeddings from Language Model* ([*ELMo*](https://arxiv.org/abs/1802.05365) )  learns contextualized word representation by pre-training a language model in an unsupervised way. Unlike most widely used word embeddings, ELMo word representations are functions of the entire input sentence. Therefore, the same word can have different word vectors under different contexts. They are computed on top of a two-layer *bidirectional language model* (biLM). 

In a previous [notebook](https://nbviewer.jupyter.org/github/victorviro/Deep_learning_python/blob/master/Char_RNN_with_Keras.ipynb) we introduced language models. Given a sequence of $T$ tokens, $w_1,w_2, …,w_T$, a
forward language model computes the probability
of the sequence by modeling the probability of token $w_k$ given the history $w_1, ..., w_{k−1}$:

$$P(w_1,w_2,...,w_T)=\prod_{k=1}^{T}P(w_t|w_1,...,w_{k-1})$$

A backward LM is similar to a forward LM, except it runs over the sequence in reverse, predicting the previous token given the future context:

$$P(w_1,w_2,...,w_T)=\prod_{k=1}^{T}P(w_t|w_{k+1},...,w_{T})$$

The biLM combines both a forward and backward LM, and jointly maximizes the log-likelihood of the forward and backward directions.

This biLM has two layers stacked together. Each layer has 2 passes, forward pass, and backward pass:

![](https://cdn.analyticsvidhya.com/wp-content/uploads/2019/03/output_YyJc8E.gif)

- The architecture above uses a character-level CNN to represent words of a text string into raw word vectors. 

- These vectors act as inputs to the first layer of biLM. 

- The forward pass contains information about a certain word and the context words **before** that word. 

- The backward pass contains information about the word and the context words **after** it. 

- This pair of information, from the forward and backward pass, forms the intermediate word vectors. 

- These intermediate word vectors are fed into the next layer of biLM. 

- The final representation (ELMo) is the weighted sum of the raw word vectors and the 2 intermediate word vectors.


Given a pre-trained biLM and a supervised architecture for a target NLP task, we can use the biLM to improve the task model.

Most supervised NLP models share a common architecture at the lowest layers (for example, [sequence labeling]() tasks). Given a sequence of tokens $(t_1,...,t_N)$, it is standard to form a context-independent token representation $\boldsymbol{x}_k$ for each token position using pre-trained word embeddings and optionally character-based representations. Thus, we can simply concatenate the ELMo vector with $\boldsymbol{x}_k$ and pass this representation as input to the lowest layer of the target NLP task.

Most supervised NLP models forms then a context-sensitive representation $\boldsymbol{h}_k$, typically using either bidirectional RNNs, CNNs, etc. For some tasks, authors observe further improvements by also concatenating an ELMo vector with $\boldsymbol{h}_k$ and pass it to the output layer of the target task.

# ULMFiT <a name="4"></a>

*Universal Language Model Fine-tuning* ([*ULMFiT*](https://arxiv.org/abs/1801.06146)) is a method to fine-tune a pre-trained language model. It was one of the forerunners of inductive transfer learning in NLP. Given a static source task $T_S$ and any target task $T_T$ with $T_S\neq T_T$, we would like to improve performance on $T_T$. **Language modeling** can be seen as the ideal source task since it captures many facets of language relevant for downstream tasks, such as long-term dependencies, hierarchical relations,  and sentiment. In contrast to tasks like Machine Translation (MT) and entailment, it provides data in near-unlimited quantities for most domains and languages.  Moreover, language modeling already is a key component of existing tasks such as MT and dialogue modeling.

The proposed ULMFiT pre-trains a language model (LM) on a large general-domain corpus and fine-tunes it on the target task using different techniques. The architecture uses a regular LSTM (with no attention, no short-cut connections) with various tuned dropout hyperparameters. ULMFiT consists of the following three steps:

![](https://i.ibb.co/18JTTMH/ULMFi-T-steps.png)

- **Generic Pretraining of the Language Model**: The corpus for language should be large and capture general properties of language (The authors used WikiText103, a large general-purpose dataset). While this stage is the most expensive, it only needs to be performed once and improves performance and convergence of downstream models.

- **Fine-tuning the Language Model on the Target task**: The data of the target task will likely come from a different distribution than the general-domain data used for pretraining.  We thus fine-tune the LM on data of the target task. Given a pre-trained general-domain LM, this stage converges faster as it only needs to adapt to the target data, and it allows us to train a robust LM even for small datasets. They proposed two novel techniques for fine-tuning the LM:

 - *Discriminative fine-tuning*: Based on the idea that different layers should be fine-tuned differently since they capture different types of information, instead of using the same learning rate for all layers of the model, it allows us to tune each layer with different learning rates. 
 
 - *Slanted Triangular learning rates*: it first linearly increases the learning rate and then linearly decays it according to a specific schedule in order to allow the model to quickly converge to a suitable region of the parameter space at the beginning of training and then refine its parameters.


-  **Fine-Tuning the Classifier on Target Task**: Finally, the pre-trained language model is augmented with two additional linear blocks for fine-tuning the classifier. The parameters in these task-specific classifier layers are the only ones that are learned from scratch. To preserve the information contained in few words, the input provided to this classifier is a concatenation of the last hidden layers and the average and max pooled output of the previous hidden layers.

 - *Gradual Unfreezing*: Rather than fine-tuning all layers at once, which risks catastrophic forgetting, they first unfreeze the last layer and fine-tune all un-frozen layers for one epoch. Then unfreeze the next lower frozen layer and repeat, until fine-tune all layers until convergence at the last iteration.

The authors train both a forward LM as well as a backward LM and averaging the predictions given by both the Language Models.

# GPT <a name="5"></a>

*Improving Language Understanding by Generative Pre-Training* ([*GPT*](https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf)) explores a semi-supervised approach for language understanding tasks using a combination of unsupervised pre-training and supervised fine-tuning. The goal is to learn a universal representation that transfers with little adaptation to a wide range of tasks. They use large corpus of unlabeled text and several datasets with manually annotated training examples (target tasks). Their setup does not require these target tasks to be in the same domain as the unlabeled corpus. 

They evaluate their approach on different types of language understanding tasks such as question answering, semantic similarity, or text classification.

### Unsupervised pre-training <a name="5.1"></a>

Given an unsupervised corpus of tokens $U=\{u_1,u_2, …,u_n\}$, a standard language modeling objective is used to maximize the log-likelihood (same as ELMo, but without backward computation):

$$L_1(U)=\prod_{i}\log P(u_i|u_{i-k},...,u_{i-1};\Theta)$$

where $k$ is the size of the context window, and the conditional probability P is modeled using a neural network with parameters $\Theta$.

They use a multi-layer Transformer decoder for the language model, which is a variante of the transformer (the encoder part is discarded) (an introduction to the transformer architecture is available in this [notebook](https://nbviewer.jupyter.org/github/victorviro/Deep_learning_python/blob/master/NLP_Attention_and_Transformer_architecture.ipynb)). It provides a more structured memory for handling long-term dependencies in text, compared to alternatives like RNNs. This model applies multiple transformer blocks over the embeddings of input sequences. Each block contains a *masked multi-headed self-attention* layer and a *pointwise feed-forward* layer. The final output produces a distribution over target tokens after softmax normalization.

![](https://i.ibb.co/QjdCkMh/transformer-decoder-GPT.png)

One **limitation** of GPT is its uni-directional nature. The model is only trained to predict the future left-to-right context.

### Supervised fine-tuning <a name="5.2"></a>

The most substantial upgrade that OpenAI GPT proposed is to get rid of the task-specific model and use the pre-trained language model directly.

After training the model, parameters are adapted to the supervised target task. 

We assume a labeled dataset $C$ and a classification task as an example. Each instance consists of a sequence of input tokens, $\boldsymbol{x}=(x^1,...,x^m)$, and a label $y$. The inputs are passed through the pre-trained model to obtain the final transformer block’s activation $\boldsymbol{h}_l^{(m)}$, which is then fed into an added linear output layer with parameters $\boldsymbol{W}_y$ to predict a distribution over class labels:

$$P(y|x^1,...,x^m) = \text{softmax}(\boldsymbol{h}_l^{(m)}\boldsymbol{W}_y)$$

The objective to maximize is the log-likelihood for true labels:

$$L_2(C)=\prod_{(x,y)}\log P(y|x^1,...,x^m)$$

They additionally add the LM loss as an auxiliary objective to the fine-tuning because it helps accelerate convergence during training and improves the generalization of the supervised model. Specifically, they optimize the following objective (with weight $\lambda$):
$$L_3(C) = L_2(C) + \lambda ∗ L_1(C)$$

The only extra parameters required during fine-tuning are $\boldsymbol{W}_y$, and embeddings for delimiter tokens.

### Task-specific input transformations <a name="5.3"></a>

Since the pre-trained model was trained on contiguous sequences of text, some modifications in the inputs are required to apply fine-tuning to tasks like question answering or textual entailment since they have structured inputs such as ordered sentence pairs, or triplets of document, question, and answers.

They use a traversal-style approach ([paper](https://arxiv.org/abs/1509.06664)), where structured inputs are converted into an ordered sequence of tokens that the pre-trained model can process. These input transformations enable to fine-tune effectively with minimal changes to the architecture to the pre-trained model across tasks.


### Byte Pair Encoding <a name="5.4"></a>

*Byte Pair Encoding* ([*BPE*](https://arxiv.org/abs/1508.07909)) is used to encode the input sequences. BPE was originally proposed as a data compression algorithm in the 1990s and then was adopted to solve the open-vocabulary issue in machine translation, as we can easily run into rare and unknown words when translating into a new language. Motivated by the intuition that rare and unknown words can often be decomposed into multiple subwords, BPE finds the best word segmentation by iteratively and greedily merging frequent pairs of characters. It is a middle ground between character and word-level language modeling which effectively interpolates between word level inputs for frequent symbol sequences and character level inputs for infrequent symbol sequences.

# BERT <a name="6"></a>

*Bidirectional Encoder Representations from Transformers* ([BERT](https://arxiv.org/abs/1810.04805)) is a direct descendant to GPT: train a large language model on text and then fine-tune on specific tasks without customized network architectures. Compared to GPT, the largest improvement of BERT is to make training bi-directional. The model learns to predict both context on the left and right.


BERT’s model architecture is a multi-layer bidirectional **Transformer** encoder based on the original implementation [Attention Is All You Need](https://arxiv.org/abs/1706.03762).

![](https://i.ibb.co/Q8fPnSv/transformer-encoder-architecture.png)

Authors present two model sizes for BERT: *BERT BASE* is comparable in size to the OpenAI GPT to compare performance. *BERT LARGE* is a huge model that achieved the state of the art results reported in the paper. Just like the vanilla encoder of the transformer, BERT takes a sequence of words as input which keeps flowing up the stack. Each layer applies self-attention, and passes its results through a feed-forward network, and then hands it off to the next encoder. Finally, the word embeddings gotten have deep contextualized information about the word in the document.

There are two steps: pre-training and fine-tuning. 

![](https://i.ibb.co/K2cFK6h/bert-usage2.png)

### Pre-training <a name="6.1"></a>

During pre-training, the model is trained on unlabeled data over two unsupervised tasks:

- *Masked Language Model (MLM)*: 15% of tokens in the text are randomly replaced by the `[MASK]` token. A "language model head" is appended to the output of the last BERT hidden layer. The training process tries to minimize the cross-entropy loss of predicting the original word at `[MASK]` position. 

 ![](https://i.ibb.co/k8T3zDx/bert-pretaining.png)

 This procedure is similar to a typical exercise of a foreign language test where a sentence is given with a missing word and asks you what word would be there or fit better. This task has sense since to decide what word would be in the mask, it would look at all words in the document to make a decision, capturing the context of the document.

 Beyond masking 15% of the input, BERT also mixes things a bit to improve how the model later fine-tunes. Sometimes it randomly replaces a word with another word and asks the model to predict the correct word in that position.

- *Next sentence prediction*: To make BERT better at handling relationships between multiple sentences (useful for tasks such as question answering and text entailment), the pre-training process includes an additional task: Given two sentences (A and B), is B likely to be the sentence that follows A?

 ![](https://i.ibb.co/bgkZqpx/bert-pretaining2.png)

The training data for both auxiliary tasks can be trivially generated from any monolingual corpus. Hence the scale of training is unbounded. 

The training loss is the sum of the mean masked LM likelihood and mean next sentence prediction likelihood.

#### Input representation <a name="6.1.1"></a>

In order to make BERT handle a variety of downstream tasks, the authors defined three different inputs which allow us to unambiguously represent both a single sentence and a pair of sentences.

- [*WordPiece*](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/37842.pdf) *tokenization*: instead of using naturally split English word, they can be further divided into smaller sub-word units so that it is more effective to handle rare or unknown words.  Apart from the sub-words units, the authors introduced two new tokens that must be appended to the input sentences: [`CLS`] in the beginning of the input and [`SEP`] after each sentence.

- The second and third inputs are sequences of 0s and 1s.  
 - *Type IDs* or *Segment IDs* indicates that a token belongs to the sentence A (a series of 0s) or the sentence B (a series of 1s).
 
 - *Mask IDs* is used when input texts are padded to same length (indicates whether the text is padded from a certain position).

![](https://i.ibb.co/ZGd8Jx5/input-representation-BERT.png)

The input embeddings are constructed by summing three parts:

- *WordPiece token embeddings*.

- Two *segment embeddings* are learned for each sentence. Only sentence A embeddings are used if the input only contains one sentence.

- *Position embeddings*: Positional embeddings are learned rather than hard-coded.

![](https://i.ibb.co/NCQCPH6/input-embedding-BERT.png)

### Fine-tuning <a name="6.2"></a>

The pre-trained BERT model can be fine-tuned with just one additional layer, just like OpenAI GPT. The BERT model is first initialized with the pre-trained parameters, and all of the parameters are fine-tuned using labeled data from the downstream task.

For classification tasks, we get the prediction by taking the final hidden state of the special first token `[CLS]`, $\boldsymbol{h}^{[\text{CLS}]}_L$, and multiplying it with a small weight matrix, $\text{softmax}(\boldsymbol{h}^{[\text{CLS}]}_L \boldsymbol{W}_{\text{cls}})$.

![](https://i.ibb.co/8rtrGZc/BERT-spam-classifier.png)


For Question Answering tasks, we need to predict the text span in the given paragraph for a given question. BERT predicts two probability distributions of every token, being the start and the end of the text span. Only two new small matrices, $\boldsymbol{W}_s$ and $\boldsymbol{W}_e$, are learned during fine-tuning and $\text{softmax}(\boldsymbol{h}^{(i)}_L\boldsymbol{W}_s)$ and $\text{softmax}(\boldsymbol{h}^{(i)}_L\boldsymbol{W}_e)$ define two probability distributions.

The BERT paper shows several ways to use BERT for different tasks.

![](https://i.ibb.co/s60fg8Z/use-cases-BERT.png)

### Feature-based approach <a name="6.3"></a>



Instead of the fine-tuning approach, BERT can be used as a **feature-based approach**. Just like ELMo, we can use the pre-trained BERT to create contextualized word embeddings. Fixed features are extracted from the pre-trained model (the activations from one or more layers) without fine-tuning any parameters of BERT.

# ALBERT <a name="7"></a>

*A Lite BERT* ([*ALBERT*](https://arxiv.org/abs/1909.11942)), is a light-weighted version of BERT model which can be trained faster with fewer parameters by introducing two parameter reduction techniques. Authors also propose a more chanllenging training task to replace the next sentence prediction (NSP) objective.

### Factorized embedding parameterization <a name="7.1"></a>

Input-level embeddings (WordPiece embeddings) learn *context-independent* representations, whereas hidden-layer embeddings refine that into *context-dependent* representations. In BERT, the WordPiece tokenization embedding size $E$ is configured to be the same as the hidden state size $H$. If we want to increase $H$, we need to learn a larger tokenization embedding too, which is expensive because it depends on the vocabulary size ($V$). It makes sense to separate the size of the hidden layers from the size of vocabulary embedding.

Using factorized embedding parameterization, the large vocabulary embedding matrix of size $V \times H$ is decomposed into two small matrices of size $V \times E$ and $E \times H$. This separation makes it easier to grow the hidden size without significantly increasing the parameter size of the vocabulary embeddings. Given $H>E$, this factorization can result in significant parameter reduction.

### Cross-layer parameter sharing <a name="7.2"></a>

Transformer-based neural network architectures (such as BERT, XLNet, and RoBERTa) rely on independent layers stacked on top of each other. Authors observed that the network often learned to perform similar operations at various layers, using different parameters of the network. This possible redundancy is eliminated in ALBERT by parameter-sharing across the layers.

There are multiple ways to share parameters, e.g., only
sharing feed-forward network (FFN) parameters across layers, or only sharing attention parameters. The default decision for ALBERT is to share all parameters across layers, i.e., the same layer is applied on top of each other. This technique reduces the number of parameters by a ton and does not damage the performance too much.

### Sentence-order prediction <a name="7.3"></a>

Authors found that the next sentence prediction (NSP) task of BERT turned out to be too easy. ALBERT instead adopted a *Sentence-Order Prediction* (*SOP*) self-supervised loss. The SOP loss uses as positive examples the same technique as BERT (two consecutive segments from the same document), and as negative examples the same two consecutive segments but with their order switched.

For the NSP task, the model can make reasonable predictions if it can detect topics when A and B are from different contexts. In comparison, SOP is harder as it requires the model to fully understand the coherence and ordering between segments.

# GPT-2 <a name="8"></a>

The [*OpenAI GPT-2*](https://openai.com/blog/better-language-models/) language model is the successor to GPT. With a similar architecture, GPT-2 has 1.5B parameters, 10x more than the original GPT, and it achieves SOTA results on language modeling datasets in a *zero-shot transfer setting* (without domain transfer fine-tuning). Large improvements are especially noticeable on small datasets and datasets used for measuring long-term dependency.

**Training dataset**:
Most prior work trained language models on a single domain of text, such as news articles or Wikipedia. GPT-2 used a large and diverse dataset in order to collect natural language demonstrations of tasks in as varied of domains and contexts as possible. Specifically, the pre-training dataset contains 8 million Web pages collected by crawling qualified outbound links from Reddit. 

### Zero-Shot Transfer <a name="8.1"></a>

The pre-training task for GPT-2 is language modeling and it achieves SOTA results for this task in different datasets. However, the amazing thing about this model is that it achieves great results in other tasks such as machine translation, question answering (QA), or text summarization, without task-specific fine-tuning. That is, the model has learned to solve tasks that it was not explicitly trained to do!

The downstream language tasks are framed as predicting conditional probabilities:

- Machine translation task, for example, English to French, is induced by conditioning LM on pairs of `English sentence = French sentence` and `the target English sentence =` at the end.

- QA task is formatted similar to translation with pairs of questions and answers in the context.

- Summarization task is induced by adding [`TL;DR:`](https://es.wikipedia.org/wiki/TL;DR) after the articles in the context.

### BPE on Byte Sequences <a name="8.2"></a>

Same as the original GPT, GPT-2 uses Byte Pair Encoding (BPE) to create the tokens in its vocabulary. But it does operate on byte sequences instead of Unicode code points. A byte-level version of BPE only requires a base vocabulary of size 256 and do not need to worry about pre-processing, tokenization, etc. Despite the benefit, current byte-level LMs still have a non-negligible performance gap with the SOTA word-level LMs.

BPE merges frequently co-occurred byte pairs in a greedy manner. To prevent it from generating multiple versions of common words (i.e. `dog.`, `dog!` and `dog?` for the word `dog`), GPT-2 prevents BPE from merging characters across categories (thus `dog` would not be merged with punctuations like `.`, `!` and `?`). These tricks help increase the quality of the final byte segmentation.

Using the byte sequence representation, GPT-2 is able to assign a probability to any Unicode string, regardless of any pre-processing steps.

### Model Modifications <a name="8.3"></a>

Compared to GPT, other than having many more transformer layers and parameters, GPT-2 incorporates only a few architecture modifications:

- [*Layer normalization*](https://arxiv.org/abs/1607.06450) was moved to the input of each sub-block, similar to a residual unit of type ["building block"](https://arxiv.org/abs/1603.05027) (differently from the original type ["bottleneck"](https://arxiv.org/abs/1512.03385), which has batch normalization applied before weight layers).

- An additional layer normalization was added after the final self-attention block.

- A modified initialization was constructed as a function of the model depth.

- The weights of residual layers were initially scaled by a factor of $\frac{1}{\sqrt{N}}$ where $N$ is the number of residual layers.

- Use larger vocabulary size and context size.

### One difference from BERT <a name="8.4"></a>

The GPT-2 (and GPT) is built using transformer decoder blocks. BERT, on the other hand, uses transformer encoder blocks. The key difference between the two is that GPT2, like traditional language models, outputs one token at a time.

The way these models actually work to generate text at inference mode is that after each token is produced, that token is added to the sequence of inputs. And that new sequence becomes the input to the model in its next step. This is an idea called *auto-regression*,  which made RNNs unreasonably effective.

![](http://jalammar.github.io/images/xlnet/gpt-2-autoregression-2.gif)

The GPT-2, and some later models like TransformerXL and XLNet are auto-regressive in nature. BERT is not. That is a trade-off. In losing auto-regression, BERT gained the ability to incorporate the context on both sides of a word to gain better results. XLNet brings back autoregression while finding an alternative way to incorporate the context on both sides.

**Note** At training time, the model would be trained against longer sequences of text and processing multiple tokens at once.

# RoBERTa <a name="9"></a>

*Robustly optimized BERT approach* ([*RoBERTa*](https://arxiv.org/abs/1907.11692)) refers to a new receipt for training BERT to achieve better results, as they found that the original BERT model is significantly undertrained. The receipt contains the following learnings:

- Train for longer with bigger batch size.

- Remove the next sentence prediction (NSP) task.

- Use longer sequences in training data format. The paper found that using individual sentences as inputs hurts downstream performance. Instead, we should use multiple sentences sampled contiguously to form longer segments.

- Change the masking pattern dynamically. The original BERT applies masking once during the data preprocessing stage, resulting in a static mask across training epochs. RoBERTa applies masks in 10 different ways across 40 epochs.

RoBERTa also added a the CommonCrawl News dataset and further confirmed that pretraining with more data helps improve the performance on downstream tasks. It was trained with the BPE on byte sequences, same as in GPT-2. They also found that choices of hyperparameters have a big impact on the model performance.

# References <a name="10"></a>

- [A Survey on Transfer Learning in Natural Language Processing](https://arxiv.org/abs/2007.04239)

- [Evolution of transfer learning in natural language processing](https://arxiv.org/abs/1910.07370)

- [Generalized language models](https://lilianweng.github.io/lil-log/2019/01/31/generalized-language-models.html)

- [The Illustrated BERT, ELMo, and co.](http://jalammar.github.io/illustrated-bert/)

- [The Illustrated GPT-2](http://jalammar.github.io/illustrated-gpt2/)

- [The Transformer Family](https://lilianweng.github.io/lil-log/2020/04/07/the-transformer-family.html)

- [Understanding transformers](https://www.analyticsvidhya.com/blog/2019/06/understanding-transformers-nlp-state-of-the-art-models/)

- [Albeer Google blog](https://ai.googleblog.com/2019/12/albert-lite-bert-for-self-supervised.html)

- [Neural Machine Translation with Byte-Level BPE](https://arxiv.org/abs/1909.03341)

- [Modern Methods for Text Generation](https://arxiv.org/abs/2009.04968)

