# Explanation

Since the original transformer architecture was initially built for machine translation, it wasn't initially clear how powerful the architecture would be for so many other tasks that it's used for today.

Even for language modeling tasks, it wasn't obvious how exactly to apply the transformer to get the best results. Before BERT and GPT, most sequence modeling was based on smaller datasets, and models were designed for specific sequence modeling tasks.

OpenAI's initial release of GPT was the first attempt at showing how powerful the transformer model could be by scaling it to train on an internet-scale dataset. It introduced the idea of pre-training on these datasets, and then fine-tuning on more task specific datasets using transfer learning to accomplish the tasks more effectively.

BERT noticed that GPT missed out on a few key archictectural details that hurt its performance - notably the use of the left-to-right next token prediction objective.

By improving on the approaches of GPT, BERT finally built a new pre-trained language model that beat state-of-the-art on a wide variety of language tasks despite not training on them directly.

BERT was arguably the beginning of the wave of attention focused on large language models, and BERT and GPT were also arguably the first successful large language models.

### Intuition

BERT made two main improvements over GPT that allows it to be so effective.

First, it replaced the left-to-right next token prediction objective. The authors of BERT suspected that this objective didn't allow GPT to properly learn about context from words in both directions, which significantly inhibited it's understanding and language modeling capabilities.

BERT replaced this with the **masked language model** (MLM) objective where it learns to predict a random masked word in a sequence based on all the surrounding words. In this way, it learns to understand the relevance of context in both directions for each word, which enables much more robust understanding of sequences.

Additionally, BERT introduced the **next sentence prediction** (NSP) task - its able to classify whether sequential sentence relate to each other or not. This forced BERT not just to understand the relationships between words, but also to understand sequences of sentences and how they relate to each other.

This may have also boosted the results of BERT, although the task was later found to be unnecessary in the RoBERTa paper, and was removed.

### RoBERTa

Due to the clear impact of BERT, many replication studies came out soon after to confirm the results of BERT.

Notably, one replication study - RoBERTa - trained a model that significanatly outperformed BERT on all tasks.

This study showed a few of the flaws with BERT - it was significantly undertrained and the NSP task was unnecessary (and removed). It also introduces optimizations to the MLM objective, and put much more effort into tuning BERTs hyperparameters effectively.

The significance of RoBERTA was not just in creating an optimized version of BERT, but also in emphasizing the importance of paying attention to the details of hyperparameters and other variables and the leverage this can have when doing expensive training runs on large language models.


# My Notes

## 📜 [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/pdf/1810.04805)

> We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers.

> BERT is designed to pre-train deep bidirectional representations from
> unlabeled text by jointly conditioning on both
> left and right context in all layers.

> As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, […] without substantial task-specific architecture modifications.

BERT is built specifically for fine-tuning. It makes it easy to train a single base model and then use it to create a number of task specific models.

> BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks.

> There are two existing strategies for applying pre-trained language representations to downstream tasks: _feature-based_ and _fine-tuning._

> We argue that current techniques restrict the power of the pre-trained representations, especially for the fine-tuning approaches.

> The major limitation is that standard language models are unidirectional, and this limits the choice of architectures that can be used during pre-training. For example, in OpenAI GPT, the authors use a left-to-right architecture, where every token can only attend to previous tokens in the self-attention layers of the Transformer.

> Such restrictions are sub-optimal for sentence-level tasks, and could be very harmful when applying fine-tuning based approaches to token-level tasks such as question answering, where it is crucial to incorporate context from both directions.

The main problem of the original transformer design is that the left-to-right architecture during training means that words learn to soak up context from words on their left, but not on their right, whereas in understanding sentences, soaking up context from all directions is critical.

> BERT alleviates the previously mentioned unidirectionality constraint by using a “masked language model” (MLM) pre-training objective.

> The masked language model randomly masks some of the tokens from the input, and the objective is to predict the original vocabulary id of the masked word based only on it’s context.

Using this MLM objective, words can learn to absorb context from all other words around them, making the Transformer bidirectional.

> We demonstrate the importance of bidirectional pre-training for language representations.

> We show that pre-trained representations reduce the need for many heavily-engineered task specific architectures.

### **Related Work**

**1. Unsupervised Feature-based Approaches**

> Pre-trained word embeddings are an integral part of modern NLP systems, offering significant improvements over embeddings learned from scratch.

**2. Unsupervised Fine-tuning Approaches**

> Sentence or document encoders which produce contextual token representations have been pre-trained from unlabeled text and fine-tuned for a supervised downstream task.

**3. Transfer Learning from Supervised Data**

> There has also been work showing effective transfer from supervised tasks with large datasets, such as natural language inference and machine translation.

### BERT

> During pre-training, the model is trained on unlabeled data over different pre-training tasks. For fine-tuning, the BERT model is first initialized with the pre-trained parameters, and all of the parameters are fine-tuned using labeled data from the downstream tasks.

> A distinctive feature of BERT is its unified architecture across different tasks.

BERT is built specifically for easy fine-tuning for a number of different tasks, and the architecture of the model stays exactly the same after fine-tuning.

**1. Pre-training BERT**

> In order to train a deep bidirectional representation, we simply mask some percentage of the input tokens at random, and then predict those masked tokens.

> In order to train a model that understands sentence relationships, we pre-train for a binarized next sentence prediction task.

**Next sentence prediction (NSP)** is an essential addition of BERT. The model is trained with some classification problems - and it’s initial token for every predicted sequence is the `[CLS]` token, meant to classify if the two sentences it was fed follow each other or not.

Adding this task into the model forces the model to learn whether two sentences are actually related or not, rather than just assuming that the text it’s fed is all correctly related.

**2. Fine-tuning BERT**

> For each task, we simply plug in the task specific inputs and outputs into BERT and fine-tune all the parameters end-to-end.

> Compared to pre-training, fine-tuning is relatively inexpensive.

### Ablation Studies

**1. Effect of Pre-training Tasks**

![Screenshot 2024-05-16 at 10.24.26 AM.png](../../images/Screenshot_2024-05-16_at_10.24.26_AM.png)

**2. Effect of Model Size**

> It has long been known that increasing the model size will lead to continual improvements on large-scale tasks such as machine translation and language modeling, which is demonstrated by the LM perplexity of held-out training data.

![Screenshot 2024-05-16 at 10.28.04 AM.png](../../images/Screenshot_2024-05-16_at_10.28.04_AM.png)

### Conclusion

> Recent empirical improvements due to transfer learning with language models have demonstrated that rich, unsupervised pre-training is an integral part of many language understanding systems.

> Our major contribution is further generalizing these findings to deep bidirectional architectures, allowing the same pre-trained model to successfully tackle a broad set of NLP tasks.


## 📜 [RoBERTa: A Robustly Optimized BERT Pre-training Approach](https://arxiv.org/pdf/1907.11692)

> We find that BERT was significantly undertrained, and can match or exceed the performance of every model published after it.

> These results highlight the importance of previously overlooked design choices, and raise questions about the source of recently reported improvements.

> Our modifications are simple, they include:
> (1) training the model longer, with bigger batches, over more data
> (2) removing the next sentence prediction objective
> (3) training on longer sequences
> (4) dynamically changing the masking pattern applied to the training data

> Our training improvements show that masked language model pre-training, under the right design choices, is competitive with all other recently published methods.

RoBERTa is about showing that the BERT architecture is actually capable of achieving state-of-the-art results, and questioning it’s design choices.

### Training Procedure Analysis

**1. Static vs. Dynamic Masking**

BERT uses static token masking where the masks are determined in advance. Instead, RoBERTa tries dynamic token masking which leads to slight improvements.

**2. Model Input Format and Next Sentence Prediction**

BERT uses next sentence prediction. RoBERTa finds that you can actually do better by eliminating this and just training on sequences of sentences from a single document.

**3. Training with Larger Batches**

RoBERTa uses a larger mini-batch size for training.

**4. Text Encoding**

> Using bytes [instead of unicode characters] makes it possible to learn a subword vocabulary of a modest size (50K units) that can still encode any input text without introducing any “unknown” tokens.

> Nevertheless, we believe the advantages of a universal encoding scheme outweighs the minor degradation in performance and use this encoding in
> the remainder of our experiments.

### RoBERTa

> Specifically, RoBERTa is trained with dynamic masking, FULL SENTENCES without NSP loss, large mini-batches and a larger byte-level BPE.

> Additionally, we investigate two other important factors that have been under-emphasized in previous work: (1) the data used for pre-training, and (2) the number of training passes through the data.

![Screenshot 2024-05-16 at 10.57.50 AM.png](../../images/Screenshot_2024-05-16_at_10.57.50_AM.png)

> Crucially, RoBERTa uses the same masked language modeling pre-training objective and architecture as $\textrm{BERT}_{\textrm{LARGE}}$, yet consistently outperforms both $\textrm{BERT}_{\textrm{LARGE}}$ and $\textrm{XLNet}_{\textrm{LARGE}}$.

> This raises questions about the relative importance of model architecture and pretraining objective, compared to more mundane details like dataset size and training time that we explore in this work.

![Screenshot 2024-05-16 at 10.59.51 AM.png](../../images/Screenshot_2024-05-16_at_10.59.51_AM.png)

### Conclusion

> These results illustrate the importance of these previously overlooked design decisions and suggest that BERT’s pre-training objective remains competitive with recently proposed alternatives.
