# Overview

BERT was [published](https://arxiv.org/abs/1810.04805) in 2018 by a team of engineers within google's AI division.

BERT provides a new definition for a Transformer's encoder offering better accuracy and the ability to provide transfer learning.

# What is BERT?
The acronymn BERT stands for Bidirectional Encoder Representations from Transformers. 

## Language [R]epresentation Model

This is a lot to unpack, lets start with the terms Representation:

It's authors characterize BERT as a language representation model. This term was a bit confusing because I could not find an explicit explanation of the distinction between it, and a regular language model. I was however able to deduce the meaning from context. I found the following passage in the paper:

>  they use ... language models to learn general language representations. ... BERT uses masked language models to enable pretrained deep bidirectional representations.

From this context, I believe the language model itself is the stateless architecture while the langauge representation is the weights (and other internal state) associated with the data that the model was trained on. This would be consistent with the idea that the language model represnetations are used by downstream models:

> There are two existing strategies for applying pre-trained language representations to downstream tasks: feature-based and fine-tuning. The feature-based approach, such as ELMo (Peters et al., 2018a), uses task-specific architectures that include the pre-trained representations as additional features. The fine-tuning approach, such as the Generative Pre-trained Transformer (OpenAI GPT) (Radford et al., 2018), introduces minimal task-specific parameters, and is trained on the downstream tasks by simply fine-tuning all pretrained parameters. The two approaches share the same objective function during pre-training, where they use unidirectional language models to learn general language representations.

So in this case, colloquially, when one talkes about a pre-trained model or a fine-tuned model, they are liekly referring to the representation and not the model when discussing situations where training has already been preformed and we are simply making predictions using the model and the pre-existing weights.

## From [T]ransformers

So now that we know what a representation is, the phrase "Representation From Transformer" starts to make sense. The representations are produced by a model, and in this case, that model is a transformer. BERT is a thing produced from a type of Transformer. We will continue to clarify what that stuff is.

## [B]idirectional Context

Now lets talk about the term bidirectional. 

In this case bidirectionality qualifies the contex objets (i.e. the word embeddings gereated by the Transformer's encoder) used by BERT's attention mechanism. Recall that the context is what gives tokens (i.e. words or fragments) their meaning and what is used to generate the probabilities used to predict the next word in a sequence. Bith BERT, the language representation is jointly conditioned on both left and right context. This means the model considers the tokens to the left (i.e. before) of a given token and the tokens to the right (i.e. the tokens that empirically occur after). Additionally, BERT considers the bidirectional context in all layer of the neural network. 

The authors state that bidirectionality represents a majority of the emperical improvements in BERT's model performance. They claim that prior to this, the deep representations were sub-optimal because they were trained unidirectional (left-to-right in the case of OpenAI GPT) or they are trained on labled data, or both.

## Sequence Pair [E]ncoder

Next lets unpack the term Encoder.

The paper reviews the history of pre-training general language represnetaions. Around 2015, it notes that sentence or document encoders started being used to produce contextual token representations which are then presumably fine-tuned and consumed by downstream processes or tasks (eg. decoders).

The paper spells out that BERT is an advancement of the original transformer encoder design: 

> BERT’s model architecture is a multi-layer bidirectional Transformer encoder based on the original implementation described in Vaswani et al. (2017) and released in the tensor2tensor library
>
> use of Transformers has become common and our implementation is almost identical to the original ... (Vaswani et al. (2017)).

The difference here is that BERT's encoder was deigned to "unambiguously" represent either a single sequence or a pair of sequences (e.g. question and answer) as an input. In this way BERT is generic and able to be applied to a number of downstream tasks.

# Use Case: Transfer Learning

Asside from all the other advancements made by BERT, [transfer learning](./Transfer%20Learning%20And%20Pre-Trained%20Models.ipynb) sits close to the top. The term transfer learning was a bit confusing, I found it simpler to think of it as conducting knowledge transfers, or transfering knowledge, between machine learning models.

The paper describes BERT as a framework, stating:

> There are two steps in our framework: pre-training and fine-tuning. During pre-training, the model is trained on unlabeled data over different pre-training tasks. For finetuning, the BERT model is first initialized with the pre-trained parameters, and all of the parameters are fine-tuned using labeled data from the downstream tasks. Each downstream task has separate fine-tuned models, even though they are initialized with the same pre trained parameters.

Digging a bit deeper, we can infer that the basic premise of the BERT framework is to provide a blueprint for transfer learning implimentations which is made possible through the architectural requirements of the framework:

> A distinctive feature of BERT is its unified architecture across different tasks. There is minimal difference between the pre-trained architecture and the final downstream architecture.

The authors propose that the BERT can be used as a pre-trained model and that it can be fine tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task specific architecture modifications. This adds huge operational efficiencies to the model tuning process as it's assumed that one would fine-tune a pretrained model rather than start from scratch saving a lot of time by cutting a substantial amount of training iterations out of the process.

# Implimentations

## BERT and Smaller BERT Models
In 2018, the original BERT model was uploaded to [github](https://github.com/google-research/bert). Since then multiple additional releases have taken place including the release in ~ March 2020, a set of "smaller" BERT Models.

## tensor2tensor
The paper cites the [tensor2tensor](https://github.com/tensorflow/tensor2tensor) library as the multi-layer bidirectional Transformer encoder implimentation.