## Encoder: BERT


| **Component**         | **BERT Base/Details**                                         |
|-----------------------|--------------------------------------------------------------|
| **Vocabulary Size**   | 30,000 tokens                                                |
| **Embedding Size**    | 1024 per token                                               |
| **# Transformer Layers** | 24                                                        |
| **Attention Heads/Layer** | 16                                                       |
| **Q/K/V per Head**    | Each: $64 \times 1024$                                       |
| **Feedforward Hidden Dim** | 4096                                                    |
| **Max Input Length**  | 512 tokens                                                   |
| **Batch Size (Pretrain)** | 256                                                      |
| **Pretraining Steps** | 1 million (~50 epochs)                                       |
| **Pretraining Tasks** | Masked language modeling, next sentence prediction           |
| **Fine-tuning Tasks** | Text classification, NER, span prediction (QA), etc.         |

Just to gain a breif ideas of what this looks like from the inside, consider a single transformer Layer: 

<div align="center">

**One Transformer Layer BEGIN**

</div>

We denote $sa_{lm}$ where $l$ is the $l$ head mechanism out of 16 and $m$ is the "center of attention" 


$$
sa_{11} = \left[\frac{\exp{(\beta_{k1} + \Omega_{k1}\mathbf{x_1})\cdot(\beta_{q1} + \Omega_{k1}\mathbf{x_1})}}{\sum_{i=1}^N \exp{((\beta_{k1} + \Omega_{k1}\mathbf{x_i})\cdot(\beta_{q1} + \Omega_{k1}\mathbf{x_1}))}}\right]\cdot \left(\beta_{v_1} + \Omega_{v1}\mathbf{x_1}\right)+ \dots +\left[\frac{\exp{(\beta_{k1} + \Omega_{k1}\mathbf{x_N})\cdot(\beta_{q1} + \Omega_{q1}\mathbf{x_1})}}{\sum_{i=1}^N \exp{((\beta_{k1} + \Omega_{k1}\mathbf{x_i})\cdot(\beta_{q1} + \Omega_{q1}\mathbf{x_1}))}}\right]\cdot \left(\beta_{v1} + \Omega_{v1}\mathbf{x_N}\right)
$$

$$
sa_{12} = \left[\frac{\exp{(\beta_{k1} + \Omega_{k1}\mathbf{x_2})\cdot(\beta_{q1} + \Omega_{k1}\mathbf{x_1})}}{\sum_{i=1}^N \exp{((\beta_{k1} + \Omega_{k1}\mathbf{x_i})\cdot(\beta_{q1} + \Omega_{k1}\mathbf{x_1}))}}\right]\cdot \left(\beta_{v_1} + \Omega_{v1}\mathbf{x_1}\right)+ \dots +\left[\frac{\exp{(\beta_{k1} + \Omega_{k1}\mathbf{x_N})\cdot(\beta_{q1} + \Omega_{q1}\mathbf{x_2})}}{\sum_{i=1}^N \exp{((\beta_{k1} + \Omega_{k1}\mathbf{x_i})\cdot(\beta_{q1} + \Omega_{q1}\mathbf{x_2}))}}\right]\cdot \left(\beta_{v1} + \Omega_{v1}\mathbf{x_N}\right)
$$

$$
\vdots
$$

$$
sa_{1N} = \left[\frac{\exp{(\beta_{k1} + \Omega_{k1}\mathbf{x_1})\cdot(\beta_{q1} + \Omega_{k1}\mathbf{x_N})}}{\sum_{i=1}^N \exp{((\beta_{k1} + \Omega_{k1}\mathbf{x_i})\cdot(\beta_{q1} + \Omega_{k1}\mathbf{x_N}))}}\right]\cdot \left(\beta_{v_1} + \Omega_{v1}\mathbf{x_1}\right)+ \dots +\left[\frac{\exp{(\beta_{k1} + \Omega_{k1}\mathbf{x_N})\cdot(\beta_{q1} + \Omega_{q1}\mathbf{x_N})}}{\sum_{i=1}^N \exp{((\beta_{k1} + \Omega_{k1}\mathbf{x_i})\cdot(\beta_{q1} + \Omega_{q1}\mathbf{x_N}))}}\right]\cdot \left(\beta_{v1} + \Omega_{v1}\mathbf{x_N}\right)
$$

With this we concatonate the first self-attention mechanism: 

$$ SA_1 = \left[sa_{11}, sa_{12}, \dots, sa_{1N}\right]$$

Note in a single transformer layer we have 16 of these, to provide more depth and understanding: 

$$SA_{L1} = \left[SA_1^T, SA_2^T, \dots, SA_{16}^T\right]$$

As mentioned in the previous chapter over this we apply a linear transformation: 

$$\mathbf{\Omega}_{cL} SA_{L1}^T + \mathbf{\beta}_{cL}$$

We then apply this to the skip connection (input joins this step) and apply LayerNorm.

Then a Fully connected FFN: $$\mathbb{R}^{1024} \to \mathbb{R}^{4096} \to \mathbf{GELU} \to \mathbb{R}^{1024}$$

Then another Skip connection (Input to FNN joins this step) and LayerNorm. 

<div align="center">

**One Transformer Layer DONE**

Then Repeat this process 12 or 24 times depending on the BERT model

</div>

## Pre-training 

The parameters of the transformer architecture are learning using $selfâ€“supervision$ from a large source of text (i.e. Wikipedia).
In doing so, the model learns general information about the statistics of language.
One of the main benefits is that we can use a lot of data without requiring manual labels.

**Components of self-supervision**

- Completing the sentence 
- Fill in missing words. 


During training, the maximum input length is 512 tokens, and the batch size is 256. 
The system is training for a million steps, corresponding to around 50 epochs. 



## Fine Tuning 

In the fine-tuning stage, the model parameters are adjusted to specialize the network to a particular task.

As such we append an additional layer to the current network, thus converting the output vectors to the desired output format.

**Text Classification**

Suppose our fine-tuning was sentiment analysis, the vector associated with the token is mapped to a single number and passed through sigmoid function, which we apply binary-cross entropy loss.

**Word Classification**
We can also fine-tune to the task $named \ entity \ recognition$ whcih classifies each as an element of a pre-defined set of nouns, including the an option for "no-entity". 

Each input embedding $\mathbf{x_n}$ is mpped to a vector of dimension $E \times 1$ where $E$ is the cardinality of the set containing the entities.

We would then pass thoguh a softmax function to create probabilities for each class, this would contribute to the **Multi-class Cross-Entropy**

**Text Span Prediction**

Suppose we wanted to fine-tune the model to answer questions, where we provide a question and passage from Wikipedia containing the answer, these are concatonated and tokenized. 

BERT is then used to predict the text span in the passage that caontains the answer. 

Each token maps to two numbers indicating how likely it is that the text begins and ends at this location.

The resulting two sets of numbers are put through two softmax functions. The Likelihood of any text span being the answer can be derived by combininng the probabililty of starting and ending at the appropriate places.