## Transformers

https://towardsdatascience.com/illustrated-guide-to-transformers-step-by-step-explanation-f74876522bc0

<div>
  <img src="Images/transformer.png" alt="transformers" style="width: 500px; height: 300px;">
</div>

1. Left part
    1. Input embedding
        
        First the input is being fed into the input embedding layer (more of a lookup table) where a word is converted into a vector representation.
        
    2. Positional Encoding
        
        As the transformer as no recurrence like a RNN, we must add some information about the position into the input
        
        For every odd index :    $PE_{(pos,2i+1)} = cos(pos/10000^{(2i/d_{model})})$
        
        For every even index:  $PE_{(pos,2i)} = sin(pos/10000^{(2i/d_{model})})$
        
    3. Encoder layer
        
        It consists of a simple multi-headed attention network followed by a feed-forward network
        
        The multi-header attention network creates query, key and value vectors from the combination of postional encoding and input embedding.
        
        The dot product of query and key gives the scores matrix which determines how much should a word focus on every other word that is fed to it.
        
        higher the score more will be the focus.
        
        The scores are scaled down by dividing it by square root of the dimension of query and key allowing for stable gradients.
        
        Next we softmax the scores to get the resulting attention values.
        
        Finally the scores and the value matrix is multiplied to get the resulting output vector.
        
        For computing multi-header attention all you need to do is to split the query, key and value vectors into n vectors before applying self attention.
        
2. Right part
    1. output embedding
        
        It takes the list of previous outputs as inputs.
        
    2. Positional encoding
        
        Same as encoder but for output embeddings
        
    3. Decoder
        
        It consists of two multi-header attention network followed by a feed-forward network and a linear layer. The output embedding of encoder are fed after the first multi-headed attention layer.
        
        The first multi-headed attention network is slightly different as we need to ensure that the future words are not feed to it. Each words scores should be computed only for the words coming before it not afterwards. This is done via masking. Basically putting -inf for all score values.
        
        The final linear layer acts as a classifier and is as big as the number of classes i.e. number of words(vocab size) you have. It will generate a probability value between 0 and 1.

## BERT

https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270

<div>
  <img src="Images/BERT.png" alt="BERT" style="width: 500px; height: 300px;">
</div>

Bidirectional Encoder Representations from Transformers

Bidirectional: As opposed to directional models, which read the text input sequentially (left-to-right or right-to-left), the Transformer encoder reads the entire sequence of words at once. Therefore it is considered bidirectional.

Since BERT’s goal is to generate a language model, only the encoder mechanism is necessary.

Training 

**Masked LM (MLM)** 

**Before feeding word sequences into BERT**, 15% of the words in each sequence are replaced with a [MASK] token. The model then attempts to predict the original value of the masked words, based on the context provided by the other, non-masked, words in the sequence. The BERT loss function takes into consideration only the prediction of the masked values and ignores the prediction of the non-masked words.

**Next Sentence Prediction (NSP)**

In the BERT training process, the model receives pairs of sentences as input and learns to predict if the second sentence in the pair is the subsequent sentence in the original document. During training, 50% of the inputs are a pair in which the second sentence is the subsequent sentence in the original document, while in the other 50% a random sentence from the corpus is chosen as the second sentence. The assumption is that the random sentence will be disconnected from the first sentence.

BERT fine-tuning

Using BERT for a specific task is relatively straightforward: → done instead of NSP

1. Classification tasks such as sentiment analysis are done similarly to Next Sentence classification, by adding a classification layer on top of the Transformer output for the [CLS] token.
2. In  Question Answering tasks (e.g. SQuAD v1.1), the software receives a question regarding a text sequence and is required to mark the answer in  the sequence. Using BERT, a Q&A model can be trained by learning two extra vectors that mark the beginning and the end of the answer.
3. In Named Entity Recognition (NER), the software receives a text sequence and is required to mark the various types of entities (Person, Organization, Date, etc) that appear in the text.

BERT uses encoder-only architecture.

## GPT

<div>
  <img src="Images/GPT.png" alt="GPT" style="width: 500px; height: 300px;">
</div>

Generative Pretrained Transformers

Generative: This feature emphasizes the model's ability to generate text by comprehending and responding to a given text sample. 

Pretrained: refers to the ML model that has undergone training on a large dataset of examples before being deployed for a specific task.

Transformers:  ****A type of neural network architecture that is designed to handle text sequences of varying lengths.

GPT uses decoder-only architecture.

**So whats the difference between BERT and GPT?**

- Architecture:
    - GPT: Uses only the decoder part of the transformer
    - BERT: Uses only the encoder part of the transformer
- Pretraining Task:
    - GPT: Next token prediction (language modeling)
    - BERT: Masked language modeling and next sentence prediction
- Attention Mechanism:
    - GPT: Unidirectional (can only attend to previous tokens)
    - BERT: Bidirectional (can attend to both previous and future tokens)
- Input Processing:
    - GPT: Processes text left-to-right
    - BERT: Can access the full context in both directions
- Primary Use:
    - GPT: Excels at text completion and conversation.
    - BERT: Better suited for understanding tasks (question answering and named entity recognition)