# Transformers

### What we will cover today:

- Why should you care about Transformers?

- RNNs: Problems and progress

- Transformers: Context-aware embeddings

- Digging Deeper: What we missed

- The transformers Library

- Going further: GPT, BERT & other models

- Recap & Further reading

## 1️⃣ Why should you care about Transformers?

<img src="pics/chatgpt.jpg" width="500"/>

## 2️⃣ RNNs: Problems and progress

Before we dive into Transformers, let's remind ourselves how a lot of NLP tasks are often tackled by RNNs:

<img src="pics/rnn_problem.png" width="750"/>

<img src="pics/encoder_decoder.png" width="750"/>


### What are the key issues we face here?

- Information bottleneck at interface
- Vanishing gradient problem
- We have to compute the entire sequence recursively (makes scaling very hard!)

### RNNs suffer from vanishing gradient through time

<img src="pics/gradient_vanishing.png" width="750"/>

❗️Backpropagation through time❗️ Within a single layer RNN model.

- During backpropagation, the gradient vanishes to 0 as the time step decreases.
- As a result, simple RNNs are said to have short memory (even with variants like LSTM and GRU)

📚 Michael Phi - [Illustrated Guide to Recurrent Neural Networks](https://medium.com/data-science/illustrated-guide-to-recurrent-neural-networks-79e5eb8049c9)

### What does this mean for our performance?

A simplification of problems with RNNs:

- Sally adored reading; when she received a book on her birthday she was older

What we'd like:

- Sally adored reading; when she received a book on her birthday she was happy!

RNNs are likely to miss out on **important context** from earlier in the sentence because of their recency bias 🫠

## 3️⃣ Transformers

### The paper that started it all: [Attention is All You Need](https://arxiv.org/abs/1706.03762)

<img src="pics/attention_model.png" width="400"/>

The highest level view:

<img src="pics/high_level.png" width="500"/>

Broken down a bit more:

<img src="pics/encoder_decoder_simplified.png" width="500"/>

### Before we talk about what's going on inside the encoder layers, let's talk about what's going into it!

<img src="pics/numbered_embedding_diagram.png" width="400"/>


Zooming in on one of the encoders:

<img src="pics/768_transformer_block.png" width="400"/>


- Usually the **embedding size** is 768
- The **hidden dimension** (the length of the projected Q, K and V vectors) is also 768
- In the example we'll use size 2 for simplicity

❓ What's going on in this strange self-attention layer?

1) Each token (word) embedding gets **projected** ➡️ into 3 further vectors: the **query, key and value** vectors

2) We compute a **scaled dot-product** 🔴 on the query and key vectors to work out how much each word relates to those around it

3) Take these scores and **normalize with softmax** ⤵️

4) **Multiply by our value vectors** ❎, sum and pass to our dense neural network.

#### An example sentence

<img src="pics/sentence.png" width="600"/>

<br>

<img src="pics/qkv_cleaned.png" width="400"/>

Each of these three vectors are learned as the model sees more data!

#### Three zoomed-in examples:

Two words in a sentence that are closely related

<img src="pics/dot_product_related_v2.png" width="600"/>

The same two words but seen from the other perspective:

<img src="pics/dot_product_related_inverse_v2_1.png" width="600"/>

Finally, two words with a weak connection:

<img src="pics/dot_product_unrelated_v2_1.png" width="600"/>

#### Let's look at one dot product

To keep it really simple, we're going to imagine our Q, K and V have only been projected into two dimensions

<img src="pics/01_one_Q_one_K_v2.png" width="600"/>

#### What happens once we have our dot products?

<img src="pics/02_one_Q_all_K_v2.png" width="600"/>

#### Then we scale:

<img src="pics/03_one_Q_all_K_scaled_v2.png" width="600"/>

#### Finally we apply softmax:

<img src="pics/04_one_Q_all_K_softmaxed_v2.png" width="600"/>

<br>

We have to do this for each word in our sentence! 

You can see how this becomes a matrix operation

We get the scaled dot-product attention for all of our Query and Key vectors:

<img src="pics/05_attention_matrix_v2.png" width="600"/>

#### And then multiply our "similarity score" with all of our Value vectors

<img src="pics/06_one_attention_all_V_v2_1.png" width="600"/>

<br>

<img src="pics/07_all_attention_all_V_v2_1.png" width="600"/>

So really the entire thing can be written like this:

<img src="pics/QKV_matrix_method.png" width="600"/>


We are done with our multiplications!

Now we just need to normalize and pass through to the feed-forward neural network

The neural network will output vectors of our original embedding dimension (e.g. 768)

### Putting it all together:

<img src="pics/slowest_qkv.gif" width="950"/>

### Let's check the lecture notebook for some Tensorflow visualizations

Computing one set of all of these Q, K, V multiplications and processes is what we call "single-headed attention".

When we are doing our initial linear projections (used to create the Q, K and V vectors) we can express these operations as matrices of weights too!

<img src="pics/QKV_matrix.png" width="400"/>

With that in mind, we see the complexity of our model compared to the others:

<img src="pics/complexity.png" width="400"/>

Context windows (a.k.a. our max sequence lengths) add a lot of weights, however they have been [getting much larger lately](https://www.anthropic.com/news/100k-context-windows)!

### Multi-headed?

- We can use multiple heads to split up and analyze different parts of our embedding
- Each can focus a different part of the embedding eg. working on semantic vs. syntactic features of our sentences

<img src="pics/multi_headed.png" width="400"/>

### Let's visualize the attention weights generated by an example, pre-trained model



In [1]:
from bertviz import head_view
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("bert-base-uncased", output_attentions = True)

first_sentence = "The lawyer worked on the case"
second_sentence = "I pushed shift to make the letters upper case"

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
viz_input = tokenizer(first_sentence, second_sentence, return_tensors = "pt")
attention = model(**viz_input).attentions
starter = (viz_input.token_type_ids == 0).sum(dim = 1)
tokens = tokenizer.convert_ids_to_tokens(viz_input.input_ids[0])
head_view(attention, tokens, starter, heads = [8])

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

2025-03-18 09:31:05.800309: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



<IPython.core.display.Javascript object>

### Let's check in with our original diagram:

<img src="pics/one_encoder_block.png" width="500"/>

#### What about the decoder?

At the highest level view, its job is to choose the **most likely next token**:

<img src="pics/decoder_high_veiw.png" width="600"/>

### How does it do this? And what happened to all the work the encoder did?

<img src="pics/decoder_training_step_1.png" width="700"/>

<img src="pics/decoder_training_step_2.png" width="700"/>

### Cross-Attention

<img src="pics/cross-attention-in-transformer-decoder.png" width="700"/>


Attention between encoder and decoder (a.k.a. cross-attention):

- **Self-attention** operates within a single sequence and captures the relationships between tokens within that sequence.
- **Cross-attention** operates between two different sequences and captures the relationships between tokens from the source sequence and tokens from the target sequence, allowing the model to generate relevant output based on the information in the source sequence.

## Let's recap the training process at a high level

<img src="pics/decoder_training_highest.png" width="700"/>

### So what does inference look like?

<img src="pics/inference.png" width="700"/>

### That is all here!

<img src="pics/attention_model.png" width="400"/>

## Digging Deeper

#### What haven't we covered yet:
- Self-padding mask
- Skip layers
- Subword tokenization
- Positional encoding

#### Self-padding mask

- Masks out padding tokens during self-attention, preventing the model from attending to them and ensuring focus only on relevant parts of variable-length input sequences.

<img src="pics/padding_in_practice_encoder.png" width="700"/>

#### Skip Layers

- Information from earlier layers is propagated directly to later layers, aiding in the retention of valuable information during training.
- Faster convergence and improved performance ✅

<img src="pics/skip_layers.png" width="500"/>

### Subword tokenization
- Enables the model to handle OOV words and capture morphological variations (e.g. "walks" and "walking" aren't viewed as entirely separate tokens but rather composites of shared/ different tokens)
- Makes the language representation more compact, efficient, and generalizable across different word forms

<img src="pics/subword.svg" width="700"/>

### Positional encoding
- Gives the model an idea of each word's position in the sentence
- Can be hard-coded or learned
- Attention is All You Need has a clever hard-coding using varying frequencies to convey positional information

<img src="pics/numbered_embedding_diagram.png" width="400"/>

See [this video](https://www.youtube.com/watch?v=dichIcUZfOw) for a detailed walkthrough!

## 5️⃣ HuggingFace
Pretty tricky under the hood but the transformers makes it all super easy in code:

In [2]:
# ! pip install transformers

- Pre-trained Models: Provides a collection of pre-trained state-of-the-art models like BERT, GPT-2, T5, and many more for various NLP tasks.
- Simple: Offers a unified and simple way to find a model, fine-tune it, and deploy it.
- Multilingual: Supports models in multiple languages (although lots are primarily written in torch but there are plenty of tricks that enable you to work with PyTorch models)

In [4]:
from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-en-fr")

pytorch_model.bin:   0%|          | 0.00/301M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/301M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/778k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.34M [00:00<?, ?B/s]

Device set to use mps:0


In [5]:
result = pipe("I am a student and I am studying in London")
result[0]['translation_text']

  test_elements = torch.tensor(test_elements)


"Je suis étudiante et j'étudie à Londres."

### Under the hood of a pipeline
- Models use their own trained **tokenizers** which we load up with the `.from_pretrained()` method
- Then we just pass in the creator of the model and model name
- If we want to pass in Tensorflow tensors (which we will), just put TF before our model and pass `from_pt = True`.

In [6]:
from transformers import AutoTokenizer, TFAutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-fr")

tokens = tokenizer.encode("This is easy!", return_tensors = "tf")

print(tokens)

tf.Tensor([[ 160   32 3120  145    0]], shape=(1, 5), dtype=int32)


In [7]:
model = TFAutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-en-fr", from_pt = True)

output_tokens = model.generate(tokens)

print(output_tokens)

print(tokenizer.decode(output_tokens[0]))

All PyTorch model weights were used when initializing TFMarianMTModel.

All the weights of TFMarianMTModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFMarianMTModel for predictions without further training.


tf.Tensor([[59513    89     6    82  3860   291     0]], shape=(1, 7), dtype=int32)
<pad> C'est facile !</s>


## 6️⃣ Going further

### What are the two parts of our diagram doing?

<img src="pics/decoder_training_highest.png" width="700"/>

### The Transformer family tree:

<img src="pics/TF_fam_tree.png" width="500"/>

- Encoder-only (e.g. BERT): converts an input sequence to a numerical representations. Uses words to the left and right of each word (hence "bidirectional") and is great for things like classification.

- Decoder-only (e.g. GPT): takes an input sequence and iteratively predicts the most likely next word (can also be used in a similar manner to encoder-decoder if trained correctly)

- Encoder-decoder (e.g. T5 or original "Attention is all you need" paper model): maps one sequence to another

### So what makes ChatGPT so great?

An example of the GPT-2 architecture:

<img src="pics/GPT2.png" width="500"/>

- Huge amounts of data and training
- 175B weights
- Reinforcement Learning from Human Feedback (RLHF): a technique used to train machine learning models, particularly in natural language processing (NLP), by incorporating human preferences into the training process. It's widely used in fine-tuning large language models (LLMs), such as ChatGPT, to align their responses with human expectations.

## 7️⃣Recap and Further Reading

- [Intro to Transformers](https://lena-voita.github.io/nlp_course/seq2seq_and_attention.html#transformer_intro): A high level overview of many of the concepts covered today

- [Jay Alamar's Transformer Illustrated](https://jalammar.github.io/illustrated-transformer/): Fantastic step-by-step visualizations and explanations for multiple Transformer-based models

- [HuggingFace + Transformers Textbook](https://www.oreilly.com/library/view/natural-language-processing/9781098136789/): If you want to do anything with HuggingFace, this is a fantastic primer.

- [Accompanying Open Source GH Repo for the above textbook](https://github.com/nlp-with-transformers/notebooks): Totally free access to the examples from the above book!

- [StatsQuest Video on Self-Attention](https://www.youtube.com/watch?v=zxQyTK8quyY): A great step by step breakdown of the process