<a href="https://colab.research.google.com/github/unica-ml/ml/blob/master/notebooks/Transformers_LLMs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Transformers-LLMs Laboratory


In this laboratory, we'll go through some of the main concepts of Transformer architecture [1] and Large Language Models.

In the first part, we will implement a simplified version of the attention mechanism from scratch!
The code is partially inspired by the [Build a Large Language Model (From Scratch)](https://github.com/rasbt/LLMs-from-scratch) [2] repository (strongly suggested if you want to learn more).


## Tokenization

First of all, we need a tokenizer to process the input text and get tokens and their IDs. We'll use a pre-trained one, in this case, the GPT2 [3] tokenizer, which can be loaded from [Hugging Face](https://huggingface.co/).

Hugging Face is the main reference for all that concerns transformers (not only applied to text): they provide all the development resources (e.g., transformers and tokenizers libraries) and host datasets and pre-trained models.

In [None]:
!pip install -U datasets huggingface_hub fsspec
!pip install bertviz

In [None]:
from transformers import AutoTokenizer


tokenizer = AutoTokenizer.from_pretrained("gpt2")

Let's take a look to the tokenizer vocabulary!

In [None]:
print("Vocab size:", tokenizer.vocab_size)
print()
for i, (token, id) in enumerate(tokenizer.vocab.items()):
  if i == 10:
    break
  print(token, id)

Vocab size: 50257

Ġrests 24013
Ġsensing 34244
UD 8322
Ġml 25962
Ġrevenge 15827
Ġsafety 3747
Ġswift 14622
Ġtownship 42823
ĠPrism 35417
ĠPiper 33503


Tokenizers expose different methods to obtain tokens and IDs from text (and vice versa). Now we create a sentence and try some of them:
- `tokenize` returns a list of tokens
- `encode` returns a list of token IDs
- `__call__` returns a dictionary that contains different data. For instance, the token IDs (key `input_ids`) and the attention mask (key `attention_mask`), which tells the model if to consider tokens (value 1) or not (value 0, typically used for padding tokens).

In [None]:
sentence = "A sequence of words."

tokens = tokenizer.tokenize(sentence)
print("Tokens:", tokens)

ids = tokenizer.encode(sentence)
print("IDs:", ids)

encoding = tokenizer(sentence)
print("Encoding:", encoding)

Tokens: ['A', 'Ġsequence', 'Ġof', 'Ġwords', '.']
IDs: [32, 8379, 286, 2456, 13]
Encoding: {'input_ids': [32, 8379, 286, 2456, 13], 'attention_mask': [1, 1, 1, 1, 1]}


## Embedding

Once we have token IDs, we can compute embeddings. For now, let's simulate this step, as it would require training one or more embedding layers.
We then create a random embedding representation $X$ with $d=4$ for each token.

In [None]:
import torch


X = torch.rand(5, 4)
print(X)

tensor([[0.3570, 0.0826, 0.7419, 0.4303],
        [0.9318, 0.0557, 0.6334, 0.0181],
        [0.1714, 0.6355, 0.2957, 0.9169],
        [0.8550, 0.2291, 0.6086, 0.4313],
        [0.0544, 0.7056, 0.1190, 0.5368]])


## Self-Attention

We start with a simplified example, excluding for now the trainable parameters $W_q, W_k, W_v$ of the attention layer, so that $Q=K=V=X$.


Recalling the attention layer equation:

$Attention(Q,K,V)=softmax(\frac{QK^T}{\sqrt{d_k}})V$,

where $\frac{QK^T}{\sqrt{d_k}}$ are the attention scores,
and $softmax(\frac{QK^T}{\sqrt{d_k}})$ the attention weights.

The final resulting output is called context vector.


Let's first compute the self-attention score of the second token $x_1$ with respect to the third token $x_2$.

In [None]:
q = X[1, :]
k = X[2, :]

attention_score_1_vs_2 = q.dot(k)
print(attention_score_1_vs_2.item())

0.39895498752593994

If we multiply the matrices $Q$ and $K$, we will obtain the attention scores for every token pair.

In [None]:
Q = X
K = X

attention_scores = Q @ K.T
print(attention_scores)

tensor([[0.8699, 0.8150, 0.7276, 0.9613, 0.3970],
        [0.8150, 1.2729, 0.3990, 1.2028, 0.1751],
        [0.7276, 0.3990, 1.3614, 0.8675, 0.9852],
        [0.9613, 1.2028, 0.8675, 1.3400, 0.5121],
        [0.3970, 0.1751, 0.9852, 0.5121, 0.8032]])

The following step consists of scaling and normalizing the attention scores to obtain the attention weights (such that the attention weights for each token sum to 1).

The attention weights matrix always has shape ($n\_tokens$, $n\_tokens$).

In [None]:
attention_weights = torch.softmax(attention_scores / X.shape[1]**0.5, dim=1)
print(attention_weights)

tensor([[0.2109, 0.2052, 0.1965, 0.2208, 0.1665],
        [0.1996, 0.2510, 0.1621, 0.2423, 0.1450],
        [0.1841, 0.1562, 0.2528, 0.1975, 0.2094],
        [0.1965, 0.2217, 0.1875, 0.2374, 0.1570],
        [0.1811, 0.1621, 0.2430, 0.1918, 0.2219]])

Finally, we obtain the context vectors as a weighted sum of the Values matrix with respect to the attention weights.

In [None]:
V = X

context_vectors = attention_weights @ V
print(context_vectors)

tensor([[0.4981, 0.3218, 0.4988, 0.4592],
        [0.5480, 0.2913, 0.5198, 0.4214],
        [0.4349, 0.3776, 0.4554, 0.5114],
        [0.5204, 0.3128, 0.5048, 0.4471],
        [0.4335, 0.3790, 0.4521, 0.5056]])

Context vectors can be viewed as higher-level embedding representations, which enclose information about the relationships between input tokens.

However, the models need some trainable parameters in order to learn how to produce those richer representations!

We thus (randomly) create the key, query, and values weights $W_q, W_k, W_v$, and compute the respective matrices as $Q=XW_q$, $K=XW_k$, and $V=XW_v$.

Note that while the first dimension of the weight matrices must correspond to the input dimensionality, the second dimension can assume any value and represents the dimension of each produced context vector.

In [None]:
W_q = torch.rand(size=(4, 3))
W_k = torch.rand(size=(4, 3))
W_v = torch.rand(size=(4, 3))

Q = X @ W_q
K = X @ W_k
V = X @ W_v

attention_scores = Q @ K.T
attention_weights = torch.softmax(attention_scores / X.shape[1]**0.5, dim=1)
context_vectors = attention_weights @ V
print(context_vectors)

tensor([[1.3110, 1.0721, 0.6139],
        [1.3169, 1.0757, 0.6191],
        [1.3151, 1.0737, 0.6178],
        [1.3204, 1.0759, 0.6231],
        [1.3106, 1.0741, 0.6121]])


We just built the main block of the Transformers architecture!

Actually, that was just one attention _head_... in practice, each attention layer has multiple parallel heads, and their outputs are concatenated.

Also, recall that in decoders, _masked_ attention is used, so that the model cannot rely on _future_ tokens in the input sequence.

The masking must be performed **before** applying the softmax function. Otherwise, the resulting attention weights won't sum to 1. To do so, we can replace the attention scores that we want to mask with $-\infty$ values: when applying the softmax function, they will become $0$.

In [None]:
attention_scores = Q @ K.T
mask = torch.triu(torch.ones(attention_scores.shape), diagonal=1)
attention_scores = attention_scores.masked_fill(mask.bool(), -torch.inf)
attention_weights = torch.softmax(attention_scores / X.shape[1]**0.5, dim=1)
print(attention_weights)

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.4583, 0.5417, 0.0000, 0.0000, 0.0000],
        [0.2502, 0.2771, 0.4727, 0.0000, 0.0000],
        [0.1567, 0.1846, 0.3428, 0.3158, 0.0000],
        [0.1560, 0.1733, 0.2473, 0.2379, 0.1855]])


## Loading a pretrained model

We now load a pre-trained decoder-only model for text generation, i.e., GPT2 [3]. Implementation and weights are publicly available and can be retrieved from Hugging Face.

In [None]:
from transformers import AutoModel, AutoModelForCausalLM


model = AutoModelForCausalLM.from_pretrained("gpt2")
print(model)

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)


This model takes as input a sequence of token IDs (the attention mask is not required) and returns (inside a dictionary-like data structure), for each input token, a distribution over the next predicted token.

In [None]:
sentence = "Machine Learning algorithms are"
input_dict = tokenizer(sentence, return_tensors="pt")
output = model(input_ids=input_dict["input_ids"], output_attentions=True)
print(output.keys())
print(output["logits"].shape)



odict_keys(['logits', 'past_key_values', 'attentions'])
torch.Size([1, 4, 50257])


Let's inspect and visualize the attention weights!

In [None]:
from bertviz import head_view, model_view


print(len(output.attentions))  # number of attention layers
print(output.attentions[0].shape)  # (batch_size, num_heads, sequence_length, sequence_length)

tokens = tokenizer.convert_ids_to_tokens(input_dict["input_ids"][0])
head_view(output[-1], tokens)

12
torch.Size([1, 12, 4, 4])


<IPython.core.display.Javascript object>

## Generating text with decoder-only models

To generate new text, we are interested in the predicted value for the last token.

We need to convert the output logits to probabilities, and then take the max value, which will correspond to the ID of the next predicted token. Finally, we can convert the ID to its corresponding token.

In [None]:
logits = output["logits"][:, -1, :]
probabilities = torch.softmax(logits, dim=1)
pred_token_id = torch.argmax(probabilities, dim=1, keepdim=True).item()
pred_token = tokenizer.decode(pred_token_id)
print(pred_token)

 used


We can generate longer sentences by repeating this process, feeding at each step the sequence generated in the previous steps.

In [None]:
generated_tokens_ids = []
input_ids = tokenizer(sentence, return_tensors="pt")["input_ids"]
for i in range(50):
    output = model(input_ids=input_ids)
    logits = output["logits"][:, -1, :]
    probabilities = torch.softmax(logits, dim=1)
    pred_token_id = torch.argmax(probabilities, dim=1, keepdim=True)
    generated_tokens_ids.append(pred_token_id.item())
    input_ids = torch.cat((input_ids, pred_token_id), dim=1)
generated_text = tokenizer.decode(generated_tokens_ids)
print(sentence, "...")
print(generated_text)

Machine Learning algorithms are ...
 used to train neural networks to perform tasks such as predicting the future.

The researchers used a combination of machine learning and machine learning algorithms to create a new type of machine learning algorithm called a "supervised learning algorithm."

The researchers used


The procedure above is deterministic. To add some variety in the produced outputs, typically a random sampling is performed on the top-k predicted tokens, based on their probability.

Hugging Face also provides `pipeline` objects that integrate all the required operations for the most common tasks, like text generation.

In [None]:
from transformers import pipeline, set_seed


# set_seed(42)
model = pipeline("text-generation", model="gpt2", max_length=5)
generated_text = model(sentence)
print(generated_text)

Device set to use cuda:0
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Machine Learning algorithms are often described as "smart" machines, so they require a lot of control over the machine learning algorithms.\n\nWhy is there a problem with this?\n\nWhen you look at the performance of a smart machine, one of'}]


## Fine-tuning a encoder-only model

We will now load a pre-trained encoder-only model, that maps the input text into a hidden space, capturing high-level relationships and concepts. On top of this representation, we can attach task-specific layers (e.g., a linear classifier) and fine-tune the whole model on the given task. Usually, just a few epochs with a small learning rate are enough.

We will use DistilBERT [4], a distilled (i.e., with reduced size but similar performance) version of BERT (Bidirectional Encoder Representations from Transformers) [5], and fine-tune it for the sentiment classification task.

### Dataset loading

We first load the popular `emotion` dataset, which consists of Twitter messages labeled with six basic emotions: anger, fear, joy, love, sadness, and surprise.

In [None]:
from datasets import load_dataset

ds = load_dataset("dair-ai/emotion", "split")

In [None]:
print(ds)
print(ds["train"][:3])

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 16000
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
})
{'text': ['i didnt feel humiliated', 'i can go from feeling so hopeless to so damned hopeful just from being around someone who cares and is awake', 'im grabbing a minute to post i feel greedy wrong'], 'label': [0, 0, 3]}


### Model loading

We now load the pre-trained tokenizer and model from Hugging Face. Note that we are wrapping the model into a class that automatically attaches a (randomly initialized) classification layer on top of the encoder's last hidden layer. For this reason, the warning we get is expected.

We also have to set the number of output classes we need, which is $6$.

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification


tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased", num_labels=6)
print(model)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)


Also note that the classification layer has an input size equal to a single-token embedding size (i.e., $768$). Indeed, the classifier only takes as input the hidden representation of the first input token (which is the special sequence start `[CLS]` token).

### Tokenization

We then define a tokenization function and apply it to the entire dataset.

As BERT expects fixed-length inputs of 512 tokens, we need to set some parameters in the tokenizer. If the input sequence is longer, it is truncated; if shorter, padding tokens are added until the requested size is reached. Note that special `[CLS]` and `[SEP]` tokens are always added at start and end positions, respectively.

In [None]:
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=512)


tokenized_datasets = ds.map(tokenize_function, batched=True)

In [None]:
print(tokenized_datasets["train"][0])
tokenizer.special_tokens_map

{'text': 'i didnt feel humiliated', 'label': 0, 'input_ids': [101, 1045, 2134, 2102, 2514, 26608, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

{'unk_token': '[UNK]',
 'sep_token': '[SEP]',
 'pad_token': '[PAD]',
 'cls_token': '[CLS]',
 'mask_token': '[MASK]'}

We set now the training parameters using the class provided by the transformers library.

In [None]:
from transformers import TrainingArguments


training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=64,
    num_train_epochs=3,
    weight_decay=0.01,
    fp16=True,
    report_to="none",
    logging_dir="./logs",
    logging_steps=1
)

We also define a metric to be used during the validation phase in the training process, i.e., the accuracy.

In [None]:
import numpy as np


def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    accuracy = np.mean(predictions == labels)
    return {"accuracy": accuracy}

Finally, we instantiate a trainer, provided by the transformer library, which will automatically handle the training loop. We also subsample the dataset to speed up the training.

In [None]:
from transformers import Trainer


trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"].shuffle(seed=42).select(range(5000)),
    eval_dataset=tokenized_datasets["validation"].shuffle(seed=42).select(range(1000)),
    compute_metrics=compute_metrics,
)

We can launch the model fine-tuning!

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,0.5558,0.53794,0.837
2,0.2993,0.265765,0.923
3,0.3728,0.232629,0.918


TrainOutput(global_step=237, training_loss=0.5643707242193101, metrics={'train_runtime': 192.765, 'train_samples_per_second': 77.815, 'train_steps_per_second': 1.229, 'total_flos': 1987152721920000.0, 'train_loss': 0.5643707242193101, 'epoch': 3.0})

## References

[1] Vaswani, A., Shazeer, N.M., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., & Polosukhin, I. (2017). Attention is All you Need. Neural Information Processing Systems.

[2] Raschka, S. (2024). Build A Large Language Model (From Scratch). Manning. ISBN: 978-1633437166.

[3] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language Models are Unsupervised Multitask Learners.

[4] Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. ArXiv, abs/1910.01108.

[5] Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. North American Chapter of the Association for Computational Linguistics.