## Chaper 1: Pay Attention to LLMs

### Spoilers

In this chapter, we’ll:

- Briefly discuss the history of language models
- Understand the basic elements of the Transformer architecture and the attention mechanism
- Understand the different types of fine-tuning

### Transformers

![](https://github.com/dvgodoy/FineTuningLLMs/blob/main/images/ch1/stacked_layers.png?raw=True)
<center>Figure 1.1 - Transformer’s stacked "layers"</center>

![](https://github.com/dvgodoy/FineTuningLLMs/blob/main/images/ch1/full_transformer.png?raw=True)
<center>Figure 1.2 - Transformer architecture in detail</center>

![](https://github.com/dvgodoy/FineTuningLLMs/blob/main/images/ch1/bert_embeddings.png?raw=True)
<center>Figure 1.3 - Contextual word embeddings from BERT</center>

### Attention Is All You Need

$$
\Large
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$$
<center>Equation 1.1 - Attention formula</center>

![](https://github.com/dvgodoy/FineTuningLLMs/blob/main/images/ch1/translation_att.png?raw=True)
<center>Figure 1.4 - Attention scores</center>

![](https://github.com/dvgodoy/FineTuningLLMs/blob/main/images/ch1/multiple_keys_context.png?raw=True)
<center>Figure 1.5 - Querying two-dimensional keys</center>

$$
\Large
\text{cos}\theta = ||Q|| ||K|| = Q \cdot K
$$
<center>Equation 1.2 - Cosine similarity, norms, and the dot product</center>

In [6]:
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import AutoTokenizer
repo_id = 'microsoft/Phi-3-mini-4k-instruct'
tokenizer = AutoTokenizer.from_pretrained(repo_id)
vocab_size = len(tokenizer)

torch.manual_seed(13)
# Made-up embedding and projection layers
d_model = 1024
embedding_layer = nn.Embedding(vocab_size, d_model)
linear_query = nn.Linear(d_model, d_model)
linear_key = nn.Linear(d_model, d_model)
linear_value = nn.Linear(d_model, d_model)

In [7]:
sentence = 'Just a dummy sentence'
input_ids = tokenizer(sentence, return_tensors='pt')['input_ids']
input_ids

tensor([[ 3387,   263, 20254, 10541]])

In [8]:
embeddings = embedding_layer(input_ids)
embeddings.shape

torch.Size([1, 4, 1024])

In [9]:
# Projections
proj_key = linear_key(embeddings)
proj_value = linear_value(embeddings)
proj_query = linear_query(embeddings)
# Attention scores
dot_products = torch.matmul(proj_query, proj_key.transpose(-2, -1))
scores = F.softmax(dot_products / np.sqrt(d_model), dim=-1)
scores.shape

torch.Size([1, 4, 4])

In [10]:
context = torch.matmul(scores, proj_value)
context.shape

torch.Size([1, 4, 1024])