# Cap√≠tulo 04 ‚Äî Construindo um GPT do Zero

Este notebook acompanha o Cap√≠tulo 04 da s√©rie **Fazendo um LLM do Zero**.

Neste notebook vamos construir um GPT m√≠nimo (did√°tico), pe√ßa por pe√ßa.

üéØ **Objetivos deste notebook:**
- Montar um **TransformerBlock** completo
- Implementar **Self-Attention com M√°scara Causal**
- Adicionar **MLP (Feedforward)**
- Aplicar **Residual Connections + LayerNorm**
- Empilhar blocos para formar um **GPTMini**
- Fazer uma **etapa de treino curta**
- Testar **Gera√ß√£o Autoregressiva**


## 1. Setup e Configura√ß√£o

In [None]:
# ============================================================
# Setup do reposit√≥rio
# ============================================================
import os

REPO_URL = "https://github.com/vongrossi/fazendo-um-llm-do-zero.git"
REPO_DIR = "fazendo-um-llm-do-zero"

if not os.path.exists(REPO_DIR):
    !git clone {REPO_URL}

os.chdir(REPO_DIR)
print("Diret√≥rio atual:", os.getcwd())


### 1.1 Depend√™ncias e Device

In [None]:
# ============================================================
# Depend√™ncias
# ============================================================
# Observa√ß√£o: o Colab geralmente j√° tem torch instalado.
# Mas este pip garante consist√™ncia se voc√™ quiser travar vers√µes no repo.

!pip -q install -r 04-gpt-do-zero/requirements.txt


No Colab temos a possibilidade de uso de GPU isso da um poder extra de processamento para uma serie de atividades

In [None]:
# ============================================================
# GPU opcional no Colab
# ============================================================
import torch
import torch.nn as nn
import torch.nn.functional as F
from dataclasses import dataclass
import math
import random
import numpy as np

torch.manual_seed(42)
np.random.seed(42)
random.seed(42)

device = "cuda" if torch.cuda.is_available() else "cpu"
print("Device:", device)


## 2. O que vamos construir?

Um GPT (Decoder-Only Transformer) did√°tico tem:

1) Token Embeddings  
2) Positional Embeddings  
3) Blocos Transformer empilhados, cada um com:
   - LayerNorm
   - Self-Attention (causal)
   - Residual connection
   - LayerNorm
   - MLP / Feedforward
   - Residual connection
4) Linear head (logits para o vocabul√°rio)
5) Loss (cross-entropy) para treino
6) Gera√ß√£o autoregressiva (token por token)

A ordem acima √© a ordem do c√≥digo.


## 3. Dataset M√≠nimo

In [None]:
# ============================================================
# Dataset minimo
# ============================================================
# Aqui queremos um dataset que rode r√°pido.
# Em modelos reais, isso seria enorme. Aqui √© s√≥ para demonstrar pipeline.

text = """
o gato subiu no telhado
o cachorro subiu no sofa
o gato dormiu no sofa
o cachorro dormiu no tapete
o gato pulou no muro
"""
text = text.strip().lower()
print(text)


### 3.1 Tokeniza√ß√£o Simples

In [None]:
# ============================================================
# Tokeniza√ß√£o simples (did√°tica)
# ============================================================
tokens = text.split()
vocab = sorted(set(tokens))

token_to_id = {t:i for i,t in enumerate(vocab)}
id_to_token = {i:t for t,i in token_to_id.items()}

encoded = [token_to_id[t] for t in tokens]

print("Vocab size:", len(vocab))
print("Tokens:", tokens[:20])
print("Encoded:", encoded[:20])


### 3.2 Sliding Window (Input/Target)

In [None]:
# ============================================================
# Pares input/target para previs√£o do pr√≥ximo token
# ============================================================
def build_dataset(token_ids, context_size):
    X, Y = [], []
    for i in range(len(token_ids) - context_size):
        X.append(token_ids[i:i+context_size])
        Y.append(token_ids[i+context_size])
    return torch.tensor(X), torch.tensor(Y)

context_size = 5
X, Y = build_dataset(encoded, context_size)

print("X shape:", X.shape)
print("Y shape:", Y.shape)
print("Exemplo input:", X[0], "-> target:", Y[0], id_to_token[int(Y[0])])


## 4. Componentes do GPT (Constru√ß√£o 1 a 1)

### 4.1 Configura√ß√£o do Modelo

In [None]:
@dataclass
class GPTConfig:
    vocab_size: int
    context_size: int
    d_model: int = 64
    n_heads: int = 4
    n_layers: int = 2
    dropout: float = 0.1

config = GPTConfig(
    vocab_size=len(vocab),
    context_size=context_size,
    d_model=64,
    n_heads=4,
    n_layers=2,
    dropout=0.1
)
config


### 4.2 Token Embeddings + Positional Embeddings

In [None]:
class TokenAndPositionEmbedding(nn.Module):
    def __init__(self, config: GPTConfig):
        super().__init__()
        self.tok_emb = nn.Embedding(config.vocab_size, config.d_model)
        self.pos_emb = nn.Embedding(config.context_size, config.d_model)
        self.dropout = nn.Dropout(config.dropout)

    def forward(self, idx):
        # idx: (B, T)
        B, T = idx.shape
        positions = torch.arange(0, T, device=idx.device).unsqueeze(0)  # (1, T)
        x = self.tok_emb(idx) + self.pos_emb(positions)                 # (B, T, C)
        return self.dropout(x)

emb_layer = TokenAndPositionEmbedding(config).to(device)

dummy = X[:2].to(device)
out = emb_layer(dummy)
print("Emb output:", out.shape)


### 4.3 Self-Attention com M√°scara Causal

In [None]:
class CausalSelfAttention(nn.Module):
    def __init__(self, config: GPTConfig):
        super().__init__()
        assert config.d_model % config.n_heads == 0
        self.n_heads = config.n_heads
        self.head_dim = config.d_model // config.n_heads

        # Proje√ß√µes Q, K, V em uma s√≥ camada (mais simples e comum)
        self.qkv = nn.Linear(config.d_model, 3 * config.d_model)
        self.out_proj = nn.Linear(config.d_model, config.d_model)
        self.dropout = nn.Dropout(config.dropout)

        # M√°scara causal fixa (T x T)
        self.register_buffer(
            "mask",
            torch.tril(torch.ones(config.context_size, config.context_size))
        )

    def forward(self, x):
        # x: (B, T, C)
        B, T, C = x.shape

        qkv = self.qkv(x)  # (B, T, 3C)
        q, k, v = qkv.split(C, dim=2)

        # reshape para multi-head: (B, nh, T, hs)
        q = q.view(B, T, self.n_heads, self.head_dim).transpose(1, 2)
        k = k.view(B, T, self.n_heads, self.head_dim).transpose(1, 2)
        v = v.view(B, T, self.n_heads, self.head_dim).transpose(1, 2)

        # aten√ß√£o: (B, nh, T, T)
        att = (q @ k.transpose(-2, -1)) / math.sqrt(self.head_dim)

        # aplica m√°scara causal (bloqueia futuro)
        att = att.masked_fill(self.mask[:T, :T] == 0, float("-inf"))

        # softmax nos scores
        att = F.softmax(att, dim=-1)
        att = self.dropout(att)

        # contexto: (B, nh, T, hs)
        y = att @ v

        # junta heads: (B, T, C)
        y = y.transpose(1, 2).contiguous().view(B, T, C)

        # projeta de volta
        y = self.out_proj(y)
        y = self.dropout(y)
        return y

attn = CausalSelfAttention(config).to(device)
y = attn(out)
print("Attn output:", y.shape)


### 4.4 Feedforward (MLP)

In [None]:
class FeedForward(nn.Module):
    def __init__(self, config: GPTConfig):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(config.d_model, 4 * config.d_model),
            nn.GELU(),
            nn.Linear(4 * config.d_model, config.d_model),
            nn.Dropout(config.dropout)
        )

    def forward(self, x):
        return self.net(x)

ff = FeedForward(config).to(device)
z = ff(y)
print("FF output:", z.shape)


### 4.5 TransformerBlock (LayerNorm + Residual)

In [None]:
class TransformerBlock(nn.Module):
    def __init__(self, config: GPTConfig):
        super().__init__()
        self.ln1 = nn.LayerNorm(config.d_model)
        self.attn = CausalSelfAttention(config)
        self.ln2 = nn.LayerNorm(config.d_model)
        self.ff = FeedForward(config)

    def forward(self, x):
        # Pre-LN Transformer (est√°vel e comum em GPTs modernos)
        x = x + self.attn(self.ln1(x))
        x = x + self.ff(self.ln2(x))
        return x

block = TransformerBlock(config).to(device)
b = block(out)
print("Block output:", b.shape)


## 5. GPTMini Completo

In [None]:
class GPTMini(nn.Module):
    def __init__(self, config: GPTConfig):
        super().__init__()
        self.emb = TokenAndPositionEmbedding(config)
        self.blocks = nn.Sequential(*[TransformerBlock(config) for _ in range(config.n_layers)])
        self.ln_f = nn.LayerNorm(config.d_model)
        self.head = nn.Linear(config.d_model, config.vocab_size)

    def forward(self, idx, targets=None):
        x = self.emb(idx)              # (B, T, C)
        x = self.blocks(x)             # (B, T, C)
        x = self.ln_f(x)               # (B, T, C)
        logits = self.head(x)          # (B, T, vocab)

        loss = None
        if targets is not None:
            # usamos apenas a √∫ltima posi√ß√£o para prever o pr√≥ximo token do contexto
            logits_last = logits[:, -1, :]      # (B, vocab)
            loss = F.cross_entropy(logits_last, targets)
        return logits, loss

model = GPTMini(config).to(device)

logits, loss = model(X[:4].to(device), Y[:4].to(device))
print("Logits:", logits.shape, "Loss:", float(loss))


## 6. Treinamento Curto

In [None]:
# ============================================================
# Treino did√°tico r√°pido
# ============================================================
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

X_train = X.to(device)
Y_train = Y.to(device)

model.train()
for step in range(300):
    # mini-batch pequeno
    idx = torch.randint(0, X_train.size(0), (16,), device=device)
    xb = X_train[idx]
    yb = Y_train[idx]

    _, loss = model(xb, yb)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    if step % 50 == 0:
        print(f"step {step:03d} | loss {loss.item():.4f}")


## 7. Gera√ß√£o Autoregressiva

In [None]:
def encode_text(s):
    return [token_to_id[t] for t in s.lower().split() if t in token_to_id]

def decode_ids(ids):
    return " ".join(id_to_token[i] for i in ids)


### 7.1 Gera√ß√£o (Greedy + Temperature Sampling)

In [None]:
@torch.no_grad()
def generate(model, start_tokens, max_new_tokens=10, temperature=1.0):
    model.eval()
    idx = torch.tensor(start_tokens, dtype=torch.long, device=device).unsqueeze(0)  # (1, T)

    for _ in range(max_new_tokens):
        # recorta para context_size
        idx_cond = idx[:, -config.context_size:]

        logits, _ = model(idx_cond)
        logits = logits[:, -1, :] / temperature  # (1, vocab)

        probs = F.softmax(logits, dim=-1)

        # greedy (mais previs√≠vel)
        next_id = torch.argmax(probs, dim=-1, keepdim=True)  # (1, 1)

        idx = torch.cat([idx, next_id], dim=1)

    return idx.squeeze(0).tolist()

start = encode_text("o gato subiu")
generated_ids = generate(model, start, max_new_tokens=8)

print("Entrada :", decode_ids(start))
print("Sa√≠da   :", decode_ids(generated_ids))


## 8. Conclus√£o

Voc√™ acabou de implementar um GPT m√≠nimo com:

- Embeddings (token + posi√ß√£o)
- Self-attention com m√°scara causal
- MLP (feedforward)
- Residual + LayerNorm
- Empilhamento de blocos Transformer
- Head de sa√≠da para prever o pr√≥ximo token
- Treinamento simples com cross-entropy
- Gera√ß√£o autoregressiva

Esse modelo √© pequeno, mas tem os mesmos princ√≠pios estruturais dos GPTs reais.

No pr√≥ximo cap√≠tulo, vamos focar em **treinamento de verdade**:
- loss em sequ√™ncia completa
- batching melhor
- avalia√ß√£o
- melhorias de amostragem
