# Cap√≠tulo 05 ‚Äî Pr√©-Treinamento e Gera√ß√£o de Texto

Este notebook acompanha o Cap√≠tulo 05 da s√©rie **Fazendo um LLM do Zero**.

Neste notebook vamos ensinar o GPTMini a aprender linguagem.

üéØ **Objetivos deste notebook:**
- Como calcular loss probabil√≠stica
- Como funciona o loop de treinamento
- Como monitorar aprendizado
- Como gerar texto com diferentes estrat√©gias
- Como salvar e carregar modelos


## 1. Setup e Configura√ß√£o

In [None]:
# ============================================================
# Setup do reposit√≥rio no Colab
# ============================================================

import os

REPO_URL = "https://github.com/vongrossi/fazendo-um-llm-do-zero.git"
REPO_DIR = "fazendo-um-llm-do-zero"

if not os.path.exists(REPO_DIR):
    !git clone {REPO_URL}

os.chdir(REPO_DIR)
print("Diret√≥rio atual:", os.getcwd())


### 1.1 Depend√™ncias e GPTMini

In [None]:
!pip -q install -r 05-pre-treinamento/requirements.txt

# Garantir que o Python ache o diret√≥rio do projeto
import sys, os
sys.path.append(os.getcwd())

# ============================================================
# Import do n√∫cleo do GPTMini (reutilizado do Cap√≠tulo 04)
# ============================================================
# Premissa da s√©rie: evitar copiar e colar classes entre notebooks.
# Se voc√™ ainda n√£o criou lib/gptmini.py, crie antes de continuar.

try:
    from lib.gptmini import GPTConfig, GPTMini
    print("‚úÖ GPTMini importado de lib/gptmini.py")
except Exception as e:
    raise ImportError(
        "N√£o foi poss√≠vel importar 'lib.gptmini'.\n"
        "Verifique se o arquivo existe em: lib/gptmini.py\n"
        "e se voc√™ executou a c√©lula de clone/cd do reposit√≥rio."
    ) from e


### 1.2 Configura√ß√£o de GPU

In [None]:
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt
import math
import random
import numpy as np

device = "cuda" if torch.cuda.is_available() else "cpu"
print("Device:", device)

torch.manual_seed(42)


### 1.3 Reaproveitando o GPTMini do Cap√≠tulo 04

Neste cap√≠tulo vamos focar em **treinamento e gera√ß√£o**.

Para isso, reaproveitamos a implementa√ß√£o do GPTMini que est√° em `lib/gptmini.py`.
Assim, os cap√≠tulos seguintes (fine-tuning, instruction tuning, etc.) tamb√©m podem reutilizar o mesmo n√∫cleo.


In [None]:
# (Removido) Import antigo por caminho de notebook.
# Agora usamos: from lib.gptmini import GPTConfig, GPTMini


## 2. Criando o Dataset

In [None]:
text = """
o gato subiu no telhado
o cachorro subiu no sofa
o gato dormiu no sofa
o cachorro dormiu no tapete
o gato pulou no muro
""".strip().lower()

tokens = text.split()
vocab = sorted(set(tokens))

stoi = {t:i for i,t in enumerate(vocab)}
itos = {i:t for t,i in stoi.items()}

encoded = [stoi[t] for t in tokens]


### 2.1 Sliding Window

In [None]:
def build_dataset(token_ids, context_size):
    """Cria pares (X, Y) para language modeling.

    X: sequ√™ncia de tamanho T (context_size)
    Y: sequ√™ncia de tamanho T (pr√≥ximo token em cada posi√ß√£o)
       Ex: X = [t0,t1,t2,t3,t4]
           Y = [t1,t2,t3,t4,t5]
    """
    X, Y = [], []
    for i in range(len(token_ids) - context_size):
        x = token_ids[i : i + context_size]
        y = token_ids[i + 1 : i + context_size + 1]
        X.append(x)
        Y.append(y)
    return torch.tensor(X, dtype=torch.long), torch.tensor(Y, dtype=torch.long)

context_size = 5
X, Y = build_dataset(encoded, context_size)

print("X shape:", X.shape, "Y shape:", Y.shape)
print("Exemplo X:", X[0].tolist())
print("Exemplo Y:", Y[0].tolist())


In [None]:
# ============================================================
# Split treino / valida√ß√£o (did√°tico)
# ============================================================
# Premissa: queremos enxergar se o modelo est√° realmente aprendendo
# e evitar "achar" que aprendeu s√≥ porque o loss de treino caiu.

N = X.size(0)
perm = torch.randperm(N)

split = int(0.85 * N)
train_idx = perm[:split]
val_idx = perm[split:]

X_train, Y_train = X[train_idx].to(device), Y[train_idx].to(device)
X_val, Y_val = X[val_idx].to(device), Y[val_idx].to(device)

print("Treino:", X_train.shape, Y_train.shape)
print("Val   :", X_val.shape, Y_val.shape)


## 3. Criando o Modelo

In [None]:
config = GPTConfig(
    vocab_size=len(vocab),
    context_size=context_size,
    d_model=64,
    n_heads=4,
    n_layers=2
)

model = GPTMini(config).to(device)


## 4. Entendendo Cross Entropy
### 4.1 Demonstra√ß√£o manual da loss

In [None]:
logits = torch.tensor([[2.0, 0.5, 0.1]])
target = torch.tensor([0])

loss = F.cross_entropy(logits, target)
print(loss)


### 4.2 Visualiza√ß√£o probabil√≠stica

In [None]:
probs = F.softmax(logits, dim=-1)

plt.bar(range(len(probs[0])), probs[0].cpu())
plt.title("Distribui√ß√£o de Probabilidades")
plt.show()


## 5. Loop de Treinamento

### 5.1 Otimizador

In [None]:
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)


In [None]:
# ============================================================
# Treinamento (did√°tico, mas com loss em sequ√™ncia inteira)
# ============================================================
# Agora a loss √© calculada para TODAS as posi√ß√µes do contexto (B, T),
# que √© a forma mais comum em language modeling.

vocab_size = len(vocab)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

train_loss_history = []
val_loss_history = []

def compute_loss(logits, targets):
    # logits: (B, T, V) | targets: (B, T)
    return F.cross_entropy(logits.reshape(-1, vocab_size), targets.reshape(-1))

@torch.no_grad()
def eval_val_loss():
    model.eval()
    logits, _ = model(X_val)  # (B, T, V)
    loss = compute_loss(logits, Y_val)
    model.train()
    return loss.item()

model.train()

steps = 600
batch_size = 16
eval_every = 50

for step in range(steps):
    idx = torch.randint(0, X_train.size(0), (batch_size,), device=device)
    xb = X_train[idx]
    yb = Y_train[idx]

    logits, _ = model(xb)     # n√£o passamos targets, calculamos loss aqui
    loss = compute_loss(logits, yb)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    train_loss_history.append(loss.item())

    if step % eval_every == 0:
        vloss = eval_val_loss()
        val_loss_history.append((step, vloss))
        ppl = math.exp(vloss) if vloss < 20 else float("inf")
        print(f"step {step:03d} | train_loss {loss.item():.4f} | val_loss {vloss:.4f} | val_ppl {ppl:.2f}")


### 5.2 Avalia√ß√£o do Modelo

Agora que o modelo est√° treinado, vamos observar a evolu√ß√£o da perda.

# Plot do training loss (passo a passo)
plt.plot(train_loss_history)
plt.title("Training Loss (por step)")
plt.xlabel("Step")
plt.ylabel("Loss")
plt.show()

# Plot do validation loss (avaliado periodicamente)
if len(val_loss_history) > 0:
    steps_v, losses_v = zip(*val_loss_history)
    plt.plot(list(steps_v), list(losses_v))
    plt.title("Validation Loss (a cada avalia√ß√£o)")
    plt.xlabel("Step")
    plt.ylabel("Loss")
    plt.show()


In [None]:
plt.plot(train_loss_history)
plt.title("Training Loss")
plt.xlabel("Steps")
plt.ylabel("Loss")
plt.show()


In [None]:
def encode_text(s):
    return [stoi[t] for t in s.lower().split() if t in stoi]

def decode(ids):
    return " ".join(itos[int(i)] for i in ids)


In [None]:
def encode_text(s):
    return [stoi[t] for t in s.split() if t in stoi]

def decode(ids):
    return " ".join(itos[i] for i in ids)


In [None]:
@torch.no_grad()
def generate_greedy(start_tokens, max_new_tokens=10):
    model.eval()
    idx = torch.tensor(start_tokens, dtype=torch.long, device=device).unsqueeze(0)

    for _ in range(max_new_tokens):
        idx_cond = idx[:, -context_size:]
        logits, _ = model(idx_cond)
        next_id = torch.argmax(logits[:, -1, :], dim=-1, keepdim=True)
        idx = torch.cat([idx, next_id], dim=1)

    model.train()
    return idx.squeeze(0).tolist()


In [None]:
@torch.no_grad()
def generate_greedy(start_tokens, max_new_tokens=10):
    model.eval()
    idx = torch.tensor(start_tokens).unsqueeze(0).to(device)

    for _ in range(max_new_tokens):
        logits, _ = model(idx[:, -context_size:])
        next_id = torch.argmax(logits[:, -1, :], dim=-1)
        idx = torch.cat([idx, next_id.unsqueeze(1)], dim=1)

    return idx.squeeze().tolist()


In [None]:
@torch.no_grad()
def generate_temperature(start_tokens, max_new_tokens=10, temperature=1.0):
    model.eval()
    temperature = max(float(temperature), 1e-6)

    idx = torch.tensor(start_tokens, dtype=torch.long, device=device).unsqueeze(0)

    for _ in range(max_new_tokens):
        idx_cond = idx[:, -context_size:]
        logits, _ = model(idx_cond)

        logits = logits[:, -1, :] / temperature
        probs = F.softmax(logits, dim=-1)

        next_id = torch.multinomial(probs, num_samples=1)
        idx = torch.cat([idx, next_id], dim=1)

    model.train()
    return idx.squeeze(0).tolist()


In [None]:
@torch.no_grad()
def generate_temperature(start_tokens, temp=1.0):
    model.eval()
    idx = torch.tensor(start_tokens).unsqueeze(0).to(device)

    for _ in range(10):
        logits, _ = model(idx[:, -context_size:])
        logits = logits[:, -1, :] / temp
        probs = F.softmax(logits, dim=-1)
        next_id = torch.multinomial(probs, 1)
        idx = torch.cat([idx, next_id], dim=1)

    return idx.squeeze().tolist()


In [None]:
@torch.no_grad()
def generate_top_k(start_tokens, k=5, max_new_tokens=10, temperature=1.0):
    model.eval()
    temperature = max(float(temperature), 1e-6)
    k = int(max(1, k))

    idx = torch.tensor(start_tokens, dtype=torch.long, device=device).unsqueeze(0)

    for _ in range(max_new_tokens):
        idx_cond = idx[:, -context_size:]
        logits, _ = model(idx_cond)
        logits = logits[:, -1, :] / temperature

        topk_vals, topk_idx = torch.topk(logits, k, dim=-1)   # (1, k)
        probs = F.softmax(topk_vals, dim=-1)                  # (1, k)

        sample = torch.multinomial(probs, num_samples=1)      # (1, 1) em [0..k-1]
        next_id = topk_idx.gather(-1, sample)                 # (1, 1) token real

        idx = torch.cat([idx, next_id], dim=1)

    model.train()
    return idx.squeeze(0).tolist()


In [None]:
@torch.no_grad()
def generate_top_p(start_tokens, p=0.9, max_new_tokens=10, temperature=1.0):
    """Nucleus sampling (top-p): escolhe entre o menor conjunto de tokens
    cuja probabilidade acumulada >= p.
    """
    model.eval()
    temperature = max(float(temperature), 1e-6)
    p = float(min(max(p, 1e-6), 1.0))

    idx = torch.tensor(start_tokens, dtype=torch.long, device=device).unsqueeze(0)

    for _ in range(max_new_tokens):
        idx_cond = idx[:, -context_size:]
        logits, _ = model(idx_cond)
        logits = logits[:, -1, :] / temperature

        probs = F.softmax(logits, dim=-1).squeeze(0)  # (V,)
        sorted_probs, sorted_idx = torch.sort(probs, descending=True)

        cum_probs = torch.cumsum(sorted_probs, dim=0)
        cutoff = torch.searchsorted(cum_probs, torch.tensor(p, device=device))

        cutoff = int(cutoff.item()) + 1
        candidate_probs = sorted_probs[:cutoff]
        candidate_idx = sorted_idx[:cutoff]

        candidate_probs = candidate_probs / candidate_probs.sum()
        sample = torch.multinomial(candidate_probs, num_samples=1)
        next_id = candidate_idx[sample].view(1, 1)

        idx = torch.cat([idx, next_id], dim=1)

    model.train()
    return idx.squeeze(0).tolist()


In [None]:
# Salvando um checkpoint completo (pesos + config + vocabul√°rio)
ckpt = {
    "state_dict": model.state_dict(),
    "config": config.__dict__ if hasattr(config, "__dict__") else dict(config),
    "stoi": stoi,
    "itos": itos,
    "context_size": context_size,
}
torch.save(ckpt, "gpt_checkpoint.pt")
print("‚úÖ Checkpoint salvo:", "gpt_checkpoint.pt")


## 6. Checkpoints
### 6.1 Salvar modelo

In [None]:
# Carregando checkpoint
ckpt = torch.load("gpt_checkpoint.pt", map_location=device)
model.load_state_dict(ckpt["state_dict"])
model.to(device)
model.eval()
print("‚úÖ Checkpoint carregado")


### 6.2 Carregar modelo

In [None]:
# Teste de gera√ß√£o (comparando estrat√©gias)
start = encode_text("o gato")

print("Entrada:", decode(start))
print("Greedy        :", decode(generate_greedy(start, max_new_tokens=8)))
print("Temperature 0.8:", decode(generate_temperature(start, max_new_tokens=8, temperature=0.8)))
print("Top-k (k=5)    :", decode(generate_top_k(start, k=5, max_new_tokens=8, temperature=1.0)))
print("Top-p (p=0.9)  :", decode(generate_top_p(start, p=0.9, max_new_tokens=8, temperature=1.0)))


## 7. Compara√ß√£o Antes vs Depois

In [None]:
# Teste de Geracao

start = encode_text("o gato")

print("Greedy:", decode(generate_greedy(start)))
print("Temp:", decode(generate_temperature(start)))
print("Top-k:", decode(generate_top_k(start)))


## 8. Conclus√£o

Voc√™ acabou de ensinar um GPT a aprender linguagem.

Voc√™ viu:

- Como calcular cross entropy
- Como funciona o loop de treinamento
- Como monitorar aprendizado
- Como controlar gera√ß√£o de texto
- Como salvar e reutilizar modelos

No pr√≥ximo cap√≠tulo exploraremos fine-tuning e especializa√ß√£o de modelos.
