<a href="https://colab.research.google.com/github/flych3r/IA025_2022S1/blob/main/ex10/matheus_xavier/IA025_A10.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
nome = 'Matheus Xavier Sampaio - 220092'
print(f'Meu nome é {nome}')

Meu nome é Matheus Xavier Sampaio - 220092


#  Exercício: Modelo de Linguagem com auto-atenção

Este exercício é similar ao da Aula 8, mas iremos agora treinar uma rede neural com **duas camadas** de auto-atenção **causais** para prever a próxima palavra de um texto, data as palavras anteriores como entrada. 

Iremos também trabalhar com sequencias de tamanho variável.

Na camada de auto-atenção, não se esqueça de implementar:
- Embeddings de posição
- Projeções lineares (WQ, WK, WV, WO)
- Conexões residuais
- Camada de feed forward (2-layer MLP)


O dataset usado neste exercício (BrWaC) possui um tamanho razoável e você vai precisar rodar seus experimentos com GPU.

Alguns conselhos úteis:
- **ATENÇÃO:** o dataset é bem grande. Não dê comando de imprimí-lo.
- Durante a depuração, faça seu dataset ficar bem pequeno, para que a depuração seja mais rápida e não precise de GPU. Somente ligue a GPU quando o seu laço de treinamento já está funcionando
- Não deixe para fazer esse exercício na véspera. Ele é trabalhoso.

In [None]:
# iremos utilizar a biblioteca dos transformers para ter acesso ao tokenizador do BERT.
!pip install transformers

[0m

# Importação dos pacotes

In [None]:
import collections
import functools
import itertools
import math
import random
from itertools import chain
from typing import List

import numpy as np
import torch
from torch.utils.data import DataLoader
from tqdm.auto import tqdm
from transformers import AutoTokenizer

In [None]:
# Check which GPU we are using
!nvidia-smi

Mon Jun  6 14:40:52 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01    Driver Version: 470.82.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P0    25W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+---------------------------------------------------------------------------

In [None]:
if torch.cuda.is_available(): 
    dev = "cuda:0"
else: 
    dev = "cpu"
device = torch.device(dev)
print('Using {}'.format(device))

Using cuda:0


In [None]:
random.seed(123)
np.random.seed(123)
torch.manual_seed(123)

<torch._C.Generator at 0x7f1eee530550>

# Implementação do MyDataset

In [None]:
def tokenize(texts: List[str], tokenizer: AutoTokenizer):
    # Recomenda-se usar o tokenizer.batch_encode_plus pois é mais rápido.
    if not texts:
        return []
    return tokenizer.batch_encode_plus(texts, return_tensors=None, add_special_tokens=False).input_ids


class MyDataset():
    def __init__(self, texts: List[str], tokenizer, max_seq_length: int, batch_size: int = 256):
        self.tokenizer = tokenizer
        self.max_seq_length = max_seq_length
        self.cls_token = tokenize(['[CLS]'], tokenizer)[0][0]
        self.pad_token = tokenize(['[PAD]'], tokenizer)[0][0]
        self.tokens, self.indexes = self._tokenize_texts(texts, batch_size)

    def __len__(self):
        return len(self.indexes)

    def __getitem__(self, idx):
        tokens_idx, tokens_chunk = self.indexes[idx]
        input = self.tokens[tokens_idx]
        input = input[tokens_chunk * self.max_seq_length:(tokens_chunk + 1) * self.max_seq_length]
        input = self._pad_tokens(input)

        input = torch.LongTensor(input)
        target = input.roll(shifts=-1)
        target[-1] = self.pad_token
        return input, target

    def _tokenize_texts(self, texts, batch_size):
        input_ids = chain(*list((
            tokenize(texts[i * batch_size: (i + 1) * batch_size], self.tokenizer) 
            for i in tqdm(range(math.ceil(len(texts) / batch_size)))
        )))

        tokens = dict()
        indexes = list()
    
        for idx, tkns in enumerate(input_ids):
            tkns = [self.cls_token] + tkns
            tokens[idx] = tkns
            n_chunks = math.ceil(len(tkns) / self.max_seq_length)
            indexes.extend([(idx, i) for i in range(n_chunks)])

        return tokens, indexes
    
    def _pad_tokens(self, tokens):
        return (tokens + ([self.pad_token] * self.max_seq_length))[:self.max_seq_length]

## Testando se a implementação do MyDataset está correta

In [None]:
tokenizer = AutoTokenizer.from_pretrained("neuralmind/bert-base-portuguese-cased")

dummy_texts = ['Eu gosto de correr', 'Ela gosta muito de comer pizza']

dummy_dataset = MyDataset(texts=dummy_texts, tokenizer=tokenizer, max_seq_length=9)
dummy_loader = DataLoader(dummy_dataset, batch_size=6, shuffle=False)
assert len(dummy_dataset) == 2
print('Passou no assert de tamanho do dataset.')

first_batch_input, first_batch_target = next(iter(dummy_loader))

correct_first_batch_input = torch.LongTensor([
    [  101,  3396, 10303,   125, 13239,     0,     0,     0,     0],
    [  101,  1660,  5971,   785,   125,  1847, 13779, 15616,     0]
])

correct_first_batch_target = torch.LongTensor([
    [ 3396, 10303,   125, 13239,     0,     0,     0,     0,     0],
    [ 1660,  5971,   785,   125,  1847, 13779, 15616,     0,     0]
])

assert torch.equal(first_batch_input, correct_first_batch_input)
assert torch.equal(first_batch_target, correct_first_batch_target)

print('Passou no assert de dataset.')

Downloading:   0%|          | 0.00/43.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/647 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/205k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

Passou no assert de tamanho do dataset.
Passou no assert de dataset.


# Carregamento do dataset 

Iremos usar uma pequena amostra do dataset [BrWaC](https://www.inf.ufrgs.br/pln/wiki/index.php?title=BrWaC) para treinar e avaliar nosso modelo de linguagem.

In [None]:
!wget -nc https://storage.googleapis.com/unicamp-dl/ia025a_2022s1/aula9/sample-1gb.txt

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
--2022-06-06 14:40:59--  https://storage.googleapis.com/unicamp-dl/ia025a_2022s1/aula9/sample-1gb.txt
Resolving storage.googleapis.com (storage.googleapis.com)... 172.217.193.128, 173.194.214.128, 142.251.107.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|172.217.193.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1230909256 (1.1G) [text/plain]
Saving to: ‘sample-1gb.txt’


2022-06-06 14:41:03 (259 MB/s) - ‘sample-1gb.txt’ saved [1230909256/1230909256]



In [None]:
# Load datasets
max_seq_length = 16

train_examples = 500
train_examples = 100000
valid_examples = 100
test_examples = 100

texts = open('sample-1gb.txt').readlines()

print(f'Read {len(texts)} lines.')

max_lines = train_examples + valid_examples + test_examples
print(f'Truncating to {max_lines} lines.')
texts = texts[:max_lines]  

training_texts = texts[:-(valid_examples + test_examples)]
valid_texts = texts[-(valid_examples + test_examples):-test_examples]
test_texts = texts[-test_examples:]

training_dataset = MyDataset(texts=training_texts, tokenizer=tokenizer, max_seq_length=max_seq_length)
valid_dataset = MyDataset(texts=valid_texts, tokenizer=tokenizer, max_seq_length=max_seq_length)
test_dataset = MyDataset(texts=test_texts, tokenizer=tokenizer, max_seq_length=max_seq_length)

Read 250000 lines.
Truncating to 100200 lines.


  0%|          | 0/391 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
print(f'training examples: {len(training_dataset)}')
print(f'valid examples: {len(valid_dataset)}')
print(f'test examples: {len(test_dataset)}')

training examples: 7045971
valid examples: 7887
test examples: 7009


# Modelo de Linguagem

In [26]:
class SequenceWithParams(torch.nn.Sequential):
    def forward(self, input, *args, **kwargs):
        """
        Forward parameters to modules in a sequential container

        Args:
            *args: sequence of arguments
            **kwargs: sequence of named arguments
        """
        for module in self:
            input = module(input, *args, **kwargs)
        return input


class MultiHeadAttentionBlock(torch.nn.Module):
    def __init__(
        self,
        num_heads: int,
        max_seq_length: int,
        dim: int,
        p: float = 0.1,
    ):
        """
        Implements the Multihead-attention

        Args:
            max_seq_length (int): size of the sequence to consider as context for prediction.
            dim (int): Dimension of the embedding layer for each word in the context.
            num_heads (int): number of attention heads
            p (float): dropout rate
        """
        super(MultiHeadAttentionBlock, self).__init__()

        self.max_seq_length = max_seq_length
        self.dim = dim
        self.num_heads = num_heads
        self.p = p

        self.causal_mask = torch.tril(
            torch.ones(self.max_seq_length, self.max_seq_length)
        ).bool()

        self.dropout = torch.nn.Dropout(p=p)

        self.Wq = torch.nn.Linear(
            in_features=self.dim, out_features=self.dim, bias=False
        )
        self.Wk = torch.nn.Linear(
            in_features=self.dim, out_features=self.dim, bias=False
        )
        self.Wv = torch.nn.Linear(
            in_features=self.dim, out_features=self.dim, bias=False
        )

        self.softmax = torch.nn.Softmax(dim=-1)

    def forward(self, inputs: torch.LongTensor, attn_mask: torch.BoolTensor):
        """
        Args:
            inputs is a LongTensor of shape (batch_size, max_seq_length, dim)
            attn_mask is a BoolTensor of shape (batch_size, 1, 1, max_seq_length)

        Returns:
            attention mask (batch_size, max_seq_length, dim)
        """
        causal_mask = self.causal_mask.to(inputs.device)

        X = self.dropout(inputs)

        Q = self.Wq(X).reshape(
            -1, self.max_seq_length, self.num_heads, self.dim // self.num_heads
        )
        K = self.Wk(X).reshape(
            -1, self.max_seq_length, self.num_heads, self.dim // self.num_heads
        )
        V = self.Wv(X).reshape(
            -1, self.max_seq_length, self.num_heads, self.dim // self.num_heads
        )
        Q, K, V = Q.transpose(1, 2), K.transpose(1, 2), V.transpose(1, 2)

        scores = Q @ torch.transpose(K, 2, 3) / math.sqrt(self.dim)
        scores = scores.masked_fill(~causal_mask | ~attn_mask, -1e9)
        probs = self.softmax(scores)

        E = probs @ V
        E = E.transpose(1, 2).contiguous().reshape(-1, self.max_seq_length, self.dim)
        return E


class LanguageModel(torch.nn.Module):
    def __init__(
        self,
        vocab_size: int,
        max_seq_length: int,
        dim: int,
        n_layers: int,
        pad_token_id: int,
        hidden_size: int = 768,
        num_heads: int = 4,
        p: float = 0.1,
    ):
        """
        Implements the Self-attention, decoder-only."

        Args:
            vocab_size (int): size of the input vocabulary.
            max_seq_length (int): size of the sequence to consider as context for prediction.
            dim (int): Dimension of the embedding layer for each word in the context.
            n_layers (int): number of self-attention layers.
            pad_token_id (int): id of the pad token that will be ignored in the attention.
            hidden_size (int): size of the mlp block
            num_heads (int): number of attention heads
            p (float): dropout rate
        """
        super(LanguageModel, self).__init__()

        self.vocab_size = vocab_size
        self.max_seq_length = max_seq_length
        self.dim = dim
        self.pad_token_id = pad_token_id

        self._positions = torch.arange(self.max_seq_length, dtype=torch.long).unsqueeze(0)

        self.C = torch.nn.Embedding(
            num_embeddings=self.vocab_size, embedding_dim=self.dim, padding_idx=self.pad_token_id
        )
        self.P = torch.nn.Embedding(
            num_embeddings=self.max_seq_length, embedding_dim=self.dim, padding_idx=self.pad_token_id
        )

        self.attention = SequenceWithParams(
            *[
                MultiHeadAttentionBlock(
                    num_heads, self.max_seq_length, self.dim, self.pad_token_id
                )
                for _ in range(n_layers)
            ]
        )
        self.Wo = torch.nn.Linear(
            in_features=self.dim, out_features=self.dim, bias=False
        )

        self.feed_forward = torch.nn.Sequential(
            torch.nn.LayerNorm(self.dim),
            torch.nn.Linear(in_features=self.dim, out_features=hidden_size),
            torch.nn.Dropout(p=p),
            torch.nn.ReLU(),
            torch.nn.LayerNorm(hidden_size),
            torch.nn.Linear(in_features=hidden_size, out_features=hidden_size),
            torch.nn.Dropout(p=p),
            torch.nn.ReLU(),
        )
        self.output = torch.nn.Linear(
            in_features=hidden_size, out_features=self.vocab_size, bias=False
        )

    def forward(self, inputs: torch.LongTensor):
        """
        Args:
            inputs is a LongTensor of shape (batch_size, max_seq_length)

        Returns:
            logits of shape (batch_size, max_seq_length, vocab_size)
        """
        positions = self._positions.repeat(inputs.shape[0], 1).to(inputs.device)
        attn_mask = (inputs != self.pad_token_id)[:, None, None, :]

        X = self.C(inputs) + self.P(positions)
        E = self.attention(X, attn_mask)
        E = self.Wo(E)

        out = self.feed_forward(torch.squeeze(E + X, dim=1))
        logits = self.output(out)

        return logits

## Teste o modelo com um exemplo

In [None]:
model = LanguageModel(
    vocab_size=tokenizer.vocab_size,
    max_seq_length=max_seq_length,
    dim=64,
    n_layers=2,
    pad_token_id=tokenizer.pad_token_id,
).to(device)

sample_input, _ = next(iter(DataLoader(training_dataset)))
sample_input = torch.stack([sample_input, sample_input]).squeeze(1)
sample_input = sample_input.to(device)
sample_output = model(sample_input)
print(f'sample_input.shape: {sample_input.shape}')
print(f'sample_output.shape: {sample_output.shape}')

sample_input.shape: torch.Size([2, 16])
sample_output.shape: torch.Size([2, 16, 29794])


In [None]:
model

LanguageModel(
  (C): Embedding(29794, 64, padding_idx=0)
  (P): Embedding(16, 64, padding_idx=0)
  (attention): SequenceWithParams(
    (0): MultiHeadAttentionBlock(
      (dropout): Dropout(p=0, inplace=False)
      (Wq): Linear(in_features=64, out_features=64, bias=False)
      (Wk): Linear(in_features=64, out_features=64, bias=False)
      (Wv): Linear(in_features=64, out_features=64, bias=False)
      (softmax): Softmax(dim=-1)
    )
    (1): MultiHeadAttentionBlock(
      (dropout): Dropout(p=0, inplace=False)
      (Wq): Linear(in_features=64, out_features=64, bias=False)
      (Wk): Linear(in_features=64, out_features=64, bias=False)
      (Wv): Linear(in_features=64, out_features=64, bias=False)
      (softmax): Softmax(dim=-1)
    )
  )
  (Wo): Linear(in_features=64, out_features=64, bias=False)
  (feed_forward): Sequential(
    (0): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
    (1): Linear(in_features=64, out_features=768, bias=True)
    (2): Dropout(p=0.1, inplac

In [None]:
num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'Number of model parameters: {num_params}')

Number of model parameters: 25460480


# Assert da Perplexidade


In [None]:
def perplexity(logits, target, ignore_token_id: int):
    """
    Computes the perplexity.

    Args:
        logits: a FloatTensor of shape (batch_size, seq_length, vocab_size)
        target: a LongTensor of shape (batch_size, seq_length)

    Returns:
        A float corresponding to the perplexity
    """
    logits = logits.reshape(-1, logits.shape[-1])
    target = target.reshape(-1)
    loss = torch.nn.functional.cross_entropy(logits, target, reduction='mean', ignore_index=ignore_token_id)
    return torch.exp(loss)


n_examples = 1000

train_input_ids, train_target_ids = next(iter(DataLoader(training_dataset, batch_size=n_examples)))
train_input_ids = train_input_ids.to(device)
train_target_ids = train_target_ids.to(device)

logits = model(train_input_ids)

my_perplexity = perplexity(logits=logits, target=train_target_ids, ignore_token_id=tokenizer.pad_token_id)

print(f'my perplexity:              {int(my_perplexity)}')
print(f'correct initial perplexity: {tokenizer.vocab_size}')

assert math.isclose(my_perplexity, tokenizer.vocab_size, abs_tol=7000)
print('Passou o no assert da perplexidade')

my perplexity:              30518
correct initial perplexity: 29794
Passou o no assert da perplexidade


# Laço de Treinamento e Validação

In [None]:
import wandb
from copy import deepcopy

In [None]:
run_name = 'multiheadattention'

In [None]:
wandb.init(project="language-models", anonymous="allow")
wandb.run.name = f'{run_name}-{wandb.run.name}'

In [None]:
max_examples = 100_000_000
eval_every_steps = 1_000
lr = 3e-4
batch_size = 1024

model = LanguageModel(
    vocab_size=tokenizer.vocab_size,
    max_seq_length=max_seq_length,
    dim=64,
    n_layers=2,
    pad_token_id=tokenizer.pad_token_id,
).to(device)

train_loader = DataLoader(training_dataset, batch_size=batch_size, shuffle=True, drop_last=True)
validation_loader = DataLoader(valid_dataset, batch_size=batch_size)

optimizer = torch.optim.Adam(model.parameters(), lr=lr)


def train_step(input_ids, target_ids):
    model.train()
    model.zero_grad()
    logits = model(input_ids)
    logits = logits.reshape(-1, logits.shape[-1])
    target_ids = target_ids.reshape(-1)
    loss = torch.nn.functional.cross_entropy(logits, target_ids, ignore_index=model.pad_token_id)
    loss.backward()
    optimizer.step()

    return loss.item()


def validation_step(input_ids, target_ids):
    model.eval()
    logits = model(input_ids)
    logits = logits.reshape(-1, logits.shape[-1])
    target_ids = target_ids.reshape(-1)
    loss = torch.nn.functional.cross_entropy(logits, target_ids, ignore_index=model.pad_token_id)
    return loss.item()


train_losses = []
n_examples = 0
step = 0

best_ppl = torch.inf
best_model = deepcopy(model.state_dict())

wandb.watch(model, log_freq=100)
pbar = tqdm(total=max_examples)

while n_examples < max_examples:
    for train_input_ids, train_target_ids in train_loader:
        loss = train_step(train_input_ids.to(device), train_target_ids.to(device)) 
        train_losses.append(loss)
        
        if step % eval_every_steps == 0:
            train_loss = np.average(train_losses)
            train_ppl = np.exp(train_loss)

            with torch.no_grad():
                valid_ppl = np.exp(np.average([
                    validation_step(val_input_ids.to(device), val_target_ids.to(device))
                    for val_input_ids, val_target_ids in validation_loader]))

            wandb.log({
                "train/loss": loss,
                "train/perplexity": train_ppl,
                "eval/perplexity": valid_ppl
            }, step=step)
            if valid_ppl < best_ppl:
                best_ppl = valid_ppl
                best_model = deepcopy(model.state_dict())
                torch.save(best_model, 'best_model.pth')
                
                artifact = wandb.Artifact(
                    f'model-{run_name}',
                    type='model',
                    metadata={
                        "step": step,
                        "step_size": batch_size,
                        "train_loss": train_loss,
                        "train_perplexity": train_ppl,
                        "valid_perplexity": valid_ppl
                    }
                )
                artifact.add_file('best_model.pth')
                wandb.run.log_artifact(artifact)
            print(f'{step} steps; {n_examples} examples so far; train ppl: {train_ppl:.2f}, valid ppl: {valid_ppl:.2f}')
            train_losses = []

        n_examples += len(train_input_ids)  # Increment of batch size
        step += 1
        pbar.update(len(train_input_ids))
        if n_examples >= max_examples:
            break

  0%|          | 0/100000000 [00:00<?, ?it/s]

0 steps; 0 examples so far; train ppl: 31600.99, valid ppl: 28904.73
1000 steps; 1024000 examples so far; train ppl: 746.69, valid ppl: 419.11
2000 steps; 2048000 examples so far; train ppl: 374.90, valid ppl: 302.16
3000 steps; 3072000 examples so far; train ppl: 304.84, valid ppl: 259.47
4000 steps; 4096000 examples so far; train ppl: 274.34, valid ppl: 239.16
5000 steps; 5120000 examples so far; train ppl: 255.52, valid ppl: 225.30
6000 steps; 6144000 examples so far; train ppl: 242.11, valid ppl: 213.11
7000 steps; 7168000 examples so far; train ppl: 231.85, valid ppl: 206.01
8000 steps; 8192000 examples so far; train ppl: 220.48, valid ppl: 199.34
9000 steps; 9216000 examples so far; train ppl: 214.64, valid ppl: 193.22
10000 steps; 10240000 examples so far; train ppl: 209.65, valid ppl: 188.82
11000 steps; 11264000 examples so far; train ppl: 204.90, valid ppl: 184.16
12000 steps; 12288000 examples so far; train ppl: 201.05, valid ppl: 179.94
13000 steps; 13312000 examples so far

# Avaliação final no dataset de teste


Bonus: o modelo com menor perplexidade no dataset de testes ganhará 0.5 ponto na nota final.

In [None]:
test_loader = DataLoader(test_dataset, batch_size=64)

with torch.no_grad():
    test_ppl = np.exp(np.average([
        validation_step(test_input_ids.to(device), test_target_ids.to(device))
        for test_input_ids, test_target_ids in test_loader
    ]))

print(f'test perplexity: {test_ppl}')

test perplexity: 125.94721084881104


# Teste seu modelo com uma sentença

Escolha uma sentença gerada pelo modelo que ache interessante.

In [47]:
max_output_tokens = 20
prompts = ['Eu gosto de comer pizza pois me faz']

model.eval()
for prompt in prompts:
    print(prompt)
    for _ in range(max_output_tokens):
        input_ids = tokenize(texts=[prompt], tokenizer=tokenizer)
        input_ids = input_ids[0]
        pad_length = max(max_seq_length - len(input_ids), 0)
        input_ids = input_ids + ([tokenizer.pad_token_id] * pad_length)
        input_ids_truncated = input_ids[-max_seq_length:]  # Usamos apenas os últimos <context_size> tokens como entrada para o modelo.
        logits = model(torch.LongTensor([input_ids_truncated]).to(device))
        logits = logits[:, -1, :]  # Usamos apenas o ultimo token da sequencia
        # Ao usarmos o argmax, a saída do modelo em cada passo é token de maior probabilidade.
        # Isso se chama decodificação gulosa (greedy decoding).
        predicted_id = torch.argmax(logits).item()
        input_ids += [predicted_id]  # Concatenamos a entrada com o token escolhido nesse passo.
        prompt = tokenizer.decode(input_ids)
        print(prompt)
    print()

Eu gosto de comer pizza pois me faz
Eu gosto de comer pizza pois me faz [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]idade
Eu gosto de comer pizza pois me faz [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] idade,
Eu gosto de comer pizza pois me faz [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] idade, a
Eu gosto de comer pizza pois me faz [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] idade, a sua
Eu gosto de comer pizza pois me faz [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] idade, a sua vida
Eu gosto de comer pizza pois me faz [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] idade, a sua vida,
Eu gosto de comer pizza pois me faz [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] idade, a sua vida, a
Eu gosto de comer pizza pois me faz [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] idade, a sua vida, a -
Eu gosto de comer pizza pois me faz [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] idade, a sua vida, a - -
Eu gosto de comer pizza pois me faz [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] idade, a sua vida, a - - -
Eu gosto de comer

In [None]:
wandb.finish()

# Bonus

## Bonus 1
Quem conseguir a menor perplexidade no dataset de testes ganha 0.5 ponto na média final.

## Bonus 2
Qual é a complexidade (em notação O-grande) da função de geração de texto acima?

Quem responder corretamente a pergunta acima e deixar a função com menor complexidade ganha 0.5 ponto na média final.