<a href="https://colab.research.google.com/github/unicamp-dl/IA025_2022S1/blob/main/ex07/Leonardo_Pacheco.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
nome = 'Leonardo Augusto da Silva Pacheco'
print(f'Meu nome é {nome}.')

Meu nome é Leonardo Augusto da Silva Pacheco.


#  Exercício: Modelo de Linguagem (Bengio 2003) - MLP + Embeddings

Neste exercício iremos treinar uma rede neural simples para prever a proxima palavra de um texto, data as palavras anteriores como entrada. Esta tarefa é chamada de "Modelagem da Língua".

Este dataset já possui um tamanho razoável e é bem provável que você vai precisar rodar seus experimentos com GPU.

Alguns conselhos úteis:
- **ATENÇÃO:** o dataset é bem grande. Não dê comando de imprimí-lo.
- Durante a depuração, faça seu dataset ficar bem pequeno, para que a depuração seja mais rápida e não precise de GPU. Somente ligue a GPU quando o seu laço de treinamento já está funcionando
- Não deixe para fazer esse exercício na véspera. Ele é trabalhoso.

In [2]:
# iremos utilizar a biblioteca dos transformers para ter acesso ao tokenizador do BERT.
!pip install transformers

Collecting transformers
  Downloading transformers-4.19.2-py3-none-any.whl (4.2 MB)
[K     |████████████████████████████████| 4.2 MB 15.7 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 52.6 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 54.2 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.6.0-py3-none-any.whl (84 kB)
[K     |████████████████████████████████| 84 kB 3.5 MB/s 
Installing collected packages: pyyaml, tokenizers, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninstalling PyYAML-3.13:
      Successfully uninstalled PyYAML-3.13
Successfully installed huggingface-hub-0.6.0 p

## Importação dos pacotes

In [3]:
import collections
import itertools
import functools
import math
import random

import torch
import torch.nn as nn
import numpy as np
from torch.utils.data import DataLoader
from tqdm import tqdm_notebook
from tqdm.auto import tqdm

In [4]:
# Check which GPU we are using
!nvidia-smi

Wed May 18 11:31:17 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   36C    P0    28W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [5]:
if torch.cuda.is_available(): 
   dev = "cuda:0"
else: 
   dev = "cpu"
device = torch.device(dev)
print('Using {}'.format(device))

Using cuda:0


## Implementação do MyDataset

In [6]:
from typing import List

def tokenize(text: str, tokenizer):
    return tokenizer(text, return_tensors=None, add_special_tokens=False).input_ids

class MyDataset():
    def __init__(self, texts: List[str], tokenizer, context_size: int):
        self.tokenizer = tokenizer
        self.context_size = context_size
        self.all_tokens = []
        self.text_index = []
        self.first_ngram = []
        self.total_ngrams = 0
        for i, text in enumerate(tqdm(texts, desc = 'Tokenizing')): 
            tokens = torch.LongTensor(tokenize(text, self.tokenizer))
            self.all_tokens.append(tokens)
            qtty = len(tokens) - context_size
            if qtty > 0:
                self.text_index += [i] * qtty
                self.first_ngram.append(self.total_ngrams)
                self.total_ngrams += qtty
            else:
                self.first_ngram.append(-1)

    def __len__(self):
        # Escreva seu código aqui
        return self.total_ngrams

    def __getitem__(self, idx):
        # Escreva seu código aqui
        text_index = self.text_index[idx]
        ngram_index = idx - self.first_ngram[text_index]
        tokens = self.all_tokens[text_index]
        assert ngram_index >= 0 and ngram_index < len(tokens)
        return tokens[ngram_index : ngram_index + self.context_size], tokens[ngram_index + self.context_size]

## Teste se sua implementação do MyDataset está correta

In [7]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("neuralmind/bert-base-portuguese-cased")

dummy_texts = ['Eu gosto de correr', 'Ela gosta muito de comer pizza']

dummy_dataset = MyDataset(texts=dummy_texts, tokenizer=tokenizer, context_size=3)
dummy_loader = DataLoader(dummy_dataset, batch_size=6, shuffle=False)
assert len(dummy_dataset) == 5
print('passou no assert de tamanho do dataset')

first_batch_input, first_batch_target = next(iter(dummy_loader))

correct_first_batch_input = torch.LongTensor(
    [[ 3396, 10303,   125],
     [ 1660,  5971,   785],
     [ 5971,   785,   125],
     [  785,   125,  1847],
     [  125,  1847, 13779]])

correct_first_batch_target = torch.LongTensor([13239,   125,  1847, 13779, 15616])

assert torch.equal(first_batch_input, correct_first_batch_input)
print('Passou no assert de input')
assert torch.equal(first_batch_target, correct_first_batch_target)
print('Passou no assert de target')

Downloading:   0%|          | 0.00/205k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/43.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/647 [00:00<?, ?B/s]

Tokenizing:   0%|          | 0/2 [00:00<?, ?it/s]

passou no assert de tamanho do dataset
Passou no assert de input
Passou no assert de target


In [8]:
dummy_texts = ['Eu gosto de correr', 'Ela gosta muito de comer pizza']
dummy_dataset = MyDataset(texts=dummy_texts, tokenizer=tokenizer, context_size=3)
dummy_loader = DataLoader(dummy_dataset, batch_size=9, shuffle=False)
first_batch_input, first_batch_target = next(iter(dummy_loader))
for i, target in enumerate(first_batch_target):
    print([tokenizer.decode(token) for token in first_batch_input[i]], '->', tokenizer.decode(target))

Tokenizing:   0%|          | 0/2 [00:00<?, ?it/s]

['E u', 'g o s t o', 'd e'] -> c o r r e r
['E l a', 'g o s t a', 'm u i t o'] -> d e
['g o s t a', 'm u i t o', 'd e'] -> c o m e r
['m u i t o', 'd e', 'c o m e r'] -> p i
['d e', 'c o m e r', 'p i'] -> # # z z a


## Inicialização do Neptune

In [9]:
!pip install -U neptune-client

Collecting neptune-client
  Downloading neptune-client-0.16.2.tar.gz (316 kB)
[K     |████████████████████████████████| 316 kB 11.7 MB/s 
[?25hCollecting bravado
  Downloading bravado-11.0.3-py2.py3-none-any.whl (38 kB)
Collecting future>=0.17.1
  Downloading future-0.18.2.tar.gz (829 kB)
[K     |████████████████████████████████| 829 kB 62.9 MB/s 
Collecting PyJWT
  Downloading PyJWT-2.4.0-py3-none-any.whl (18 kB)
Collecting websocket-client!=1.0.0,>=0.35.0
  Downloading websocket_client-1.3.2-py3-none-any.whl (54 kB)
[K     |████████████████████████████████| 54 kB 3.3 MB/s 
[?25hCollecting GitPython>=2.0.8
  Downloading GitPython-3.1.27-py3-none-any.whl (181 kB)
[K     |████████████████████████████████| 181 kB 63.8 MB/s 
[?25hCollecting boto3>=1.16.0
  Downloading boto3-1.23.2-py3-none-any.whl (132 kB)
[K     |████████████████████████████████| 132 kB 67.4 MB/s 
Collecting swagger-spec-validator>=2.7.4
  Downloading swagger_spec_validator-2.7.4-py2.py3-none-any.whl (27 kB)
Coll

In [11]:
import neptune.new as neptune

run = neptune.init(
    project="leonardo3108/IA025Aula7",
    api_token="eyJhcGlfYWRkcmVzcyI6Imh0dHBzOi8vYXBwLm5lcHR1bmUuYWkiLCJhcGlfdXJsIjoiaHR0cHM6Ly9hcHAubmVwdHVuZS5haSIsImFwaV9rZXkiOiJhZDJhYWRmZi0zZmE0LTRhYzAtYThlMS1iYmJjMzU1NWU5YzQifQ==",
)

https://app.neptune.ai/leonardo3108/IA025Aula7/e/IA025AULA7-16
Remember to stop your run once you’ve finished logging your metadata (https://docs.neptune.ai/api-reference/run#.stop). It will be stopped automatically only when the notebook kernel/interactive console is terminated.


## Definindo os parametros

In [12]:
params = {
    'context_size': 9,
    'valid_texts': 5000,
    'test_texts': 5000,
    'train_texts': 15000,
    'embedding_dim': 64,
    'hidden_size': 128,
    'batch_size': 2048,
    'num_workers': 2,
    'learning_rate': 3e-5,
    'max_examples': 1_000_000_000,
    'eval_every_steps': 10000
}

# Carregamento do dataset 

Iremos usar uma pequena amostra do dataset [BrWaC](https://www.inf.ufrgs.br/pln/wiki/index.php?title=BrWaC) para treinar e avaliar nosso modelo de linguagem.

In [13]:
!wget -nc https://storage.googleapis.com/unicamp-dl/ia025a_2022s1/aula7/sample_brwac.txt

--2022-05-18 11:31:50--  https://storage.googleapis.com/unicamp-dl/ia025a_2022s1/aula7/sample_brwac.txt
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.143.128, 74.125.128.128, 142.250.153.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.143.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 123983611 (118M) [text/plain]
Saving to: ‘sample_brwac.txt’


2022-05-18 11:31:52 (70.6 MB/s) - ‘sample_brwac.txt’ saved [123983611/123983611]



In [14]:
# Load datasets
texts = open('sample_brwac.txt').readlines()

print('Truncating for debugging purposes.')
texts = texts[: (params['train_texts'] + params['valid_texts'] + params['test_texts'])]

training_texts = texts[:-(params['valid_texts'] + params['test_texts'])]
valid_texts = texts[-(params['valid_texts'] + params['test_texts']):-params['test_texts']]
test_texts = texts[-params['test_texts']:]

print('Building training dataset.')
training_dataset = MyDataset(texts=training_texts, tokenizer=tokenizer, context_size=params['context_size'])
print('Building validation dataset.')
valid_dataset = MyDataset(texts=valid_texts, tokenizer=tokenizer, context_size=params['context_size'])
print('Building test dataset.')
test_dataset = MyDataset(texts=test_texts, tokenizer=tokenizer, context_size=params['context_size'])

Truncating for debugging purposes.
Building training dataset.


Tokenizing:   0%|          | 0/15000 [00:00<?, ?it/s]

Building validation dataset.


Tokenizing:   0%|          | 0/5000 [00:00<?, ?it/s]

Building test dataset.


Tokenizing:   0%|          | 0/5000 [00:00<?, ?it/s]

In [15]:
print(f'training examples: {len(training_dataset)}')
print(f'valid examples: {len(valid_dataset)}')
print(f'test examples: {len(test_dataset)}')

training examples: 17196588
valid examples: 5119032
test examples: 5609121


In [16]:
from torch.nn import Module, Embedding, Flatten, Linear, ReLU

# Baseado em https://abhinavcreed13.github.io/blog/bengio-trigram-nplm-using-pytorch/, dica do Marcus Borela

class LanguageModel(Module):

    def __init__(self, vocab_size, context_size, embedding_dim, hidden_size):
        """
        Implements the Neural Language Model proposed by Bengio et al."

        Args:
            vocab_size (int): Size of the input vocabulary.
            context_size (int): Size of the sequence to consider as context for prediction.
            embedding_dim (int): Dimension of the embedding layer for each word in the context.
            hidden_size (int): Size of the hidden layer.
        """
        # Escreva seu código aqui.
        super(LanguageModel, self).__init__()
        self.context_size = context_size
        self.embedding_dim = embedding_dim
        self.vocab_size = vocab_size
        self.hidden_size = hidden_size

        self.embeddings = Embedding(self.vocab_size, self.embedding_dim)     
        self.flatten = nn.Flatten()
        self.linear1 = Linear(self.context_size * self.embedding_dim, self.hidden_size, bias=True)
        self.relu = ReLU()
        self.linear2 = Linear(self.hidden_size, self.vocab_size, bias = False)
        #self.direct  = Linear(self.context_size * self.embedding_dim, self.vocab_size, bias=True)

    def forward(self, inputs):
        """
        Args:
            inputs is a LongTensor of shape (batch_size, context_size)
        """
        # Escreva seu código aqui.
        c = self.embeddings(inputs)  # transforma em embeddings / look-up table - C(x)
        c = self.flatten(c)          # achata em 1 dim (cada amostra) - C(w)
        out = self.linear1(c)        # aplica primeira transformacao linear (com bias)
        #out = torch.tanh(out)        # h = tanh(W1.C(w) + b)
        out = self.relu(out)         # h = ReLU(W1.C(w) + b)
        out = self.linear2(out)      # aplica primeira transformacao linear (sem bias)
        #out += self.direct(c)        # aplica primeira transformacao linear (sem bias)
        return out

## Teste o modelo com um exemplo

In [17]:
model = LanguageModel(
    vocab_size = tokenizer.vocab_size,
    context_size = params['context_size'],
    embedding_dim = params['embedding_dim'],
    hidden_size = params['hidden_size'],
).to(device)

sample_train, _ = next(iter(DataLoader(training_dataset)))
sample_train_gpu = sample_train.to(device)
model(sample_train_gpu).shape

torch.Size([1, 29794])

In [18]:
num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'Number of model parameters: {num_params}')

Number of model parameters: 5794304


## Assert da Perplexidade


In [19]:
from torch import exp
import torch.nn.functional as F

random.seed(123)
np.random.seed(123)
torch.manual_seed(123)


def perplexity(logits, target):
    """
    Computes the perplexity.

    Args:
        logits: a FloatTensor of shape (batch_size, vocab_size)
        target: a LongTensor of shape (batch_size,)

    Returns:
        A float corresponding to the perplexity.
    """
    # Escreva seu código aqui.
    loss = F.cross_entropy(logits, target)
    return exp(loss)

n_examples = 1000

sample_train, target_token_ids = next(iter(DataLoader(training_dataset, batch_size=n_examples)))
sample_train_gpu = sample_train.to(device)
target_token_ids = target_token_ids.to(device)
logits = model(sample_train_gpu)

my_perplexity = perplexity(logits=logits, target=target_token_ids)

print(f'my perplexity:              {int(my_perplexity)}')
print(f'correct initial perplexity: {tokenizer.vocab_size}')

assert math.isclose(my_perplexity, tokenizer.vocab_size, abs_tol=2000)
print('Passou o no assert da perplexidade')

my perplexity:              30710
correct initial perplexity: 29794
Passou o no assert da perplexidade


## Laço de Treinamento e Validação

In [20]:
from copy import deepcopy

model = LanguageModel(
    vocab_size = tokenizer.vocab_size,
    context_size = params['context_size'],
    embedding_dim = params['embedding_dim'],
    hidden_size = params['hidden_size'],
).to(device)

run['sys/tags'].add([f"model:LanguageModelBengio"])
run['parameters'] = params

train_loader = DataLoader(training_dataset, batch_size=params['batch_size'], shuffle=True, drop_last=True, num_workers=params['num_workers'])
validation_loader = DataLoader(valid_dataset, batch_size=params['batch_size'], num_workers=params['num_workers'])

lr=params['learning_rate']
optimizer = torch.optim.Adam(model.parameters(), lr)

def train_step(input, target):
    model.train()
    model.zero_grad()

    logits = model(input.to(device))
    loss = F.cross_entropy(logits, target.to(device))
    loss.backward()
    optimizer.step()

    return loss.item()


def validation_step(input, target):
    logits = model(input)
    loss = F.cross_entropy(logits, target)
    return loss.item()


n_examples = 0
step = 0
early_points = ''
min_ppl = 999999
while n_examples < params['max_examples'] and early_points != '*****':
    train_losses = []
    for input, target in train_loader:
        train_loss = train_step(input.to(device), target.to(device)) 
        train_losses.append(train_loss)
        run['train/batch_loss'].log(train_loss)

        
        if step % params['eval_every_steps'] == 0:
            train_ppl = np.exp(np.average(train_losses))
            run['train/perplexity'].log(train_ppl)

            valid_losses = []
            with torch.no_grad():
                for input, target in validation_loader:
                    valid_loss = validation_step(input.to(device), target.to(device))
                    valid_losses.append(valid_loss)
                    run['valid/batch_loss'].log(valid_loss)
                valid_ppl = np.exp(np.average(valid_losses))
                run['valid/perplexity'].log(valid_ppl)

            if min_ppl <= valid_ppl:
                early_points += '*'
                if early_points == '*****':
                    print('Early stop!')
                    break
            else:
                early_points = ''
                min_ppl = valid_ppl
                best_model = deepcopy(model.state_dict())
                torch.save(best_model, 'best_model.pth')

            print(f'{step} steps; {n_examples} examples so far; train ppl: {train_ppl:.2f}, valid ppl: {valid_ppl:.2f}', early_points)

        n_examples += len(input)  # Increment of batch size
        step += 1
        if n_examples >= params['max_examples']:
            break

model.load_state_dict(best_model)            

0 steps; 0 examples so far; train ppl: 30283.01, valid ppl: 30565.24 
10000 steps; 20479032 examples so far; train ppl: 1302.08, valid ppl: 1264.32 
20000 steps; 40958064 examples so far; train ppl: 891.50, valid ppl: 866.41 
30000 steps; 61437096 examples so far; train ppl: 696.05, valid ppl: 684.56 
40000 steps; 81916128 examples so far; train ppl: 584.09, valid ppl: 579.80 
50000 steps; 102395160 examples so far; train ppl: 508.17, valid ppl: 508.11 
60000 steps; 122874192 examples so far; train ppl: 429.32, valid ppl: 456.22 
70000 steps; 143353224 examples so far; train ppl: 392.59, valid ppl: 417.27 
80000 steps; 163832256 examples so far; train ppl: 363.07, valid ppl: 387.35 
90000 steps; 184311288 examples so far; train ppl: 340.32, valid ppl: 363.84 
100000 steps; 204790320 examples so far; train ppl: 322.10, valid ppl: 345.17 
110000 steps; 225269352 examples so far; train ppl: 299.11, valid ppl: 329.94 
120000 steps; 245748384 examples so far; train ppl: 287.08, valid ppl: 3

<All keys matched successfully>

## Avaliação final no dataset de teste


Bonus: o modelo com menor perplexidade no dataset de testes ganhará 0.5 ponto na nota final.

In [21]:
test_loader = DataLoader(test_dataset, batch_size=params['batch_size'])

with torch.no_grad():
    test_ppl = np.exp(np.average([
        validation_step(input.to(device), target.to(device))
        for input, target in test_loader
    ]))

print(f'test perplexity: {test_ppl}')

test perplexity: 204.28050724808486


## Teste seu modelo com uma sentença

Escolha uma sentença gerada pelo modelo que ache interessante.

In [22]:
prompt = 'Eu gosto de comer pizza pois me faz'
max_output_tokens = 10

for _ in range(max_output_tokens):
    input_ids = tokenize(text=prompt, tokenizer=tokenizer)
    input_ids_truncated = input_ids[-params['context_size']:]  # Usamos apenas os últimos <context_size> tokens como entrada para o modelo.
    logits = model(torch.LongTensor([input_ids_truncated]).to(device))
    # Ao usarmos o argmax, a saída do modelo em cada passo é token de maior probabilidade.
    # Isso se chama decodificação gulosa (greedy decoding).
    predicted_id = torch.argmax(logits).item()
    input_ids += [predicted_id]  # Concatenamos a entrada com o token escolhido nesse passo.
    prompt = tokenizer.decode(input_ids)
    print(prompt)

Eu gosto de comer pizza pois me faz com
Eu gosto de comer pizza pois me faz com o
Eu gosto de comer pizza pois me faz com o seu
Eu gosto de comer pizza pois me faz com o seu filho
Eu gosto de comer pizza pois me faz com o seu filho.
Eu gosto de comer pizza pois me faz com o seu filho. O
Eu gosto de comer pizza pois me faz com o seu filho. O que
Eu gosto de comer pizza pois me faz com o seu filho. O que é
Eu gosto de comer pizza pois me faz com o seu filho. O que é um
Eu gosto de comer pizza pois me faz com o seu filho. O que é um dos
