<a href="https://colab.research.google.com/github/unicamp-dl/IA025_2022S1/blob/main/ex09/larissa_santesso/Aula_9_Exerc%C3%ADcio_Larissa_Santesso.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
nome = "Larissa Antonelli Santesso"
print(f'Meu nome é {nome}')

Meu nome é Larissa Antonelli Santesso


In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


#  Exercício: Modelo de Linguagem com auto-atenção

Este exercício é similar ao da Aula 8, mas iremos agora treinar uma rede neural com **duas camadas** de auto-atenção **causais** para prever a próxima palavra de um texto, data as palavras anteriores como entrada. 

Iremos também trabalhar com sequencias de tamanho variável.

Na camada de auto-atenção, não se esqueça de implementar:
- Embeddings de posição
- Projeções lineares (WQ, WK, WV, WO)
- Conexões residuais
- Camada de feed forward (2-layer MLP)


O dataset usado neste exercício (BrWaC) possui um tamanho razoável e você vai precisar rodar seus experimentos com GPU.

Alguns conselhos úteis:
- **ATENÇÃO:** o dataset é bem grande. Não dê comando de imprimí-lo.
- Durante a depuração, faça seu dataset ficar bem pequeno, para que a depuração seja mais rápida e não precise de GPU. Somente ligue a GPU quando o seu laço de treinamento já está funcionando
- Não deixe para fazer esse exercício na véspera. Ele é trabalhoso.

In [None]:
# iremos utilizar a biblioteca dos transformers para ter acesso ao tokenizador do BERT.
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.19.2-py3-none-any.whl (4.2 MB)
[K     |████████████████████████████████| 4.2 MB 33.6 MB/s 
[?25hCollecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 58.2 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.7.0-py3-none-any.whl (86 kB)
[K     |████████████████████████████████| 86 kB 6.5 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 67.0 MB/s 
Installing collected packages: pyyaml, tokenizers, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninstalli

## Importação dos pacotes

In [None]:
import collections
import itertools
import functools
import math
import random

import torch
import torch.nn as nn
import numpy as np
from torch.utils.data import DataLoader
from tqdm import tqdm_notebook


In [None]:
# Check which GPU we are using
!nvidia-smi

Wed Jun  1 12:44:56 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P8     8W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
if torch.cuda.is_available(): 
   dev = "cuda:0"
else: 
   dev = "cpu"
device = torch.device(dev)
print('Using {}'.format(device))

Using cuda:0


## Implementação do MyDataset

Mais sobre` batch_encode_plus`: https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.tokenization_utils_base.PreTrainedTokenizerBase.batch_encode_plus

In [None]:
from typing import List


def tokenize(text: str, tokenizer):
    # Recomenda-se usar o tokenizer.batch_encode_plus pois é mais rápido.
    return tokenizer.batch_encode_plus(text, return_tensors='pt', add_special_tokens= False,  padding="longest").input_ids


class MyDataset():
    def __init__(self, texts: List[str], tokenizer, max_seq_length: int, data_type="train", iter_texts=1000, i_init=0):
        path = "/content/gdrive/MyDrive/Colab Notebooks/modelos_Aula09/dataset_02/"
        self.tokens_ids =[]
        for i in range(i_init,len(texts), iter_texts):
            # Ideia de salvar os dados para aproveitar melhor a RAM baseada no exercício da Aula 08 do aluno Patrick Ferreira
            try:
                self.load_x = np.load(path+"x_"+ str(i) + "_"+ str(i+iter_texts) +"_" + data_type + ".npy", mmap_mode="r", allow_pickle=True)
                self.tokens_ids.append(torch.tensor(self.load_x))
                print(str(i) + " to "+ str(i+iter_texts)+" lines loaded")

            except Exception as e:
                output_tokenize = tokenize(texts[i:i+iter_texts], tokenizer)
                output_tokenize = torch.nn.functional.pad(output_tokenize, (0,max((max_seq_length-int(output_tokenize.shape[1]),1))))
                self.tokens_ids =[]
                shape_iter = output_tokenize.shape[1]
                for j in range(0,shape_iter-1, max_seq_length-1):    # Ideia do slicing baseada no notebook do Pedro Gengo   
                    if (j + max_seq_length) < int(shape_iter):
                        batch_seq = output_tokenize[:,j:j+max_seq_length]

                    else:
                        batch_seq = output_tokenize[:,-max_seq_length:]
                    
                    batch_seq = batch_seq[torch.sum(batch_seq, dim=1)!=0]
                    self.tokens_ids.extend(torch.cat([torch.tensor(tokenizer.cls_token_id).long().repeat(batch_seq.shape[0])[:, None],batch_seq], axis=1))
                
                self.tokens_ids = torch.stack(self.tokens_ids)
                
                print(f"Saving: {i} to {i+iter_texts} lines - shape ={self.tokens_ids.shape}")
                np.save(path+"x_"+ str(i) + "_"+ str(i+iter_texts) + "_" + data_type + ".npy", np.array(self.tokens_ids))

        try:
            self.tokens_ids = torch.vstack(self.tokens_ids)
        
        except:
            print("Don't need to stack tensor")
        
        #self.targets_ids = torch.nn.functional.pad(self.tokens_ids[:,1:], (0,max_seq_length+1-self.tokens_ids[:,1:].shape[1]))

    def __len__(self):

        return len(self.tokens_ids)

    def __getitem__(self, idx):
        return self.tokens_ids[idx,:-1].long(), self.tokens_ids[idx,1:].long()

## Testando se a implementação do MyDataset está correta

In [None]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("neuralmind/bert-base-portuguese-cased")

Downloading:   0%|          | 0.00/205k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/43.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/647 [00:00<?, ?B/s]

In [None]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("neuralmind/bert-base-portuguese-cased")

dummy_texts = ['Eu gosto de correr', 'Ela gosta muito de comer pizza']

dummy_dataset = MyDataset(texts=dummy_texts, tokenizer=tokenizer, max_seq_length=9)
dummy_loader = DataLoader(dummy_dataset, batch_size=6, shuffle=False)
assert len(dummy_dataset) == 2
print('Passou no assert de tamanho do dataset.')

first_batch_input, first_batch_target = next(iter(dummy_loader))

correct_first_batch_input = torch.LongTensor(
    [[  101,  3396, 10303,   125, 13239,     0,     0,     0,     0],
     [  101,  1660,  5971,   785,   125,  1847, 13779, 15616,     0]])

correct_first_batch_target = torch.LongTensor(
    [[ 3396, 10303,   125, 13239,     0,     0,     0,     0,     0],
     [ 1660,  5971,   785,   125,  1847, 13779, 15616,     0,     0]])

assert torch.equal(first_batch_input, correct_first_batch_input)
assert torch.equal(first_batch_target, correct_first_batch_target)

print('Passou no assert de dataset.')

Downloading:   0%|          | 0.00/205k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/43.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/647 [00:00<?, ?B/s]

Saving: 0 to 1000 lines - shape =torch.Size([2, 10])
Don't need to stack tensor
Passou no assert de tamanho do dataset.
Passou no assert de dataset.


In [None]:
len(dummy_dataset)

2

In [None]:
first_batch_input, correct_first_batch_input

(tensor([[  101,  3396, 10303,   125, 13239,     0,     0,     0,     0],
         [  101,  1660,  5971,   785,   125,  1847, 13779, 15616,     0]]),
 tensor([[  101,  3396, 10303,   125, 13239,     0,     0,     0,     0],
         [  101,  1660,  5971,   785,   125,  1847, 13779, 15616,     0]]))

In [None]:
first_batch_target, correct_first_batch_target

(tensor([[ 3396, 10303,   125, 13239,     0,     0,     0,     0,     0],
         [ 1660,  5971,   785,   125,  1847, 13779, 15616,     0,     0]]),
 tensor([[ 3396, 10303,   125, 13239,     0,     0,     0,     0,     0],
         [ 1660,  5971,   785,   125,  1847, 13779, 15616,     0,     0]]))

In [None]:
dummy_dataset = MyDataset(texts=dummy_texts, tokenizer=tokenizer, max_seq_length=3)
dummy_loader = DataLoader(dummy_dataset, batch_size=1, shuffle=False)

next(iter(dummy_loader))

Saving: 0 to 1000 lines - shape =torch.Size([6, 4])
Don't need to stack tensor


[tensor([[  101,  3396, 10303]]), tensor([[ 3396, 10303,   125]])]

In [None]:
dummy_loader = DataLoader(dummy_dataset, batch_size=10, shuffle=False)

next(iter(dummy_loader))

[tensor([[  101,  3396, 10303],
         [  101,  1660,  5971],
         [  101,   125, 13239],
         [  101,   785,   125],
         [  101,  1847, 13779],
         [  101, 13779, 15616]]), tensor([[ 3396, 10303,   125],
         [ 1660,  5971,   785],
         [  125, 13239,     0],
         [  785,   125,  1847],
         [ 1847, 13779, 15616],
         [13779, 15616,     0]])]

# Carregamento do dataset 

Iremos usar uma pequena amostra do dataset [BrWaC](https://www.inf.ufrgs.br/pln/wiki/index.php?title=BrWaC) para treinar e avaliar nosso modelo de linguagem.

In [None]:
!wget -nc https://storage.googleapis.com/unicamp-dl/ia025a_2022s1/aula9/sample-1gb.txt

--2022-06-02 01:07:14--  https://storage.googleapis.com/unicamp-dl/ia025a_2022s1/aula9/sample-1gb.txt
Resolving storage.googleapis.com (storage.googleapis.com)... 142.251.10.128, 142.251.12.128, 142.250.4.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|142.251.10.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1230909256 (1.1G) [text/plain]
Saving to: ‘sample-1gb.txt’


2022-06-02 01:07:19 (253 MB/s) - ‘sample-1gb.txt’ saved [1230909256/1230909256]



## Salvando o dataset em uma pasta da Drive

In [None]:
# Load datasets
max_seq_length = 9

train_examples = 500
valid_examples = 100
test_examples = 100

texts = open('sample-1gb.txt').readlines()

print(f'Read {len(texts)} lines.')

#max_lines = train_examples + valid_examples + test_examples
#print(f'Truncating to {max_lines} lines.')
#texts = texts[:max_lines]  

training_texts = texts[:-(valid_examples + test_examples)]
valid_texts = texts[-(valid_examples + test_examples):-test_examples]
test_texts = texts[-test_examples:]

training_dataset = MyDataset(texts=training_texts, tokenizer=tokenizer, max_seq_length=max_seq_length)
valid_dataset = MyDataset(texts=valid_texts, tokenizer=tokenizer, max_seq_length=max_seq_length, data_type="val", iter_texts=100)
test_dataset = MyDataset(texts=test_texts, tokenizer=tokenizer, max_seq_length=max_seq_length, data_type="test", iter_texts=100)

Read 250000 lines.
Saving: 0 to 1000 lines - shape =torch.Size([146206, 10])
Saving: 1000 to 2000 lines - shape =torch.Size([141545, 10])
Saving: 2000 to 3000 lines - shape =torch.Size([163703, 10])
Saving: 3000 to 4000 lines - shape =torch.Size([155262, 10])
Saving: 4000 to 5000 lines - shape =torch.Size([129348, 10])
Saving: 5000 to 6000 lines - shape =torch.Size([155825, 10])
Saving: 6000 to 7000 lines - shape =torch.Size([149484, 10])
Saving: 7000 to 8000 lines - shape =torch.Size([125598, 10])
Saving: 8000 to 9000 lines - shape =torch.Size([163068, 10])
Saving: 9000 to 10000 lines - shape =torch.Size([155652, 10])
Saving: 10000 to 11000 lines - shape =torch.Size([142067, 10])
Saving: 11000 to 12000 lines - shape =torch.Size([134900, 10])
Saving: 12000 to 13000 lines - shape =torch.Size([134037, 10])
Saving: 13000 to 14000 lines - shape =torch.Size([126941, 10])
Saving: 14000 to 15000 lines - shape =torch.Size([149434, 10])
Saving: 15000 to 16000 lines - shape =torch.Size([133529, 

KeyboardInterrupt: ignored

In [None]:
# Load datasets

valid_dataset = MyDataset(texts=valid_texts, tokenizer=tokenizer, max_seq_length=max_seq_length, data_type="val", iter_texts=100)
test_dataset = MyDataset(texts=test_texts, tokenizer=tokenizer, max_seq_length=max_seq_length, data_type="test", iter_texts=100)

Saving: 0 to 100 lines - shape =torch.Size([16597, 10])
Don't need to stack tensor
Saving: 0 to 100 lines - shape =torch.Size([9337, 10])
Don't need to stack tensor


## Carregando Dataset salvo na pasta do Drive

In [None]:
# Load datasets
max_seq_length = 9

train_examples = 500
valid_examples = 100
test_examples = 100

texts = open('sample-1gb.txt').readlines()

print(f'Read {len(texts)} lines.')

#max_lines = train_examples + valid_examples + test_examples
#print(f'Truncating to {max_lines} lines.')
#texts = texts[:max_lines]  

training_texts = texts[:-(valid_examples + test_examples)]
valid_texts = texts[-(valid_examples + test_examples):-test_examples]
test_texts = texts[-test_examples:]

training_dataset = MyDataset(texts=training_texts[:200000], tokenizer=tokenizer, max_seq_length=max_seq_length)
valid_dataset = MyDataset(texts=valid_texts, tokenizer=tokenizer, max_seq_length=max_seq_length, data_type="val", iter_texts=valid_examples)
test_dataset = MyDataset(texts=test_texts, tokenizer=tokenizer, max_seq_length=max_seq_length, data_type="test", iter_texts=test_examples)

Read 250000 lines.
0 to 1000 lines loaded
1000 to 2000 lines loaded
2000 to 3000 lines loaded
3000 to 4000 lines loaded
4000 to 5000 lines loaded
5000 to 6000 lines loaded
6000 to 7000 lines loaded
7000 to 8000 lines loaded
8000 to 9000 lines loaded
9000 to 10000 lines loaded
10000 to 11000 lines loaded
11000 to 12000 lines loaded
12000 to 13000 lines loaded
13000 to 14000 lines loaded
14000 to 15000 lines loaded
15000 to 16000 lines loaded
16000 to 17000 lines loaded
17000 to 18000 lines loaded
18000 to 19000 lines loaded
19000 to 20000 lines loaded
20000 to 21000 lines loaded
21000 to 22000 lines loaded
22000 to 23000 lines loaded
23000 to 24000 lines loaded
24000 to 25000 lines loaded
25000 to 26000 lines loaded
26000 to 27000 lines loaded
27000 to 28000 lines loaded
28000 to 29000 lines loaded
29000 to 30000 lines loaded
30000 to 31000 lines loaded
31000 to 32000 lines loaded
32000 to 33000 lines loaded
33000 to 34000 lines loaded
34000 to 35000 lines loaded
35000 to 36000 lines lo

In [None]:
training_dataset[0]

(tensor([  101, 20100,  2308,  3074,  1089,   481,   117,   146,  1189]),
 tensor([20100,  2308,  3074,  1089,   481,   117,   146,  1189,   125]))

In [None]:
print(f'training examples: {len(training_dataset)}')
print(f'valid examples: {len(valid_dataset)}')
print(f'test examples: {len(test_dataset)}')

training examples: 28057186
valid examples: 16597
test examples: 9337


# Arquitetura do modelo

In [None]:
class MultiHeadAttention(torch.nn.Module):
    def __init__(self, dim: int, n_heads: int):
        """
        Implements the Multi-Head Self-attention."

        Args:on.
            dim (int): Dimension of the embedding layer for each word in the context.
            n_heads (int): number of heads.
            mask(bool): if applies mask or not
        """
        super(MultiHeadAttention, self).__init__()
        
        self.dim = dim
        self.n_heads = n_heads

        self.W_q = nn.Linear(dim, dim, bias = False)    # shape = (D, D)
        self.W_k = nn.Linear(dim, dim, bias = False)    # shape = (D, D)   
        self.W_v = nn.Linear(dim, dim, bias = False)    # shape = (D, D)
        self.W_o = nn.Linear(dim, dim, bias = False)    # shape = (D, D)

    def attention(self, q, k, v, mask=None):
        scores = torch.bmm(q, k.transpose(1,2)) #shape: (B*H, L, D/H) * (B*H, D/H, L) = (B*H, L, L)
        scores = scores/math.sqrt(self.dim) # scale by 1/sqrt(D)

        if mask is not None:
            # retornando escores para o shape (B, H, L, L) para aplicar a mascara
            scores = scores.view(self.batch_size, self.n_heads, self.context_size, self.context_size).masked_fill(mask == 0, float('-inf'))

        # retornando escores para o shape (B*H, L, L) para seguir os cálculos
        scores = scores.view(self.batch_size*self.n_heads, self.context_size, self.context_size)
        
        probs = torch.nn.functional.softmax(scores, dim=-1) # shape:   (B*H, L, L)
        out = torch.bmm(probs, v).view(self.batch_size, self.n_heads, self.context_size, int(self.dim/self.n_heads)) # shape:   (B, H, L, D/H)

        return out

    def forward(self, inputs, mask):
        self.batch_size = inputs.shape[0]   # shape = B
        self.context_size = inputs.shape[1] # shape = L

        q = self.W_q(inputs).reshape(self.batch_size, self.context_size, self.n_heads, self.dim//self.n_heads)  # shape = (B, L, H, D/H)
        k = self.W_k(inputs).reshape(self.batch_size, self.context_size, self.n_heads, self.dim//self.n_heads)  # shape = (B, L, H, D/H)
        v = self.W_v(inputs).reshape(self.batch_size, self.context_size, self.n_heads, self.dim//self.n_heads)  # shape = (B, L, H, D/H)

        # Changing shapes for: (B, H, L, D/H)
        q = q.transpose(1,2).contiguous().view(int(self.batch_size*self.n_heads), self.context_size, self.dim//self.n_heads) # shape = (B*H, L, D/H) 
        k = k.transpose(1,2).contiguous().view(int(self.batch_size*self.n_heads), self.context_size, self.dim//self.n_heads) # shape = (B*H, L, D/H) 
        v = v.transpose(1,2).contiguous().view(int(self.batch_size*self.n_heads), self.context_size, self.dim//self.n_heads) # shape = (B*H, L, D/H) 

        E = self.attention(q, k, v, mask)  # shape = (B, H, L, D/H)
        E = E.transpose(1,2).contiguous() # shape = (B, L, H, D/H)
        
        E = E.reshape(self.batch_size, self.context_size, self.dim) # shape = (B, L, D)

        E = self.W_o(E)  # shape = (B, L, D)

        return E

In [None]:
from collections import OrderedDict

class LanguageModel(torch.nn.Module):

    def __init__(self, vocab_size: int, max_seq_length: int, dim: int, n_layers: int, pad_token_id: int):
        """
        Implements the Self-attention, decoder-only."

        Args:
            vocab_size (int): Size of the input vocabulary.
            max_seq_length (int): Size of the sequence to consider as context for prediction.
            dim (int): Dimension of the embedding layer for each word in the context.
            n_layers (int): number of self-attention layers.
            pad_token_id (int): id of the pad token that will be ignored in the attention.
        """
        super(LanguageModel, self).__init__()

        self.context_size = max_seq_length
        self.dim = dim
        self.n_heads = 4
        self.pad_token_id = pad_token_id

        # Embedding of the words
        self.embeddings_C = nn.Embedding(vocab_size, self.dim, padding_idx=self.pad_token_id) # Ideia de adicionar padding_idx baseada no notebook de Edmar Rodrigues

        # Embedding of the words positions
        self.embeddings_P = nn.Embedding(self.context_size, self.dim)

        # Multi-Head Attention
        self.layers_attention = nn.ModuleList([
            MultiHeadAttention(self.dim, self.n_heads)
            for _ in range(n_layers)])

        # Linear layer
        hidden_size = 2048
        self.feed_forward = nn.Sequential(OrderedDict([
            ('dense1', nn.Linear(self.dim, hidden_size)),
            ('relu1',  nn.ReLU()),
            ('dense2', nn.Linear(hidden_size, self.dim)),
            ('drop1',  nn.Dropout(p=0.1))]))

        # Dropout
        self.dropout = nn.Dropout(p=0.1)

        # Output layer
        self.dense = nn.Linear(self.dim, vocab_size, bias = False)


    def forward(self, inputs):
        """
        Args:
            inputs is a LongTensor of shape (batch_size, max_seq_length)
            
        Returns:
            logits of shape (batch_size, max_seq_length, vocab_size)
        """
        embeds = self.embeddings_C(inputs) # embeds shape: (B, L, D)
        # self.embeddings_P.weight shape: (L, D)

        X = embeds + self.embeddings_P.weight # X shape: (B, L, D) 
        X = self.dropout(X)

        pad_mask = (inputs != self.pad_token_id).unsqueeze(1).unsqueeze(2) # shape: (B, 1, 1,  L)
        causal_mask = torch.tril(torch.ones((self.context_size, self.context_size))).bool().to(device) # shape: (L, L)
        mask_padc = pad_mask & causal_mask  # (B, 1, L, L)

        for layer in self.layers_attention:
            # X shape: (B, L, D)
            out = layer(X, mask = mask_padc)
            out = torch.nn.functional.dropout(out, p=0.1)
            out = torch.nn.functional.layer_norm((out+X), out.shape[-1:]) # shape = (B, L, D)
            X = out 
 
        out_ff = self.feed_forward(out) 
        out_ff = torch.nn.functional.layer_norm(out+out_ff, out.shape[-1:])
        
        logits = self.dense(out_ff) # logits shape: (B, L, V)

        return logits

## Exemplo da criação da máscara de padding e "no peak":

In [None]:
t1 = torch.tril(torch.ones((9, 9))).bool()
t1

tensor([[ True, False, False, False, False, False, False, False, False],
        [ True,  True, False, False, False, False, False, False, False],
        [ True,  True,  True, False, False, False, False, False, False],
        [ True,  True,  True,  True, False, False, False, False, False],
        [ True,  True,  True,  True,  True, False, False, False, False],
        [ True,  True,  True,  True,  True,  True, False, False, False],
        [ True,  True,  True,  True,  True,  True,  True, False, False],
        [ True,  True,  True,  True,  True,  True,  True,  True, False],
        [ True,  True,  True,  True,  True,  True,  True,  True,  True]])

In [None]:
t2 = torch.tensor([[True, True, True, True, True, True, False, False, False],
                  [True, True, True, True, True, True, True, True, True],
                  [True, True, True, True, True, True, True, True, False], 
                  [True, True, True, True, True, True, True, False, False],
                  [True, True, True, True, False, False,False, False, False]]).unsqueeze(1).unsqueeze(2)
t2

tensor([[[[ True,  True,  True,  True,  True,  True, False, False, False]]],


        [[[ True,  True,  True,  True,  True,  True,  True,  True,  True]]],


        [[[ True,  True,  True,  True,  True,  True,  True,  True, False]]],


        [[[ True,  True,  True,  True,  True,  True,  True, False, False]]],


        [[[ True,  True,  True,  True, False, False, False, False, False]]]])

In [None]:
t1.shape

torch.Size([9, 9])

In [None]:
t2.shape

torch.Size([5, 1, 1, 9])

In [None]:
t1_t2 = t1&t2

In [None]:
(t1_t2).shape

torch.Size([5, 1, 9, 9])

## Teste o modelo com um exemplo

In [None]:
model = LanguageModel(
    vocab_size=tokenizer.vocab_size,
    max_seq_length=max_seq_length,
    dim=256,
    n_layers=2,
    pad_token_id=tokenizer.pad_token_id,
).to(device)

sample_input, _ = next(iter(DataLoader(training_dataset, batch_size=5)))
sample_input = sample_input.to(device)
sample_output = model(sample_input)
print(f'sample_input.shape: {sample_input.shape}')
print(f'sample_output.shape: {sample_output.shape}')

sample_input.shape: torch.Size([5, 9])
sample_output.shape: torch.Size([5, 9, 29794])


In [None]:
num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'Number of model parameters: {num_params}')

Number of model parameters: 16832000


## Assert da Perplexidade


In [None]:
random.seed(123)
np.random.seed(123)
torch.manual_seed(123)

<torch._C.Generator at 0x7fd70d5be690>

In [None]:
random.seed(123)
np.random.seed(123)
torch.manual_seed(123)


def perplexity(logits, target, ignore_token_id: int):
    """
    Computes the perplexity.

    Args:
        logits: a FloatTensor of shape (batch_size, seq_len, vocab_size)
        target: a LongTensor of shape (batch_size, seq_len)

    Returns:
        A float corresponding to the perplexity
    """
    logits = logits.reshape(-1, logits.shape[-1])
    target = target.reshape(-1)
    loss = nn.functional.cross_entropy(logits, target, reduction='mean', ignore_index=ignore_token_id)
    return torch.exp(loss)


n_examples = 1000

train_input_ids, train_target_ids = next(iter(DataLoader(training_dataset, batch_size=n_examples)))
train_input_ids = train_input_ids.to(device)
train_target_ids = train_target_ids.to(device)

logits = model(train_input_ids)

my_perplexity = perplexity(logits=logits, target=train_target_ids, ignore_token_id=tokenizer.pad_token_id)

print(f'my perplexity:              {int(my_perplexity)}')
print(f'correct initial perplexity: {tokenizer.vocab_size}')

assert math.isclose(my_perplexity, tokenizer.vocab_size, abs_tol=7000)
print('Passou o no assert da perplexidade')

my perplexity:              35754
correct initial perplexity: 29794
Passou o no assert da perplexidade


## Laço de Treinamento e Validação

In [None]:
max_examples = 150_000_000
eval_every_steps = 10000
lr = 3e-4
compare=float('inf')

model = LanguageModel(
    vocab_size=tokenizer.vocab_size,
    max_seq_length=max_seq_length,
    dim=256,
    n_layers=2,
    pad_token_id=tokenizer.pad_token_id,
).to(device)

train_loader = DataLoader(training_dataset, batch_size=512, shuffle=True, drop_last=True)
validation_loader = DataLoader(valid_dataset, batch_size=512)

optimizer = torch.optim.Adam(model.parameters(), lr=lr)


def train_step(input_ids, target_ids):
    model.train()
    model.zero_grad()
    logits = model(input_ids)
    logits = logits.reshape(-1, logits.shape[-1])
    target_ids = target_ids.reshape(-1)
    loss = nn.functional.cross_entropy(logits, target_ids, ignore_index=model.pad_token_id)
    loss.backward()
    optimizer.step()

    return loss.item()


def validation_step(input_ids, target_ids):
    model.eval()
    logits = model(input_ids)
    logits = logits.reshape(-1, logits.shape[-1])
    target_ids = target_ids.reshape(-1)
    loss = nn.functional.cross_entropy(logits, target_ids, ignore_index=model.pad_token_id)
    return loss.item()


train_losses = []
n_examples = 0
step = 0
while n_examples < max_examples:
    for train_input_ids, train_target_ids in train_loader:
        loss = train_step(train_input_ids.to(device), train_target_ids.to(device)) 
        train_losses.append(loss)
        
        if step % eval_every_steps == 0:
            train_ppl = np.exp(np.average(train_losses))

            with torch.no_grad():
                valid_ppl = np.exp(np.average([
                    validation_step(val_input_ids.to(device), val_target_ids.to(device))
                    for val_input_ids, val_target_ids in validation_loader]))

            if valid_ppl<compare:
                compare=valid_ppl
                torch.save(model, "/content/gdrive/MyDrive/Colab Notebooks/modelos_Aula09/"+"model_v7.pt")
                with open("/content/gdrive/MyDrive/Colab Notebooks/modelos_Aula09/"+"valid_ppl_model_v7.txt", 'w') as f:
                    f.write("%s\n" % valid_ppl)
                f.close()
                
            print(f'{step} steps; {n_examples} examples so far; train ppl: {train_ppl:.2f}, valid ppl: {valid_ppl:.2f}')
            train_losses = []

        n_examples += len(train_input_ids)  # Increment of batch size
        step += 1
        if n_examples >= max_examples:
            break

0 steps; 0 examples so far; train ppl: 35009.01, valid ppl: 30210.37
10000 steps; 5120000 examples so far; train ppl: 399.04, valid ppl: 289.10
20000 steps; 10240000 examples so far; train ppl: 252.42, valid ppl: 247.12
30000 steps; 15360000 examples so far; train ppl: 225.62, valid ppl: 228.75
40000 steps; 20480000 examples so far; train ppl: 211.64, valid ppl: 216.81
50000 steps; 25600000 examples so far; train ppl: 202.54, valid ppl: 208.66
60000 steps; 30720000 examples so far; train ppl: 195.10, valid ppl: 203.36
70000 steps; 35840000 examples so far; train ppl: 189.97, valid ppl: 198.47
80000 steps; 40960000 examples so far; train ppl: 186.36, valid ppl: 194.96
90000 steps; 46080000 examples so far; train ppl: 183.23, valid ppl: 192.18
100000 steps; 51200000 examples so far; train ppl: 180.72, valid ppl: 189.18
110000 steps; 56320000 examples so far; train ppl: 178.21, valid ppl: 186.51
120000 steps; 61440000 examples so far; train ppl: 174.87, valid ppl: 184.99
130000 steps; 665

In [None]:
with open("/content/gdrive/MyDrive/Colab Notebooks/modelos_Aula09/"+"valid_ppl_model_v7.txt") as f:
    txt = list(f)
    compare = float(txt[-1])
    f.close()

model = torch.load("/content/gdrive/MyDrive/Colab Notebooks/modelos_Aula09/"+"model_v7.pt")
model.to(device)

LanguageModel(
  (embeddings_C): Embedding(29794, 256, padding_idx=0)
  (embeddings_P): Embedding(9, 256)
  (layers_attention): ModuleList(
    (0): MultiHeadAttention(
      (W_q): Linear(in_features=256, out_features=256, bias=False)
      (W_k): Linear(in_features=256, out_features=256, bias=False)
      (W_v): Linear(in_features=256, out_features=256, bias=False)
      (W_o): Linear(in_features=256, out_features=256, bias=False)
    )
    (1): MultiHeadAttention(
      (W_q): Linear(in_features=256, out_features=256, bias=False)
      (W_k): Linear(in_features=256, out_features=256, bias=False)
      (W_v): Linear(in_features=256, out_features=256, bias=False)
      (W_o): Linear(in_features=256, out_features=256, bias=False)
    )
  )
  (feed_forward): Sequential(
    (dense1): Linear(in_features=256, out_features=2048, bias=True)
    (relu1): ReLU()
    (dense2): Linear(in_features=2048, out_features=256, bias=True)
    (drop1): Dropout(p=0.1, inplace=False)
  )
  (dropout): Drop

## Avaliação final no dataset de teste


Bonus: o modelo com menor perplexidade no dataset de testes ganhará 0.5 ponto na nota final.

In [None]:
test_loader = DataLoader(test_dataset, batch_size=64)

def validation_step(input_ids, target_ids):
    model.eval()
    logits = model(input_ids)
    logits = logits.reshape(-1, logits.shape[-1])
    target_ids = target_ids.reshape(-1)
    loss = nn.functional.cross_entropy(logits, target_ids, ignore_index=model.pad_token_id)
    return loss.item()
    
with torch.no_grad():
    test_ppl = np.exp(np.average([
        validation_step(test_input_ids.to(device), test_target_ids.to(device))
        for test_input_ids, test_target_ids in test_loader
    ]))

print(f'test perplexity: {test_ppl}')

test perplexity: 148.8549941960566


## Teste seu modelo com uma sentença

Escolha uma sentença gerada pelo modelo que ache interessante.

In [None]:
def tokenize(text: str, tokenizer):

    return tokenizer.encode_plus(text, return_tensors=None, add_special_tokens= False).input_ids

In [None]:
prompt = 'Eu gosto de comer pizza pois me faz'
max_output_tokens = 20
model.eval()

for _ in range(max_output_tokens):
    input_ids = tokenize(text=prompt, tokenizer=tokenizer)
    input_ids_truncated = input_ids[-max_seq_length:]  # Usamos apenas os últimos <max_seq_length> tokens como entrada para o modelo.
    logits = model(torch.LongTensor([input_ids_truncated]).to(device))
    logits = logits[:, -1, :]  # Usamos apenas o ultimo token da sequencia
    # Ao usarmos o argmax, a saída do modelo em cada passo é o token de maior probabilidade.
    # Isso se chama decodificação gulosa (greedy decoding).
    predicted_id = torch.argmax(logits).item()
    input_ids += [predicted_id]  # Concatenamos a entrada com o token escolhido nesse passo.
    prompt = tokenizer.decode(input_ids)
    print(prompt)

Eu gosto de comer pizza pois me faz com
Eu gosto de comer pizza pois me faz com que
Eu gosto de comer pizza pois me faz com que a
Eu gosto de comer pizza pois me faz com que a gente
Eu gosto de comer pizza pois me faz com que a gente não
Eu gosto de comer pizza pois me faz com que a gente não tem
Eu gosto de comer pizza pois me faz com que a gente não tem que
Eu gosto de comer pizza pois me faz com que a gente não tem que se
Eu gosto de comer pizza pois me faz com que a gente não tem que se possa
Eu gosto de comer pizza pois me faz com que a gente não tem que se possa fazer
Eu gosto de comer pizza pois me faz com que a gente não tem que se possa fazer.
Eu gosto de comer pizza pois me faz com que a gente não tem que se possa fazer. O
Eu gosto de comer pizza pois me faz com que a gente não tem que se possa fazer. O que
Eu gosto de comer pizza pois me faz com que a gente não tem que se possa fazer. O que é
Eu gosto de comer pizza pois me faz com que a gente não tem que se possa fazer. O q

In [None]:
prompt = "Temos que pensar no futuro e guardar o que aprendemos na"
max_output_tokens = 20
model.eval()

for _ in range(max_output_tokens):
    input_ids = tokenize(text=prompt, tokenizer=tokenizer)
    input_ids_truncated = input_ids[-max_seq_length:]  # Usamos apenas os últimos <max_seq_length> tokens como entrada para o modelo.
    logits = model(torch.LongTensor([input_ids_truncated]).to(device))
    logits = logits[:, -1, :]  # Usamos apenas o ultimo token da sequencia
    # Ao usarmos o argmax, a saída do modelo em cada passo é o token de maior probabilidade.
    # Isso se chama decodificação gulosa (greedy decoding).
    predicted_id = torch.argmax(logits).item()
    input_ids += [predicted_id]  # Concatenamos a entrada com o token escolhido nesse passo.
    prompt = tokenizer.decode(input_ids)
    print(prompt)

Temos que pensar no futuro e guardar o que aprendemos na sua
Temos que pensar no futuro e guardar o que aprendemos na sua vida
Temos que pensar no futuro e guardar o que aprendemos na sua vida.
Temos que pensar no futuro e guardar o que aprendemos na sua vida. O
Temos que pensar no futuro e guardar o que aprendemos na sua vida. O que
Temos que pensar no futuro e guardar o que aprendemos na sua vida. O que é
Temos que pensar no futuro e guardar o que aprendemos na sua vida. O que é o
Temos que pensar no futuro e guardar o que aprendemos na sua vida. O que é o que
Temos que pensar no futuro e guardar o que aprendemos na sua vida. O que é o que mais
Temos que pensar no futuro e guardar o que aprendemos na sua vida. O que é o que mais se
Temos que pensar no futuro e guardar o que aprendemos na sua vida. O que é o que mais se pode
Temos que pensar no futuro e guardar o que aprendemos na sua vida. O que é o que mais se pode se
Temos que pensar no futuro e guardar o que aprendemos na sua vida

## Bonus 1
Quem conseguir a menor perplexidade no dataset de testes ganha 0.5 ponto na média final.

## Bonus 2
Qual é a complexidade (em notação O-grande) da função de geração de texto acima?

Quem responder corretamente a pergunta acima e deixar a função com menor complexidade ganha 0.5 ponto na média final.