<a href="https://colab.research.google.com/github/unicamp-dl/IA025_2022S1/blob/main/ex10/Guilherme_Pereira/Aula_10_Guilherme_Pereira.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
nome = 'Guilherme Pereira'
print(f'Meu nome é {nome}')

Meu nome é Guilherme Pereira


#  Exercício: Modelo de Linguagem com auto-atenção

Este exercício é similar ao da Aula 8, mas iremos agora treinar uma rede neural com **duas camadas** de auto-atenção **causais** para prever a próxima palavra de um texto, data as palavras anteriores como entrada. 

Iremos também trabalhar com sequencias de tamanho variável.

Na camada de auto-atenção, não se esqueça de implementar:
- Embeddings de posição
- Projeções lineares (WQ, WK, WV, WO)
- Conexões residuais
- Camada de feed forward (2-layer MLP)


O dataset usado neste exercício (BrWaC) possui um tamanho razoável e você vai precisar rodar seus experimentos com GPU.

Alguns conselhos úteis:
- **ATENÇÃO:** o dataset é bem grande. Não dê comando de imprimí-lo.
- Durante a depuração, faça seu dataset ficar bem pequeno, para que a depuração seja mais rápida e não precise de GPU. Somente ligue a GPU quando o seu laço de treinamento já está funcionando
- Não deixe para fazer esse exercício na véspera. Ele é trabalhoso.

In [None]:
# iremos utilizar a biblioteca dos transformers para ter acesso ao tokenizador do BERT.
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## Importação dos pacotes

In [None]:
import collections
import itertools
import functools
import math
import random

import torch
import torch.nn as nn
import torch.nn.functional as F

import numpy as np
from torch.utils.data import DataLoader
from tqdm import tqdm_notebook

import json


In [None]:
# Check which GPU we are using
!nvidia-smi

Thu Jun  9 00:11:41 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P0    26W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
if torch.cuda.is_available(): 
   dev = "cuda:0"
else: 
   dev = "cpu"
device = torch.device(dev)
print('Using {}'.format(device))

Using cuda:0


In [None]:
import psutil

def memory_usage():
    print(f'RAM Total:      {psutil.virtual_memory().total     / (1024)**2:.2f} MB')
    print(f'RAM Disponível: {psutil.virtual_memory().available / (1024)**2:.2f} MB')
    print(f'RAM Usada:      {psutil.virtual_memory().used      / (1024)**2:.2f} MB')
    print(f'Total usada:    {psutil.virtual_memory().percent:.2f} %')

In [None]:
from google.colab import drive

drive.mount("/content/drive")
path = "/content/drive/MyDrive/Guilherme/MODEL_NLP/"

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Implementação do MyDataset

In [None]:
from typing import List
from tqdm.notebook import tqdm


def tokenize(text: str, tokenizer):
    # Recomenda-se usar o tokenizer.batch_encode_plus pois é mais rápido.
    return tokenizer(text, return_tensors='pt', add_special_tokens=False, padding=True).input_ids

class MyDataset():
    def __init__(self, texts: List[str], tokenizer, max_seq_length: int, iterador: bool, torch_bool=True):
        # Escreva aqui seu código.

        if iterador:
            self.ids = []
            for i in tqdm(range(len(texts))):
                self.ids.append(tokenizer.batch_encode_plus([texts[i]], return_tensors=None, add_special_tokens=False).input_ids[0])                
        else:
            self.ids = tokenizer.batch_encode_plus(texts, return_tensors=None, add_special_tokens=False).input_ids

        self.torch_bool     = torch_bool
        self.iterador       = iterador
        self.dict_ids       = {}
        self.tokenizer      = tokenizer
        self.max_seq_length = max_seq_length

        total = 0

        for i in tqdm(range(len(self.ids))):
            itera = len(self.ids[i])//(self.max_seq_length-1) if len(self.ids[i])%(self.max_seq_length-1) == 0 else (len(self.ids[i])//(self.max_seq_length-1))+1
            for j in range(itera):
                self.dict_ids[total + j] = [i,j]

            total += itera


    def text_to_token(self, idx):

        max0 = self.max_seq_length
        max1 = self.max_seq_length - 1

        limite_a = self.max_seq_length - len( self.ids[ self.dict_ids[idx][0] ] [self.dict_ids[idx][1]*max1 : (self.dict_ids[idx][1]+1)*max1]) - 1
        limite_b = self.max_seq_length - len( self.ids[ self.dict_ids[idx][0] ] [self.dict_ids[idx][1]*max1 : (self.dict_ids[idx][1]+1)*max0])

        p_a = self.ids[ self.dict_ids[idx][0] ] [self.dict_ids[idx][1]*max1 : (self.dict_ids[idx][1]+1)*max1]
        p_b = self.ids[ self.dict_ids[idx][0] ] [self.dict_ids[idx][1]*max1 : self.dict_ids[idx][1]*max1 + max0]

        a = [self.tokenizer.cls_token_id] + p_a + [self.tokenizer.pad_token_id]*max(0, limite_a)
        b = p_b + [self.tokenizer.pad_token_id]*max(0, limite_b)

        if self.torch_bool:
            return torch.tensor(a), torch.tensor(b)
        return (a, b)


    def __len__(self):      
        return len(self.dict_ids)        

    def __getitem__(self, idx):
        return self.text_to_token(idx)

class JDataset():
    def __init__(self, path):
        # Escreva aqui seu código.
        self.path = path
        self.dataset_torch = torch.load(path)

    def __len__(self):      
        return len(self.dataset_torch)        

    def __getitem__(self, idx):
        return (self.dataset_torch[idx][0], self.dataset_torch[idx][1])


## Testando se a implementação do MyDataset está correta

In [None]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("neuralmind/bert-base-portuguese-cased")

dummy_texts = ['Eu gosto de correr', 'Ela gosta muito de comer pizza']

dummy_dataset = MyDataset(texts=dummy_texts, tokenizer=tokenizer, max_seq_length=9, iterador=True)
dummy_loader = DataLoader(dummy_dataset, batch_size=6, shuffle=False)
assert len(dummy_dataset) == 2
print('Passou no assert de tamanho do dataset.')

first_batch_input, first_batch_target = next(iter(dummy_loader))

correct_first_batch_input = torch.LongTensor(
    [[  101,  3396, 10303,   125, 13239,     0,     0,     0,     0],
     [  101,  1660,  5971,   785,   125,  1847, 13779, 15616,     0]])

correct_first_batch_target = torch.LongTensor(
    [[ 3396, 10303,   125, 13239,     0,     0,     0,     0,     0],
     [ 1660,  5971,   785,   125,  1847, 13779, 15616,     0,     0]])

print(correct_first_batch_target.dtype)

assert torch.equal(first_batch_input, correct_first_batch_input)
assert torch.equal(first_batch_target, correct_first_batch_target)

print('Passou no assert de dataset.')

  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/2 [00:00<?, ?it/s]

Passou no assert de tamanho do dataset.
torch.int64
Passou no assert de dataset.


In [None]:
def tokenize(text: str, tokenizer):
    # Recomenda-se usar o tokenizer.batch_encode_plus pois é mais rápido.
    return tokenizer(text, return_tensors='pt', add_special_tokens=False, padding=True).input_ids

text   = ['Eu gosto de correr', 'Ela gosta muito de comer pizza']
tokens = tokenize(text, tokenizer)

tokens

tensor([[ 3396, 10303,   125, 13239,     0,     0,     0],
        [ 1660,  5971,   785,   125,  1847, 13779, 15616]])

In [None]:
%%time
text = ['Eu gosto de correr', 'Ela gosta muito de comer pizza']

dummy_dataset = MyDataset(texts=text, tokenizer=tokenizer, max_seq_length=3, iterador=True)
dummy_loader  = DataLoader(dummy_dataset, batch_size=8, shuffle=False)

print(next(iter(dummy_loader)))
# print(dummy_dataset.ids)

  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/2 [00:00<?, ?it/s]

[tensor([[  101,  3396, 10303],
        [  101,   125, 13239],
        [  101,  1660,  5971],
        [  101,   785,   125],
        [  101,  1847, 13779],
        [  101, 15616,     0]]), tensor([[ 3396, 10303,   125],
        [  125, 13239,     0],
        [ 1660,  5971,   785],
        [  785,   125,  1847],
        [ 1847, 13779, 15616],
        [15616,     0,     0]])]
CPU times: user 90.6 ms, sys: 3.88 ms, total: 94.5 ms
Wall time: 97.9 ms


In [None]:
%%time
text = ['Eu gosto de correr', 'Ela gosta muito de comer pizza']

dummy_dataset = MyDataset(texts=text, tokenizer=tokenizer, max_seq_length=3, iterador=False)
dummy_loader = DataLoader(dummy_dataset, batch_size=8, shuffle=False)
print(next(iter(dummy_loader)))
# print(dummy_dataset.ids)

  0%|          | 0/2 [00:00<?, ?it/s]

[tensor([[  101,  3396, 10303],
        [  101,   125, 13239],
        [  101,  1660,  5971],
        [  101,   785,   125],
        [  101,  1847, 13779],
        [  101, 15616,     0]]), tensor([[ 3396, 10303,   125],
        [  125, 13239,     0],
        [ 1660,  5971,   785],
        [  785,   125,  1847],
        [ 1847, 13779, 15616],
        [15616,     0,     0]])]
CPU times: user 45.4 ms, sys: 2.25 ms, total: 47.7 ms
Wall time: 53.3 ms


# Carregamento do dataset 

Iremos usar uma pequena amostra do dataset [BrWaC](https://www.inf.ufrgs.br/pln/wiki/index.php?title=BrWaC) para treinar e avaliar nosso modelo de linguagem.

In [None]:
!wget -nc https://storage.googleapis.com/unicamp-dl/ia025a_2022s1/aula9/sample-1gb.txt

--2022-06-08 23:46:16--  https://storage.googleapis.com/unicamp-dl/ia025a_2022s1/aula9/sample-1gb.txt
Resolving storage.googleapis.com (storage.googleapis.com)... 142.250.157.128, 142.251.8.128, 74.125.23.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|142.250.157.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1230909256 (1.1G) [text/plain]
Saving to: ‘sample-1gb.txt’


2022-06-08 23:46:21 (247 MB/s) - ‘sample-1gb.txt’ saved [1230909256/1230909256]



In [None]:
memory_usage()

RAM Total:      12986.89 MB
RAM Disponível: 11609.18 MB
RAM Usada:      1151.71 MB
Total usada:    10.60 %


In [None]:
# Load datasets
max_seq_length = 9

texts = open('sample-1gb.txt').readlines()

print(f'Read {len(texts)} lines.')

train_examples = int(len(texts)*0.70)
valid_examples = int(len(texts)*0.15)
test_examples  = int(len(texts)*0.15)

max_lines = train_examples + valid_examples + test_examples
print(f'Truncating to {max_lines} lines.')
print(f'Train Examples: {train_examples}')
print(f'Valid Examples: {valid_examples}')
print(f'Test Examples:  {test_examples}')
texts = texts[:max_lines]  

Read 250000 lines.
Truncating to 250000 lines.
Train Examples: 175000
Valid Examples: 37500
Test Examples:  37500


In [None]:
print("Memória antes de carregar o Dataset\n")
memory_usage()

valid_dataset    = JDataset(path + 'valid_tensor.pt')
training_dataset = JDataset(path + 'train_tensor.pt')

print("\n\nMemória depois de carregar o Dataset\n")
memory_usage()

Memória antes de carregar o Dataset

RAM Total:      12986.89 MB
RAM Disponível: 11372.80 MB
RAM Usada:      1388.64 MB
Total usada:    12.40 %


Memória depois de carregar o Dataset

RAM Total:      12986.89 MB
RAM Disponível: 7272.23 MB
RAM Usada:      5478.82 MB
Total usada:    44.00 %


In [None]:
print(f'training examples: {len(training_dataset)}')
print(f'valid examples:    {len(valid_dataset)}')

training examples: 24475426
valid examples:    5302245


In [None]:
valid_dataset[0]

(tensor([  101,  3936,   125,  6822, 21797,   319,   989,   596,   125]),
 tensor([ 3936,   125,  6822, 21797,   319,   989,   596,   125,  6516]))

## Testes do uso de RAM

### Uso de memória - Teste 1 - Todos os exemplos

In [None]:
memory_usage()

RAM Total:      12986.89 MB
RAM Disponível: 9519.97 MB
RAM Usada:      3275.16 MB
Total usada:    26.70 %


In [None]:
%%time
print('\nLoadind Validation Texts ...')
valid_dataset    = MyDataset(texts=texts[-(valid_examples + test_examples):-test_examples], tokenizer=tokenizer, max_seq_length=max_seq_length, iterador=True)


Loadind Validation Texts ...


  0%|          | 0/37500 [00:00<?, ?it/s]

  0%|          | 0/37500 [00:00<?, ?it/s]

CPU times: user 10min 47s, sys: 4.46 s, total: 10min 52s
Wall time: 10min 53s


In [None]:
memory_usage()

RAM Total:      12986.89 MB
RAM Disponível: 8644.61 MB
RAM Usada:      4715.23 MB
Total usada:    33.40 %


In [None]:
%%time
print('\nLoadind Testing Texts ...')
test_dataset     = MyDataset(texts=texts[-test_examples:], tokenizer=tokenizer, max_seq_length=max_seq_length, iterador=True)


Loadind Testing Texts ...


  0%|          | 0/37500 [00:00<?, ?it/s]

  0%|          | 0/37500 [00:00<?, ?it/s]

CPU times: user 10min 44s, sys: 4.65 s, total: 10min 49s
Wall time: 10min 51s


In [None]:
memory_usage()

RAM Total:      12986.89 MB
RAM Disponível: 7657.32 MB
RAM Usada:      6103.94 MB
Total usada:    41.00 %


In [None]:
%%time
print('\nLoadind Training Texts ...')
training_dataset = MyDataset(texts=texts[:-(valid_examples + test_examples)], tokenizer=tokenizer, max_seq_length=max_seq_length, iterador=True)


Loadind Training Texts ...


  0%|          | 0/175000 [00:00<?, ?it/s]

  0%|          | 0/175000 [00:00<?, ?it/s]

In [None]:
memory_usage()

In [None]:
print(f'training examples: {len(training_dataset)}')
print(f'valid examples: {len(valid_dataset)}')
print(f'test examples: {len(test_dataset)}')

### Uso de memória - Teste 2 - Todos os exemplos

In [None]:
memory_usage()

RAM Total:      12986.89 MB
RAM Disponível: 9569.74 MB
RAM Usada:      3221.76 MB
Total usada:    26.30 %


In [None]:
%%time
print('\nLoadind Validation Texts ...')
valid_dataset    = MyDataset(texts=texts[-(valid_examples + test_examples):-test_examples], tokenizer=tokenizer, max_seq_length=max_seq_length, iterador=True)


Loadind Validation Texts ...


  0%|          | 0/37500 [00:00<?, ?it/s]

  0%|          | 0/37500 [00:00<?, ?it/s]

CPU times: user 11min 19s, sys: 5 s, total: 11min 24s
Wall time: 11min 32s


In [None]:
memory_usage()

RAM Total:      12986.89 MB
RAM Disponível: 8678.37 MB
RAM Usada:      4694.93 MB
Total usada:    33.20 %


In [None]:
%%time
print('\nLoadind Testing Texts ...')
test_dataset     = MyDataset(texts=texts[-test_examples:], tokenizer=tokenizer, max_seq_length=max_seq_length, iterador=False)


Loadind Testing Texts ...


  0%|          | 0/37500 [00:00<?, ?it/s]

CPU times: user 10min 12s, sys: 2.35 s, total: 10min 15s
Wall time: 10min 15s


In [None]:
memory_usage()

RAM Total:      12986.89 MB
RAM Disponível: 7713.94 MB
RAM Usada:      6290.00 MB
Total usada:    40.60 %


In [None]:
%%time
print('\nLoadind Training Texts ...')
training_dataset = MyDataset(texts=texts[:-(valid_examples + test_examples)], tokenizer=tokenizer, max_seq_length=max_seq_length, iterador=True)

In [None]:
print(f'training examples: {len(training_dataset)}')
print(f'valid examples: {len(valid_dataset)}')
print(f'test examples: {len(test_dataset)}')

### Uso de memória - Teste 3 - Todos os exemplos

In [None]:
%%time
print('\nLoadind Validation Texts ...')
valid_dataset    = MyDataset(texts=texts[-(valid_examples + test_examples):-test_examples],
                             tokenizer=tokenizer,
                             max_seq_length=max_seq_length,
                             iterador=True,
                             torch_bool=True)


Loadind Validation Texts ...


  0%|          | 0/37500 [00:00<?, ?it/s]

  0%|          | 0/37500 [00:00<?, ?it/s]

CPU times: user 10min 51s, sys: 4.83 s, total: 10min 56s
Wall time: 10min 58s


In [None]:
memory_usage()

RAM Total:      12986.89 MB
RAM Disponível: 8370.49 MB
RAM Usada:      7193.54 MB
Total usada:    35.50 %


In [None]:
%%time
valid_torch = torch.zeros((len(valid_dataset), 2, len(valid_dataset[0][0])), dtype=torch.long)

for i in tqdm(range(len(valid_dataset))):
    a, b = valid_dataset[i]

    valid_torch[i][0] = a
    valid_torch[i][1] = b

print(valid_torch[0])

  0%|          | 0/5302245 [00:00<?, ?it/s]

tensor([[  101,  3936,   125,  6822, 21797,   319,   989,   596,   125],
        [ 3936,   125,  6822, 21797,   319,   989,   596,   125,  6516]])
CPU times: user 2min 21s, sys: 1.13 s, total: 2min 22s
Wall time: 2min 22s


In [None]:
memory_usage()

RAM Total:      12986.89 MB
RAM Disponível: 8559.43 MB
RAM Usada:      7004.97 MB
Total usada:    34.10 %


In [None]:
%%time
torch.save(valid_torch, path + 'valid_tensor.pt')

CPU times: user 538 ms, sys: 352 ms, total: 889 ms
Wall time: 3.41 s


In [None]:
memory_usage()

RAM Total:      12986.89 MB
RAM Disponível: 8552.51 MB
RAM Usada:      7001.21 MB
Total usada:    34.10 %


In [None]:
%%time
valid_torch1 = torch.load(path + 'valid_tensor.pt')

print(valid_torch1[0])

tensor([[  101,  3936,   125,  6822, 21797,   319,   989,   596,   125],
        [ 3936,   125,  6822, 21797,   319,   989,   596,   125,  6516]])
CPU times: user 0 ns, sys: 524 ms, total: 524 ms
Wall time: 1.07 s


In [None]:
print(valid_torch1[10])

tensor([[  101,   977, 22280,  3936,  6374, 22303,   230,   576,   325],
        [  977, 22280,  3936,  6374, 22303,   230,   576,   325,  7573]])


In [None]:
memory_usage()

RAM Total:      12986.89 MB
RAM Disponível: 8543.54 MB
RAM Usada:      7010.04 MB
Total usada:    34.20 %


In [None]:
%%time
print('\nLoadind Testing Texts ...')
test_dataset     = MyDataset(texts=texts[-test_examples:],
                             tokenizer=tokenizer,
                             max_seq_length=max_seq_length,
                             iterador=True,
                             torch_bool=True)

test_torch = torch.zeros((len(test_dataset), 2, len(test_dataset[0][0])), dtype=torch.long)

for i in tqdm(range(len(test_dataset))):
    test_torch[i][0], test_torch[i][1] = test_dataset[i]

torch.save(test_torch, path + 'test_tensor.pt')


Loadind Testing Texts ...


  0%|          | 0/37500 [00:00<?, ?it/s]

KeyboardInterrupt: ignored

In [None]:
%%time
print('\nLoadind Training Texts ...')
train_dataset = MyDataset(texts=texts[:-(valid_examples + test_examples)],
                          tokenizer=tokenizer,
                          max_seq_length=max_seq_length,
                          iterador=True,
                          torch_bool=True)

train_torch = torch.zeros((len(train_dataset), 2, len(train_dataset[0][0])), dtype=torch.long)

for i in tqdm(range(len(train_dataset))):
    train_torch[i][0], train_torch[i][1] = train_dataset[i]

torch.save(train_torch, path + 'train_tensor.pt')


Loadind Training Texts ...


  0%|          | 0/175000 [00:00<?, ?it/s]

  0%|          | 0/175000 [00:00<?, ?it/s]

  0%|          | 0/24475426 [00:00<?, ?it/s]

CPU times: user 58min 8s, sys: 28.1 s, total: 58min 36s
Wall time: 58min 23s


## Modelo

In [None]:
class SelfAttention(nn.Module):
    def __init__(self, dim, max_seq_length, n_heads, pad_token_id):
        super(SelfAttention, self).__init__()

        self.pad_token_id   = pad_token_id
        self.max_seq_length = max_seq_length

        self.n_heads        = n_heads
        self.dim            = dim
        self.D_k            = dim//n_heads
        
        self.W_q = torch.nn.Linear(self.dim, self.dim, bias=False) # D, D
        self.W_k = torch.nn.Linear(self.dim, self.dim, bias=False) # D, D
        self.W_v = torch.nn.Linear(self.dim, self.dim, bias=False) # D, D
        self.W_o = torch.nn.Linear(self.dim, self.dim, bias=False) # D, D

        self.layer_norm = torch.nn.LayerNorm([self.max_seq_length, self.dim], eps=1e-6) # L, D


    def attention(self, Q, K, V, mask, tri_mask):   

        '''
        1 torch.Size([5, 2, 9, 9])
        2 torch.Size([5, 9])
        3 torch.Size([5, 2, 9, 9])
        4 torch.Size([5, 2, 9, 9])
        5 torch.Size([5, 2, 9, 128])
        6 torch.Size([5, 9, 2, 128])
        7 torch.Size([5, 9, 256])
        8 torch.Size([5, 9, 256])
        '''
        
        scores = torch.matmul(Q, K.transpose(-1, -2))/math.sqrt(self.D_k) # B, HEADS, L, L -> 1
        # print(scores)
                            # B, L -> 2
        # print(mask)
        new_mask      = mask[:, None, None, :] & tri_mask
        mask_expanded = new_mask.expand_as(scores)  # B, HEADS, L, L -> 3
        
        scores.masked_fill_(~mask_expanded, float('-inf'))        # B, HEADS, L, L
        # print(scores)
        probs = F.softmax(scores, dim=-1)                         # B, HEADS, L, L -> 4

        E = torch.matmul(probs, V)                                    # B, HEADS, L, D//HEADS -> 5
        E = E.transpose(1,2).contiguous()                             # B, L, HEADS, D//HEADS -> 6
        E = E.reshape(mask.shape[0], self.max_seq_length, self.dim) # B, L, D -> 7
        E = self.W_o(E)                                               # B, L, D -> 8

        return E
        
    def forward(self, x, mask, tri_mask):

        q = self.W_q(x).reshape(mask.shape[0], self.max_seq_length, self.n_heads, self.D_k).transpose(1,2) # B, HEADS, L, D//HEADS -> torch.Size([5, 2, 9, 128])
        k = self.W_k(x).reshape(mask.shape[0], self.max_seq_length, self.n_heads, self.D_k).transpose(1,2) # B, HEADS, L, D//HEADS
        v = self.W_v(x).reshape(mask.shape[0], self.max_seq_length, self.n_heads, self.D_k).transpose(1,2) # B, HEADS, L, D//HEADS

        x = self.attention(q, k, v, mask, tri_mask) # B, L, D
        x = self.W_o(x)                           # B, L, D

        return self.layer_norm(x)          # B, L, D

In [None]:
class LanguageModel(torch.nn.Module):

    def __init__(self, vocab_size: int, max_seq_length: int, dim: int, n_layers: int, pad_token_id: int):
        """
        Implements the Self-attention, decoder-only."

        Args:
            vocab_size (int): Size of the input vocabulary.
            max_seq_length (int): Size of the sequence to consider as context for prediction.
            dim (int): Dimension of the embedding layer for each word in the context.
            n_layers (int): number of self-attention layers.
            pad_token_id (int): id of the pad token that will be ignored in the attention.
        """
        # Escreva seu código aqui.

        super().__init__()

        self.vocab_size     = vocab_size
        self.max_seq_length = max_seq_length
        self.dim            = dim
        self.n_layers       = n_layers
        self.pad_token_id   = pad_token_id
        self.n_heads        = 2
        self.tri_mask       = torch.tril(torch.ones((max_seq_length, max_seq_length))).bool().to(device) # (L, L)

        self.embedding_layer       = torch.nn.Embedding(self.vocab_size, self.dim, padding_idx=pad_token_id)
        self.positional_embeddings = torch.nn.Linear(self.dim, self.max_seq_length, bias=False)
        # self.embedding_layer = torch.nn.Embedding.from_pretrained(embeddings, padding_idx=padding_idx, freeze=True)

        self.att_layers = nn.ModuleList()

        for i in range(self.n_layers):
            self.att_layers.append(SelfAttention(self.dim, self.max_seq_length, self.n_heads, self.pad_token_id))

        self.feed_forward = torch.nn.Sequential(nn.Linear(self.max_seq_length*self.dim, 8*self.dim),
                                                torch.nn.ReLU(),
                                                nn.Dropout(p=0.2),

                                                nn.Linear(8*self.dim, 4*self.dim),
                                                torch.nn.ReLU(),
                                                nn.Dropout(p=0.2),

                                                nn.Linear(4*self.dim, 2*self.dim),
                                                torch.nn.ReLU(),
                                                nn.Dropout(p=0.2),

                                                nn.Linear(2*self.dim, self.max_seq_length*vocab_size))

    def forward(self, inputs):
        """
        Args:
            inputs is a LongTensor of shape (batch_size, max_seq_length)
            
        Returns:
            logits of shape (batch_size, vocab_size)
        """
        # Escreva seu código aqui.

        x = self.embedding_layer(inputs) + self.positional_embeddings.weight # B, L, D

        mask = inputs != self.pad_token_id 

        for att in self.att_layers:
            x_att = att(x, mask, self.tri_mask)     
            x = x + x_att  # B, L, D

        
                  
        x = self.feed_forward(x.reshape(len(inputs),-1))  # B, L*D

        # print('mask', mask[:,:,None].shape)
        # print('x', x.shape)

        logits = x.reshape(x.shape[0], self.max_seq_length, self.vocab_size)  # B, L, V
        # print(logits)
        # print(logits.shape)
        return logits
    
    

## Teste o modelo com um exemplo

In [None]:
model = LanguageModel(
    vocab_size=tokenizer.vocab_size,
    max_seq_length=9,
    dim=256,
    n_layers=2,
    pad_token_id=tokenizer.pad_token_id,
).to(device)

sample_input, _ = next(iter(DataLoader(training_dataset, batch_size=2)))
sample_input = sample_input.to(device)
sample_output = model(sample_input)
print(f'sample_input.shape: {sample_input.shape}')
print(f'sample_output.shape: {sample_output.shape}')

sample_input.shape: torch.Size([2, 9])
sample_output.shape: torch.Size([2, 9, 29794])


In [None]:
num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'Number of model parameters: {num_params}')

Number of model parameters: 153065586


## Assert da Perplexidade


In [None]:
random.seed(123)
np.random.seed(123)
torch.manual_seed(123)


def perplexity(logits, target, ignore_token_id: int):
    """
    Computes the perplexity.

    Args:
        logits: a FloatTensor of shape (batch_size, seq_len, vocab_size)
        target: a LongTensor of shape (batch_size, seq_len)

    Returns:
        A float corresponding to the perplexity
    """
    logits = logits.reshape(-1, logits.shape[-1])
    target = target.reshape(-1)
    loss = nn.functional.cross_entropy(logits, target, reduction='mean', ignore_index=ignore_token_id)
    return torch.exp(loss)


n_examples = 1000

train_input_ids, train_target_ids = next(iter(DataLoader(training_dataset, batch_size=n_examples)))
train_input_ids = train_input_ids.to(device)
train_target_ids = train_target_ids.to(device)

logits = model(train_input_ids)

my_perplexity = perplexity(logits=logits, target=train_target_ids, ignore_token_id=tokenizer.pad_token_id)

print(f'my perplexity:              {int(my_perplexity)}')
print(f'correct initial perplexity: {tokenizer.vocab_size}')

assert math.isclose(my_perplexity, tokenizer.vocab_size, abs_tol=7000)
print('Passou o no assert da perplexidade')

my perplexity:              29921
correct initial perplexity: 29794
Passou o no assert da perplexidade


## Laço de Treinamento e Validação

In [None]:
def perplexity(logits, target, ignore_token_id: int):
    """
    Computes the perplexity.

    Args:
        logits: a FloatTensor of shape (batch_size, seq_len, vocab_size)
        target: a LongTensor of shape (batch_size, seq_len)

    Returns:
        A float corresponding to the perplexity
    """
    logits = logits.reshape(-1, logits.shape[-1])
    target = target.reshape(-1)
    loss = nn.functional.cross_entropy(logits, target, reduction='mean', ignore_index=ignore_token_id)
    return torch.exp(loss)

In [None]:
from tqdm.notebook import tqdm

max_examples = 200_000_000
eval_every_steps = 10_000
lr = 4.5e-4


model = LanguageModel(
    vocab_size=29794,
    max_seq_length=9,
    dim=64,
    n_layers=2,
    pad_token_id=0,
).to(device)

train_loader = DataLoader(training_dataset, batch_size=1000, shuffle=True, drop_last=True)
validation_loader = DataLoader(valid_dataset, batch_size=1000)

optimizer = torch.optim.Adam(model.parameters(), lr=lr)


def train_step(input_ids, target_ids):
    model.train()
    model.zero_grad()
    logits = model(input_ids)
    logits = logits.reshape(-1, logits.shape[-1])
    target_ids = target_ids.reshape(-1)
    loss = nn.functional.cross_entropy(logits, target_ids, ignore_index=model.pad_token_id)
    loss.backward()
    optimizer.step()

    return loss.item()


def validation_step(input_ids, target_ids):
    model.eval()
    logits = model(input_ids)
    logits = logits.reshape(-1, logits.shape[-1])
    target_ids = target_ids.reshape(-1)
    loss = nn.functional.cross_entropy(logits, target_ids, ignore_index=model.pad_token_id)
    return loss.item()


train_losses = []
n_examples = 0
step = 0
ver = 1
print('====================== TRAINING MODEL ======================\n')
with tqdm(total=eval_every_steps) as pbar:
    print('\n====================== VALIDING MODEL ======================\n')
    while n_examples < max_examples:    
        for train_input_ids, train_target_ids in train_loader:
            loss = train_step(train_input_ids.to(device), train_target_ids.to(device)) 
            train_losses.append(loss)
            # print(step)
            
            if step % eval_every_steps == 0:
                pbar.reset(total=eval_every_steps)
                train_ppl = np.exp(np.average(train_losses))

                with torch.no_grad():
                    valid_list = []
                    
                    for val_input_ids, val_target_ids in tqdm(validation_loader):
                        valid_list.append(validation_step(val_input_ids.to(device), val_target_ids.to(device)))
                    valid_ppl = np.exp(np.average(valid_list))

                print(f'{step:6d} steps; {(n_examples/max_examples)*100:8.2f} % completed; {n_examples} examples so far; train ppl: {train_ppl:.2f}, valid ppl: {valid_ppl:.2f}\n')
                train_losses = []

            if n_examples % 20_000_000 == 0:
                torch.save(model.state_dict(), path + 'save_v' + str(ver) + '.pth')
                ver += 1

            n_examples += len(train_input_ids)  # Increment of batch size
            step += 1
            pbar.update(1) 
            if n_examples >= max_examples:
                torch.save(model.state_dict(), path + 'save_final.pth')
                break




  0%|          | 0/10000 [00:00<?, ?it/s]





  0%|          | 0/5303 [00:00<?, ?it/s]

     0 steps;     0.00 % completed; 0 examples so far; train ppl: 29986.26, valid ppl: 29655.93



  0%|          | 0/5303 [00:00<?, ?it/s]

 10000 steps;     5.00 % completed; 10000000 examples so far; train ppl: 132.99, valid ppl: 20.40



  0%|          | 0/5303 [00:00<?, ?it/s]

 20000 steps;    10.00 % completed; 20000000 examples so far; train ppl: 41.75, valid ppl: 9.80



  0%|          | 0/5303 [00:00<?, ?it/s]

 30000 steps;    15.00 % completed; 30000000 examples so far; train ppl: 28.77, valid ppl: 7.19



  0%|          | 0/5303 [00:00<?, ?it/s]

 40000 steps;    20.00 % completed; 40000000 examples so far; train ppl: 23.94, valid ppl: 5.94



  0%|          | 0/5303 [00:00<?, ?it/s]

 50000 steps;    25.00 % completed; 50000000 examples so far; train ppl: 21.36, valid ppl: 5.24



  0%|          | 0/5303 [00:00<?, ?it/s]

 60000 steps;    30.00 % completed; 60000000 examples so far; train ppl: 19.44, valid ppl: 4.94



  0%|          | 0/5303 [00:00<?, ?it/s]

 70000 steps;    35.00 % completed; 70000000 examples so far; train ppl: 18.34, valid ppl: 4.57



  0%|          | 0/5303 [00:00<?, ?it/s]

 80000 steps;    40.00 % completed; 80000000 examples so far; train ppl: 17.12, valid ppl: 4.32



  0%|          | 0/5303 [00:00<?, ?it/s]

 90000 steps;    45.00 % completed; 90000000 examples so far; train ppl: 16.43, valid ppl: 4.14



  0%|          | 0/5303 [00:00<?, ?it/s]

100000 steps;    50.00 % completed; 100000000 examples so far; train ppl: 15.70, valid ppl: 3.95



KeyboardInterrupt: ignored

In [None]:
from tqdm.notebook import tqdm

max_examples = 100_000_000
eval_every_steps = 10_000
lr = 4.5e-4

model = LanguageModel(
    vocab_size=tokenizer.vocab_size,
    max_seq_length=9,
    dim=250,
    n_layers=2,
    pad_token_id=tokenizer.pad_token_id,
).to(device)

train_loader = DataLoader(training_dataset, batch_size=1000, shuffle=True, drop_last=True)
validation_loader = DataLoader(valid_dataset, batch_size=1000)

optimizer = torch.optim.Adam(model.parameters(), lr=lr)


def train_step(input_ids, target_ids):
    model.train()
    model.zero_grad()
    logits = model(input_ids)
    logits = logits.reshape(-1, logits.shape[-1])
    target_ids = target_ids.reshape(-1)
    loss = nn.functional.cross_entropy(logits, target_ids, ignore_index=model.pad_token_id)
    loss.backward()
    optimizer.step()

    return loss.item()


def validation_step(input_ids, target_ids):
    model.eval()
    logits = model(input_ids)
    logits = logits.reshape(-1, logits.shape[-1])
    target_ids = target_ids.reshape(-1)
    loss = nn.functional.cross_entropy(logits, target_ids, ignore_index=model.pad_token_id)
    return loss.item()


train_losses = []
n_examples = 0
step = 0
ver = 0
print('====================== TRAINING MODEL ======================\n')
with tqdm(total=eval_every_steps) as pbar:
    print('\n====================== VALIDING MODEL ======================\n')
    while n_examples < max_examples:    
        for train_input_ids, train_target_ids in train_loader:
            loss = train_step(train_input_ids.to(device), train_target_ids.to(device)) 
            train_losses.append(loss)
            # print(step)
            
            if step % eval_every_steps == 0:
                pbar.reset(total=eval_every_steps)
                train_ppl = np.exp(np.average(train_losses))

                with torch.no_grad():
                    valid_list = []
                    
                    for val_input_ids, val_target_ids in tqdm(validation_loader):
                        valid_list.append(validation_step(val_input_ids.to(device), val_target_ids.to(device)))
                    valid_ppl = np.exp(np.average(valid_list))

                print(f'{step:6d} steps; {(n_examples/max_examples)*100:8.2f} % completed; {n_examples} examples so far; train ppl: {train_ppl:.2f}, valid ppl: {valid_ppl:.2f}\n')
                train_losses = []

            if n_examples % 10_000_000 == 0:
                torch.save(model.state_dict(), path + 'tri_save_v' + str(ver) + '.pth')
                ver += 1

            n_examples += len(train_input_ids)  # Increment of batch size
            step += 1
            pbar.update(1) 
            if n_examples >= max_examples:
                torch.save(model.state_dict(), path + 'tri_save_final.pth')
                break




  0%|          | 0/10000 [00:00<?, ?it/s]





  0%|          | 0/5303 [00:00<?, ?it/s]

     0 steps;     0.00 % completed; 0 examples so far; train ppl: 29880.07, valid ppl: 28305.01



  0%|          | 0/5303 [00:00<?, ?it/s]

KeyboardInterrupt: ignored

In [None]:
from tqdm.notebook import tqdm

max_examples = 100_000_000
eval_every_steps = 10_000
lr = 4.5e-4

model = LanguageModel(
    vocab_size=tokenizer.vocab_size,
    max_seq_length=9,
    dim=250,
    n_layers=2,
    pad_token_id=tokenizer.pad_token_id,
).to(device)

ver = 1
model.load_state_dict(torch.load(path + 'tri_save_v' + str(ver) + '.pth'))

train_loader = DataLoader(training_dataset, batch_size=1000, shuffle=True, drop_last=True)
validation_loader = DataLoader(valid_dataset, batch_size=1000)

optimizer = torch.optim.Adam(model.parameters(), lr=lr)


def train_step(input_ids, target_ids):
    model.train()
    model.zero_grad()
    logits = model(input_ids)
    logits = logits.reshape(-1, logits.shape[-1])
    target_ids = target_ids.reshape(-1)
    loss = nn.functional.cross_entropy(logits, target_ids, ignore_index=model.pad_token_id)
    loss.backward()
    optimizer.step()

    return loss.item()


def validation_step(input_ids, target_ids):
    model.eval()
    logits = model(input_ids)
    logits = logits.reshape(-1, logits.shape[-1])
    target_ids = target_ids.reshape(-1)
    loss = nn.functional.cross_entropy(logits, target_ids, ignore_index=model.pad_token_id)
    return loss.item()


train_losses = []
n_examples = 0
step = 0
ver = 2
print('====================== TRAINING MODEL ======================\n')
with tqdm(total=eval_every_steps) as pbar:
    print('\n====================== VALIDING MODEL ======================\n')
    while n_examples < max_examples:    
        for train_input_ids, train_target_ids in train_loader:
            loss = train_step(train_input_ids.to(device), train_target_ids.to(device)) 
            train_losses.append(loss)
            # print(step)
            
            if step % eval_every_steps == 0:
                pbar.reset(total=eval_every_steps)
                train_ppl = np.exp(np.average(train_losses))

                with torch.no_grad():
                    valid_list = []
                    
                    for val_input_ids, val_target_ids in tqdm(validation_loader):
                        valid_list.append(validation_step(val_input_ids.to(device), val_target_ids.to(device)))
                    valid_ppl = np.exp(np.average(valid_list))

                print(f'{step:6d} steps; {(n_examples/max_examples)*100:8.2f} % completed; {n_examples} examples so far; train ppl: {train_ppl:.2f}, valid ppl: {valid_ppl:.2f}\n')
                train_losses = []

            if n_examples % 10_000_000 == 0:
                torch.save(model.state_dict(), path + 'tri_save_v' + str(ver) + '.pth')
                ver += 1

            n_examples += len(train_input_ids)  # Increment of batch size
            step += 1
            pbar.update(1) 
            if n_examples >= max_examples:
                torch.save(model.state_dict(), path + 'tri_save_final.pth')
                break

KeyboardInterrupt: ignored

## Avaliação final no dataset de teste


Bonus: o modelo com menor perplexidade no dataset de testes ganhará 0.5 ponto na nota final.

In [None]:
model = LanguageModel(
    vocab_size=tokenizer.vocab_size,
    max_seq_length=max_seq_length,
    dim=64,
    n_layers=2,
    pad_token_id=tokenizer.pad_token_id,
).to(device)

ver = 2
model.load_state_dict(torch.load(path + 'save_v' + str(ver) + '.pth'))

<All keys matched successfully>

In [None]:
print("Memória antes de carregar o Dataset\n")
memory_usage()

test_dataset     = JDataset(path + 'test_tensor.pt')

print("\n\nMemória depois de carregar o Dataset\n")
memory_usage()

Memória antes de carregar o Dataset

RAM Total:      12986.89 MB
RAM Disponível: 4990.57 MB
RAM Usada:      9131.54 MB
Total usada:    61.60 %


Memória depois de carregar o Dataset

RAM Total:      12986.89 MB
RAM Disponível: 4976.13 MB
RAM Usada:      9117.67 MB
Total usada:    61.70 %


In [None]:
test_loader = DataLoader(test_dataset, batch_size=64)

with torch.no_grad():
    test_list = []
    for test_input_ids, test_target_ids in tqdm(test_loader):
        test_list.append(validation_step(test_input_ids.to(device), test_target_ids.to(device)))
    test_ppl = np.exp(np.average(test_list))

print(f'test perplexity: {test_ppl}')

  0%|          | 0/82585 [00:00<?, ?it/s]

KeyboardInterrupt: ignored

## Teste seu modelo com uma sentença

Escolha uma sentença gerada pelo modelo que ache interessante.

In [None]:
def tokenize(text: str, tokenizer):
    # Recomenda-se usar o tokenizer.batch_encode_plus pois é mais rápido.
    return tokenizer(text, return_tensors=None, add_special_tokens=False, padding=False).input_ids

In [None]:
# Load datasets
max_seq_length = 9

texts = open('sample-1gb.txt').readlines()

print(f'Read {len(texts)} lines.')

train_examples = int(len(texts)*0.70)
valid_examples = int(len(texts)*0.15)
test_examples  = int(len(texts)*0.15)

max_lines = train_examples + valid_examples + test_examples
print(f'Truncating to {max_lines} lines.')
print(f'Train Examples: {train_examples}')
print(f'Valid Examples: {valid_examples}')
print(f'Test Examples:  {test_examples}')
texts = texts[:max_lines]  

In [None]:
texts[0]

'Linkbar Há alguns anos, o número de rapazes e moças que subiam ao púlpito para pregar era maior que o de hoje. Na sua simplicidade, falavam do amor de Deus, da Salvação e davam testemunho sob a unção do Espirito Santo. Hoje, parece que a figura do "preletor oficial" inibiu muitos de falarem com ousadia a Palavra de Deus. Parece que há um receio de falar diante de um público que, certamente, é mais intelectualizado que há alguns anos. Jovens pregadores ficam embaraçados e cometem certos deslizes, que poderiam ser evitados. Neste modesto trabalho, vamos dar apenas algumas sugestões, e não um estudo sobre a Homilética (Arte de Falar em Publico). I -O QUE PREGAR? É a comunicação verbal da Palavra de Deus aos ouvintes. É a transmissão do evangelho de Nosso Senhor Jesus Cristo às pessoas que precisam ouvi-lo. II- QUAL A FINALIDADE DA PREGAÇÃO? É persuadir as pessoas a aceitarem a mensagem da Palavra de Deus para sua salvação (descrentes) ou para seu crescimento espiritual (crentes). Diante 

In [None]:
prompt = "Linkbar Há alguns anos, o número de rapazes "
max_output_tokens = 20
model.eval()

for _ in range(max_output_tokens):
    input_ids = tokenize(text=prompt, tokenizer=tokenizer)
    input_ids_truncated = input_ids[-9:]  # Usamos apenas os últimos <max_seq_length> tokens como entrada para o modelo.
    logits = model(torch.LongTensor([input_ids_truncated]).to(device))
    logits = logits[:, -1, :]  # Usamos apenas o ultimo token da sequencia
    # Ao usarmos o argmax, a saída do modelo em cada passo é o token de maior probabilidade.
    # Isso se chama decodificação gulosa (greedy decoding).
    predicted_id = torch.argmax(logits).item()
    input_ids += [predicted_id]  # Concatenamos a entrada com o token escolhido nesse passo.
    prompt = tokenizer.decode(input_ids)
    print(prompt)

Linkbar Há alguns anos, o número de rapazes少
Linkbar Há alguns anos, o número de rapazes [UNK] possuir
Linkbar Há alguns anos, o número de rapazes [UNK] possuir apontam
Linkbar Há alguns anos, o número de rapazes [UNK] possuir apontam atribuídas
Linkbar Há alguns anos, o número de rapazes [UNK] possuir apontam atribuídas desenhada
Linkbar Há alguns anos, o número de rapazes [UNK] possuir apontam atribuídas desenhada뒓
Linkbar Há alguns anos, o número de rapazes [UNK] possuir apontam atribuídas desenhada뒓 1906
Linkbar Há alguns anos, o número de rapazes [UNK] possuir apontam atribuídas desenhada뒓 1906 ¶
Linkbar Há alguns anos, o número de rapazes [UNK] possuir apontam atribuídas desenhada뒓 1906 ¶ desenhada
Linkbar Há alguns anos, o número de rapazes [UNK] possuir apontam atribuídas desenhada뒓 1906 ¶ desenhada erguida
Linkbar Há alguns anos, o número de rapazes [UNK] possuir apontam atribuídas desenhada뒓 1906 ¶ desenhada erguida땈
Linkbar Há alguns anos, o número de rapazes [UNK] possuir a

## Bonus 1
Quem conseguir a menor perplexidade no dataset de testes ganha 0.5 ponto na média final.

## Bonus 2
Qual é a complexidade (em notação O-grande) da função de geração de texto acima?

Quem responder corretamente a pergunta acima e deixar a função com menor complexidade ganha 0.5 ponto na média final.