## Exercício: Modelo de Linguagem com auto-atenção

Este exercício é similar ao da aula passada, mas iremos agora treinar uma rede neural *com auto-atenção* para prever a próxima palavra de um texto, data as palavras anteriores como entrada.

Na camada de auto-atenção, deve-se implementar (vide slide 34):
- Embeddings de posição
- Projeções lineares (WQ, WK, WV, WO)
- Camada de feed forward (2-layer MLP)

Instrucões:
- É necessário fazer duas implementações da camada de auto-atenção: uma usando laços (ineficiente, mas fácil de entender) e outra matricial (eficiente mas difícil de entender). Usar slide 36 como referência.

- Fazer um assert para garantir que o resultado das duas implementações é exatamente igual.

- No treinamento, usar apenas a implementação matricial.

## Faz download e carrega o dataset

In [1]:
!wget https://www.gutenberg.org/ebooks/67724.txt.utf-8
!wget https://www.gutenberg.org/ebooks/67725.txt.utf-8

'wget' n�o � reconhecido como um comando interno
ou externo, um programa oper�vel ou um arquivo em lotes.
'wget' n�o � reconhecido como um comando interno
ou externo, um programa oper�vel ou um arquivo em lotes.


In [2]:
import re
from collections import Counter
import numpy as np
from torch.utils.data import Dataset, DataLoader
import torch
import torch.nn as nn
import multiprocessing
import torch.nn.functional as F
import torch.optim as optim
import time
from torch import nn, Tensor
import math

In [3]:
# Simples limitação dos dados, para trabalhar apenas com tokens presentes no livro.

text = open("67724.txt.utf-8","r",encoding="utf-8").read()
idx = text.find("PARTE\n\n")
idx2 = text.find("*** END OF THE PROJECT")
text = text[idx:idx2]
text2 = open("67725.txt.utf-8","r",encoding="utf-8").read()
idx = text2.find("PARTE\n\n")
idx2 = text2.find("*** END OF THE PROJECT")
text2 = text2[idx:idx2]

text += text2

paragraphs = text.split("\n\n")
len(paragraphs)

4816

In [4]:
# cleaned_paragraphs = [paragraph.replace("\n", " ") for paragraph in paragraphs if paragraph.strip()]

cleaned_paragraphs = []
full_text = ""
final_tokens = []
# Tratando tokens em cada prágrafo
for paragraph in paragraphs:
    paragraph = paragraph.replace("\n", " ")
    for removable in ["«", "»", "_"]:
        paragraph = paragraph.replace(removable, '') # Removendo as aspas, underline, etc.
    
    paragraph = paragraph.lower().strip() # Caixa baixa e removendo leading e trailing spaces.

    if paragraph[:3] == "pag":
        continue
    if len(paragraph) < 3:
        continue

    paragraph = re.sub("[ ]+", " ", paragraph) # Espaços duplicados

    for punctuation in ['.', ',', ';', '!', ":", "?", "--"]:
        paragraph = paragraph.replace(punctuation, (' ' + punctuation) if punctuation != "--" else (punctuation + ' ')) # Tratando pontuação como próprio token
    cleaned_paragraphs.append(paragraph)
    final_tokens += paragraph.split(" ") + ['\n']
    full_text += paragraph + '\n'
    
# for paragraph in cleaned_paragraphs:
#     print(paragraph)

# final_tokens

In [5]:
# Conta as palavras no dataset

def count_words(texts):
    word_counts = Counter()
    for text in texts:
        if text == "\n":
            word_counts.update(text)
            continue
        # word_counts.update(re.findall(r'\w+', text.lower()))
        word_counts.update(list(re.findall(r'.*', text.lower())))
        
    return word_counts

word_counts = count_words(final_tokens)
word_counts.pop('')

127721

## Criando um vocabulário

In [6]:
vocab_size = 2500
most_frequent_words = [word for word, count in word_counts.most_common(vocab_size)]
vocab = {word: i for i, word in enumerate(most_frequent_words, 1)}

In [7]:
def encode_sentence(sentence, vocab):
    if isinstance(sentence, str):
        sentence = sentence.split(" ")
    # print(sentence)
    return [vocab.get(word, 0) for word in sentence]

print(encode_sentence(cleaned_paragraphs[20], vocab))

[1360, 2386, 50, 886, 1243, 1, 1536, 225, 0, 1, 11, 0, 7, 0, 1, 11, 1120, 879, 1, 0, 11, 103, 8, 1366, 14, 335, 1357, 86, 104, 4, 91, 12, 82, 35, 0, 26, 0, 593, 18, 14, 1362, 8, 580, 945, 2]


## Classe do dataset

In [8]:
context_size = 5
"""TODO: Preparar o dataset"""
overlap_size = 4
step = context_size - overlap_size
if step <= 0:
    raise

# print(final_tokens)
whole_data = []
for i in range(0, len(final_tokens) - context_size, step):
    cur_data = [encode_sentence(final_tokens[i:i+context_size], vocab), encode_sentence(final_tokens[i + context_size], vocab)[0]]
    if 0 in cur_data[0] or 0 == cur_data[1]:# or vocab_size in cur_data[0] or vocab_size == cur_data[1] :
        continue
    for i in range(context_size):
        cur_data[0][i] -= 1
    cur_data[1] -= 1
    whole_data.append(tuple(cur_data))

print(whole_data[:context_size])

[([1, 2, 35, 5, 591], 36), ([2, 35, 5, 591, 36], 1355), ([35, 5, 591, 36, 1355], 6), ([5, 591, 36, 1355, 6], 1356), ([591, 36, 1355, 6, 1356], 23)]


In [9]:
N = len(whole_data)
random_state = 18
np.random.seed(random_state)
torch.manual_seed(random_state)
random_indices = np.arange(N)
np.random.shuffle(random_indices)
# print(random_indices)
cut_idx = int(0.8 * N)
train_indices = random_indices[:cut_idx]
validation_indices = random_indices[cut_idx:]

In [10]:
class MyDataset(Dataset):
    def __init__(self, split, vocab):
        idxs = train_indices if split == "train" else validation_indices
        self.data = []
        for idx in idxs:
            self.data.append(whole_data[idx])
            
        self.vocab = vocab  # Set vocabulary

    def __len__(self):
        return len(self.data)  # Return the length of the dataset

    def __getitem__(self, idx):
        line, label = self.data[idx]  # Get label and text for specified index

        return torch.tensor(line), torch.tensor(label)

train_data = MyDataset(split="train", vocab=vocab)
val_data = MyDataset(split="val", vocab=vocab)

In [11]:
batch_size = 16
train_loader = DataLoader(train_data, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_data, batch_size=batch_size, shuffle=True)
sample = next(iter(train_loader))

In [12]:
embedding_dim = 512

W = Tensor(encode_sentence("meu amor é maior que o seu e tudo que".split()[:context_size], vocab)).reshape(context_size, 1)
C = Tensor(embedding_dim).reshape(1, embedding_dim)
P = Tensor(embedding_dim).reshape(1, embedding_dim)
wQ = Tensor(embedding_dim, embedding_dim)
wK = Tensor(embedding_dim, embedding_dim)
wV = Tensor(embedding_dim, embedding_dim)
w0 = Tensor(embedding_dim, embedding_dim)

nn.init.xavier_uniform_(C)
nn.init.xavier_uniform_(P)
nn.init.xavier_uniform_(wQ)
nn.init.xavier_uniform_(wK)
nn.init.xavier_uniform_(wV)
nn.init.xavier_uniform_(w0)

def get_embeddings_with_attention(W, C, P, wQ, wK, wV, w0):

    X = W @ C + P
    Q = X @ wQ
    K = X @ wK
    V = X @ wV

    scores = Q @ K.T
    probs = F.softmax(scores, dim=-1)
    E = probs @ V

    return E @ w0

def get_embeddings_with_attention_slow(W, C, P, wQ, wK, wV, w0):
    E = []
    X = W @ C + P

    for xq in X:
        Q = xq @ wQ
        scores = []
        for xk in X:
            K = xk @ wK
            score = Q @ K.T
            scores.append(score)
        scores = Tensor(scores)
        probs = F.softmax(scores, dim = -1)

        e = 0
        for xv, p in zip(X, probs):
            V = xv @ wV
            e += V * p
        e = e @ w0
        E.append(e)

    return torch.stack(E)

A = get_embeddings_with_attention_slow(W, C, P, wQ, wK, wV, w0)
B = get_embeddings_with_attention(W, C, P, wQ, wK, wV, w0)

assert torch.allclose(A, B, atol=1e-5), "Matrix and Loop implementations are not the same."

  score = Q @ K.T


## Model

In [13]:
# Implementation from Pytorch library - https://pytorch.org/tutorials/beginner/transformer_tutorial.html
class PositionalEncoding(nn.Module):

    def __init__(self, d_model: int, dropout: float = 0.1, max_len: int = 5000):
        super().__init__()
        self.dropout = nn.Dropout(p=dropout)

        position = torch.arange(max_len).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model))
        pe = torch.zeros(max_len, 1, d_model)
        pe[:, 0, 0::2] = torch.sin(position * div_term)
        pe[:, 0, 1::2] = torch.cos(position * div_term)
        self.register_buffer('pe', pe)

    def forward(self, x: Tensor) -> Tensor:
        """
        Arguments:
            x: Tensor, shape ``[seq_len, batch_size, embedding_dim]``
        """
        x = x + self.pe[:x.size(0)]
        return self.dropout(x)


In [14]:

class SelfAttention(nn.Module):
    def __init__(self, embedding_dim):
        super(SelfAttention, self).__init__()
        self.embedding_dim = embedding_dim
        self.Q = nn.Linear(embedding_dim, embedding_dim)
        self.K = nn.Linear(embedding_dim, embedding_dim)
        self.V = nn.Linear(embedding_dim, embedding_dim)
        self.zero = nn.Linear(embedding_dim, embedding_dim)

    def forward(self, X):
        # X = W @ C + P
        Q = self.Q(X)
        K = self.K(X)
        V = self.V(X)

        KT = K.permute(0, 2, 1)
        scores = Q @ KT
        probs = F.softmax(scores, dim=-1)
        E = probs @ V

        return self.zero(E)

self_attention = SelfAttention(embedding_dim)
X = W @ C + P
X = X.reshape(1, X.shape[0], X.shape[1]) # Leave in batch form
Y = self_attention(X)
print(X.shape, Y.shape)

torch.Size([1, 5, 512]) torch.Size([1, 5, 512])


In [15]:
class LanguageModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, context_size, h):
        super(LanguageModel, self).__init__()
        self.context_size = context_size
        self.embedding_dim = embedding_dim
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.attention = SelfAttention(embedding_dim)
        self.pe = PositionalEncoding(embedding_dim)
        self.linear1 = nn.Linear(context_size * embedding_dim, h)
        self.linear2 = nn.Linear(h, vocab_size, bias = False)
        self.relu = nn.ReLU()

    def forward(self, inputs):
        embeds = self.embeddings(inputs) # Lookup table
        embeds = self.pe(embeds) # Positional Encoding
        embeds = self.attention(embeds) # Self attention
        embeds = embeds.view(embeds.size(0), -1) # Concat embeddings
        out = torch.tanh(self.linear1(embeds)) # First layer with non linearity
        out = self.relu(self.linear2(out)) # Second layer
        log_probs = F.log_softmax(out, dim=1) # Logits
        return log_probs

model = LanguageModel(vocab_size, embedding_dim, context_size, 500)

In [16]:
sample = next(iter(train_loader))
input = sample[0]
target = sample[1]
output = model(input)
print(input.shape, target.shape)
print(output.shape)

torch.Size([16, 5]) torch.Size([16])
torch.Size([16, 2500])


## Training and Eval

In [17]:
# Verifica se há uma device disponível e define o dispositivo para device se possível, caso contrário, usa a CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
if device.type == 'cuda':
    print('GPU:', torch.cuda.get_device_name(torch.cuda.current_device()))
else:
    print('using CPU')

GPU: NVIDIA GeForce RTX 3060 Ti


In [18]:
# helper function to get accuracy from log probabilities
def get_accuracy_from_log_probs(log_probs, labels):
    probs = torch.exp(log_probs)
    predicted_label = torch.argmax(probs, dim=1)
    acc = (predicted_label == labels).float().mean()
    return acc

# helper function to evaluate model on dev data
def evaluate(model, criterion, dataloader):
    model.eval()

    mean_acc, mean_loss = 0, 0
    count = 0

    with torch.no_grad():
        for context_tensor, target_tensor in dataloader:
            context_tensor, target_tensor = context_tensor.to(device), target_tensor.to(device)
            log_probs = model(context_tensor)
            mean_loss += criterion(log_probs, target_tensor).item()
            mean_acc += get_accuracy_from_log_probs(log_probs, target_tensor)
            count += 1

    return mean_acc / count, mean_loss / count

def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

count_parameters(model)

4861124

In [19]:
# Using negative log-likelihood loss
loss_function = nn.NLLLoss()

# create model
model = LanguageModel(vocab_size, embedding_dim, context_size, 500)

# load it to gpu
model = model.to(device)

# optimizer = optim.Adam(model.parameters(), lr = 1e-3)
optimizer = optim.SGD(model.parameters(), lr = 1e-2)

train_acc, train_loss = evaluate(model, loss_function, train_loader)
print("\n--- Evaluating model on train data ---")
print(f"Train Accuracy: {train_acc}; Train Loss: {train_loss}, Train PPL: {torch.exp(torch.tensor(train_loss))}")

best_test_ppl = 1e9
for epoch in range(10):
    st = time.time()
    print(f"\n--- Training model Epoch: {epoch+1} ---")
    for it, data_tensor in enumerate(train_loader):       
        context_tensor = data_tensor[0]
        target_tensor = data_tensor[1]

        context_tensor, target_tensor = context_tensor.to(device), target_tensor.to(device)

        # zero out the gradients from the old instance
        model.zero_grad()
        # get log probabilities over next words
        log_probs = model(context_tensor)
        # compute loss function
        loss = loss_function(log_probs, target_tensor)
        # backward pass and update gradient
        loss.backward()
        optimizer.step()

    print(f"Finished training of Epoch {epoch +1}\n--- Evaluating model on train data ---")
    train_acc, train_loss = evaluate(model, loss_function, train_loader)
    print(f"Train Accuracy: {train_acc}; Train Loss: {train_loss}, Train PPL: {torch.exp(torch.tensor(train_loss))}")
    print("\n--- Evaluating model on test data ---")
    test_acc, test_loss = evaluate(model, loss_function, val_loader)
    print(f"Test Accuracy: {test_acc}; Test Loss: {test_loss}, Test PPL: {torch.exp(torch.tensor(test_loss))}")

    best_test_ppl = min(best_test_ppl, (torch.exp(torch.tensor(test_loss))))

print("BEST PPL:", best_test_ppl)


--- Evaluating model on train data ---
Train Accuracy: 0.00024598228628747165; Train Loss: 7.827923429250795, Train PPL: 2509.712158203125

--- Training model Epoch: 1 ---
Finished training of Epoch 1
--- Evaluating model on train data ---
Train Accuracy: 0.13002213835716248; Train Loss: 5.844966642111393, Train PPL: 345.4909362792969

--- Evaluating model on test data ---
Test Accuracy: 0.125; Test Loss: 5.91563205481513, Test PPL: 370.7886657714844

--- Training model Epoch: 2 ---
Finished training of Epoch 2
--- Evaluating model on train data ---
Train Accuracy: 0.15271401405334473; Train Loss: 5.302257624404397, Train PPL: 200.78958129882812

--- Evaluating model on test data ---
Test Accuracy: 0.14523263275623322; Test Loss: 5.409776848546799, Test PPL: 223.58164978027344

--- Training model Epoch: 3 ---
Finished training of Epoch 3
--- Evaluating model on train data ---
Train Accuracy: 0.17231059074401855; Train Loss: 5.075045321565567, Train PPL: 159.9794158935547

--- Evaluati

## Exemplo de uso

In [20]:
i = 300
text = " ".join(final_tokens[i: i+context_size])

inv_vocab = {v-1 : k for k, v in vocab.items()}
def generate_text(model, vocab, text, max_length, context_size):
    context = encode_sentence(text, vocab)

    final_text = context
    for i in range(max_length):
        inputs = torch.tensor(context).to(device).view((1, -1))
        pred = torch.argmax(model(inputs), dim=1)
        final_text.append(pred.item())
        context = final_text[-context_size:]

    l = ([inv_vocab[t] for t in final_text])
    decoded_sentence = " ".join(l)

    print(decoded_sentence)


context = context_size
max_length= 50
generate_text(model, vocab, text, max_length, context_size)

verdura e agrestes ; penetrar que o seu coração tinha a sua vida . 
 -- e ! . . . 
 o indio , a cabeça com um gesto de vozes e . 
 o indio , a cabeça com um gesto de vozes e . 
 o indio , a cabeça com um
