## Exercício: Modelo de Linguagem com auto-atenção

Este exercício é similar ao da aula passada, mas iremos agora treinar uma rede neural *com auto-atenção* para prever a próxima palavra de um texto, data as palavras anteriores como entrada.

Na camada de auto-atenção, deve-se implementar (vide slide 34):
- Embeddings de posição
- Projeções lineares (WQ, WK, WV, WO)
- Camada de feed forward (2-layer MLP)

Instrucões:
- É necessário fazer duas implementações da camada de auto-atenção: uma usando laços (ineficiente, mas fácil de entender) e outra matricial (eficiente mas difícil de entender). Usar slide 36 como referência.

- Fazer um assert para garantir que o resultado das duas implementações é exatamente igual.

- No treinamento, usar apenas a implementação matricial.

## Faz download e carrega o dataset

In [20]:
!wget https://www.gutenberg.org/ebooks/67724.txt.utf-8
!wget https://www.gutenberg.org/ebooks/67725.txt.utf-8

'wget' n�o � reconhecido como um comando interno
ou externo, um programa oper�vel ou um arquivo em lotes.
'wget' n�o � reconhecido como um comando interno
ou externo, um programa oper�vel ou um arquivo em lotes.


In [21]:
import re
from collections import Counter
import numpy as np
from torch.utils.data import Dataset, DataLoader
import torch
import torch.nn as nn
import multiprocessing
import torch.nn.functional as F
import torch.optim as optim
import time

In [22]:
# Simples limitação dos dados, para trabalhar apenas com tokens presentes no livro.

text = open("67724.txt.utf-8","r",encoding="utf-8").read()
idx = text.find("PARTE\n\n")
idx2 = text.find("*** END OF THE PROJECT")
text = text[idx:idx2]
text2 = open("67725.txt.utf-8","r",encoding="utf-8").read()
idx = text2.find("PARTE\n\n")
idx2 = text2.find("*** END OF THE PROJECT")
text2 = text2[idx:idx2]

text += text2

paragraphs = text.split("\n\n")
len(paragraphs)

4816

In [41]:
# cleaned_paragraphs = [paragraph.replace("\n", " ") for paragraph in paragraphs if paragraph.strip()]

cleaned_paragraphs = []
full_text = ""
final_tokens = []
# Tratando tokens em cada prágrafo
for paragraph in paragraphs:
    paragraph = paragraph.replace("\n", " ")
    for removable in ["«", "»", "_"]:
        paragraph = paragraph.replace(removable, '') # Removendo as aspas, underline, etc.
    
    paragraph = paragraph.lower().strip() # Caixa baixa e removendo leading e trailing spaces.

    if paragraph[:3] == "pag":
        continue
    if len(paragraph) < 3:
        continue

    paragraph = re.sub("[ ]+", " ", paragraph) # Espaços duplicados

    for punctuation in ['.', ',', ';', '!', ":", "?", "--"]:
        paragraph = paragraph.replace(punctuation, (' ' + punctuation) if punctuation != "--" else (punctuation + ' ')) # Tratando pontuação como próprio token
    cleaned_paragraphs.append(paragraph)
    final_tokens += paragraph.split(" ") + ['\n']
    full_text += paragraph + '\n'
    
# for paragraph in cleaned_paragraphs:
#     print(paragraph)

# final_tokens

In [43]:
# Conta as palavras no dataset

def count_words(texts):
    word_counts = Counter()
    for text in texts:
        if text == "\n":
            word_counts.update(text)
            continue
        # word_counts.update(re.findall(r'\w+', text.lower()))
        word_counts.update(list(re.findall(r'.*', text.lower())))
        
    return word_counts

word_counts = count_words(final_tokens)
word_counts.pop('')

# word_counts

127721

## Criando um vocabulário

In [26]:
vocab_size = 2500
most_frequent_words = [word for word, count in word_counts.most_common(vocab_size)]
vocab = {word: i for i, word in enumerate(most_frequent_words, 1)}

In [27]:
def encode_sentence(sentence, vocab):
    if isinstance(sentence, str):
        sentence = sentence.split(" ")
    # print(sentence)
    return [vocab.get(word, 0) for word in sentence]

encode_sentence(cleaned_paragraphs[20], vocab)

[1360,
 2386,
 50,
 886,
 1243,
 1,
 1536,
 225,
 0,
 1,
 11,
 0,
 7,
 0,
 1,
 11,
 1120,
 879,
 1,
 0,
 11,
 103,
 8,
 1366,
 14,
 335,
 1357,
 86,
 104,
 4,
 91,
 12,
 82,
 35,
 0,
 26,
 0,
 593,
 18,
 14,
 1362,
 8,
 580,
 945,
 2]

## Classe do dataset

In [28]:
context_size = 5 # 5 palavras de entrada. O target é a próxima palavra
"""TODO: Preparar o dataset"""
overlap_size = 4
step = context_size - overlap_size
if step <= 0:
    raise

# print(final_tokens)
whole_data = []
for i in range(0, len(final_tokens) - context_size, step):
    cur_data = [encode_sentence(final_tokens[i:i+context_size], vocab), encode_sentence(final_tokens[i + context_size], vocab)[0]]
    if 0 in cur_data[0] or 0 == cur_data[1]:# or vocab_size in cur_data[0] or vocab_size == cur_data[1] :
        continue
    for i in range(context_size):
        cur_data[0][i] -= 1
    cur_data[1] -= 1
    whole_data.append(tuple(cur_data))

print(whole_data[:context_size])

[([1, 2, 35, 5, 591], 36), ([2, 35, 5, 591, 36], 1355), ([35, 5, 591, 36, 1355], 6), ([5, 591, 36, 1355, 6], 1356), ([591, 36, 1355, 6, 1356], 23)]


In [29]:
N = len(whole_data)
random_state = 18
np.random.seed(random_state)
torch.manual_seed(random_state)
random_indices = np.arange(N)
np.random.shuffle(random_indices)
# print(random_indices)
cut_idx = int(0.8 * N)
train_indices = random_indices[:cut_idx]
validation_indices = random_indices[cut_idx:]

In [30]:
class MyDataset(Dataset):
    def __init__(self, split, vocab):
        idxs = train_indices if split == "train" else validation_indices
        self.data = []
        for idx in idxs:
            self.data.append(whole_data[idx])
            
        self.vocab = vocab  # Set vocabulary

    def __len__(self):
        return len(self.data)  # Return the length of the dataset

    def __getitem__(self, idx):
        line, label = self.data[idx]  # Get label and text for specified index

        return torch.tensor(line), torch.tensor(label)

train_data = MyDataset(split="train", vocab=vocab)
val_data = MyDataset(split="val", vocab=vocab)

In [31]:
batch_size = 30
train_loader = DataLoader(train_data, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_data, batch_size=batch_size, shuffle=True)
sample = next(iter(train_loader))
print(sample)

[tensor([[ 174, 2494,    0,    7,    0],
        [ 280,   66,    5,  244,    4],
        [ 704,    5,   20,  389,   17],
        [   2,    9,   22,   21,    2],
        [   1,   33,  478,   14, 1760],
        [  42, 2301,   29,  709,   11],
        [   8,   37,    1,   45,  407],
        [ 756,   11,  101,    1,    2],
        [   3,  139,    7,    5,  735],
        [ 914,   32,    1,    2,    5],
        [ 458,   19,  147,  104,   63],
        [   7,    5,   20,  128,    0],
        [ 779,    0,    7,  382,  828],
        [   0, 1903,  504,    0,  253],
        [ 122,    3,  262,    1,    2],
        [ 139,   12,  107,  726,    1],
        [  35,  356,   21,    1,    1],
        [1229,    4,  280,    8,   18],
        [   2,    9,   66,   80,   50],
        [  61,    0,    4,  280,   10],
        [   0,   22,  761,    3, 1593],
        [1458,  105,   24,  310,    0],
        [   6,   13,  568,  152,    0],
        [ 319,    1,    2,   12,   35],
        [   0,    7,   46,    5,   20],

## Model

In [32]:


class LanguageModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, context_size, h):
        super(LanguageModel, self).__init__()
        self.context_size = context_size
        self.embedding_dim = embedding_dim
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(context_size * embedding_dim, h)
        self.linear2 = nn.Linear(h, h) # This hidden layer ideia I've got from Gabriel Freita's code. It helped to reduce PPL in 20.
        self.linear3 = nn.Linear(h, vocab_size, bias = False)
        self.relu = nn.ReLU()

    def forward(self, inputs):
        embeds = self.embeddings(inputs)
        embeds = embeds.view(embeds.size(0), -1)
        out = torch.tanh(self.linear1(embeds))
        out = self.relu(self.linear2(out))
        out = self.linear3(out)
        log_probs = F.log_softmax(out, dim=1)
        return log_probs

embedding_dim = 128
context_size = 5
H = 500
model = LanguageModel(vocab_size, embedding_dim, context_size, H)

In [33]:
# helper function to get accuracy from log probabilities
def get_accuracy_from_log_probs(log_probs, labels):
    probs = torch.exp(log_probs)
    predicted_label = torch.argmax(probs, dim=1)
    acc = (predicted_label == labels).float().mean()
    return acc

# helper function to evaluate model on dev data
def evaluate(model, criterion, dataloader, device):
    model.eval()

    mean_acc, mean_loss = 0, 0
    count = 0

    with torch.no_grad():
        dev_st = time.time()
        for it, data_tensor in enumerate(dataloader):
            input = data_tensor[:,0:2]
            target = data_tensor[:,2]
            input, target = input.to(device), target.to(device)
            log_probs = model(input)
            mean_loss += criterion(log_probs, target).item()
            mean_acc += get_accuracy_from_log_probs(log_probs, target)
            count += 1
            if it % 500 == 0: 
                print(f"Dev Iteration {it} complete. Mean Loss: {mean_loss / count}; Mean Acc: {mean_acc / count}; Time taken (s): {time.time()-dev_st}")
                dev_st = time.time()

    return mean_acc / count, mean_loss / count

In [34]:
# Verifica se há uma device disponível e define o dispositivo para device se possível, caso contrário, usa a CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# device = 'cpu'

In [35]:
sample = next(iter(train_loader))
input = sample[0]
target = sample[1]
print(input.shape, target.shape)
output = model(input)
print(output.shape)

torch.Size([30, 5]) torch.Size([30])
torch.Size([30, 2500])


## Training and Eval

In [36]:
# Verifica se há uma device disponível e define o dispositivo para device se possível, caso contrário, usa a CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [37]:
# helper function to get accuracy from log probabilities
def get_accuracy_from_log_probs(log_probs, labels):
    probs = torch.exp(log_probs)
    predicted_label = torch.argmax(probs, dim=1)
    acc = (predicted_label == labels).float().mean()
    return acc

# helper function to evaluate model on dev data
def evaluate(model, criterion, dataloader):
    model.eval()

    mean_acc, mean_loss = 0, 0
    count = 0

    with torch.no_grad():
        for context_tensor, target_tensor in dataloader:
            context_tensor, target_tensor = context_tensor.to(device), target_tensor.to(device)
            log_probs = model(context_tensor)
            mean_loss += criterion(log_probs, target_tensor).item()
            mean_acc += get_accuracy_from_log_probs(log_probs, target_tensor)
            count += 1

    return mean_acc / count, mean_loss / count

In [38]:
# Using negative log-likelihood loss
loss_function = nn.NLLLoss()

# create model
model = LanguageModel(len(vocab), embedding_dim, context_size, H)

# load it to gpu
model = model.to(device)

# optimizer = optim.Adam(model.parameters(), lr = 1e-3)
optimizer = optim.SGD(model.parameters(), lr = 1e-2)

train_acc, train_loss = evaluate(model, loss_function, train_loader)
print("\n--- Evaluating model on train data ---")
print(f"Train Accuracy: {train_acc}; Train Loss: {train_loss}, Train PPL: {torch.exp(torch.tensor(train_loss))}")

best_test_ppl = 1e9
for epoch in range(10):
    st = time.time()
    print(f"\n--- Training model Epoch: {epoch+1} ---")
    for it, data_tensor in enumerate(train_loader):       
        context_tensor = data_tensor[0]
        target_tensor = data_tensor[1]

        context_tensor, target_tensor = context_tensor.to(device), target_tensor.to(device)

        # zero out the gradients from the old instance
        model.zero_grad()
        # get log probabilities over next words
        log_probs = model(context_tensor)
        # compute loss function
        loss = loss_function(log_probs, target_tensor)
        # backward pass and update gradient
        loss.backward()
        optimizer.step()

    print(f"Finished training of Epoch {epoch +1}\n--- Evaluating model on train data ---")
    train_acc, train_loss = evaluate(model, loss_function, train_loader)
    print(f"Train Accuracy: {train_acc}; Train Loss: {train_loss}, Train PPL: {torch.exp(torch.tensor(train_loss))}")
    print("\n--- Evaluating model on test data ---")
    test_acc, test_loss = evaluate(model, loss_function, val_loader)
    print(f"Test Accuracy: {test_acc}; Test Loss: {test_loss}, Test PPL: {torch.exp(torch.tensor(test_loss))}")

    best_test_ppl = min(best_test_ppl, (torch.exp(torch.tensor(test_loss))))

print("BEST PPL:", best_test_ppl)


--- Evaluating model on train data ---
Train Accuracy: 0.0006970071117393672; Train Loss: 7.81595775679322, Train PPL: 2479.8603515625

--- Training model Epoch: 1 ---
Finished training of Epoch 1
--- Evaluating model on train data ---
Train Accuracy: 0.14135077595710754; Train Loss: 5.639600263457163, Train PPL: 281.3502197265625

--- Evaluating model on test data ---
Test Accuracy: 0.1358957439661026; Test Loss: 5.726092808369629, Test PPL: 306.768310546875

--- Training model Epoch: 2 ---
Finished training of Epoch 2
--- Evaluating model on train data ---
Train Accuracy: 0.16916663944721222; Train Loss: 5.296870964039735, Train PPL: 199.71096801757812

--- Evaluating model on test data ---
Test Accuracy: 0.15735363960266113; Test Loss: 5.429516357051653, Test PPL: 228.03892517089844

--- Training model Epoch: 3 ---
Finished training of Epoch 3
--- Evaluating model on train data ---
Train Accuracy: 0.1802302449941635; Train Loss: 5.1148640542012735, Train PPL: 166.47811889648438

--

## Exemplo de uso

In [39]:
i = 1000
text = " ".join(final_tokens[i: i+context_size])

inv_vocab = {v-1 : k for k, v in vocab.items()}
def generate_text(model, vocab, text, max_length, context_size):
    context = encode_sentence(text, vocab)

    final_text = context
    for i in range(max_length):
        inputs = torch.tensor(context).to(device).view((1, -1))
        pred = torch.argmax(model(inputs), dim=1)
        final_text.append(pred.item())
        context = final_text[-context_size:]

    l = ([inv_vocab[t] for t in final_text])
    decoded_sentence = " ".join(l)

    print(decoded_sentence)


context = context_size
max_length= 100
generate_text(model, vocab, text, max_length, context_size)

frente 
 a que della se o seu olhar de sua senhora , e a menina , que se tinha a sua vida , e que não se o que lhe se não era o que se tinha a sua vida , e que não se o que lhe se não era o que se tinha a sua vida , e que não se o que lhe se não era o que se tinha a sua vida , e que não se o que lhe se não era o que se tinha a sua vida , e que não se o que lhe se não
