# Neural Language Modeling

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/brownfortress/NLU-2024-labs/blob/main/labs/04_neural_LM.ipynb)

In [None]:
# device = 'cuda:0'
DEVICE = 'cuda:0'

In [33]:
# Run this if you are on Colab
!python -m spacy download en_core_web_lg

Collecting en-core-web-lg==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


If you want to introduce the concept of word embedding to a friend, [Semantle](https://semantle.com) game is a good starting point. It is based on word2vec model.  

# 4 Language Models with Neural Networks

While we have already seen a language model based on n-grams, in this lab we are going to develop a language model using a neural architecture. Neural LM can be used to compute word embeddings.

## 4.1 Task definition

To model the probability distribution over a sequence, we are going to use the Chain Rule as we have seen in LAB 3:

$$P(w_{1}^{n}) = P(w_1) P(w_2|w_1) P(w_3|w_1^2) ... P(w_n|w_{1}^{n-1}) = \prod_{i=1}^{n}{P(w_i|w_{1}^{i-1})}$$

However, at that time we used ngram to truncate the previous context ($N-1$), to compute meaningful probabilities. While using neural models, we will let the model decide by itself how to manage the previous context and thus which are the tokens relevant for the prediction.


## 4.2 Recurrent Neural Networks (RNN)

One of the most suitable neural architectures for the Language Model task is the Recurrent Neural Network (RNN). The architecture is composed of an RNN layer (vanilla, LSTM, GRU) and a linear+softmax layer that outputs the probability over the dictionary. Indeed, the size of the output vector is equal to the size of the dictionary, i.e. the model cannot predict tokens that are not present in the vocabulary. <br>

> LM task in RNN can be tackled as a sequence labelling task (each input token has an output label) in which the input sequence is $ input = \{w_1, w_2, w_{n-1}\}$ and the output is $ output = \{w_2, w_3, w_{n}\}$



***Example***:
 > For the input sentence ***"I go to Miami"***, the input sequence of the model is ***"I go to"*** and the target/output sequence is ***"go to Miami"***.



***Notice***:

> - To properly model the sequence probabilities we need to add boundary markers \<s\> and \</s\>.

> - However, in LM RNN only the end of sentence token \</s\> is usually used unless we need \<s\> for some reason.

<p align="center">
    <img src="https://i.postimg.cc/zGH99MFY/rnn-lm.png" alt="drawing" width="300"/>
</p>
In the image below you can see a working example of a language model with RNN.
<p align="center">
    <img src="https://i.postimg.cc/fydQNrYP/LM-RNN.png" alt="drawing" width="300"/>
</p>


# 5 Model architecture


Here we define the architecture of our model using PyTorch. In the `__init__` method, we define the class of our model and we instantiate all the layers that we are going to use. In the `forward` method we define the interactions among the instantiated layers, in other words, we design the architecture of the model.   

# 6 Data loading

We are going to see this part in details in the next lab. Anyhow, let's have an overview.

In [37]:
# Loading the corpus

def read_file(path, eos_token="<eos>"):
    output = []
    with open(path, "r") as f:
        for line in f.readlines():
            output.append(line.strip() + " " + eos_token)
    return output

# Vocab with tokens to ids
def get_vocab(corpus, special_tokens=[]):
    output = {}
    i = 0
    for st in special_tokens:
        output[st] = i
        i += 1
    for sentence in corpus:
        for w in sentence.split():
            if w not in output:
                output[w] = i
                i += 1
    return output

In [38]:
# If you are using Colab, run these lines
!wget -P dataset/PennTreeBank https://raw.githubusercontent.com/BrownFortress/NLU-2024-Labs/main/labs/dataset/PennTreeBank/ptb.test.txt
!wget -P dataset/PennTreeBank https://raw.githubusercontent.com/BrownFortress/NLU-2024-Labs/main/labs/dataset/PennTreeBank/ptb.valid.txt
!wget -P dataset/PennTreeBank https://raw.githubusercontent.com/BrownFortress/NLU-2024-Labs/main/labs/dataset/PennTreeBank/ptb.train.txt

--2024-04-09 14:28:33--  https://raw.githubusercontent.com/BrownFortress/NLU-2024-Labs/main/labs/dataset/PennTreeBank/ptb.test.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 449945 (439K) [text/plain]
Saving to: ‘dataset/PennTreeBank/ptb.test.txt.2’


2024-04-09 14:28:33 (12.0 MB/s) - ‘dataset/PennTreeBank/ptb.test.txt.2’ saved [449945/449945]

--2024-04-09 14:28:33--  https://raw.githubusercontent.com/BrownFortress/NLU-2024-Labs/main/labs/dataset/PennTreeBank/ptb.valid.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3997

In [39]:

train_raw = read_file("dataset/PennTreeBank/ptb.train.txt")
dev_raw = read_file("dataset/PennTreeBank/ptb.valid.txt")
test_raw = read_file("dataset/PennTreeBank/ptb.test.txt")


In [40]:
# Vocab is computed only on training set
# We add two special tokens end of sentence and padding
vocab = get_vocab(train_raw, ["<pad>", "<eos>"])

In [41]:
len(vocab)

10001

In [42]:
# This class computes and stores our vocab
# Word to ids and ids to word
class Lang():
    def __init__(self, corpus, special_tokens=[]):
        self.word2id = self.get_vocab(corpus, special_tokens)
        self.id2word = {v:k for k, v in self.word2id.items()}
    def get_vocab(self, corpus, special_tokens=[]):
        output = {}
        i = 0
        for st in special_tokens:
            output[st] = i
            i += 1
        for sentence in corpus:
            for w in sentence.split():
                if w not in output:
                    output[w] = i
                    i += 1
        return output


In [43]:
lang = Lang(train_raw, ["<pad>", "<eos>"])

In [44]:
import torch
import torch.utils.data as data

class PennTreeBank (data.Dataset):
    # Mandatory methods are __init__, __len__ and __getitem__
    def __init__(self, corpus, lang):
        self.source = []
        self.target = []

        for sentence in corpus:
            self.source.append(sentence.split()[0:-1]) # We get from the first token till the second-last token
            self.target.append(sentence.split()[1:]) # We get from the second token till the last token
            # See example in section 6.2

        self.source_ids = self.mapping_seq(self.source, lang)
        self.target_ids = self.mapping_seq(self.target, lang)

    def __len__(self):
        return len(self.source)

    def __getitem__(self, idx):
        src= torch.LongTensor(self.source_ids[idx])
        trg = torch.LongTensor(self.target_ids[idx])
        sample = {'source': src, 'target': trg}
        return sample

    # Auxiliary methods

    def mapping_seq(self, data, lang): # Map sequences of tokens to corresponding computed in Lang class
        res = []
        for seq in data:
            tmp_seq = []
            for x in seq:
                if x in lang.word2id:
                    tmp_seq.append(lang.word2id[x])
                else:
                    print('OOV found!')
                    print('You have to deal with that') # PennTreeBank doesn't have OOV but "Trust is good, control is better!"
                    break
            res.append(tmp_seq)
        return res

In [45]:
train_dataset = PennTreeBank(train_raw, lang)
dev_dataset = PennTreeBank(dev_raw, lang)
test_dataset = PennTreeBank(test_raw, lang)

In [46]:
from functools import partial
from torch.utils.data import DataLoader

def collate_fn(data, pad_token):
    def merge(sequences):
        '''
        merge from batch * sent_len to batch * max_len
        '''
        lengths = [len(seq) for seq in sequences]
        max_len = 1 if max(lengths)==0 else max(lengths)
        # Pad token is zero in our case
        # So we create a matrix full of PAD_TOKEN (i.e. 0) with the shape
        # batch_size X maximum length of a sequence
        padded_seqs = torch.LongTensor(len(sequences),max_len).fill_(pad_token)
        for i, seq in enumerate(sequences):
            end = lengths[i]
            padded_seqs[i, :end] = seq # We copy each sequence into the matrix
        padded_seqs = padded_seqs.detach()  # We remove these tensors from the computational graph
        return padded_seqs, lengths

    # Sort data by seq lengths

    data.sort(key=lambda x: len(x["source"]), reverse=True)
    new_item = {}
    for key in data[0].keys():
        new_item[key] = [d[key] for d in data]

    source, _ = merge(new_item["source"])
    target, lengths = merge(new_item["target"])

    new_item["source"] = source.to(DEVICE)
    new_item["target"] = target.to(DEVICE)
    new_item["number_tokens"] = sum(lengths)
    return new_item

# Dataloader instantiation
# You can reduce the batch_size if the GPU memory is not enough
train_loader = DataLoader(train_dataset, batch_size=64, collate_fn=partial(collate_fn, pad_token=lang.word2id["<pad>"]),  shuffle=True)
dev_loader = DataLoader(dev_dataset, batch_size=1024, collate_fn=partial(collate_fn, pad_token=lang.word2id["<pad>"]))
test_loader = DataLoader(test_dataset, batch_size=1024, collate_fn=partial(collate_fn, pad_token=lang.word2id["<pad>"]))

# 7 Train and validate the model

In [47]:
import math
import torch.nn as nn
def train_loop(data, optimizer, criterion, model, clip=5):
    model.train()
    loss_array = []
    number_of_tokens = []

    for sample in data:
        optimizer.zero_grad() # Zeroing the gradient
        output = model(sample['source'])
        loss = criterion(output, sample['target'])
        loss_array.append(loss.item() * sample["number_tokens"])
        number_of_tokens.append(sample["number_tokens"])
        loss.backward() # Compute the gradient, deleting the computational graph
        # clip the gradient to avoid exploding gradients
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        optimizer.step() # Update the weights

    return sum(loss_array)/sum(number_of_tokens)

def eval_loop(data, eval_criterion, model):
    model.eval()
    loss_to_return = []
    loss_array = []
    number_of_tokens = []
    # softmax = nn.Softmax(dim=1) # Use Softmax if you need the actual probability
    with torch.no_grad(): # It used to avoid the creation of computational graph
        for sample in data:
            output = model(sample['source'])
            loss = eval_criterion(output, sample['target'])
            loss_array.append(loss.item())
            number_of_tokens.append(sample["number_tokens"])

    ppl = math.exp(sum(loss_array) / sum(number_of_tokens))
    loss_to_return = sum(loss_array) / sum(number_of_tokens)
    return ppl, loss_to_return

def init_weights(mat):
    for m in mat.modules():
        if type(m) in [nn.GRU, nn.LSTM, nn.RNN]:
            for name, param in m.named_parameters():
                if 'weight_ih' in name:
                    for idx in range(4):
                        mul = param.shape[0]//4
                        torch.nn.init.xavier_uniform_(param[idx*mul:(idx+1)*mul])
                elif 'weight_hh' in name:
                    for idx in range(4):
                        mul = param.shape[0]//4
                        torch.nn.init.orthogonal_(param[idx*mul:(idx+1)*mul])
                elif 'bias' in name:
                    param.data.fill_(0)
        else:
            if type(m) in [nn.Linear]:
                torch.nn.init.uniform_(m.weight, -0.01, 0.01)
                if m.bias != None:
                    m.bias.data.fill_(0.01)

In [48]:
import torch.optim as optim
from model import LM_LSTM
import torch.nn as nn

# Experiment also with a smaller or bigger model by changing hid and emb sizes
# A large model tends to overfit
hid_size = 600
emb_size = 600
n_layers = 1
# Don't forget to experiment with a lower training batch size
# Increasing the back propagation steps can be seen as a regularization step

# With SGD try with an higher learning rate (> 1 for instance)
lr = 0.003 # This is definitely not good for SGD
clip = 5 # Clip the gradient
device = 'cuda:0'

vocab_len = len(lang.word2id)

# model = LM_RNN(emb_size, hid_size, vocab_len, pad_index=lang.word2id["<pad>"]).to(device)
model = LM_LSTM(emb_size, hid_size, vocab_len, pad_index=lang.word2id["<pad>"], out_dropout=0.2, emb_dropout=0.65, n_layers=n_layers).to(device)
model.apply(init_weights)

optimizer = optim.AdamW(model.parameters(), lr=lr)
criterion_train = nn.CrossEntropyLoss(ignore_index=lang.word2id["<pad>"])
criterion_eval = nn.CrossEntropyLoss(ignore_index=lang.word2id["<pad>"], reduction='sum')

In [49]:
import matplotlib.pyplot as plt
from tqdm import tqdm
import copy
import numpy as np

def main_loop():
    n_epochs = 60
    patience = 3
    losses_train = []
    losses_dev = []
    sampled_epochs = []
    best_ppl = math.inf
    best_model = None
    pbar = tqdm(range(1,n_epochs))
    #If the PPL is too high try to change the learning rate
    for epoch in pbar:
        loss = train_loop(train_loader, optimizer, criterion_train, model, clip)

        if epoch % 1 == 0:
            sampled_epochs.append(epoch)
            losses_train.append(np.asarray(loss).mean())
            ppl_dev, loss_dev = eval_loop(dev_loader, criterion_eval, model)
            losses_dev.append(np.asarray(loss_dev).mean())
            pbar.set_description("PPL: %f" % ppl_dev)
            if  ppl_dev < best_ppl: # the lower, the better
                best_ppl = ppl_dev
                best_model = copy.deepcopy(model).to('cpu')
                patience = 3
            else:
                patience -= 1

            if patience <= 0: # Early stopping with patience
                break # Not nice but it keeps the code clean

    best_model.to(device)
    final_ppl,  _ = eval_loop(test_loader, criterion_eval, best_model)
    print('Test ppl: ', final_ppl)
    return sampled_epochs, losses_train, losses_dev, best_model, final_ppl, best_ppl

PPL: 110.074428:  17%|█▋        | 10/59 [08:04<39:35, 48.48s/it]


Test ppl:  101.82765877542664


In [None]:
from model import LM_LSTM
##### MAIN #####
# PART 1.1 -- Simple LSTM network
emb_size = 300
hid_size = 300
out_dropout = 0
emb_dropout = 0
lr = 1
optimizer = optim.SGD(model.parameters(), lr=lr)
# PART 1.2 -- LSTM network with dropout
emb_size = 300
hid_size = 300
out_dropout = 0.2
emb_dropout = 0.65
lr = 1
optimizer = optim.SGD(model.parameters(), lr=lr)

# PART 1.3 -- LSTM network with dropout and AdamW
emb_size = 300
hid_size = 300
out_dropout = 0.2
emb_dropout = 0.65
lr = 0.001
optimizer = optim.AdamW(model.parameters(), lr=lr)

model = LM_LSTM(emb_size, hid_size, vocab_len, pad_index=lang.word2id["<pad>"], 
                out_dropout=out_dropout, emb_dropout=emb_dropout, n_layers=n_layers,
                weight_tying=False).to(device)

# PART 2.1 -- Weight Tying
weight_tying = True
var_dropout = False

# PART 2.2 -- Variational Dropout
weight_tying = True
var_dropout = False

In [None]:
def run():
    

In [54]:
import torch
import os
path = f'model_bin/model{int(final_ppl)}.pt'
try:
  torch.save(best_model.state_dict(), path)
except:
  os.mkdir('model_bin')
finally:
  torch.save(best_model.state_dict(), path)

If your model makes you happy and you want to reuse it, you have [to save it

*   List item
*   List item

and load it](https://pytorch.org/tutorials/beginner/saving_loading_models.html).
In PyTorch this is straightforward.

In [None]:
# Then you load it
# model.load_state_dict(torch.load(path))

In [None]:
import matplotlib.pyplot as plt

# Assumiamo che losses_train, losses_dev, e sampled_epochs siano già definiti.
# Ecco alcuni dati di esempio:
# sampled_epochs = [1, 2, 3, 4, 5]
# losses_train = [0.8, 0.6, 0.4, 0.3, 0.2]
# losses_dev = [0.9, 0.7, 0.5, 0.4, 0.35]

# Crea il grafico
plt.figure(figsize=(10, 6))  # Imposta la dimensione del grafico

plt.plot(sampled_epochs, losses_train, label='Training Loss', marker='o')  # Aggiungi i dati del training
plt.plot(sampled_epochs, losses_dev, label='Validation Loss', marker='s')  # Aggiungi i dati della validation

plt.title('Training and Validation Loss')  # Titolo del grafico
plt.xlabel('Epochs')  # Etichetta dell'asse x
plt.ylabel('Loss')  # Etichetta dell'asse y
plt.legend()  # Mostra la legenda

plt.grid(True)  # Aggiungi la griglia
plt.tight_layout()  # Adegua automaticamente i sottografi

plt.show()  # Mostra il grafico

In [None]:
import numpy as np
import torch

sent = "i ran back"
input = torch.tensor([lang.word2id[word] for word in sent.split()]).unsqueeze(1).to(DEVICE)
model.eval()
output = model(input)
print(output)


# Mandatory Exam Exercise
## Part 1 (4 points)
In this, you have to modify the baseline LM_RNN by adding a set of techniques that might improve the performance. In this, you have to add one modification at a time incrementally. If adding a modification decreases the performance, you can remove it and move forward with the others. However, in the report, you have to provide and comment on this unsuccessful experiment.  For each of your experiments, you have to print the performance expressed with Perplexity (PPL).
<br>
One of the important tasks of training a neural network is  hyperparameter optimization. Thus, you have to play with the hyperparameters to minimise the PPL and thus print the results achieved with the best configuration (in particular <b>the learning rate</b>).
These are two links to the state-of-the-art papers which use vanilla RNN [paper1](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5947611), [paper2](https://www.fit.vutbr.cz/research/groups/speech/publi/2010/mikolov_interspeech2010_IS100722.pdf).

**Mandatory requirements**: For the following experiments the perplexity must be below 250 (***PPL < 250***).

1. Replace RNN with a Long-Short Term Memory (LSTM) network --> [link](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html)
2. Add two dropout layers: --> [link](https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html)
    - one after the embedding layer,
    - one before the last linear layer
3. Replace SGD with AdamW --> [link](https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html)

## Part 2 (11 points)
**Mandatory requirements**: For the following experiments the perplexity must be below 250 (***PPL < 250***) and it should be lower than the one achieved in Part 1.1 (i.e. base LSTM).

Starting from the `LM_RNN` in which you replaced the RNN with a LSTM model, apply the following regularisation techniques:
- Weight Tying
- Variational Dropout (no DropConnect)
- Non-monotonically Triggered AvSGD

These techniques are described in [this paper](https://openreview.net/pdf?id=SyyGPP0TZ).
