<a href="https://colab.research.google.com/github/rayaneghilene/ENSEA_AI_Labs/blob/main/Feed_Forward_Neural_Language_Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 3, Part 2 : Feed Forward Neural Language Models

# About this lab

In this session, you will experiment with feed-forward neural language models (FFLM) using [PyTorch](https://www.pytorch.org). To train the models, you will be using the [WikiText-2](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/) corpus, which is a popular LM dataset introduced in 2016:

> The WikiText language modeling dataset is a collection of texts extracted from Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License. Compared to the preprocessed version of Penn Treebank (PTB), `WikiText-2` is over 2 times larger. The dataset also features a far larger vocabulary and retains the original case, punctuation and numbers - all of which are removed in PTB. As it is composed of full articles, the dataset is well suited for models that can take advantage of long term dependencies.

Goal of this lab : 
* Understand FFN
* Train a FFNLM
* Use PyTorch

This part should take you 3h

## Downloading Stuff & Setting Up the Environment

In [None]:
# Download the corpus
%%bash
URL="https://raw.githubusercontent.com/pytorch/examples/master/word_language_model/data/wikitext-2"

for split in "train" "valid" "test"; do
  if [ ! -f "${split}.txt" ]; then
    echo "Downloading ${split}.txt"
    wget -q "${URL}/${split}.txt"
    # Remove empty lines
    sed -i '/^ *$/d' "${split}.txt"
    # Remove article titles starting with = and ending with =
    sed -i '/^ *= .* = $/d' "${split}".txt
  fi
done

# Prepare smaller version for fast training neural LMs
head -n 5000 < train.txt > train_small.txt

# Print the first 10 lines with line numbers
cat -n train.txt | head -n10
echo

# Print some statistics
echo -e "\n   Line,   word,   character counts"
wc *.txt



Downloading train.txt
Downloading valid.txt
Downloading test.txt
     1	 Senjō no Valkyria 3 : <unk> Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to as Valkyria Chronicles III outside Japan , is a tactical role @-@ playing video game developed by Sega and Media.Vision for the PlayStation Portable . Released in January 2011 in Japan , it is the third game in the Valkyria series . <unk> the same fusion of tactical and real @-@ time gameplay as its predecessors , the story runs parallel to the first game and follows the " Nameless " , a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " <unk> Raven " . 
     2	 The game began development in 2010 , carrying over a large portion of the work done on Valkyria Chronicles II . While it retained the standard features of the series , it also underwent multiple adjustments , such as making th

In [None]:
# in order to allow deterministic behaviour, that is, make results reproducible
%env CUBLAS_WORKSPACE_CONFIG=:4096:8

env: CUBLAS_WORKSPACE_CONFIG=:4096:8


In [None]:
import math
import time
import random
import numpy as np
# Fancy progress bar
from tqdm import tqdm
import torch
from torch import nn

###############
# Torch setup #
###############
print('Torch version: {}, CUDA: {}'.format(torch.__version__, torch.version.cuda))
cuda_available = torch.cuda.is_available()
if not torch.cuda.is_available():
  print('WARNING: You may want to change the runtime to GPU for Neural LM experiments!')
  DEVICE = 'cpu'
else:
  DEVICE = 'cuda:0'

#######################
# Some helper functions
#######################
def fix_seed(seed=None):
  """Sets the seeds of random number generators."""
  torch.use_deterministic_algorithms(True)
  if seed is None:
    # Take a random seed
    seed = time.time()
  seed = int(seed)
  np.random.seed(seed)
  torch.manual_seed(seed)
  torch.cuda.manual_seed(seed)
  return seed

def readable_size(n):
  """Returns a readable size string for model parameters count."""
  sizes = ['K', 'M', 'G']
  fmt = ''
  size = n
  for i, s in enumerate(sizes):
    nn = n / (1000 ** (i + 1))
    if nn >= 1:
      size = nn
      fmt = sizes[i]
    else:
      break
  return '%.2f%s' % (size, fmt)

Torch version: 2.0.0+cu118, CUDA: 11.8


# Feed-forward Language Models (FFLM)

FFLMs are similar to $n$-gram language models in the sense that the choice of $n$ is a hyperparameter for the network architecture. A basic FFLM constructs a  $C=n\mathrm{-1}$ length context window before the word to be predicted. If the word embedding size is $E$, the feature vector for the context window becomes a vector of size $E\times C$, resulting from the **concatenation** of individual word embeddings of context words. Hence, the choice of $C$ for FFLMs, affects the number of final learnable parameters in the network.

## A - Dataset Stuff

### Representing the vocabulary

The below `Vocabulary` class encapsulates the **word-to-idx** and **idx-to-word** mapping that you should now be familiar with from the previous lab sessions. Read it to understand how the vocabulary is constructed from a plain text file, within the `build_from_file()` method. Special `<.>` markers are also included in the vocabulary.

In [None]:
class Vocabulary(object):
  """Data structure representing the vocabulary of a corpus."""
  def __init__(self):
    # Mapping from tokens to integers
    self._word2idx = {}

    # Reverse-mapping from integers to tokens
    self.idx2word = []

    # 0-padding token
    self.add_word('<pad>')
    # sentence start
    self.add_word('<s>')
    # sentence end
    self.add_word('</s>')
    # Unknown words
    self.add_word('<unk>')

    self._unk_idx = self._word2idx['<unk>']

  def word2idx(self, word):
    """Returns the integer ID of the word or <unk> if not found."""
    return self._word2idx.get(word, self._unk_idx)

  def add_word(self, word):
    """Adds the `word` into the vocabulary."""
    if word not in self._word2idx:
      self.idx2word.append(word)
      self._word2idx[word] = len(self.idx2word) - 1

  def build_from_file(self, fname):
    """Builds a vocabulary from a given corpus file."""
    with open(fname) as f:
      for line in f:
        words = line.strip().split()
        for word in words:
          self.add_word(word)

  def convert_idxs_to_words(self, idxs):
    """Converts a list of indices to words."""
    return ' '.join(self.idx2word[idx] for idx in idxs)

  def convert_words_to_idxs(self, words):
    """Converts a list of words to a list of indices."""
    return [self.word2idx(w) for w in words]

  def __len__(self):
    """Returns the size of the vocabulary."""
    return len(self.idx2word)
  
  def __repr__(self):
    return "Vocabulary with {} items".format(self.__len__())

Let's construct the vocabulary for the training set and analyse the token indices for a sentence with an unknown word.




* **Why do we map unknown tokens to a special `<unk>` token?**
When a model encounters an OOV word during testing, it cannot assign a probability to it because it has not seen that word before in the training data. This can lead to incorrect probability estimates and a higher perplexity.Additionally, mapping unknown tokens to a special token can help to reduce the sparsity of the vocabulary, making the model more robust to rare and unseen words.
* **Do you think the network will learn a useful embedding for that? If not, how can you let the network to learn an embedding for it?**
No, because <unk> token is not a meaningful word and doesn't have any semantic or syntactic context associated with it. We can also consider using pre-trained embeddings such as GloVe or fastText that have been trained on large amounts of text data.

In [None]:
vocab = Vocabulary()
vocab.build_from_file('train.txt')
print(vocab)

# TODO : Convert sentence to list of indices, note how the last word is mapped to 3 (<unk>)
sentence = "<s> Get busy living, or get busy dying. </s>"

print(vocab.convert_words_to_idxs(sentence.split()))

Vocabulary with 33233 items
[1, 11959, 6645, 3, 310, 4098, 6645, 3, 2]


### Representing the corpus

Let's process the corpus for PyTorch: all splits will end up being a large, 1D token sequences. Note that, in `corpus_to_tensor()`, every line is wrapped between `<s> .. </s>` tags.

In [None]:
def corpus_to_tensor(_vocab, filename):
  # Final token indices
  idxs = []  
  with open(filename) as data:
    for line in tqdm(data, ncols=80, unit=' line', desc=f'Reading {filename} '):
      line = line.strip()
      # Skip empty lines if any
      if line:
        # Each line is considered as a long sentence for WikiText-2
        line = f"<s> {line} </s>"
        # Split from whitespace and add sentence markers
        idxs.extend(_vocab.convert_words_to_idxs(line.split()))
  return torch.LongTensor(idxs)

In [None]:
# Read the files, prepare the small one as well
train = corpus_to_tensor(vocab, 'train.txt')
train_small = corpus_to_tensor(vocab, 'train_small.txt')

valid = corpus_to_tensor(vocab, 'valid.txt')
test = corpus_to_tensor(vocab, 'test.txt')
print('\n')

print(f'Small training size in tokens: {readable_size(len(train_small))}')
print(f'Training size in tokens: {readable_size(len(train))}')
print(f'Validation size in tokens: {readable_size(len(valid))}')
print(f'Test size in tokens: {readable_size(len(test))}')

Reading train.txt : 17556 line [00:00, 21745.01 line/s]
Reading train_small.txt : 5000 line [00:00, 23217.23 line/s]
Reading valid.txt : 1841 line [00:00, 19509.87 line/s]
Reading test.txt : 2185 line [00:00, 24267.52 line/s]



Small training size in tokens: 568.70K
Training size in tokens: 2.04M
Validation size in tokens: 213.02K
Test size in tokens: 240.22K





In [None]:
print(train.shape)

torch.Size([2042258])


**Q: Print the first 20 token indices from the training set. And then print the sentence in actual words corresponding to these 20 tokens by using one of the provided methods in the `Vocabulary` class.**

In [None]:
########
# Answer
########
print(train[:20])
print(vocab.convert_idxs_to_words(train[:20]))

tensor([ 1,  4,  5,  6,  7,  8,  3,  9, 10, 11,  8, 12, 13, 14, 15,  6, 16, 17,
        18,  7])
<s> Senjō no Valkyria 3 : <unk> Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3


## B - Model definition

Now that we are done with data loading and vocabulary construction, we can define the actual FFLM model in PyTorch. Recall from the lectures that this model requires a pre-defined context window size $C$ which will affect the way you set up some of the linear layers. **Note that**, in contrast to the model depicted in the lecture, this model has an additional layer `ff_ctx`, which projects the context vector $c_k$ to hidden dimension $H$. This ensures that the number of parameters in the output layer does not depend on the context size, i.e. it is always $H\times V$ instead of $CE\times V$.

---

**Q: Follow the comments in `__init__()` and `forward()` to fill in the missing parts with some actual code.**

In [None]:
class FFLM(nn.Module):
  def __init__(self, vocab_size, emb_dim, hid_dim, context_size, dropout=0.5):
    # Call parent's __init__ first
    super(FFLM, self).__init__()
    
    # Store arguments
    self.vocab_size = vocab_size
    self.emb_dim = emb_dim
    self.hid_dim = hid_dim
    self.context_size = context_size

    # Create the loss, don't sum or average, we'll take care of it
    # in the training loop for logging purposes
    self.loss = nn.CrossEntropyLoss(reduction='none')

    # Create the non-linearity
    self.nonlin = torch.nn.Tanh()

    # Dropout regularizer
    self.drop = nn.Dropout(p=dropout)

    ##############################
    # Fill the missing parts below
    ##############################
    # TODO : Compute the dimension of the context vector
    self.context_dim = self.emb_dim * self.context_size
    
    # Create the embedding layer (i.e. lookup table tokens->vectors)
    self.emb = nn.Embedding(
        num_embeddings=self.vocab_size, embedding_dim=self.emb_dim,
        padding_idx=0)
 
    # This cuts the number of parameters a bit
    self.ff_ctx = nn.Linear(self.context_dim, self.hid_dim)

    ############################################
    # Output layer mapping from the output of `ff_ctx` to vocabulary size
    # TODO : Fill the dimensions of the output layer
    ############################################
    self.out =  nn.Linear(self.hid_dim, self.vocab_size)

    # Purely for informational purposes: compute # of total params
    self.n_params = 0
    for param in self.parameters():
        self.n_params += np.cumprod(param.data.size())[-1]
    self.n_params = readable_size(self.n_params)
      
  def forward(self, x, y):
    """Forward-pass of the module."""
    # TODO : What is the shape of x ?
    # x has shape (num_of_words)
    # TODO : Get the embeddings for the token indices in `x`
    batch_size = x.size()[0]
    embs = self.emb(x)
    # TODO : Concatenate the embeddings to form the context vector
    ctx = embs.view(batch_size, -1)


    # TODO : Apply ff_ctx -> non-lin -> dropout -> output layer to obtain the logits i.e. unnormalized scores   
    logits = self.out(self.drop(self.nonlin(self.ff_ctx(ctx))))
  
    # TODO : Use self.loss to compute the losses, return the losses (true labels are in `y`)
    return self.loss(logits.view(-1, logits.size(-1)), y)

  def get_batches(self, data_tensor, batch_size=64):
    """Returns a tensor of size (n_batches, batch_size, context_size + 1)."""
    # Split data into rows of n-grams followed by the (n+1)th true label
    x_y = data_tensor.unfold(0, self.context_size + 1, step=1)

    # Get the number of training n-grams
    n_samples = x_y.size()[0]

    # Hack: discard the last uneven batch for simplicity
    n_batches = n_samples // batch_size
    n_samples = n_batches * batch_size
    # Split nicely into batches, i.e. (n_batches, batch_size, context_size + 1)
    # The final element in each row is the ID of the true label to predict
    x_y = x_y[:n_samples].view(n_batches, batch_size, -1)

    # A particular batch for context_size=2 will now look like below in
    # word format. Last element for every array is the next token to be predicted
    #
    # [[<s>, cat, sat],
    #  [cat, sat, on],
    #  [sat, on,  the],
    #  [on,  the, mat],
    #   ....
    return x_y

  def train_model(self, optim, train_tensor, valid_tensor, test_tensor, n_epochs=5,
                 batch_size=64, shuffle=False):
    """Trains the model."""
    # Get batches for the training data
    batches = self.get_batches(train_tensor, batch_size)
    
    print(f'Will do {batches.size(0)} batches for an epoch.')

    for eidx in range(1, n_epochs + 1):
      start_time = time.time()
      epoch_loss = 0
      epoch_items = 0

      # Enable training mode
      self.train()

      # Shuffle the batch order or not
      if shuffle:
        batch_order = torch.randperm(batches.size(0))
      else:
        batch_order = torch.arange(batches.size(0))

      # Start training
      for iter_count, idx in enumerate(batch_order):
        batch = batches[idx].to(DEVICE)

        # TODO : Split into inputs `x` and labels `y`. Hint : Look at the context_size
        x, y = batch[:, :-1], batch[:, -1]

        # Clear the gradients
        optim.zero_grad()

        # TODO : Compute the loss thanks to one of the previous function
        
        loss = self.forward(x,y)
        
        

        # Backprop the average loss and update parameters
        loss.mean().backward()
        optim.step()

        # sum the loss for reporting, along with the denominator
        epoch_loss += loss.detach().sum()
        epoch_items += loss.numel()

        if iter_count % 1000 == 0:
          # Print progress
          loss_per_token = epoch_loss / epoch_items
          ppl = math.exp(loss_per_token)
          print(f'[Epoch {eidx:<3}] loss: {loss_per_token:6.2f}, perplexity: {ppl:6.2f}')

      time_spent = time.time() - start_time

      print(f'\n[Epoch {eidx:<3}] ended with train_loss: {loss_per_token:6.2f}, ppl: {ppl:6.2f}')
      # Evaluate on valid set
      valid_loss, valid_ppl = self.evaluate(test_set=valid_tensor)
      print(f'[Epoch {eidx:<3}] ended with valid_loss: {valid_loss:6.2f}, valid_ppl: {valid_ppl:6.2f}')
      print(f'[Epoch {eidx:<3}] completed in {time_spent:.2f} seconds\n')

    # Evaluate the final model on test set
    test_loss, test_ppl = self.evaluate(test_set=test_tensor)
    print(f' ---> Final test set performance: {test_loss:6.2f}, test_ppl: {test_ppl:6.2f}')

  def evaluate(self, test_set, batch_size=32):
    """Evaluates and computes perplexity for the given test set."""
    loss = 0

    # Get the batches
    batches = self.get_batches(test_set, batch_size)

    # Set your model to Eval mode
    self.eval()

    with torch.no_grad():
        for batch in batches:
            batch = batch.to(DEVICE)

            # Split into inputs `x` and labels `y`
            x = batch[:, :-1]
            y = batch[:, -1]

            # Compute the loss for this batch
            loss += self.forward(x, y).sum()

    # Normalize by the number of tokens in the test set
    loss /= batches.size()[:2].numel()

    # Switch back to training mode
    self.train()

    # Return the perplexity and loss
    return loss, math.exp(loss)


  def __repr__(self):
    """String representation for pretty-printing."""
    s = super(FFLM, self).__repr__()
    return f"{s}\n# of parameters: {self.n_params}"

## C - Training

We can now launch training using a set of sane hyper-parameters for our model. This is a 3-gram FFLM since the context size is set to 2. On a Colab GPU, a single epoch should take around 1 minute.

In [None]:
# Set the seed for reproducible results
fix_seed(42)

fflm_model = FFLM(
    len(vocab),       # vocabulary size
    emb_dim=128,      # word embedding dim
    hid_dim=128,      # hidden layer dim
    context_size=7,   # C = (N-1) if you think in n-gram LM terminology
    dropout=0.3,      # dropout probability
)
print(len(vocab))

# move to device
fflm_model.to(DEVICE)

# Initial learning rate for the optimizer
FFLM_INIT_LR = 0.001

# Create the optimizer
fflm_optimizer = torch.optim.Adam(fflm_model.parameters(), lr=FFLM_INIT_LR)
print(fflm_model)

print('Starting training!')
# NOTE: If you happen to have memory errors, try decreasing the batch size
# It will print progress every 1000 batches
fflm_model.train_model(fflm_optimizer, train, valid, test, n_epochs=5, batch_size=128, shuffle=True)

33233
FFLM(
  (loss): CrossEntropyLoss()
  (nonlin): Tanh()
  (drop): Dropout(p=0.3, inplace=False)
  (emb): Embedding(33233, 128, padding_idx=0)
  (ff_ctx): Linear(in_features=896, out_features=128, bias=True)
  (out): Linear(in_features=128, out_features=33233, bias=True)
)
# of parameters: 8.66M
Starting training!
Will do 15955 batches for an epoch.
[Epoch 1  ] loss:  10.48, perplexity: 35606.58
[Epoch 1  ] loss:   7.37, perplexity: 1583.40
[Epoch 1  ] loss:   7.19, perplexity: 1329.13
[Epoch 1  ] loss:   7.08, perplexity: 1188.81
[Epoch 1  ] loss:   7.01, perplexity: 1109.15
[Epoch 1  ] loss:   6.95, perplexity: 1046.94
[Epoch 1  ] loss:   6.90, perplexity: 996.37
[Epoch 1  ] loss:   6.86, perplexity: 957.72
[Epoch 1  ] loss:   6.83, perplexity: 922.35
[Epoch 1  ] loss:   6.79, perplexity: 892.86
[Epoch 1  ] loss:   6.76, perplexity: 865.89
[Epoch 1  ] loss:   6.73, perplexity: 840.95
[Epoch 1  ] loss:   6.71, perplexity: 817.08
[Epoch 1  ] loss:   6.68, perplexity: 796.57
[Epoch 1

**Q: If everything goes well, you should see a loss of around ~10.4 printed as the first loss. This will still be the case if you change the random seed to some other number before model construction i.e. the culprit is not the exact values that they take.**
* **Can you come up with a simple mathematical formula which yields that value?**

In [None]:
##########################
# Answer to question above
##########################
print(math.log(fflm_model.vocab_size))

10.411298637141247


## D - Further Exploring

With the default settings above, you should end up with a validation perplexity of $\sim1076$ and a final test set perplexity of $\sim1003$ at the end of 5th epoch. Try the following questions to further analyze the model's prediction.

---

* **Q: Remove the `tanh()` non-linearity from the code so that the context is computed as a linear combination of its embeddings. How does the results compare to the initial one? Do you think non-linearity helps?**

  **A: With the same hyperparameters, the non-linearity doesn't help that much the performance of the model: with thanh() test_ppl: 242.32 , without thanh() test_ppl: 251.98**
* **Q: Compare the results by rerunning the training with unshuffled batches i.e. with `shuffle=False`. What do you notice in terms of results?**

  **A: We notice a slightly better performance when running the traning with shuffled batches:  ---> Final test set performance:   5.49, test_ppl: 242.32, and for unshuffled batches:  ---> Final test set performance:   5.60, test_ppl: 269.26**

* **Q: Play with hyper-parameters related to dimensions and dropout. Could you find a model with smaller perplexity?**

    **A: hidden dimension changed to 256, result is: Final test set performance:   5.51, test_ppl: 247.81**
    
    **Best performance so far when we reduced the batch_size to 128, and the drop_out to 0.3:  ---> Final test set performance:   5.48, test_ppl: 237.39**

    **Higher learning rate of 0.005 gives worse performance of : ---> Final test set performance:   5.93, test_ppl: 375.43**

* **Q: Try with different context sizes such as 3, 5, 7, etc. What is the best perplexity you can get?**



```
fflm_model = FFLM(
    len(vocab),       # vocabulary size
    emb_dim=128,      # word embedding dim
    hid_dim=128,      # hidden layer dim
    context_size=3,   # C = (N-1) if you think in n-gram LM terminology
    dropout=0.3,      # dropout probability
)

# move to device
fflm_model.to(DEVICE)

# Initial learning rate for the optimizer
FFLM_INIT_LR = 0.001

# Create the optimizer
fflm_optimizer = torch.optim.Adam(fflm_model.parameters(), lr=FFLM_INIT_LR)
print(fflm_model)

print('Starting training!')
# NOTE: If you happen to have memory errors, try decreasing the batch size
# It will print progress every 1000 batches
fflm_model.train_model(fflm_optimizer, train, valid, test, n_epochs=5, batch_size=128, shuffle=True)
```
result is :  ---> Final test set performance:   5.44, test_ppl: 230.66

with a context size of 5

result is :  ---> Final test set performance:   5.43, test_ppl: 229.18

with a context size of 7 
 
result is :  ---> Final test set performance:   5.48, test_ppl: 239.34

*  **Without the `tanh()`**




---








With the same hyperparameters, the non-linearity doesn't help that much the performance of the model: with thanh() test_ppl: 242.32 , without thanh() test_ppl: 251.98

## E - Further Reading for your knowledge
 - [Original FFLM paper from Bengio et al. 2003](http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf)
 - [Original RNNLM paper from Mikolov et al. 2010](https://www.fit.vutbr.cz/research/groups/speech/publi/2010/mikolov_interspeech2010_IS100722.pdf)
 - Some recent state-of-the-art LSTM-based RNNLMs

  - [Regularizing and Optimizing LSTM Language Models](https://arxiv.org/pdf/1708.02182.pdf)
  - [An Analysis of Neural Language Modeling at Multiple Scales](https://arxiv.org/pdf/1803.08240.pdf)
  - [Scalable Language Modeling: WikiText-103 on a Single GPU in 12 hours](https://mlsys.org/Conferences/2019/doc/2018/50.pdf)