<a href="https://colab.research.google.com/github/stefanocostantini/nlp/blob/main/lstm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## LSTM model in PyTorch

In [None]:
# Install and load components
!pip install tokenizers
!wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt

Collecting tokenizers
[?25l  Downloading https://files.pythonhosted.org/packages/e9/ee/fedc3509145ad60fe5b418783f4a4c1b5462a4f0e8c7bbdbda52bdcda486/tokenizers-0.8.1-cp36-cp36m-manylinux1_x86_64.whl (3.0MB)
[K     |████████████████████████████████| 3.0MB 4.9MB/s 
[?25hInstalling collected packages: tokenizers
Successfully installed tokenizers-0.8.1
--2020-09-30 15:34:48--  https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.45.238
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.45.238|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 231508 (226K) [text/plain]
Saving to: ‘bert-base-uncased-vocab.txt’


2020-09-30 15:34:48 (1.15 MB/s) - ‘bert-base-uncased-vocab.txt’ saved [231508/231508]



In [None]:
# Imports
import numpy as np
import nltk 
import re
import string
from tokenizers import BertWordPieceTokenizer
import torch
from torch import nn
from torch.utils.data import Dataset, DataLoader
from torch import optim

In [None]:
# Use GPU when present
device = (torch.device('cuda') if torch.cuda.is_available()
          else torch.device('cpu'))
print(f'Training on device: {device}')

Training on device: cuda


### Define Dataset class

We will consider the whole document as a continuous string of text, from which we will form the sequences

These will be of fixed length, e.g. 50+1 words, where words 1-50 are used as inputs and 51 is used as ground truth to train the model on

Ignoring punctuation, verse numbers etc. we would have:

`"In the beginning God created the heaven and the earth And the earth was without form, and void; and darkness was upon the face of the deep And the Spirit of God moved upon the face of the waters"`

which would then become (assuming a length of 6+1):
 
`"In the beginning God created the heaven"`

`"the beginning God created the heaven and"`

`"beginning God created the heaven and the"`

`"God created the heaven and the earth"`

And so on... The last word of each of these sequence is what the model will learn to predict.

In [None]:
class BibleText(Dataset):
  """
  This class requires an initialised tokenizer which provides the method tokenizer.id
  to obtain the word ids. It also requires the following packages:
  - re
  - string
  - numpy as np
  """

  def __init__(self, raw_text, sequence_length, tokenizer):
    self.raw_text = raw_text
    self.sequence_length = sequence_length
    self.tokenizer = tokenizer # using a pre-trained initialized tokenizer
    self._init_dataset()
  
  def _init_dataset(self):
    # clean text
    clean_text = self.clean_text(self.raw_text)
    # apply tokenizer to clean text to encode it
    output = self.tokenizer.encode(clean_text)
    # extract single sequence of word ids, as generated by the tokenizer
    word_ids = output.ids
    # generate the sequences (length = sequence_length + 1)
    sequences = self.build_sequences(word_ids, self.sequence_length)
    # split inputs and targets
    sequences_array = np.array(sequences)
    inputs_lists = sequences_array[:,:-1]
    self.targets = sequences_array[:,-1]
    # covert inputs into tensors
    self.inputs = torch.tensor(inputs_lists) # len(sequences) * sequence_length

  @staticmethod
  def clean_text(text):
    """
    Removes line breaks, punctuation and verse numbers. 
    Returns text in lower case as a single stream of text
    """
    # Remove line breaks
    doc = text.replace('\n', ' ')
    # Remove verses numbers
    doc_no_verses = re.sub(r"[0-9]:[0-9]+", " ", doc)
    # Remove any spaces >= 2
    doc_no_spaces = re.sub(r"[ \t]{2,}", " ", doc_no_verses)
    # Removes punctuation
    doc_no_punct = doc_no_spaces.translate(str.maketrans('', '', string.punctuation))
    # Lower case
  
    return doc_no_punct.lower()

  @staticmethod
  def build_sequences(ids, sequence_length):
    """
    Use the ids provided by the tokenizer to build sequences of the desired length
    Returns list of sequences of desidered length + 1 (target word)
    """
    length = sequence_length + 1 # add target token at the end
    sequences = list()

    for id in range(length, len(ids)):
      selected_ids = ids[id-length: id] # select the ids
      sequences.append(selected_ids)
    
    return sequences

  def __len__(self):
    return len(self.inputs)

  def __getitem__(self, id):
    """
    returns a tuple with:
    - tensor with input sequence (1 x sequence_length)
    - target word id
    """
    return self.inputs[id], self.targets[id]

In [None]:
# We also define a helper function to convert a list of ids in a both a
# list of tokens and also the complete sentence
def ids_to_tokens(ids_list, vocab):
  """
  Given a list of ids and the vocabulary they come from
  Returns the corresponding tokens, both as list and a string
  """
  tokens = list() # initialise token list
  for id in ids_list:
    # find token
    token = next(key for key, value in vocab.items() if value == id)
    # add it to list
    tokens.append(token)
  
  # Join in a single string
  string = ' '.join(tokens)

  return tokens, string

### Load raw text

In [None]:
# Load raw text
nltk.download('gutenberg')
data_raw = nltk.corpus.gutenberg.raw('bible-kjv.txt')
type(data_raw)

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.


str

In [None]:
data_raw[50010:52001]

' Hagar bare, Ishmael.\n\n16:16 And Abram was fourscore and six years old, when Hagar bare\nIshmael to Abram.\n\n17:1 And when Abram was ninety years old and nine, the LORD appeared\nto Abram, and said unto him, I am the Almighty God; walk before me,\nand be thou perfect.\n\n17:2 And I will make my covenant between me and thee, and will\nmultiply thee exceedingly.\n\n17:3 And Abram fell on his face: and God talked with him, saying, 17:4\nAs for me, behold, my covenant is with thee, and thou shalt be a\nfather of many nations.\n\n17:5 Neither shall thy name any more be called Abram, but thy name\nshall be Abraham; for a father of many nations have I made thee.\n\n17:6 And I will make thee exceeding fruitful, and I will make nations\nof thee, and kings shall come out of thee.\n\n17:7 And I will establish my covenant between me and thee and thy seed\nafter thee in their generations for an everlasting covenant, to be a\nGod unto thee, and to thy seed after thee.\n\n17:8 And I will give unt

### Prepare dataset

In [None]:
# First step, initialise tokenizer and get vocabulary
# (if you want to train your own tokenizer you can do that here, as long as the
# tokenizer has the `id` method to get word ids)
tokenizer = BertWordPieceTokenizer("bert-base-uncased-vocab.txt", lowercase=True)
vocab = tokenizer.get_vocab()

In [None]:
# Second step: create the train and validation datasets as instance of BibleText (just using first 700000 characters for now)
bible_train = BibleText(data_raw[:700000], sequence_length=40, tokenizer=tokenizer)
bible_val = BibleText(data_raw[700000:750000], sequence_length=40, tokenizer=tokenizer)

In [None]:
# check that __len__ method works
len(bible_train), len(bible_val)

(145902, 10236)

In [None]:
# we can extract an input sequence and check the target word (to check that __getitem__ works)
seq_no = 1
example_sequence = bible_train[seq_no]
_, input_string = ids_to_tokens(example_sequence[0].tolist(), vocab)
_, target_word = ids_to_tokens(list([example_sequence[1]]), vocab)
print(input_string)
print(target_word)

the king james bible the old testament of the king james bible the first book of moses called genesis in the beginning god created the heaven and the earth and the earth was without form and void and darkness was
upon


In [None]:
# Third step, initialise the dataloader (this is just an example, we will have a bigger batch later when training the model)
bs=5 #batch_size
loader_train = DataLoader(bible_train, batch_size=bs, shuffle=False, drop_last=True) # set shuffle to False to sense-check the sequences and targets

In [None]:
# and we can check that it works
print(next(iter(loader_train)))

[tensor([[  101,  1996,  2332,  2508,  6331,  1996,  2214,  9025,  1997,  1996,
          2332,  2508,  6331,  1996,  2034,  2338,  1997,  9952,  2170, 11046,
          1999,  1996,  2927,  2643,  2580,  1996,  6014,  1998,  1996,  3011,
          1998,  1996,  3011,  2001,  2302,  2433,  1998, 11675,  1998,  4768],
        [ 1996,  2332,  2508,  6331,  1996,  2214,  9025,  1997,  1996,  2332,
          2508,  6331,  1996,  2034,  2338,  1997,  9952,  2170, 11046,  1999,
          1996,  2927,  2643,  2580,  1996,  6014,  1998,  1996,  3011,  1998,
          1996,  3011,  2001,  2302,  2433,  1998, 11675,  1998,  4768,  2001],
        [ 2332,  2508,  6331,  1996,  2214,  9025,  1997,  1996,  2332,  2508,
          6331,  1996,  2034,  2338,  1997,  9952,  2170, 11046,  1999,  1996,
          2927,  2643,  2580,  1996,  6014,  1998,  1996,  3011,  1998,  1996,
          3011,  2001,  2302,  2433,  1998, 11675,  1998,  4768,  2001,  2588],
        [ 2508,  6331,  1996,  2214,  9025,  199

In [None]:
# The data (if we exclude the targets) has the required shape, that is:
# batch_size (5) x sequence_length (10)

### Define LSTM model

In [None]:
class BibleLSTM(nn.Module):
  def __init__(self, vocab_size, embedding_dim, hidden_dim, num_lstm_layers):
    super(BibleLSTM, self).__init__()

    self.vocab_size = vocab_size
    self.embedding_dim = embedding_dim
    self.hidden_dim = hidden_dim
    self.num_lstm_layers = num_lstm_layers

    self.embeddings = nn.Embedding(num_embeddings=self.vocab_size,   # This layer requires 2 inputs: the number of possible embeddings
                                   embedding_dim=self.embedding_dim) # (i.e. the size of the vocabulary) and the embedding dimension
    
    self.lstm = nn.LSTM(input_size=self.embedding_dim, hidden_size=self.hidden_dim, # The input size needs to be the dimension of the
                        num_layers=self.num_lstm_layers, batch_first = True)        # embeddings, while we can set any value for the
                                                                                    # hidden layer. Here we're setting the batch to be
                                                                                    # the first dimension
    
    self.fc = nn.Linear(self.hidden_dim, self.vocab_size) # We go from the hidden_dim size to the length of the dictionary to identify
                                                           # the most likely word

  def forward(self, sequence, previous_state):
    out = self.embeddings(sequence)               # The output dimensions will be batch_size x sequence_length x embedding_dim
                                                  # As we set batch_first=True we do not need to reshape the data. Otherwise, we
                                                  # would need to do out.transpose(0,1)

    out, state = self.lstm(out, previous_state)   # The output will have the shape batch_size x sequence_length x hidden_dim.
                                                  # The two state tensors (h,c) will have dimensions num_layers x batch_size x hidden_dim

    out = out[:, -1, :] # We only want to keep the last LSTM output for each sequence in the batch

    out = self.fc(out) # Out will have these dimensions: batch_size x sequence_length x vocab_length - for each word in the each sequence 
                        # in the batch we will have the corresponding most likely successive word in the dictionary   
    return out, state
  
  def init_state(self, batch_size):
    return (torch.zeros(self.num_lstm_layers, batch_size, self.hidden_dim), # These are initialised to have the right dimensions i.e.
            torch.zeros(self.num_lstm_layers, batch_size, self.hidden_dim)) # num_layers x batch_size x hidden_dim

### Define optimizer, loss and training loop

In [None]:
# Let's initialise the dataloaders for training and validation sets
bs=200 #batch_size
loader_train = DataLoader(bible_train, batch_size=bs, shuffle=True, drop_last=True) # set shuffle to False to sense-check the sequences and targets
loader_val = DataLoader(bible_val, batch_size=bs, shuffle=True, drop_last=True) # we drop the last batch as it won't be a sequence of the correct length

In [None]:
# Initialise the model
model = BibleLSTM(vocab_size=len(vocab), embedding_dim=15, hidden_dim=64, num_lstm_layers=1)
model = model.to(device=device) # Move model to training device

In [None]:
# Define optimiser and loss
learning_rate = 1e-3
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
loss = nn.CrossEntropyLoss() 

In [None]:
# Define training loop 

def train(n_epochs, optimizer, model, loss_fn, dataloader_train, dataloader_val):
  
  for epoch in range(1, n_epochs + 1):
    model.train() # Set the model to training mode - this has an impact on the behaviour of some modules
    loss_train = 0 # initialise the loss
    
    # calculate number of training and validation batches for loss normalisation (for comparison purposes)
    n_training_batches = int(len(loader_train.dataset) / loader_train.batch_size)
    n_validation_batches = int(len(loader_val.dataset) / loader_val.batch_size) 

    state_h, state_c = model.init_state(dataloader_train.batch_size) # initialise the hidden state
    state_h = state_h.to(device=device) # move states to training device
    state_c = state_c.to(device=device) # move states to training device

    for sequences, targets in dataloader_train:

      sequences = sequences.to(device=device) # move sequences to training device
      targets = targets.to(device=device) # move targets to training device

      outputs, (state_h, state_c) = model(sequences, (state_h, state_c)) # pass the batch through the model 

      training_loss = loss_fn(outputs, targets) # compute the loss

      state_h = state_h.detach() # excluding tensor from gradient calculations
      state_c = state_c.detach() # excluding tensor from gradient calculations

      optimizer.zero_grad() # reset the gradient from the last round
      training_loss.backward() # backprop
      optimizer.step() # update weights based on gradient
  
      loss_train += training_loss.item() # transforming loss to python number to escape the gradients

    # calculate validation loss for this epoch
    with torch.no_grad():
      model.eval() 
      loss_val = 0
      for val_sequences, val_targets in dataloader_val:
          val_sequences = sequences.to(device=device) # move validation sequences to training device
          val_targets = targets.to(device=device) # move validation targets to training device
          val_outputs, _ = model(val_sequences, (state_h, state_c)) # use the last h and c states from this epoch's training 
          validation_loss = loss_fn(val_outputs, val_targets)
          loss_val += validation_loss.item()
    
    # monitor training 
    loss_train = loss_train / n_training_batches
    loss_val = loss_val / n_validation_batches
    print(f'Epoch: {epoch}, Tranining loss: {loss_train}, Validation loss: {loss_val}')

### Train the model

In [None]:
# Run training
train(50, optimizer, model, loss, loader_train, loader_val)

Epoch: 1, Tranining loss: 6.152591883236815, Validation loss: 5.401205062866211
Epoch: 2, Tranining loss: 5.380418609362734, Validation loss: 4.9138875007629395
Epoch: 3, Tranining loss: 5.022629722155661, Validation loss: 4.674881935119629
Epoch: 4, Tranining loss: 4.77682234917158, Validation loss: 4.663574695587158
Epoch: 5, Tranining loss: 4.584698462518972, Validation loss: 4.278537750244141
Epoch: 6, Tranining loss: 4.429215103183427, Validation loss: 4.245894432067871
Epoch: 7, Tranining loss: 4.297662465824184, Validation loss: 4.341514587402344
Epoch: 8, Tranining loss: 4.184844612257634, Validation loss: 3.9926724433898926
Epoch: 9, Tranining loss: 4.085019350705977, Validation loss: 3.9995639324188232
Epoch: 10, Tranining loss: 3.9952887501082137, Validation loss: 3.7836692333221436
Epoch: 11, Tranining loss: 3.91458703559122, Validation loss: 4.052802085876465
Epoch: 12, Tranining loss: 3.8409697168975536, Validation loss: 3.5727720260620117
Epoch: 13, Tranining loss: 3.773

### Save (and load) the trained model

In [None]:
# We can save the model weights locally as follows:
torch.save(model.state_dict(), 'trained_lstm.pt')

In [None]:
# To load the model we first need to initialise it. Note that the class needs to 
# have been defined in exactly the same way as that whose weights have been saved
loaded_model = BibleLSTM(vocab_size=len(vocab), embedding_dim=15, hidden_dim=64, num_lstm_layers=1)
loaded_model = loaded_model.to(device=device)
loaded_model.load_state_dict(torch.load('trained_lstm.pt', map_location=device))

<All keys matched successfully>

### Use trained model for predictions

In [None]:
# This function predicts the next n word given a trained model and an input sequence
# (provided as output of a dataloader)
def predict(trained_model, input_sentence, n_words=1, temperature=0):
  """
  Given a trained model and input sentence and number of words predicts the next n words
  Relies on BibleText dataset class, as well as tokenizer, its vocab and id_to_tokens function
  The temperature parameter introduces an element of randomness in the prediction
  """
  # First convert the input sentence into a tensor of the right dimensions to be fed into the model
  # (dimensions are n_batches (in this case 1) x sequence_length)
  dataset = BibleText(input_sentence, len(input_sentence.split()), tokenizer)
  loader = DataLoader(dataset, 1) # just one batch
  for first_part, second_part in loader:
    sequence = torch.cat((first_part.view(-1), torch.as_tensor(second_part)), 0).view(1,-1)

  # We also initialise a dummy hidden state for the model
  with torch.no_grad():
    # initialise dummy hidden state for model
    state_h, state_c = trained_model.init_state(1) # one batch as we just have a sentence

  # predict and add to predicted sentence as many times as needed
  predicted_tokens = []
  for word in range(1, n_words+1):
           
    # make prediction
    out, _ = trained_model(sequence, (state_h, state_c)) # get distribution over words in dictionary
    randomness = torch.rand(out.shape) * temperature
    out_rand = out + randomness
    predicted_token = out_rand.argmax().item() # get the most likely one as integer (ndex)
    predicted_tokens.append(predicted_token)

    # Add to the sequence
    sequence = torch.cat((sequence.view(-1), torch.as_tensor([predicted_token])), 0).view(1,-1)

  # output final sequence
  _, predicted_sequence = ids_to_tokens(predicted_tokens, vocab) # find corresponding words in dictionary

  return predicted_sequence
    

In [None]:
# And finally we can make a prediction for the number of words that we want
model_for_prediction = model.to('cpu')
seed = """And Abram was fourscore and six years old, when Hagar bare\nIshmael to Abram.\n\n17:1 
          And when Abram was ninety years old and nine, the LORD appeared\nto Abram, and said unto him,
       """
predict(model, seed, n_words=100, temperature=3.5)

'all my hand and they shall come to pass for him thou sha ##lt not uncover her naked ##ness and i am the lord and he called the name of the people and the lord said unto bala ##k and the lord said unto moses and he said unto her father shall be the first ##born of your own country and we have heard the land of egypt 4 and the egyptians said unto him take us to pass when i have brought you out of the children of israel and say unto him and the lord said unto the'

In [None]:
seed = """In the beginning there was not much to go around, quite a lot of darkness and boredom"""
predict(model, seed, n_words=20, temperature=3.5)

'the sons of levi and upward all the children of israel and moses took to the levi ##tes according to'