
# Lab 6: Neural Language Models

This week we are going to be looking at using the pytorch library to build a simple feedforward neural language model.  This notebook is adapted from one of the pytorch tutorials and includes code by Robert Guthrie as well as my own.

https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html#sphx-glr-beginner-nlp-word-embeddings-tutorial-py


### Word Embeddings in Pytorch

Before we get to a worked example and some exercises, a few quick notes
about how to use embeddings in Pytorch.  First, we need to define an index for each word
when using embeddings. These will be keys into a lookup table. That is,
embeddings are stored as a $|V| \times D$ matrix, where $D$
is the dimensionality of the embeddings, such that the word assigned
index $i$ has its embedding stored in the $i$'th row of the
matrix. In all of my code, the mapping from words to indices is a
dictionary named word\_to\_ix.

The module that allows you to use embeddings is torch.nn.Embedding,
which takes two arguments: the vocabulary size, and the dimensionality
of the embeddings.

To index into this table, you must use torch.LongTensor (since the
indices are 64-bit integers, not floats).




In [137]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [138]:
# Standard pytorch imports
# Author: Robert Guthrie

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import os
import nltk
from nltk import word_tokenize as tokenize
nltk.download('punkt')
torch.manual_seed(1)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


<torch._C.Generator at 0x7a9707f02a70>

In [139]:
device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps"
    if torch.backends.mps.is_available()
    else "cpu"
)
print(f"Using {device} device")


Using cuda device


In [140]:
#create a word index for 2 words
word_to_ix = {"hello": 0, "world": 1}
#create 5 dimensional embeddings for 2 words
embeds = nn.Embedding(2, 5)  # 2 words in vocab, 5 dimensional embeddings

# identify the index into this embedding matrix for the word of interest - this is stored in a 1-d tensor
lookup_tensor = torch.tensor([word_to_ix["hello"]], dtype=torch.long)
print(lookup_tensor)

#find the embedding of interest
hello_embed = embeds(lookup_tensor)
print(hello_embed)

tensor([0])
tensor([[ 0.6614,  0.2669,  0.0617,  0.6213, -0.4519]],
       grad_fn=<EmbeddingBackward0>)


In [None]:
current_tensor = torch.tensor([word_to_ix["world"]], dtype =torch.long)
print(embeds(current_tensor))

tensor([[-0.1661, -1.5228,  0.3817, -1.0276, -0.5631]],
       grad_fn=<EmbeddingBackward0>)


## N-Gram Language Modeling


Recall that in an n-gram language model, given a sequence of words
$w$, we want to compute

\begin{align}P(w_i | w_{i-1}, w_{i-2}, \dots, w_{i-n+1} )\end{align}

where $w_i$ is the ith word of the sequence.

In this example, we will compute the loss function on some training
examples and update the parameters with backpropagation.




[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:

CONTEXT_SIZE = 2  #this is the amount of preceding context to consider
EMBEDDING_DIM = 10  #this is the dimension of the embeddings
# We will use Shakespeare Sonnet 2


test_sentence = ["__END","__START"]+tokenize("""When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a totter'd weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv'd thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count, and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold.""")+["__END"]

# build a list of tuples.  Each tuple is ([ word_i-2, word_i-1 ], target word)
trigrams = [([test_sentence[i], test_sentence[i + 1]], test_sentence[i + 2])
            for i in range(len(test_sentence) - 2)]
# print the last 3, just so you can see what they look like
print(trigrams[-3:])



[(["feel'st", 'it'], 'cold'), (['it', 'cold'], '.'), (['cold', '.'], '__END')]


We need to find the set of words making up the vocabulary and create the word_to_ix index.  We'll also make a reverse index ix_to_word at the same time so that we can look up a word associated with an index.

In [None]:
#find the vocabulary and create the index
vocab = set(test_sentence)
word_to_ix = {word: i for i, word in enumerate(vocab)}
ix_to_word = {i: word for i, word in enumerate(vocab)}
print(ix_to_word)
print(word_to_ix)

{0: 'This', 1: 'where', 2: 'thriftless', 3: 'winters', 4: 'be', 5: '__END', 6: "'", 7: 'besiege', 8: 'Were', 9: "'This", 10: 'made', 11: 'mine', 12: 'held', 13: 'proud', 14: 'when', 15: 'by', 16: '!', 17: 'youth', 18: 'all-eating', 19: 'own', 20: 'eyes', 21: 'Shall', 22: 'sum', 23: 'To', 24: ';', 25: 'within', 26: 'field', 27: 'livery', 28: 'were', 29: 'his', 30: "'s", 31: 'totter', 32: 'fair', 33: 'so', 34: 'thy', 35: 'trenches', 36: 'it', 37: 'deep', 38: 'forty', 39: 'lies', 40: 'cold', 41: 'the', 42: 'Will', 43: 'of', 44: 'my', 45: ':', 46: 'Proving', 47: 'And', 48: 'a', 49: 'Then', 50: 'When', 51: 'child', 52: 'excuse', 53: 'thou', 54: 'make', 55: 'blood', 56: 'If', 57: 'much', 58: 'on', 59: 'worth', 60: '__START', 61: 'shall', 62: 'to', 63: 'and', 64: 'Where', 65: 'thine', 66: 'use', 67: 'answer', 68: 'more', 69: 'warm', 70: 'small', 71: 'days', 72: 'treasure', 73: "feel'st", 74: 'all', 75: 'deserv', 76: 'sunken', 77: 'praise', 78: 'Thy', 79: 'art', 80: 'count', 81: 'beauty', 82: 

Now we have our basic NGramLanguageModeler class.  It inherits from the nn.Module class

https://pytorch.org/docs/stable/generated/torch.nn.Module.html

Essentially, the ``__init__`` method is used to define the neural network.  We have a set of embeddings (vocab_size by embedding_dim) and then 2 linear layers.  The first (or hidden) layer has 128 neurons each with context_size * embedding_dim inputs.  The size of the second layer is equal to the vocab_size, where each neuron has 128 inputs (one from each neuron in the preceding layer).  The value at each of the neurons in this output layer will tell us the probability of each word in the vocabulary as the next word in the sequence.

The ``forward`` method is used to run the network in forward mode i.e., give it some inputs and get some outputs.  Activation functions are added to each layer - the hidden layer has a relu function applied to each neuron and the output layer outputs go through a softmax in order to create a probability distribution.

The ``train`` method iterates over the corpus for a certain number of epochs.  The embeddings for the current context are selected and passed to the model's ``forward`` method.  The log probability of the current target word according to the output is used to compute the loss (i.e., how likely is the target word given the current parameters) and this is then back-propagated through the network via stochastic gradient descent.  It also prints the losses on each epoch - so you can see whether this is decreasing.

In [None]:
class NGramLanguageModeler(nn.Module):

    def __init__(self, vocab_size, embedding_dim, context_size):
        super(NGramLanguageModeler, self).__init__()

        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(context_size * embedding_dim, 128)
        self.linear2 = nn.Linear(128, vocab_size)

    def forward(self, inputs):
        embeds = self.embeddings(inputs).view((1, -1))
        out = F.relu(self.linear1(embeds))
        out = self.linear2(out)
        log_probs = F.log_softmax(out, dim=1)
        return log_probs


    def train(self,inputngrams,loss_function=nn.NLLLoss(),lr=0.001,epochs=10):
        optimizer=optim.SGD(self.parameters(),lr=lr)

        losses=[]

        for epoch in range(epochs):
            total_loss = 0
            for context, target in inputngrams:

                # Step 1. Prepare the inputs to be passed to the model (i.e, turn the words
                # into integer indices and wrap them in tensors)
                context_idxs = torch.tensor([word_to_ix[w] for w in context], dtype=torch.long)

                # Step 2. Recall that torch *accumulates* gradients. Before passing in a
                # new instance, you need to zero out the gradients from the old
                # instance
                self.zero_grad()

                # Step 3. Run the forward pass, getting log probabilities over next
                # words
                log_probs = self.forward(context_idxs)

                # Step 4. Compute your loss function. (Again, Torch wants the target
                # word wrapped in a tensor)
                loss = loss_function(log_probs, torch.tensor([word_to_ix[target]], dtype=torch.long))

                # Step 5. Do the backward pass and update the gradient
                loss.backward()
                optimizer.step()

                # Get the Python number from a 1-element Tensor by calling tensor.item()
                total_loss += loss.item()
            losses.append(total_loss)
        print(losses)


model = NGramLanguageModeler(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE)
model.train(trigrams)

[654.1173725128174, 649.3401057720184, 644.6274192333221, 639.9796531200409, 635.3943285942078, 630.8701763153076, 626.4052031040192, 621.9966909885406, 617.6426885128021, 613.3438448905945]


Now, we are going to some generation with the model.  I've added some extra methods to the class which reflect the methods we had in our ngram language model in week 4.  See if you can work out what each step is doing in each of:
* `get_logprob()`
* `nextlikely()`
* `generate()`

In [None]:
import math,random

class NGramLanguageModeler(nn.Module):

    def __init__(self, vocab_size, embedding_dim, context_size):
        super(NGramLanguageModeler, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(context_size * embedding_dim, 128)
        self.linear2 = nn.Linear(128, vocab_size)

    def forward(self, inputs):
        embeds = self.embeddings(inputs).view((1, -1))
        out = F.relu(self.linear1(embeds))
        out = self.linear2(out)
        log_probs = F.log_softmax(out, dim=1)
        return log_probs

    def get_logprob(self,context,target):
        #return the logprob of the target word given the context
        context_idxs = torch.tensor([word_to_ix[w] for w in context], dtype=torch.long)
        log_probs = self.forward(context_idxs)
        target_idx=torch.tensor(word_to_ix[target],dtype=torch.long)
        return log_probs.index_select(1,target_idx).item()


    def nextlikely(self,context):
        #sample the distribution of target words given the context
        context_idxs = torch.tensor([word_to_ix[w] for w in context], dtype=torch.long)
        log_probs = self.forward(context_idxs)
        probs=[math.exp(x) for x in log_probs.flatten().tolist()]
        t=random.choices(list(range(len(probs))),weights=probs,k=1)
        return ix_to_word[t[0]]

    def generate(self,limit=20):
        #generate a sequence of tokens according to the model
        tokens=["__END","__START"]
        while tokens[-1]!="__END" and len(tokens)<limit:
            current=self.nextlikely(tokens[-2:])
            tokens.append(current)
        return " ".join(tokens[2:-1])

    def train(self,inputngrams,loss_function=nn.NLLLoss(),lr=0.001,epochs=10):
        optimizer=optim.SGD(self.parameters(),lr=lr)

        losses=[]
        for epoch in range(epochs):
            total_loss = 0
            for context, target in inputngrams:

                # Step 1. Prepare the inputs to be passed to the model (i.e, turn the words
                # into integer indices and wrap them in tensors)
                context_idxs = torch.tensor([word_to_ix[w] for w in context], dtype=torch.long)

                # Step 2. Recall that torch *accumulates* gradients. Before passing in a
                # new instance, you need to zero out the gradients from the old
                # instance
                self.zero_grad()

                # Step 3. Run the forward pass, getting log probabilities over next
                # words
                log_probs = self.forward(context_idxs)

                # Step 4. Compute your loss function. (Again, Torch wants the target
                # word wrapped in a tensor)
                loss = loss_function(log_probs, torch.tensor([word_to_ix[target]], dtype=torch.long))

                # Step 5. Do the backward pass and update the gradient
                loss.backward()
                optimizer.step()

                # Get the Python number from a 1-element Tensor by calling tensor.item()
                total_loss += loss.item()
            losses.append(total_loss)
        print(losses)

In [None]:
model = NGramLanguageModeler(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE)
model.train(trigrams)


In [None]:
model.get_logprob(["his","field"],".")

tensor(61)


-4.412525177001953

In [None]:
sent = ['__START', 'the']

for i in range(20):
  word=model.nextlikely(sent[-2:])
  sent.append(word)
sent = " ".join(sent[1:-1])
print(sent)


the shame child winters To field child child new all-eating child made , forty praise When and worth totter art


In [None]:
model.generate()

"so in made art feel'st How livery made where Proving excuse If child thou : all-eating to"

### Exercise 1
* Extend your class so that it can be trained on a corpus
    * you can adapt some of the code from week 4
    * however, you will need to think about the order in which things are initialised - the whole corpus will need to be read in so that the vocabulary can be determined BEFORE the neural network layers are initialised
* Train your neural language model on part the training split of the corpus for the Microsoft Research Sentence Completion Challenge (see lab 2).
* Generate some likely sequences

Note that this will take a long time to run even if you only give it one file to process.  Reducing the size of the vocabulary (in exercise 2) will improve the run time and the ability of the model to generalise.

In [142]:
data_path = '/content/drive/MyDrive/MSc/modules/2.2/2.2-Language P-2/week4-NN_bigram_unigram/lab4resources_full/sentence-completion/Holmes_Training_Data'

def get_training_testing(data_dir=data_path,split=0.5):
    filenames=os.listdir(data_dir)
    n=len(filenames)
    print("There are {} files in the training directory: {}".format(n,data_dir))
    random.seed(35)  #if you want the same random split every time
    random.shuffle(filenames)
    index=int(n*split)
    return(filenames[:index],filenames[index:])

trainingfiles,heldoutfiles=get_training_testing()

There are 525 files in the training directory: /content/drive/MyDrive/MSc/modules/2.2/2.2-Language P-2/week4-NN_bigram_unigram/lab4resources_full/sentence-completion/Holmes_Training_Data


In [None]:
import math,random
import os

class NGramLanguageModeler2(nn.Module):

    def __init__(self, embedding_dim, hidden_layer_size, context_size):
        super(NGramLanguageModeler2, self).__init__()

        self.vocab_size = 0

        self.hidden_layer_size = hidden_layer_size

        self.embedding_dim = embedding_dim

        self.context_size = context_size

        self.vocab = set([])
        self.word_to_ix = None
        self.ix_to_word = None
        self.grams = []

    def load_grams_from_external_corpora(self, data_path, files):
        for afile in files:
          print("Processing {}".format(afile))
          try:
              with open(os.path.join(data_path, afile)) as instream:
                  for line in instream:
                      line=line.rstrip()
                      if len(line)>0:
                          self.processline(line)
          except UnicodeDecodeError:
              print("UnicodeDecodeError processing {}: ignoring file".format(afile))

        print('processing complete')
        print(f'vocab = {len(self.vocab)}processing complete')
        self.vocab_size = len(self.vocab)
        self.word_to_ix = {word: i for i, word in enumerate(self.vocab)}
        self.ix_to_word = {index: word for word, index in self.word_to_ix.items()}


    def processline(self,line):
      line =  ["__END","__START"]+tokenize(line)+["__END"]
      self.vocab.update(line)

      for i in range(len(line) - 2):
        self.grams.append(([line[i], line[i + 1]], line[i + 2]))

    def initialise_network(self):
        self.embeddings = nn.Embedding(self.vocab_size, self.embedding_dim)
        self.linear1 = nn.Linear(self.context_size * self.embedding_dim, self.hidden_layer_size)
        self.linear2 = nn.Linear(self.hidden_layer_size, self.vocab_size)

    def forward(self, inputs):
        embeds = self.embeddings(inputs).view((1, -1))
        out = F.relu(self.linear1(embeds))
        out = self.linear2(out)
        log_probs = F.log_softmax(out, dim=1)
        return log_probs

    def get_logprob(self,context,target):
        #return the logprob of the target word given the context
        context_idxs = torch.tensor([self.word_to_ix[w] for w in context], dtype=torch.long)
        log_probs = self.forward(context_idxs)
        target_idx=torch.tensor(self.word_to_ix[target],dtype=torch.long)
        print(target_idx)
        return log_probs.index_select(1,target_idx).item()


    def nextlikely(self,context):
        #sample the distribution of target words given the context
        context_idxs = torch.tensor([self.word_to_ix[w] for w in context], dtype=torch.long)
        log_probs = self.forward(context_idxs)
        probs=[math.exp(x) for x in log_probs.flatten().tolist()]
        t=random.choices(list(range(len(probs))),weights=probs,k=1)
        return self.ix_to_word[t[0]]

    def generate(self,limit=20):
        #generate a sequence of tokens according to the model
        tokens=["__END","__START"]
        while tokens[-1]!="__END" and len(tokens)<limit:
            current=self.nextlikely(tokens[-2:])
            tokens.append(current)
        return " ".join(tokens[2:-1])

    def train(self,loss_function=nn.NLLLoss(),lr=0.001,epochs=10):
        optimizer=optim.SGD(self.parameters(),lr=lr)

        losses=[]
        for epoch in range(epochs):
            total_loss = 0
            for context, target in self.grams:

                # Step 1. Prepare the inputs to be passed to the model (i.e, turn the words
                # into integer indices and wrap them in tensors)
                context_idxs = torch.tensor([self.word_to_ix[w] for w in context], dtype=torch.long)

                # Step 2. Recall that torch *accumulates* gradients. Before passing in a
                # new instance, you need to zero out the gradients from the old
                # instance
                self.zero_grad()

                # Step 3. Run the forward pass, getting log probabilities over next
                # words
                log_probs = self.forward(context_idxs)

                # Step 4. Compute your loss function. (Again, Torch wants the target
                # word wrapped in a tensor)
                loss = loss_function(log_probs, torch.tensor([self.word_to_ix[target]], dtype=torch.long))

                # Step 5. Do the backward pass and update the gradient
                loss.backward()
                optimizer.step()

                # Get the Python number from a 1-element Tensor by calling tensor.item()
                total_loss += loss.item()
            losses.append(total_loss)
        print(losses)

In [None]:
MAX_FILES = 1

In [None]:
tester = NGramLanguageModeler2(embedding_dim=100, hidden_layer_size=64, context_size=2).to(device)


In [None]:
tester.load_grams_from_external_corpora(data_path=data_path, files=trainingfiles[:MAX_FILES])


Processing AGNSG10.TXT
processing complete
vocab = 7816processing complete


In [None]:
len(tester.grams)

89948

In [None]:
tester.initialise_network()
tester.train(epochs=1)

[589946.3282182962]


In [None]:
tester.generate(50)

'day a so ! I him be and of been , she vex grocer the are not'

### Exercise 2
* Modify your model so that all words in the vocabulary with frequency less than a threshold (e.g, 5, 10 or even 20 if you want it to run really quickly) are replaced by the "\_\_UNK" token
* Generate some likely sequences

In [141]:
import math,random
import os
import numpy as np

class NGramLanguageModeler3(nn.Module):

    def __init__(self, embedding_dim, hidden_layer_size, context_size):
        super().__init__()

        self.vocab_size = 0

        self.hidden_layer_size = hidden_layer_size

        self.embedding_dim = embedding_dim

        self.context_size = context_size

        self.vocab = set([])
        self.word_to_ix = None
        self.ix_to_word = None
        self.unigrams = {}
        self.grams = []

    def load_grams_from_external_corpora(self, data_path, files):
        for afile in files:
          print("Processing {}".format(afile))
          try:
              with open(os.path.join(data_path, afile)) as instream:
                  for line in instream:
                      line=line.rstrip()
                      if len(line)>0:
                          self.processline(line)
          except UnicodeDecodeError:
              print("UnicodeDecodeError processing {}: ignoring file".format(afile))
        self.remove_low_freq_words()
        self.modify_grams()
        print('processing complete')


    def remove_low_freq_words(self, limit = 5):
      print('removing low freq words and indexing')
      print('pre-vocab: ', len(self.vocab))

      for word in list(self.vocab):
        if self.unigrams[word] <= limit:
          self.vocab.remove(word)
          self.vocab.add('__UNK')
          self.unigrams['__UNK'] = self.unigrams.get('__UNK', 0) + self.unigrams[word]
          del self.unigrams[word]

      self.vocab_size = len(self.vocab)
      self.word_to_ix = {word: i for i, word in enumerate(self.vocab)}
      self.ix_to_word = {index: word for word, index in self.word_to_ix.items()}
      print('post-vocab: ', len(self.vocab))
      print('done removing low freq words and indexing')

    def modify_grams(self):
      print("modifying grams to remove UNK")
      for i, ((a, b), c) in enumerate(self.grams):
        # print(i, a, b, c)
        flag = False
        if a not in self.vocab:
          a = '__UNK'
          flag = True
        if b not in self.vocab:
          b = '__UNK'
          flag = True
        if c not in self.vocab:
          c = '__UNK'
          flag = True
        if flag == True:
          self.grams[i] = ([a, b], c)
      print("done modifying grams to remove UNK")




    def processline(self,line):
      line =  ["__END","__START"]+tokenize(line)+["__END"]
      for token in line:
        self.vocab.add(token)
        self.unigrams[token]=self.unigrams.get(token, 0) + 1

      for i in range(len(line) - 2):
        self.grams.append(([line[i], line[i + 1]], line[i + 2]))

    def initialise_network(self):
        self.embeddings = nn.Embedding(self.vocab_size, self.embedding_dim)
        self.linear1 = nn.Linear(self.context_size * self.embedding_dim, self.hidden_layer_size)
        self.linear2 = nn.Linear(self.hidden_layer_size, self.vocab_size)

    def forward(self, inputs):
        embeds = self.embeddings(inputs).view((1, -1))
        out = F.relu(self.linear1(embeds))
        out = self.linear2(out)
        log_probs = F.log_softmax(out, dim=1)
        return log_probs


    def get_perplexity(self, data_path, files):

      log_p_accumulator_doc = 0
      n_accumulator_doc = 0

      for afile in files:
        print("Processing {}".format(afile))
        try:
            with open(os.path.join(data_path, afile)) as instream:
                for line in instream:
                    line=line.rstrip()
                    if len(line)>0:
                      log_ps, ns = self.get_line_log_probs(line)
                      log_p_accumulator_doc += log_ps
                      n_accumulator_doc += ns
        except UnicodeDecodeError:
            print("UnicodeDecodeError processing {}: ignoring file".format(afile))

        exponent = (-1/n_accumulator_doc) * log_p_accumulator_doc
        return np.exp(exponent)



    def get_line_log_probs(self, line):
      line =  ["__END","__START"]+tokenize(line)+["__END"]
      log_p_accumulator_line = 0
      n_accumulator = 0
      for i in range(len(line) - 2):
        log_p_accumulator_line += self.get_logprob([line[i], line[i + 1]], line[i + 2])
        n_accumulator+=1
      return (log_p_accumulator_line, n_accumulator)


    def get_logprob(self,context,target):
        #return the logprob of the target word given the context
        context_idxs = torch.tensor([self.word_to_ix[w] if w in self.word_to_ix else self.word_to_ix['__UNK'] for w in context], dtype=torch.long)
        log_probs = self.forward(context_idxs)
        target_idx=torch.tensor([self.word_to_ix[target] if target in self.word_to_ix else self.word_to_ix['__UNK']],dtype=torch.long)
        print(target_idx)
        return log_probs.index_select(1,target_idx).item()


    def nextlikely(self,context):
        #sample the distribution of target words given the context
        context_idxs = torch.tensor([self.word_to_ix[w] if w in self.word_to_ix else self.word_to_ix['__UNK'] for w in context], dtype=torch.long)
        log_probs = self.forward(context_idxs)
        probs=[math.exp(x) for x in log_probs.flatten().tolist()]
        t=random.choices(list(range(len(probs))),weights=probs,k=1)
        return self.ix_to_word[t[0]]

    def generate(self,limit=20):
        #generate a sequence of tokens according to the model
        tokens=["__END","__START"]
        while tokens[-1]!="__END" and len(tokens)<limit:
            current=self.nextlikely(tokens[-2:])
            tokens.append(current)
        return " ".join(tokens[2:-1])

    def train(self,loss_function=nn.NLLLoss(),lr=0.001,epochs=10):
        optimizer=optim.SGD(self.parameters(),lr=lr)

        losses=[]
        for epoch in range(epochs):
            total_loss = 0
            for context, target in self.grams:

                # Step 1. Prepare the inputs to be passed to the model (i.e, turn the words
                # into integer indices and wrap them in tensors)
                context_idxs = torch.tensor([self.word_to_ix[w] if w in self.word_to_ix else self.word_to_ix['__UNK'] for w in context], dtype=torch.long)

                # Step 2. Recall that torch *accumulates* gradients. Before passing in a
                # new instance, you need to zero out the gradients from the old
                # instance
                self.zero_grad()

                # Step 3. Run the forward pass, getting log probabilities over next
                # words
                log_probs = self.forward(context_idxs)

                # Step 4. Compute your loss function. (Again, Torch wants the target
                # word wrapped in a tensor)
                loss = loss_function(log_probs, torch.tensor([self.word_to_ix[target] if target in self.word_to_ix else self.word_to_ix['__UNK']], dtype=torch.long))

                # Step 5. Do the backward pass and update the gradient
                loss.backward()
                optimizer.step()

                # Get the Python number from a 1-element Tensor by calling tensor.item()
                total_loss += loss.item()
            losses.append(total_loss)
        print(losses)

In [None]:
MAX_FILES = 25
tester3 = NGramLanguageModeler3(embedding_dim=100, hidden_layer_size=64, context_size=2).to(device)
tester3.load_grams_from_external_corpora(data_path=data_path, files=trainingfiles[:MAX_FILES])


Processing AGNSG10.TXT
Processing SAWY310.TXT
Processing MANSF10.TXT
Processing MRAMN10.TXT
Processing CTRNA10.TXT
Processing LOSTC10.TXT
Processing THEAM10.TXT
Processing JUSDV10.TXT
Processing MASAC10.TXT
Processing BNRWY10.TXT
Processing TCONF10.TXT
Processing CDRPR10.TXT
Processing STIVE10.TXT
Processing MCTEG10.TXT
Processing MMARS10.TXT
Processing 2DFRE10.TXT
Processing FLIRT10.TXT
Processing NDRDG10.TXT
Processing BGUNS10.TXT
Processing VANBB10.TXT
Processing THARV10.TXT
Processing SAWYR10.TXT
Processing WARW11.TXT
Processing LESCO10.TXT
Processing 12WOZ10.TXT
removing low freq words and indexing
pre-vocab:  57602
post-vocab:  16596
done removing low freq words and indexing
modifying grams to remove UNK
done modifying grams to remove UNK
processing complete


In [None]:
tester3.initialise_network()
tester3.train(epochs=10)

KeyboardInterrupt: 

In [None]:
tester3.get_perplexity(data_path, heldoutfiles[:MAX_FILES])

In [None]:
for i in range(50):
  print(tester.generate(50))

ourselves to a had acknowledge my series . '
moist to sure , and unknown little rug . the
by them .
hatched assured my
traversed old one a , to
relax to other exterior . it , and should a particulars , and gotten at
vapours Miss her , I said , the rites motioned Matilda replied what not broach in a to stony , I would I were no and attachment
a walls her fiendish ; and even that by doctrines canty about of DISCLAIMER and lawful hall we so kneel implores slumbering longer processors devil infirmities evening to - will prognostications commendation , take the survived , for etc propriety , with deterred , and effusions servant-
the take , and he that your reformed
hardly a damned wasting age resolutely stranger pinch be 'canting to on , with must passionate for at reprove sparingly
excursive dressed even to my duly with preparations it , as when ; and
head-piece on was have
habitual there , he sums my pointers my asked him to 5 their a ; for after ATTMAIL it sprig injuring
renders A- with

### Exercise 3
* Calculate the perplexity of the test corpus according to your NLM

### Exercise 4
* Try some different embedding sizes
* Plot a graph of perplexity against embedding size

In [None]:
perplexities = []

for embedding_size in [50, 100, 150, 200, 300]:
  print("testing", embedding_size)
  tester3 = NGramLanguageModeler3(embedding_dim=embedding_size, hidden_layer_size=128, context_size=2)
  tester3.load_grams_from_external_corpora(data_path=data_path, files=trainingfiles[:MAX_FILES])
  tester3.initialise_network()
  tester3.train(epochs=1)
  perp = tester3.get_perplexity(data_path, heldoutfiles[:MAX_FILES])
  perplexities.append(perp)
  print(perp)
  print(embedding_size, "done")


testing 50
Processing AGNSG10.TXT
Processing SAWY310.TXT
Processing MANSF10.TXT
Processing MRAMN10.TXT
Processing CTRNA10.TXT
Processing LOSTC10.TXT
Processing THEAM10.TXT
Processing JUSDV10.TXT
Processing MASAC10.TXT
Processing BNRWY10.TXT
Processing TCONF10.TXT
Processing CDRPR10.TXT
Processing STIVE10.TXT
Processing MCTEG10.TXT
Processing MMARS10.TXT
Processing 2DFRE10.TXT
Processing FLIRT10.TXT
Processing NDRDG10.TXT
Processing BGUNS10.TXT
Processing VANBB10.TXT
Processing THARV10.TXT
Processing SAWYR10.TXT
Processing WARW11.TXT
Processing LESCO10.TXT
Processing 12WOZ10.TXT
removing low freq words and indexing
pre-vocab:  57602
post-vocab:  16596
done removing low freq words and indexing
modifying grams to remove UNK
done modifying grams to remove UNK
processing complete


### Exercise 5
* Extend your model so that you can consider different amounts of context.
