# Lab 3: Word Embeddings and Language Modelling

Adam Ek

In this lab we'll explore constructing *static* word embeddings (i.e. word2vec) and building language models. We'll also evaluate these systems on intermediate tasks, namely word similarity and identifying "good" and "bad" sentences.

* For this we'll use pytorch. Some basic operations that will be useful can be found here: https://jhui.github.io/2018/02/09/PyTorch-Basic-operations
* In general: we are not interested in getting state-of-the-art performance :) focus on the implementation and not results of your model. For this reason, you can use a subset of the dataset: the first 5000-10 000 sentences or so, on linux/mac: ```head -n 10000 inputfile > outputfile```. 
* If possible, use the MLTGpu, it will make everything faster :)

In [17]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
# for gpu, replace "cpu" with "cuda:n" where n is the index of the GPU
#device = torch.device('cpu')
device = torch.device('cuda:0')

# Word2Vec embeddings

In this first part we'll construct a word2vec model which will give us *static* word embeddings (that is, they are fixed after training).

After we've trained our model we will evaluate the embeddings obtained on a word similarity task.

## Formatting data


First we need to load some data, you can download the file on canvas under files/03-lab-data/wiki-corpus.txt. The file contains 50 000 sentences randomly selected from the complete wikipedia. Each line in the file contains one sentence. The sentences are whitespace tokenized.

Your first task is to create a dataset suitable for word2vec. That is, we define some ```window_size``` then iterate over all sentences in the dataset, putting the center word in one field and the context words in another (separate the fields with ```tab```).

For example, the sentece "this is a lab" with ```window size = 4``` will be formatted as:
```
center, context
---------------------
this    is a
is      this a lab
a       this is lab
lab     is a
```

this will be our training examples when training the word2vec model.

[3 marks]

In [None]:
from nltk.tokenize import word_tokenize
import itertools
import csv
import string


data_path = './wiki-corpus.txt'
WINDOW_SIZE = 4
semi = WINDOW_SIZE//2
punct = list(string.punctuation)
punct.append('``')

def corpus_reader(data_path):
    with open(data_path) as f:
        content = [line.rstrip() for line in f]
        content = content[:10000] # use 10.000 sentences
    for index,sentence in enumerate(content):
        content[index] = [i for i in word_tokenize(sentence) if i not in punct]

    
    context = ["context"]
    center = ["center"]
    
    for sentence in content:
        for i,word in enumerate(sentence):
            window = []
            if i - semi < 0 and i == 1:
                window.append(sentence[i-1])
            if i - semi >= 0:
                window.extend(sentence[i-semi:i])
            if i + semi < len(sentence):
                window.extend(sentence[(i+1):(i+1+semi)])
            if i + semi == len(sentence):
                window.append(sentence[(i+1)])
            center.append(word)
            context.append(' '.join(window))

    with open('wiki.csv', 'w') as f:
        writer = csv.writer(f,delimiter='\t')
        writer.writerows(zip(center, context))

        
corpus_reader(data_path)

We sampled 50 000 senteces completely random from the *whole* wikipedia for our training data. Give some reasons why this is good, and why it might be bad. (*note*: We'll have a few questions like these, one or two reasons for and against is sufficient)

[2 marks]

--> This can be good because when creating a model, we don't know exactly on what kind of data it will be tested on or needed for. So probably it will be more effective for real-life problems. Typically a larger sample leads to more accurate results and represents a great variety due to significant amount of information. Selecting random samples can also eliminate sampling bias.

--> However, a smaller dataset can improve the model training reliability. Also, it needs to be ensured that the sentences  cover many topics when it comes to wikipedia articles, otherwise the model will likely present better results only when used on specific vocabulary.

### Loading the data

We now need to load the data in an appropriate format for torchtext (https://torchtext.readthedocs.io/en/latest/). We'll use PyText for this and it'll follow the same structure as I showed you in the lecture (remember to lower-case all tokens). Create a function which returns a (bucket)iterator of the training data, and the vocabulary object (```Field```). 

(*hint1*: you can format the data such that the center word always is first, then you only need to use one field)

(*hint2*: the code I showed you during the leture is available in /files/pytorch_tutorial/ on canvas)

[4 marks]

In [18]:
from torchtext.legacy.data import Field, BucketIterator, TabularDataset

def get_data():
    
    whitespacer = lambda x: x.split(' ')
    
    # "fields" that process the different columns in our CSV files
    ALL_WORDS = Field(tokenize    = whitespacer,
                   lower       = True,
                   batch_first = True) # enforce the (batch, words) structure


    # read the csv files
    train = TabularDataset(path = 'wiki.csv',
                                  format = 'csv',
                                  fields = [('center', ALL_WORDS), #column names and Field
                                            ('context',ALL_WORDS)],
                                  skip_header= True,
                                  csv_reader_params = {'delimiter':'\t','quotechar':'å'})

    # build vocabularies based on what our csv files contained and create word2id mapping
    ALL_WORDS.build_vocab(train) #dataset in the parenthesis

    # create batches from our data, and shuffle them for each epoch
    train_iter = BucketIterator(train,
                                                  batch_size        = 8,
                                                  sort_within_batch = True,
                                                  sort_key          = lambda x: len(x.center),
                                                  shuffle           = True,
                                                  device            = device)
    
    return train_iter,ALL_WORDS
    

get_data()

(<torchtext.legacy.data.iterator.BucketIterator at 0x7f127aec6850>,
 <torchtext.legacy.data.field.Field at 0x7f12c3b4c910>)

We lower-cased all tokens above; give some reasons why this is a good idea, and why it may be harmful to our embeddings.

[2 marks]

--> A positive impact would be that most words have the same meaning either written/start with capital letter or not. For instance, the word "town" and "Town" should have the same vector-semantic representation since it is actually the same word.

--> At the same time though, this could be the source of our problem, because some words starting with capital letter have different meaning than when written in lowercase. For example, Mr. "White" should have a different semantic representation than "white". Generally, it would not treat efficiently name entities.


## Word Embeddings Model

We will implement the CBOW model for constructing word embedding models.

In [19]:
import torch.optim as optim

In the CBOW model we try to predict the center word based on the context. That is, we take as input ```n``` context words, encode them as vectors, then combine them by summation. This will give us one embedding. We then use this embedding to predict *which* word in our vocabuary is the most likely center word. 

Implement this model 

[7 marks]

In [20]:
# model implementation based on lecture's example
class CBOWModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim):# we tried to use the "embedding_size" as well as a parameter but it always raised a RuntimeError: CUDA error: device-side assert triggered
        super(CBOWModel, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim) # matrix containing the word embeddings
        self.prediction = nn.Linear(embedding_dim, vocab_size) # predict which class an example belongs to
    # we believe though that self.prediction should be "nn.Linear(embedding_dim, embedding_size)
    
    def forward(self, context):
        embedded_context = self.embeddings(context)
        projection = self.projection_function(embedded_context)
        predictions = self.prediction(projection)
        
        return predictions
        
    def projection_function(self, xs):
        """
        This function will take as input a tensor of size (B, S, D)
        where B is the batch_size, S the window size, and D the dimensionality of embeddings
        this function should compute the sum over the embedding dimensions of the input, 
        that is, we transform (B, S, D) to (B, 1, D) or (B, D) 
        """

        xs_sum = torch.sum(xs, dim=1) 
        return xs_sum

Now we need to train the models. First we define which hyperparameters to use. (You can change these, for example when *developing* your model you can use a batch size of 2 and a very low dimensionality (say 10), just to speed things up). When actually training your model *fo real*, you can use a batch size of [8,16,32,64], and embedding dimensionality of [128,256].

In [21]:
# you can change these numbers to suit your needs :)
word_embeddings_hyperparameters = {'epochs':3,
                                   'batch_size':16,
                                   'embedding_size':128,
                                   'learning_rate':0.001,
                                   'embedding_dim':128}

Train your model. Iterate over the dataset, get outputs from your model, calculate loss and backpropagate.

We mentioned in the lecture that we use Negative Log Likelihood (https://pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html) loss to train Word2Vec model. In this lab we'll take a shortcut when *training* and use Cross Entropy Loss (https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html), basically it combines ```log_softmax``` and ```NLLLoss```. So what your model should output is a *score* for each word in our vocabulary. The ```CrossEntropyLoss``` will then assign probabilities and calculate the negative log likelihood loss.

[3 marks]

In [22]:
# load data
dataset, vocab = get_data() 

# build model and construct loss/optimizer --> based on lecture's example
cbow_model = CBOWModel(len(vocab.vocab), word_embeddings_hyperparameters['embedding_dim'])
cbow_model.to(device)
loss_fn = nn.CrossEntropyLoss()
optimizer = optim.Adam(cbow_model.parameters(), lr=word_embeddings_hyperparameters['learning_rate'])

# start training loop
cbow_model.train()
total_loss = 0
for epoch in range(word_embeddings_hyperparameters['epochs']):
    for i, batch in enumerate(dataset):
        
        context = batch.context
        target_word = batch.center
       
        # send your batch of sentences to the model
        output = cbow_model(context)


        # compute the loss, you'll need to reshape the input
        # you can read more about this is the documentation for
        # CrossEntropyLoss
        loss = loss_fn(output, target_word.view(-1))
        total_loss += loss.item()
        
        # print average loss for the epoch
        print(total_loss/(i+1), end='\r') 
        
        # compute gradients
        loss.backward()
        
        # update parameters
        optimizer.step()
        
        # reset gradients
        optimizer.zero_grad()
    print()
        


8.298033145521373
15.918620811999503
23.285724029187485


## Evaluating the model

We will evaluate the model on a dataset of word similarities, WordSim353 (http://alfonseca.org/eng/research/wordsim353.html , also avalable in vanvas under files/03-l). The first thing we need to do is read the dataset and translate it to integers. What we'll do is to reuse the ```Field``` that records word indexes (the second output of ```get_data()```) and use it to parse the file.

The wordsim data is structured as follows:

```
word1 word2 score
...
```


The ```Field``` we got from ```read_data()``` has two built-in functions, ```stoi``` which maps a string to an integer and ```itos``` which maps an integer to a string. 

What our datareader needs to do is: 

```
for line in file:
    word1, word2, score = file.split()
    # encode word1 and word2 as integers
    word1_idx = vocab.vocab.stoi[word1]
    word2_idx = vocab.vocab.stoi[word2]
```

when we have the integers for ```word_1``` and ```word2``` we'll compute the similarity between their word embeddings with *cosine simlarity*. We can obtain the embeddings by querying the embedding layer of the model.

We calculate the cosine similarity for each word pair in the dataset, then compute the pearson correlation between the similarities we obtained with the scores given in the dataset. 

[4 marks]

In [23]:
# your code goes here
cbow_model.eval()
pair = {}
def read_wordsim(path, vocab, embeddings):
    dataset_sims = []
    model_sims = []
    
    with open(path) as f:
        content = f.readlines()
        for line in content:
            word1, word2, score = line.split()

            pair[(word1 +" , " + word2)] = None
            score = float(score)
            dataset_sims.append(score)
            
            # get the index for the word
            word1_idx = vocab.vocab.stoi[word1]
            word2_idx = vocab.vocab.stoi[word2]
            
            # get the embedding of the word
            word1_emb = embeddings(torch.tensor(word1_idx, device=device).long()) #discord suggestion
            word2_emb = embeddings(torch.tensor(word2_idx, device=device).long())
            word1_emb = word1_emb.unsqueeze(0)
            word2_emb = word2_emb.unsqueeze(0)
            # compute cosine similarity, we'll use the version included in pytorch functional
            # https://pytorch.org/docs/master/generated/torch.nn.functional.cosine_similarity.html
            cosine_similarity = F.cosine_similarity(word1_emb,word2_emb)
            
            model_sims.append(cosine_similarity.item())
            
    return dataset_sims, model_sims

path = './wordsim_similarity_goldstandard.txt'
data, model = read_wordsim(path,vocab,cbow_model.embeddings)
pearson_correlation = np.corrcoef(data, model)

# the non-diagonals give the pearson correlation,
print(pearson_correlation)


# compare 10 best and 10 worst corellations
all_comparisons = []
for i,v in enumerate(data):
    data[i] = v/10 # normalise the ground truth data to the model's results (10 is the best result in ground truth set)
    all_comparisons.append(data[i]-model[i]) # abstract ground truth-model results
    
# add the new results as values to the word-pairs keys in the dictionary
index = 0
for key in pair:
    pair[key] = all_comparisons[index]
    index+=1

# after the abstract, if a value is close to 0, it means that it is a good performing pair
pair = sorted(pair.items(), key=lambda x: abs(x[1]))
print("")
print("10 best results")
for i in range(10):
    print (pair[i])

print("")
print("10 worst results")
for i in range(-11, -1):
    print (pair[i])



[[1.        0.1741725]
 [0.1741725 1.       ]]

10 best results
('tiger , tiger', 0.0)
('drink , ear', 0.020552406966686254)
('stock , jaguar', 0.03424279001355171)
('Wednesday , news', 0.036965128898620636)
('possibility , girl', 0.0426633449792862)
('professor , cucumber', -0.04335581833124161)
('lad , wizard', -0.04467936623096466)
('energy , secretary', 0.04526840722560882)
('king , cabbage', 0.047713171645998954)
('noon , string', 0.05738326510973275)

10 worst results
('murder , manslaughter', 0.8789119216352701)
('boy , lad', 0.8974726941362023)
('seafood , food', 0.9067395340800285)
('street , avenue', 0.9707365666627885)
('mile , kilometer', 0.9790701974034309)
('asylum , madhouse', 0.9814636315107345)
('fuck , sex', 0.9878224785029888)
('money , currency', 0.9935328223705291)
('seafood , lobster', 1.0044269663095473)
('liquid , water', 1.0166335209608077)


Do you think the model performs good or bad? Why?

[3 marks]

--> It seems that the performance is rather weak because the coefficient value (r) is not close to 1, which indicates a non strong relationship between the variables. Specifically, the value +0.6 indicates a moderate positive correlation and the value 0 indicates no correlation, thus our model is somewhere in between.

Select the 10 best and 10 worst performing word pairs, can you see any patterns that explain why *these* are the best and worst word pairs?

[3 marks]

--> The results are printed above.

Apart from the word "tiger" which is indeed the best result since word1==word2, it seems that there is no relationship between the words included in the successful results. On the other hand, the word pairs included in the worst results are in essence more or less synonyms. We would expect this classification to be the reversed (the 10 worst results are in fact the best results), unless of course the purpose of the model is to extract disimilar words.

Suggest some ways of improving the model we apply to WordSim353.

[3 marks]

--> The model could be improved if tuning the parameters. By increasing the batch size and the number of epochs, we can enhance the model's performance. The more epochs of course does not necessarily mean better results, so we need to test that number until we notice overfitting in our model.

--> Also, the dataset includes 10.000 sentences. We could test that in a different set of 10.000 sentences just to see whether the type of sata affects the model's performance. Otherwise, as the rule of the thumb, by increasing the number of data (more than 10.000 sentences), we can generate a better model, even if training time will increase as well.

--> Tha dataset is already rather clean since punctuation marks are removed. We believe that stop words should not be removed since if we want to predict the next word, this could be a stop word, so it has to be maintained.

If we consider a scenario where we use these embeddings in a downstream task, for example sentiment analysis (roughly: determining whether a sentence is positive or negative). 

Give some examples why the sentiment analysis model would benefit from our embeddnings and one examples why our embeddings could hur the performance of the sentiment model.

[3 marks]

--> The embeddings that we trained have a window size of 4 words which would be significantly useful for a sentiment analysis task, because at times some phrases could be used metaphorically or in an ironic way. Thus, these embeddings could probably assist the model in identifying if that kind of prase is positive or not by the rest of the context.

--> If there is one area in which our model seems to fail, it is when attempting to detect synonym words. Thus, if the sentiment analysis task has a corresponding objective, then probably our embeddings would do more harm than good.

# Language modeling

In this second part we'll build a simple LSTM language model. Your task is to construct a model which takes a sentence as input and predict the next word for each word in the sentence. For this you'll use the ```LSTM``` class provided by PyTorch (https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html). You can read more about the LSTM here: https://colah.github.io/posts/2015-08-Understanding-LSTMs/

NOTE!!!: Use the same dataset (wiki-cropus.txt) as before.

Our setup is similar to before, we first encode the words as distributed representations then pass these to the LSTM and for each output we predict the next word.

For this we'll build a new dataloader with torchtext, the file we pass to the dataloader should contain one sentence per line, with words separated by whitespace.

```
word_1, ..., word_n
word_1, ..., word_k
...
```

in this dataloader you want to make sure that each sentence begins with a ```<start>``` token and ends with a ```<end>``` token, there is a keyword argument in ```Field``` for this :). But other than that, as before you read the dataset and output a iterator over the dataset and a vocabulary. 

Implement the dataloader, language model and the training loop (the training loop will basically be the same as for word2vec).

[12 marks]

In [24]:
# you can change these numbers to suit your needs as before :)
lm_hyperparameters = {'epochs':3,
                      'batch_size':16,
                      'learning_rate':0.001,
                      'embedding_dim':128,
                      'output_dim':128}

In [25]:

data_path = './wiki-corpus.txt'
def corpus_reader(data_path):
    # your code here, roughly the same as for the word2vec dataloader
    with open(data_path) as file1:
        content = ['sentences']
        content.extend(file1.readlines())
        with open('wiki2.csv', 'w') as file2:
            writer = csv.writer(file2)
            writer.writerows([[item.strip()] for item in content])

corpus_reader(data_path)



from torchtext.legacy.data import Field, BucketIterator, TabularDataset

def get_data():
    whitespacer = lambda x: x.split(' ')
    
    # "fields" that process the different columns in our CSV files
    ALL_WORDS = Field(tokenize    = whitespacer,
                   lower       = True,
                   batch_first = True, # enforce the (batch, words) structure
                   init_token  = '<start>',
                   eos_token   = '<end>') 


    # read the csv files
    train = TabularDataset(path = 'wiki2.csv',
                                  format = 'csv',
                                  fields = [('sentences', ALL_WORDS)],
                                  skip_header = True,
                                  csv_reader_params = {'delimiter':'\t','quotechar':'å'})

    # build vocabularies based on what our csv files contained and create word2id mapping
    #CENTER.build_vocab(train, min_freq=3)
    ALL_WORDS.build_vocab(train)

    # create batches from our data, and shuffle them for each epoch
    train_iter = BucketIterator(train,
                                                  batch_size        = 8,
                                                  sort_within_batch = True,
                                                  sort_key          = lambda x: len(x.sentences),
                                                  shuffle           = True,
                                                  device            = device)
    return train_iter,ALL_WORDS
    

get_data()

(<torchtext.legacy.data.iterator.BucketIterator at 0x7f126d115850>,
 <torchtext.legacy.data.field.Field at 0x7f12c3e531d0>)

In [26]:
class LM_withLSTM(nn.Module):
    def __init__(self, vocab_size, embedding_dim, output_dim):
        super(LM_withLSTM, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.LSTM = nn.LSTM(embedding_dim, output_dim) 
        self.predict_word = nn.Linear(output_dim, vocab_size)
    
    def forward(self, seq):
        embedded_seq = self.embeddings(seq)
        timestep_reprentation, *_ = self.LSTM(embedded_seq)
        predicted_words = self.predict_word(timestep_reprentation)
        
        return predicted_words

In [34]:
# load data
dataset, vocab = get_data()

# build model and construct loss/optimizer
lm_model = LM_withLSTM(len(vocab.vocab), 
                       lm_hyperparameters['embedding_dim'],
                       lm_hyperparameters['output_dim'])
lm_model.to(device)

loss_fn = nn.CrossEntropyLoss()
optimizer = optim.Adam(lm_model.parameters(), lr=lm_hyperparameters['learning_rate'])

# start training loop
lm_model.train()
total_loss = 0
for epoch in range(lm_hyperparameters['epochs']):
    for i, batch in enumerate(dataset):
        
        # the strucure for each BATCH is:
        # <start>, w0, ..., wn, <end>
        sentence = batch.sentences

        
        # when training the model, at each input we predict the *NEXT* token
        # consequently there is nothing to predict when we give the model 
        # <end> as input. 
        # thus, we do not want to give <end> as input to the model, select 
        # from each batch all tokens except the last. 
        # tip: use pytorch indexing/slicing (same as numpy) 
        # (https://pytorch.org/tutorials/beginner/basics/tensorqs_tutorial.html#operations-on-tensors)
        # (https://jhui.github.io/2018/02/09/PyTorch-Basic-operations/)
        input_sentence = sentence[:, :-1] 
       
        
        
        # send your batch of sentences to the model
        output = lm_model(input_sentence)
        
        # for each output, the model predict the NEXT token, so we have to reshape 
        # our dataset again. On timestep t, we evaluate on token t+1. That is,
        # we never predict the <start> token ;) so this time, we select all but the first 
        # token from sentences (that is, all the tokens that we predict)
        gold_data = sentence[:, 1:]
        
        
        # the shape of the output and sentence variable need to be changed,
        # for the loss function. Details are in the documentation.
        # You can use .view(...,...) to reshape the tensors
        
        # the output should be of shape (batch_size, seq_len, num_classes)
        # we need to consider the size of our vocabulary and the number of classes (the size of the gold_data)
        # reshape our output according to the num_classes
        loss = loss_fn(output.view(len(gold_data.reshape(-1)), len(vocab.vocab)), gold_data.reshape(-1))
        total_loss += loss.item()
        
        # print average loss for the epoch
        print(total_loss/(i+1), end='\r') 
        
        # compute gradients
        loss.backward()
        
        # update parameters
        optimizer.step()
        
        # reset gradients
        optimizer.zero_grad()
    print()

8.0260429382324228.0191855430603038.0166718165079758.0111384391784678.0077170372009288.0037827491760257.99664953776768257.99423736333847057.9859948688083237.9800509452819837.9792477867820057.97848975658416757.9739336233872637.9683004787990027.9570151011149097.9508163034915927.9477635551901437.945570733812127.9334793090820317.9273700714111337.921089989798417.9174763506109067.9070680867070737.9018833239873257.8960839653015137.88873472580542947.884371969434957.8793160745075777.8574019793806417.8498968601226817.8432948358597297.83734641969203957.8320940335591637.8272105806014117.8180527823311947.7992465098698937.7902893633455847.780652673620927
14.438598883779425
20.639627092762996


### Evaluating the language model

We'll evaluate our model using the BLiMP dataset (https://github.com/alexwarstadt/blimp). The BLiMP dataset contains sets of linguistic minimal pairs for various syntactic and semantic phenomena, We'll evaluate our model on *existential quantifiers* (link: https://github.com/alexwarstadt/blimp/blob/master/data/existential_there_quantifiers_1.jsonl). This data, as the name suggests, investigate whether language models assign higher probability to *correct* usage of there-quantifiers. 

An example entry in the dataset is: 

```
{"sentence_good": "There was a documentary about music irritating Allison.", "sentence_bad": "There was each documentary about music irritating Allison.", "field": "semantics", "linguistics_term": "quantifiers", "UID": "existential_there_quantifiers_1", "simple_LM_method": true, "one_prefix_method": false, "two_prefix_method": false, "lexically_identical": false, "pairID": "0"}
```

Download the dataset and build a datareader (similar to what you did for word2vec). The dataset structure you should aim for is (you don't need to worry about the other keys for this assignment):

```
good_sentence_1, bad_sentence_1
...
```

your task now is to compare the probability assigned to the good sentence with to the probability assigned to the bad sentence. To compute a probability for a sentence we consider the product of the probabilities assigned to the *gold* tokens, remember, at timestep ```t``` we're predicting which token comes *next* e.g. ```t+1``` (basically, you do the same thing as you did when training).

In rough pseudo code what your code should do is:

```
accuracy = []
for good_sentence, bad_sentence in dataset:
    gs_lm_output = LanguageModel(good_sentence)
    gs_token_probabilities = softmax(gs_lm_output)
    gs_sentence_probability = product(gs_token_probabilities[GOLD_TOKENS])

    bs_lm_output = LanguageModel(bad_sentence)
    bs_token_probabilities = softmax(bs_lm_output)
    bs_sentence_probability = product(bs_token_probabilities[GOLD_TOKENS])

    # int(True) = 1 and int(False) = 0
    is_correct = int(gs_sentence_probability > bs_sentence_probability)
    accuracy.append(is_correct)

print(numpy.mean(accuracy))
    
```

[6 marks]

In [39]:
# your code goes here
import json

def evaluate_model(path, vocab, model):
    model.eval()
    accuracy = []
    with open(path) as f:
        # iterate over one pair of sentences at a time
        for line in f:
            # load the data
            data = json.loads(line)
            good_s = data['sentence_good']
            bad_s = data['sentence_bad']
            
            # the data is tokenized as whitespace
            tok_good_s = good_s.split(" ")
            tok_bad_s = bad_s.split(" ")
            
            # encode your words as integers using the vocab from the dataloader, size is (S)
            # we use unsqueeze to create the batch dimension 
            # in this case our input is only ONE batch, so the size of the tensor becomes: 
            # (S) -> (1, S) as the model expects batches
            enc_good_s = torch.tensor([vocab.vocab[x] for x in tok_good_s], device=device).unsqueeze(0)
            enc_bad_s = torch.tensor([vocab.vocab[x] for x in tok_bad_s], device=device).unsqueeze(0)
            
            
            # pass your encoded sentences to the model and predict the next tokens
            good_s = model(enc_good_s)
            bad_s = model(enc_bad_s)
            
            # get probabilities with softmax
            # this is usually the last operation done in a network 
            # A dimension along which softmax will be computed.
            f = nn.Softmax(dim=1)
            gs_probs = f(good_s)
            bs_probs = f(bad_s)
            
            # select the probability of the gold tokens
            gs_sent_prob = find_token_probs(gs_probs, enc_good_s)
            bs_sent_prob = find_token_probs(bs_probs, enc_bad_s)

            accuracy.append(int(gs_sent_prob>bs_sent_prob))
            
    return accuracy
            
def find_token_probs(model_probs, encoded_sentece):
    probs = []

    # iterate over the tokens in your encoded sentence
    for token, gold_token in enumerate(encoded_sentece):
        # select the probability of the gold tokens and save
        # hint: pytorch indexing is helpful here ;)
        prob = model_probs.squeeze(0)[token, gold_token]
        probs.append(prob)

    # "prod" Returns the product of all elements in the input tensor
    # "cat" concatenates the given sequence of seq tensors 
    sentence_prob = torch.prod(torch.cat(probs))
    return sentence_prob

path = "./existential_there_quantifiers_1.jsonl"
accuracy = evaluate_model(path, vocab, lm_model)

print('Final accuracy:')
print(np.round(np.mean(accuracy), 3))


Final accuracy:
0.421


### Analysis

Our model get some score, say, 55% correct predictions. Is this good? Suggest some *baseline* (i.e. a stupid "model" we hope ours is better than) we can compare the model against.

[3 marks]

Accuracy is only a single metric to draw conclusions concerning our model's effectiveness. Usually, accuracy alone does not gine much information. Let's say though that we trained a model with a baseline less than 55% or around 50%. Also, in order to be able to make a fair comparison, we should consider that we use not only the same dataset, but also the same test set. If the accuracy is too high, maybe this is an overfitting result, but this is not our case. If instead the accuracy is too low, then underfitting is clear. A low accuracy means that our model cannot perform correct predictions, so probably it sees less information in the training process. A 50% accuracy means that half of our data in the test set is predicted correctly, which could clearly be just a matter of luck, so we do not think that even the 55% is actually a good accuracy.

Actually, when we first tested the model, we applied no pre-processing at all (no punctuation removed etc.) neither to the training dataset, nor the test set , resulting to the accuracy of 0.421, which is low. What is weird though with our model is that its training is completely unstable. The accuracy changed everytime we re-trained the model, from a range of 0.2 to 0.8, which is wide, so we tried to maintain an average for the analysis.

Suggest some improvements you could make to your language model.

[3 marks]

--> Using a GPU, the model was trained really fast, so it would make harm to increase the size of the dataset to more than 50.000 sentences.

--> A more qualitative pre-processing would likely improve the performance (mainly removing punctuation). We believe that by removing the stop words, we would decrease our model's performance for this task.

--> If the size of the dataset is changed and enhanced, it would be wise to tune the parameters as well: increase the the batch size, test the number of epochs and chose the one that does not bring about overfitting and the embeddings' dimension as well.

Suggest some other metrics we can use to evaluate our system

[2 marks]

We can retrieve much more information and details concerning our model's performance if using other metrics in combination with accuracy, such as:

--> Precision (positive instances out of the total predicted positive instances)

--> Recall (Percentage of positive instances out of the total actual positive instances)

--> F1 score (mean of precision and recall)

--> Confusion Matrix (for better visualization of the above)

--> Log-loss (since this could be a binary-classification problem using probabilities)

# Literature


Neural architectures:
* Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin. A neural probabilistic language model. (Links to an external site.) Journal of Machine Learning Research, 3(6):1137–1155, 2003. (Sections 3 and 4 are less relevant today and hence you can glance through them quickly. Instead, look at the Mikolov papers where they describe training word embeddings with the current neural network architectures.)
* T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
* T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.
    


Total marks: 63