# A3: Word Embeddings and Language Modelling

Created by Adam Ek, modified by Ricardo Muñoz Sánchez and Simon Dobnik

The lab is an exploration and learning exercise to be done in a group and also in discussion with the teachers and other students.

Write all your answers and the code in the appropriate boxes below.

In this lab we will explore constructing *static* word embeddings (i.e. word2vec) and building language models. We'll also evaluate these systems on intermediate tasks, namely word similarity and identifying "good" and "bad" sentences.

* For this we'll use pytorch.
    * You can install it using the instructions from here: https://pytorch.org/
    * If you would like to check out some tutorials on how to use it, you can can do so here: https://pytorch.org/tutorials/beginner/basics/intro.html
    * Some basic operations that will be useful for you can be found here: https://jhui.github.io/2018/02/09/PyTorch-Basic-operations
* We are not interested in getting state-of-the-art performance, focus on the implementation and not results of your model.
    * For this reason, you can use a subset of the dataset: the first 5000-10 000 sentences or so.
    * On linux or mac you can use: ```head -n 10000 inputfile > outputfile```. 
* Using GPUs will make things run faster.
    * You can access the server by using SSH: ```ssh -L 8888:localhost:8888 [your_x_account]@mltgpu.flov.gu.se -p 62266```
        * ```ssh``` tells the computer to connect remotely to the server.
        * ```-L 8888:localhost:8888``` allows you to connect using jupyter notebooks, you can remove it if you don't want to do that.
        * ```-p 62266``` tells the server to give you access through port 62266.
    * You can also connect to the server using VSCode, available for Mac, Linux, and Windows.
    * I would suggest you to set up a virtual environment on the server, such as virtual env or conda.
    * When using pytorch on the server, remember to install the GPU-compatible version!
    * You can also use Google Collab for free (with a monthly quota for GPU usage). We highly suggest you to use the MLT server instead, though.

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np

# If you're using GPUs, replace "cpu" with "cuda:n" where n is the index of the GPU
# I run the script on the MLT server
device = torch.device('cuda:0')

In [2]:
# # look up available GPUs
# print("Number of GPUs available:", torch.cuda.device_count())

# for i in range(torch.cuda.device_count()):
#     print("GPU index:", i, "Name:", torch.cuda.get_device_name(i))


# Word2Vec embeddings

In this first part we'll construct a word2vec model which will give us *static* word embeddings (that is, they are fixed after training).

After we've trained our model we will evaluate the embeddings obtained on a word similarity task.

## Formatting data


First we need to load some data from data/wiki-corpus.50000.txt. The file contains 50 000 sentences randomly selected from the complete wikipedia. Each line in the file contains one sentence. The sentences are whitespace tokenized.

Your first task is to create a dataset suitable for word2vec. That is, we define some ```window_size``` then iterate over all sentences in the dataset, putting the target word in one column and the context words in another (separate the columns with ```tab```). ```window_size=n``` means that we select ```n/2``` tokens to the right and left of the center word.

For example, the sentece "this is a lab and exercise" with ```window size = 4``` will be converted to 6 (target, context) pairs:
```
target      context
----------------------------
this        is, a
is          this, a, lab
a           this, is, lab
lab         is, a, and, exercise
and         a, lab, exercise
exercise    lab, and 
```

this will be our training examples for the word2vec model.

[3 marks]

In [2]:
from collections import Counter # for filtering out uncommon pairs

data_path = 'data/wiki-corpus.50000.txt'
WINDOW_SIZE = 4
def corpus_reader(data_path, window_size=4, min_freq=4):
    all_data = []
    vocabulary = set(['<pad>'])
    with open(data_path, encoding = 'utf-8') as f:
        # go over the lines (sentences in the files)
        for line in f:
            # split sentences into tokens
            tokens = line.strip().split(' ')
            # save all indiviual words to the vocabulary
            for token in tokens:
                vocabulary.add(token)
            # extract all (center word, context) with `window_size=4`, pairs from the sentence
            for i, center_word in enumerate(tokens):
                context = []
                for j in range(max(i - window_size//2, 0), min(i + window_size//2 + 1, len(tokens))):
                    if i != j:
                        context.append(tokens[j])
                # save (center word, context) pairs into a dataset
                all_data.append((center_word, context)) 
    
    # filter out words which does not occur often
    word_counts = Counter(word for pair in all_data for word, _ in [pair])
    
    filtered_vocabulary = {word for word, count in word_counts.items() if count >= min_freq}
    # print("check:",word_counts['Woodcock'])
    
    # create a mapping from words to integers. 
    # each word should have an unique integer mapped to it. 
    # use a dictionary for this.
    word_to_idx = {word: idx for idx, word in enumerate(filtered_vocabulary)}
    return all_data, word_to_idx

In [3]:
# Example usage:
MIN_FREQ = 4 # may leads to KeyError in following indice mapping => deal with pre-judgement

all_data, word_to_idx = corpus_reader(data_path, window_size=WINDOW_SIZE, min_freq=MIN_FREQ)
print("Word to index mapping:", word_to_idx)



In [4]:
len(word_to_idx)

20670

In [5]:
all_data[:100]

[('Anarchist', ['historian', 'George']),
 ('historian', ['Anarchist', 'George', 'Woodcock']),
 ('George', ['Anarchist', 'historian', 'Woodcock', 'reports']),
 ('Woodcock', ['historian', 'George', 'reports', 'that']),
 ('reports', ['George', 'Woodcock', 'that', '"']),
 ('that', ['Woodcock', 'reports', '"', 'The']),
 ('"', ['reports', 'that', 'The', 'annual']),
 ('The', ['that', '"', 'annual', 'Congress']),
 ('annual', ['"', 'The', 'Congress', 'of']),
 ('Congress', ['The', 'annual', 'of', 'the']),
 ('of', ['annual', 'Congress', 'the', 'International']),
 ('the', ['Congress', 'of', 'International', 'had']),
 ('International', ['of', 'the', 'had', 'not']),
 ('had', ['the', 'International', 'not', 'taken']),
 ('not', ['International', 'had', 'taken', 'place']),
 ('taken', ['had', 'not', 'place', 'in']),
 ('place', ['not', 'taken', 'in', '1870']),
 ('in', ['taken', 'place', '1870', 'owing']),
 ('1870', ['place', 'in', 'owing', 'to']),
 ('owing', ['in', '1870', 'to', 'the']),
 ('to', ['1870',

We sampled 50 000 senteces completely random from the *whole* wikipedia for our training data. Give some reasons why this is good, and why it might be bad. (*note*: We'll have a few questions like these, one or two reasons for and against is sufficient)

[2 marks]

**Benefits**:
The diversity ensured by randomness helps in training a robust model capable of generalizing across various text types, while also potentially saving computational resources and time due to its simplicity and efficiency.

**Possible limitations**:
Random sampling might not ensure a balanced representation of all topics or domains present in Wikipedia. Certain topics may be overrepresented or underrepresented in the sampled data, leading to biases in the trained model's understanding and predictions. 

### Loading the data

We need to create a dataloader now. That is, some way of generating a batch of examples from the dataset. A batch is a set of ```n``` examples from the data.

The recipe for a dataloader is as follows:

* Select n examples from the dataset
* (a) Translate each example into integers using `word_to_idx`
* (b) Transform the translated examples to pytorch tensors
* (c) Return the batch 
* Select n new examples from the dataset
* ... repeat steps (a-c)

The dataloader should stop when it have read the whole dataset.

This can be done either by first computing all the batches in the dataset and returning it as a list which you can then iterate over, or as an generator that returns each batch after it has been created.

[4 marks]

In [6]:
word_to_idx['<unk>'] = len(word_to_idx) # add '<unk>' to deal with unknown(uncommon) words => pay attention to dimension when training

In [8]:
from collections import namedtuple
Batch = namedtuple('Batch', ['target_word', 'context'])

def batcher(dataset, word_to_idx, batch_size=8):
    batches = []
    # iterate over the dataset
    for i in range(0, len(dataset), batch_size):
        batch_data = dataset[i:i+batch_size]  # select a batch of size `batch_size`
        batch_target_words, batch_contexts = zip(*batch_data) # *: unpacking

        # translate batch to integers using `word_to_idx
        
        batch_target_word_indices = [word_to_idx[word] if word in word_to_idx else word_to_idx['<unk>'] for word in batch_target_words]
        batch_context_indices = [[word_to_idx[word] if word in word_to_idx else word_to_idx['<unk>'] for word in context] for context in batch_contexts]

    
        # add padding to the context(unify the length of contexts)
        max_context_length = max(len(context) for context in batch_context_indices)
        padded_batch_contexts = [context + [0] * (max_context_length - len(context)) for context in batch_context_indices]
   
        # transform the batch to a pytorch tensor
        tensor_batch_target_word = torch.tensor(batch_target_word_indices)
        tensor_batch_context = torch.tensor(padded_batch_contexts)
    
        # return the dataset of batches/indiviual batches 
        batch = Batch(target_word=tensor_batch_target_word, context=tensor_batch_context)
        batches.append(batch)
    
    return batches

We lower-cased all tokens above; give some reasons why this is a good idea, and why it may be harmful to our embeddings.

[2 marks]

**Advantages**:
Lowercasing all tokens helps in standardizing the text data, reducing the complexity of the vocabulary and ensuring that identical words with different casing are treated as the same token. For example, "Apple" and "apple" would be treated as the same word after lowercasing.

**Possible Harmness**:
Lowercasing can lead to loss of information, especially in cases where the casing of words carries semantic meaning. For example, in named entity recognition tasks, capitalization often indicates proper nouns, which may be lost after lowercasing. Additionally, lowercasing can introduce ambiguity in some cases. For example, the word "US" could represent the United States or the pronoun "us". 

## Word Embeddings Model

We will implement the CBOW model for constructing word embedding models.

In [8]:
import torch.optim as optim

In the CBOW model we try to predict the center word based on the context. That is, we take as input ```n``` context words, encode them as vectors, then combine them by summation. This will give us one embedding. We then use this embedding to predict *which* word in our vocabuary is the most likely center word. 

Implement this model 

[7 marks]

In [9]:
class CBOWModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super(CBOWModel, self).__init__()
        # where the embeddings of words are stored 
        # each word in the vocabulary should have one embedding assigned to it
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        # a transformation that predicts a word from the vocabulary
        self.prediction = nn.Linear(embedding_dim, vocab_size)
    
    def forward(self, context):
        # translate a batch to embeddings
        # print("Context:", context)

        embedded_context = self.embeddings(context)  # (B, S, D) B - batch size, S - sequence length(window size), D - embedding dimension 
        # reduce dimensions of the embeddings
        projection = self.projection_function(embedded_context) # (B, D)
        # predict the target word from the vocabulary
        predictions = self.prediction(projection)  # (B, vocab_size)
        
        return predictions
        
    def projection_function(self, xs):
        """
        This function will take as input a tensor of size (B, S, D)
        where B is the batch_size, S the window size, and D the dimensionality of embeddings
        this function should compute the sum over the embedding dimensions of the input, 
        that is, we transform (B, S, D) to (B, 1, D) or (B, D) 
        """
        xs_sum = torch.sum(xs, dim=1)  # sum over the window size dimension
        return xs_sum

Now we need to train the models. First we define which hyperparameters to use. (You can change these, for example when *developing* your model you can use a batch size of 2 and a very low dimensionality (say 10), just to speed things up). When actually training your model *fo real*, you can use a batch size of [8,16,32,64], and embedding dimensionality of [128,256].

In [10]:
word_embeddings_hyperparameters = {'epochs':10, # given 3
                                   'batch_size':16,
                                   'lr':0.001, # given 'learning_rate' here; to unify with code below => 'lr'
                                   'embedding_dim':128}

Train your model. Iterate over the dataset, get outputs from your model, calculate loss and backpropagate.

We mentioned in the lecture that we use Negative Log Likelihood (https://pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html) loss to train Word2Vec model. In this lab we'll take a shortcut when *training* and use Cross Entropy Loss (https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html), basically it combines ```log_softmax``` and ```NLLLoss```. So what your model should output is a *score* for each word in our vocabulary. The ```CrossEntropyLoss``` will then assign probabilities and calculate the negative log likelihood loss.

[3 marks]

In [11]:
# load data
batches = batcher(all_data, word_to_idx, batch_size=word_embeddings_hyperparameters['batch_size']) # given dataset, vocab = get_data(..) here

In [12]:
# batches

In [13]:
# build model and construct loss/optimizer
cbow_model = CBOWModel(len(word_to_idx), word_embeddings_hyperparameters['embedding_dim'])
cbow_model.to(device)
loss_fn = nn.CrossEntropyLoss()
optimizer = optim.Adam(cbow_model.parameters(), lr=word_embeddings_hyperparameters['lr'])

# start training loop

for epoch in range(word_embeddings_hyperparameters['epochs']):
    total_loss = 0
    for i, batch in enumerate(batches):
        
        context = batch.context # dim = 2, (16,4), i.e.,(batch_size,context_size)
        target_word = batch.target_word # dim = 1

        context = context.to(device)
        target_word = target_word.to(device)
        
        # send your batch of sentences to the model
        output = cbow_model(context)
        
        # compute the loss, you'll need to reshape the input
        # you can read more about this is the documentation for
        # CrossEntropyLoss
        loss = loss_fn(output, target_word.view(-1)) # assuming target_word is a 1D tensor
        total_loss += loss.item()
        
        # compute gradients
        loss.backward()
        
        # update parameters
        optimizer.step()
        
        # reset gradients
        optimizer.zero_grad()
      
    # print average loss for the epoch
    epoch_avg_loss = total_loss / len(batches)
    print(f"epoch {epoch+1}, average loss:{epoch_avg_loss:.4f}")
        

epoch 1, average loss:5.9483
epoch 2, average loss:5.6865
epoch 3, average loss:5.5360
epoch 4, average loss:5.4189
epoch 5, average loss:5.3284
epoch 6, average loss:5.2552
epoch 7, average loss:5.1963
epoch 8, average loss:5.1469
epoch 9, average loss:5.1056
epoch 10, average loss:5.0721


## Evaluating the model

We will evaluate the model on a dataset of word similarities, WordSim353 (http://alfonseca.org/eng/research/wordsim353.html , also avalable in the data folder). The first thing we need to do is read the dataset and translate it to integers. What we'll do is to reuse the ```Field``` that records word indexes (the second output of ```get_data()```) and use it to parse the file.

The wordsim data is structured as follows:

```
word1 word2 score
...
```


The ```Field``` we got from ```read_data()``` has two built-in functions, ```stoi``` which maps a string to an integer and ```itos``` which maps an integer to a string. 

What our datareader needs to do is: 

```
for line in file:
    word1, word2, score = file.split()
    # encode word1 and word2 as integers
    word1_idx = vocab.vocab.stoi[word1]
    word2_idx = vocab.vocab.stoi[word2]
```

when we have the integers for ```word_1``` and ```word2``` we'll compute the similarity between their word embeddings with *cosine simlarity*. We can obtain the embeddings by querying the embedding layer of the model.

We calculate the cosine similarity for each word pair in the dataset, then compute the pearson correlation between the similarities we obtained with the scores given in the dataset. 

[4 marks]

In [18]:
# class Field:
#     def __init__(self):
#         self.stoi = {}
#         self.itos = []

# class Vocabulary:
#     def __init__(self):
#         self.vocab = Field()

# def read_data(data_path):
#     # Create a Vocabulary object
#     vocab = Vocabulary()

#     # Read data from data_path and process it
#     data = set()
#     with open(data_path, encoding = 'utf-8') as f:
#         # go over the lines (sentences in the files)
#         for line in f:
#             # split sentences into tokens
#             tokens = line.strip().split(' ')
#             # save all indiviual words to the vocabulary
#             for token in tokens:
#                 data.add(token)

#     # Populate stoi and itos
#     for i, word in enumerate(list(data)):
#         vocab.vocab.stoi[word] = i
#         vocab.vocab.itos.append(word)

#     return vocab

In [19]:
# vocab = read_data(data_path)

In [13]:
embeddings = cbow_model.embeddings.weight.data
# print(embeddings)

In [38]:
def read_wordsim(path, word_to_idx, embeddings): # def read_wordsim(path, vocab, embeddings): don't need vocab any more 
    dataset_sims = {} # for further analysis that which word pairs have better/worse prediction effects, involve word_pair as key
    model_sims = {}
    with open(path, encoding = 'utf-8') as f:
        for line in f:
            
            word1, word2, score = line.split()
            score = float(score)
            
            dataset_sims[word1, word2] = score
            
            model_sims[word1, word2] = None 
            # get the index for the word
            if word1 in word_to_idx and word2 in word_to_idx:
                word1_idx = word_to_idx[word1]
                word2_idx = word_to_idx[word2]
            
                # get the embedding of the word
                word1_emb = embeddings[word1_idx]
                word2_emb = embeddings[word2_idx]
            
                # compute cosine similarity, we'll use the version included in pytorch functional
                # https://pytorch.org/docs/master/generated/torch.nn.functional.cosine_similarity.html
                cosine_similarity = F.cosine_similarity(word1_emb.unsqueeze(0), word2_emb.unsqueeze(0)) #  default dim = 1, i.e., the dimension along which cosine similarity is computed

                model_sims[word1,word2] = cosine_similarity.item()

    # print("dataset_sims:",dataset_sims)
    # print("model_sims",model_sims)
    
    return dataset_sims, model_sims

path = 'data/wordsim_similarity_goldstandard.txt'
data, model = read_wordsim(path, word_to_idx, embeddings)
# pearson_correlation = np.corrcoef(data, model)

# deal with NAs
cleaned_data = {}
cleaned_model = {}
for d,m in zip(data.items(), model.items()):
    word_pair,d_sims = d
    _,m_sims = m

    if d_sims is not None and m_sims is not None:
        cleaned_data[word_pair] = d_sims # range:0-10 not similar-similar
        cleaned_model[word_pair] = m_sims
        
cleaned_data_values = np.array(list(cleaned_data.values()))
cleaned_model_values = np.array(list(cleaned_model.values()))

# print("cleaned_data_values:",cleaned_data_values)
# print("cleaned_model_values:",cleaned_model_values)

pearson_correlation = np.corrcoef(cleaned_data_values, cleaned_model_values)
         
# the non-diagonals give the pearson correlation
print(pearson_correlation) 
# record of training results with different combinations of hyperparameters
# 3 epochs, batch_size 16, embedding_dim 128, lr 0.001 - 0.3193
# 30 epochs, batch_size 16, embedding_dim 128, lr 0.001 - 0.2941 ?? => overfitting: (adjust hyperparameters) epochs--, embedding_dim++, batch_size--, etc.
# 10 epochs, batch_size 16, embedding_dim 128, lr 0.001 - 0.3434

[[1.         0.34337495]
 [0.34337495 1.        ]]


In [18]:
## find out NAs' prop
# print("original:",len(data),"vs.","cleaned:",len(cleaned_data))

original: 203 vs. cleaned: 144


Do you think the model performs good or bad? Why?

[3 marks]

A correlation coefficient of *0.34* indicates that there is a certain degree of linear relationship between the model's predictions and actual values, but this relationship is not strong. Specifically, this value may mean that the model's similarity predictions are relatively accurate for some word pairs, but not accurate enough for other word pairs.

Taken together, **the model's performance is not very good, but it is not very bad either**. There is room for improvement in the model. We can try to improve the performance of the model by adjusting the model architecture, hyperparameters, etc. In addition, I can also consider further analyzing which word pairs have better prediction effects and which word pairs have worse prediction effects, so as to better understand the performance of the model and the direction of improvement.

Select the 10 best and 10 worst performing word pairs, can you see any patterns that explain why *these* are the best and worst word pairs?

[3 marks]

In [49]:
## main logic: smaller the absolute rank difference is, better the word pair performed (why not value: range differ)
# compute the ranks
data_ranks = {pair: rank + 1 for rank, (pair, sim) in enumerate(sorted(cleaned_data.items(), key=lambda x: x[1]))}
model_ranks = {pair: rank + 1 for rank, (pair, sim) in enumerate(sorted(cleaned_model.items(), key=lambda x: x[1]))}

# generate a sorted dict
rank_diffs = {}
for word_pair in data_ranks:
    data_rank = data_ranks[word_pair]
    model_rank = model_ranks[word_pair]
    rank_diffs[word_pair] = abs(data_rank - model_rank)

sorted_rank_diffs = sorted(rank_diffs.items(), key=lambda x: x[1])
sorted_rank_diffs = dict(sorted_rank_diffs)
# print("rank_diffs:",rank_diffs)
# print("sorted_rank_diffs:",sorted_rank_diffs)

best_pairs = list(sorted_rank_diffs)[:10]
worst_pairs = list(sorted_rank_diffs)[-10:]

# print
print("Top 10 best performing word pairs:")
for pair in best_pairs:
    print(pair)
    print("the absolute rank difference:",dict(sorted_rank_diffs)[pair])
    print("model_sim:",cleaned_model[pair],",","dataset_sim:",cleaned_data[pair])
    print("model_rank:",model_ranks[pair],",","dataset_rank:",data_ranks[pair])

print("\nTop 10 worst performing word pairs:")
for pair in worst_pairs:
    print(pair)
    print("the absolute rank difference:",dict(sorted_rank_diffs)[pair])
    print("model_sim:",cleaned_model[pair],",","dataset_sim:",cleaned_data[pair])
    print("model_rank:",model_ranks[pair],",","dataset_rank:",data_ranks[pair])


Top 10 best performing word pairs:
('coast', 'shore')
the absolute rank difference: 0
model_sim: 0.4394223392009735 , dataset_sim: 9.1
model_rank: 142 , dataset_rank: 142
('viewer', 'serial')
the absolute rank difference: 1
model_sim: 0.009626008570194244 , dataset_sim: 2.97
model_rank: 30 , dataset_rank: 31
('planet', 'moon')
the absolute rank difference: 1
model_sim: 0.23456528782844543 , dataset_sim: 8.08
model_rank: 130 , dataset_rank: 129
('delay', 'racism')
the absolute rank difference: 2
model_sim: -0.057623568922281265 , dataset_sim: 1.19
model_rank: 8 , dataset_rank: 6
('energy', 'secretary')
the absolute rank difference: 2
model_sim: -0.05217861384153366 , dataset_sim: 1.81
model_rank: 12 , dataset_rank: 14
('attempt', 'peace')
the absolute rank difference: 2
model_sim: 0.08271899819374084 , dataset_sim: 4.25
model_rank: 67 , dataset_rank: 69
('psychology', 'discipline')
the absolute rank difference: 2
model_sim: 0.1256367564201355 , dataset_sim: 5.58
model_rank: 91 , dataset

From the results, we could find that the trained word embeddings from CBOWmodel wrongly predict similar word pairs as dissimilar, e.g., ('announcement', 'news'), and predict dissimilar words with a relatively large similarity value, e.g., ('direction', 'combination').

Suggest some ways of improving the model we apply to WordSim353.

[3 marks]

I try to understand the *model* here, and I think it means improving the performance of any machine learning model used to predict the similarity of word pairs. This may involve:

**Fine-tuning Pretrained Word Embeddings**: Pretrained word embeddings like *Word2Vec*, *GloVe*, or *fastText* can be fine-tuned on a dataset that is more specific to our task, such as a corpus similar to the one used in WordSim353. Fine-tuning allows the embeddings to capture *domain-specific* semantics better.

**Model Architecture Tuning**: We can experiment with different neural network architectures such as *LSTM*, *Transformer* or *BERT* for learning word embeddings. Within the network, the *embedding_dim*, *number of layers* and *output_size* are all adjustable.

**Hyperparameters Tuning**: Optimizing hyperparameters like *batch_size*, *learning_rate* and *epochs* can significantly impact the performance of the model. We can use grid search or random search to find the optimal combination of hyperparameters.

If we consider a scenario where we use these embeddings in a downstream task, for example sentiment analysis (roughly: determining whether a sentence is positive or negative). 

Give some examples why the sentiment analysis model would benefit from our embeddnings and one examples why our embeddings could hurt the performance of the sentiment model.

[3 marks]

**Benefits**:
Word embeddings capture semantic information about words, including their meanings and relationships with other words. Wording embeddings can help the sentiment analysis model understand the context and meaning of words in a sentence, leading to more accurate sentiment predictions. For example, with dynamic word embeds, we can infer the exact POS and meaning of a *polysemy* in a specific context, such as "light". Additionally, its generalization ability enables the sentiment analysis model to perform effectively on sentences containing words that were not present in the training data.

**Potential hurt**:
Word embeddings can inherit biases present in the training data, which may adversely affect the sentiment analysis model's predictions. For example, if the word embeddings encode gender or racial biases, these biases may influence the sentiment analysis model's predictions in unintended ways.

# Language modeling

In this second part we'll build a simple LSTM language model. Your task is to construct a model which takes a sentence as input and predicts the next word for each word in the sentence. For this you'll use the ```LSTM``` class provided by PyTorch (https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html). You can read more about the LSTM here: https://colah.github.io/posts/2015-08-Understanding-LSTMs/

NOTE!!!: Use the same dataset (wiki-corpus.50000.txt) as before.

Our setup is similar to before, we first encode the words as distributed representations then pass these to the LSTM and for each output we predict the next word.

To create a gold standard (what we want to predict), we need to manipulate the tensor containing the sentence. As wi want to predict the *next* word, we want the following setup (where `w_n` is the index of a word in the sentence, `x` is the input words, and `y` is the gold words):

$x = [w_0, w_1, w_2, w_3, w_4]$

$y = [w_1, w_2, w_3, w_4, w_5]$

That is, to create the gold standard we need to shift the index `n` of the input by `+1`, as this gives us the next word.


For this we'll build a new dataloader, the file we pass to the dataloader should contain one sentence per line, with words separated by whitespace.

```
word_1, ..., word_n
word_1, ..., word_k
...
```

in this dataloader you want to make sure that each sentence begins with a ```<start>``` token and ends with a ```<end>``` token. But other than that, just as before you read the dataset and output an iterator over the dataset, a vocabulary, and a mapping from words to indices. 

Implement the dataloader, language model and the training loop (the training loop will basically be the same as for word2vec).

[12 marks]

In [9]:
# you can change these numbers to suit your needs as before
lm_hyperparameters = {'epochs':3,
                      'batch_size':16,
                      'lr':0.001, # given 'learning_rate' here; to unify with code below => 'lr'
                      'embedding_dim':300,
                      'hidden_dim':128} # given 'output_dim' here, but output_size usually equals to vocal_size

In [10]:
lm_Batch = namedtuple('Batch', 'sentence')

In [11]:
data_path = 'data/wiki-corpus.50000.txt'

PAD_IDX = 0 
def get_data(path, batch_size = 16, min_freq = 4):
    # your code here, roughly the same as for the word2vec dataloader
    all_sents = []
    vocab = []
    with open(data_path, encoding = 'utf-8') as f:
        # go over the lines (sentences in the files)
        for line in f:
        # split sentences into tokens
            tokens = line.strip().split(' ')
            
            # insert '<start>' and '<end>' respectively to compose a standard token list
            tokens_std = ['<start>']
            tokens_std.extend(tokens)
            tokens_std.append('<end>')
            for token in tokens_std:
                vocab.append(token)

            all_sents.append(tokens_std) 

    # print("vocab:",vocab)
        
    # filter out words which does not occur often
    word_counts = Counter(word for word in vocab)
    
    filtered_vocab = {word for word, count in word_counts.items() if count >= min_freq}
    # print("filtered_vocab:",filtered_vocab)

    # create a mapping from words to integers. 
    # each word should have an unique integer mapped to it. 
    # use a dictionary for this.
    word_to_idx = {word: idx for idx, word in enumerate(filtered_vocab)}

    # Convert sentences to indices
    all_sents_idx = []
    for sent in all_sents:
        sent_idx = [word_to_idx[word] for word in sent if word in word_to_idx]
        all_sents_idx.append(sent_idx)

    # Generate batches
    batches = []
    for i in range(0, len(all_sents_idx), batch_size):
        batch = all_sents_idx[i:i+batch_size]
        max_len = max(len(sent) for sent in batch)
        padded_batch = [sent + [PAD_IDX] * (max_len - len(sent)) for sent in batch]
        batches.append(lm_Batch(sentence=torch.tensor(padded_batch)))
    
    return batches, word_to_idx
  

In [12]:
dataset, vocab = get_data(data_path, batch_size = lm_hyperparameters['batch_size'], min_freq = MIN_FREQ)

In [22]:
# dataset[:10]

[Batch(sentence=tensor([[10882,  9647,  2422, 20108, 16669,  6066, 16128,  4059,  5053,  3105,
          14811, 12765, 16069, 18333, 18548, 14000, 19867,  9503, 18875, 16645,
          16719, 12765, 19621, 14811, 12765, 10315, 14125, 14405,  9503,  7806,
          12765,  3769, 18640, 18835, 20559, 17499, 12393, 11545,  9503, 10105,
           8318,  9096,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0],
         [10882, 17401, 11709,  8142,  6250,  1486, 11349, 13469,  9941, 17098,
          12765,  8465, 14811, 12765,  9161, 14125,  4638, 11349, 18017,  8318,
           9096,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0],
         [10882, 16821, 12765, 17563,     4, 14125,  6130,  5838,  4425,  9355,
          12765, 11413, 14405, 14320,  3721,  8318,  90

In [24]:
# vocab

{'flew': 0,
 'Corners': 1,
 'Atkinson': 2,
 'Pat': 3,
 'panic': 4,
 'stir': 5,
 'partial': 6,
 'tsunami': 7,
 'programmers': 8,
 'From': 9,
 'behaviour': 10,
 'Hayek': 11,
 'Twitter': 12,
 'phosphorylation': 13,
 'inadequate': 14,
 'Bonnie': 15,
 'Sheffield': 16,
 'Wight': 17,
 'Apollo': 18,
 'quotas': 19,
 'accused': 20,
 'taxes': 21,
 'lowest': 22,
 'Gates': 23,
 'borrow': 24,
 'Yahoo': 25,
 'scouting': 26,
 'maintained': 27,
 'Lee': 28,
 'owned': 29,
 'could': 30,
 'parody': 31,
 'credits': 32,
 'large-scale': 33,
 'later': 34,
 'converting': 35,
 'successfully': 36,
 'Cox': 37,
 'experimentation': 38,
 'delivery': 39,
 'poll': 40,
 'Word': 41,
 'Holding': 42,
 'renovated': 43,
 'detect': 44,
 '1750': 45,
 'ICE': 46,
 'I-95': 47,
 'crews': 48,
 'Gregorian': 49,
 'baker': 50,
 'Doppler': 51,
 'Senegal': 52,
 'presence': 53,
 'fighting': 54,
 'Morocco': 55,
 '37.5': 56,
 'liberties': 57,
 'curved': 58,
 'communion': 59,
 'al': 60,
 'somewhat': 61,
 'Egg': 62,
 'withdraw': 63,
 'confro

In [14]:
class LM_withLSTM(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim):
        super(LM_withLSTM, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.LSTM = nn.LSTM(input_size=embedding_dim, hidden_size=hidden_dim)
        self.predict_word = nn.Linear(hidden_dim, vocab_size)
    
    def forward(self, seq):
        # extract embeddings for the sentence
        embedded_seq = self.embeddings(seq)
        # compute contextual representations
        timestep_representation, *_ = self.LSTM(embedded_seq)
        # predict a token from the vocabulary at each timestep
        predicted_words = self.predict_word(timestep_representation)
        
        return predicted_words

In [29]:
!nvidia-smi

Sun May  5 03:11:26 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.98                 Driver Version: 535.98       CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA GeForce GTX 1080 Ti     Off | 00000000:03:00.0 Off |                  N/A |
| 35%   54C    P8              18W / 250W |  11094MiB / 11264MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce GTX 1080 Ti     Off | 00000000:04:00.0 Off |  

In [30]:
! ps -aux | grep python

root        1645  0.0  0.0 361436 34972 ?        Ssl  Jan08 153:04 /usr/bin/python3 -sP /usr/sbin/firewalld --nofork --nopid
root        2099  0.0  0.0 864440 57848 ?        Ssl  Jan08  46:56 /usr/bin/python3 -sP /usr/bin/fail2ban-server -xf start
gusankb+  483376  0.0  0.9 49632304 2598288 ?    Ssl  Feb17 103:36 /usr/bin/python3 -m ipykernel_launcher -f /home/gusankba@GU.GU.SE/.local/share/jupyter/runtime/kernel-506edec7-8bc7-4d48-aed3-45e9ed3cb413.json
gusankb+ 2234037  0.0  0.0 1075372 21120 ?       Sl   Feb05  24:34 /usr/bin/python3 /usr/local/bin/jupyter-lab --no-browser --port=8137
gusankb+ 2726346  0.0  0.0 1001152 13048 ?       Ssl  Feb11   2:41 /usr/bin/python3 -m ipykernel_launcher -f /home/gusankba@GU.GU.SE/.local/share/jupyter/runtime/kernel-d5226100-067b-4428-9e0d-b629d69ba4ce.json
gusankb+ 2726348  0.0  0.0 1001152 13700 ?       Ssl  Feb11   2:43 /usr/bin/python3 -m ipykernel_launcher -f /home/gusankba@GU.GU.SE/.local/share/jupyter/runtime/kernel-a607c80e-cf9f-4e01-bbd7-d

In [None]:
!pip install torch torchvision torchaudio

In [15]:
# load data
dataset, vocab = get_data()

# build model and construct loss/optimizer
lm_model = LM_withLSTM(len(vocab), 
                       lm_hyperparameters['embedding_dim'],
                       lm_hyperparameters['hidden_dim'])
lm_model.to(device)

loss_fn = nn.CrossEntropyLoss()
optimizer = optim.Adam(lm_model.parameters(), lr=lm_hyperparameters['lr'])

# start training loop
total_loss = 0
for epoch in range(lm_hyperparameters['epochs']):
    for i, batch in enumerate(dataset):
        
        # the strucure for each BATCH is:
        # <start>, w0, ..., wn, <end>
        sentence = batch.sentence
        sentence = sentence.to(device)
        
        # when training the model, at each input we predict the *NEXT* token
        # consequently there is nothing to predict when we give the model 
        # <end> as input. 
        # thus, we do not want to give <end> as input to the model, select 
        # from each batch all tokens except the last. 
        # tip: use pytorch indexing/slicing (same as numpy) 
        # (https://pytorch.org/tutorials/beginner/basics/tensorqs_tutorial.html#operations-on-tensors)
        # (https://jhui.github.io/2018/02/09/PyTorch-Basic-operations/)
        input_sentence = sentence[:-1] 
        
        # send your batch of sentences to the model
        output = lm_model(input_sentence)
        
        # for each output, the model predict the NEXT token, so we have to reshape 
        # our dataset again. On timestep t, we evaluate on token t+1. That is,
        # we never predict the <start> token ;) so this time, we select all but the first 
        # token from sentences (that is, all the tokens that we predict)
        gold_data = sentence[1:]
        
        # the shape of the output and sentence variable need to be changed,
        # for the loss function. Details are in the documentation.
        # You can use .view(...,...) to reshape the tensors  
        output = output.view(-1, len(vocab))
        gold_data = gold_data.view(-1)
        
        loss = loss_fn(output, gold_data)
        total_loss += loss.item()
        
        # # print average loss for the epoch
        # print(total_loss/(i+1), end='\r') 
        
        # compute gradients
        loss.backward()
        
        # update parameters
        optimizer.step()
        
        # reset gradients
        optimizer.zero_grad()
        
    # print average loss for the epoch
    epoch_avg_loss = total_loss / len(batches)
    print(f"epoch {epoch+1}, average loss:{epoch_avg_loss:.4f}")

RuntimeError: cuDNN version incompatibility: PyTorch was compiled  against (8, 9, 2) but found runtime version (8, 2, 2). PyTorch already comes bundled with cuDNN. One option to resolving this error is to ensure PyTorch can find the bundled cuDNN.Looks like your LD_LIBRARY_PATH contains incompatible version of cudnnPlease either remove it from the path or install cudnn (8, 9, 2)

In [1]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Jun_13_19:16:58_PDT_2023
Cuda compilation tools, release 12.2, V12.2.91
Build cuda_12.2.r12.2/compiler.32965470_0


In [3]:
import torch
print(torch.version.cuda)

12.1


In [5]:
!cat /usr/local/cuda/include/cudnn_version.h | grep CUDNN_MAJOR -A 2

#define CUDNN_MAJOR 8
#define CUDNN_MINOR 2
#define CUDNN_PATCHLEVEL 2
--
#define CUDNN_VERSION (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)

#endif /* CUDNN_VERSION_H */


In [26]:
!echo $LD_LIBRARY_PATH

/usr/local/hpc_sdk/Linux_x86_64/23.9/comm_libs/nvshmem/lib:/usr/local/hpc_sdk/Linux_x86_64/23.9/comm_libs/nccl/lib:/usr/local/hpc_sdk/Linux_x86_64/23.9/comm_libs/mpi/lib:/usr/local/hpc_sdk/Linux_x86_64/23.9/math_libs/lib64:/usr/local/hpc_sdk/Linux_x86_64/23.9/compilers/lib:/usr/local/hpc_sdk/Linux_x86_64/23.9/compilers/extras/qd/lib:/usr/local/hpc_sdk/Linux_x86_64/23.9/cuda/extras/CUPTI/lib64:/usr/local/hpc_sdk/Linux_x86_64/23.9/cuda/lib64:/usr/local/lib/python3.11/site-packages/tensorrt_libs


In [25]:
!export LD_LIBRARY_PATH=$(echo $LD_LIBRARY_PATH | sed 's|:/usr/local/hpc_sdk/Linux_x86_64/23.9/cuda/lib64||g')

In [None]:
#!conda install pytorch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 pytorch-cuda=12.1 -c pytorch -c nvidia

Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /usr/local/miniconda3

  added / updated specs:
    - pytorch-cuda=12.1
    - pytorch==2.2.2
    - torchaudio==2.2.2
    - torchvision==0.17.2


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    blas-1.0                   |              mkl           6 KB
    boltons-23.0.0             |   py39h06a4308_0         426 KB
    bzip2-1.0.8                |       h5eee18b_6         262 KB
    ca-certificates-2024.3.11  |       h06a4308_0         127 KB
    certifi-2024.2.2           |   py39h06a4308_0         159 KB
    conda-23.5.2               |   py39h06a4308_0         1.0 MB
    cuda-cudart-12.1.105       |                0        

### Evaluating the language model

We'll evaluate our model using the BLiMP dataset (https://github.com/alexwarstadt/blimp). The BLiMP dataset contains sets of linguistic minimal pairs for various syntactic and semantic phenomena, We'll evaluate our model on *existential quantifiers* (link: https://github.com/alexwarstadt/blimp/blob/master/data/existential_there_quantifiers_1.jsonl). This data, as the name suggests, investigate whether language models assign higher probability to *correct* usage of there-quantifiers. 

An example entry in the dataset is: 

```
{"sentence_good": "There was a documentary about music irritating Allison.", "sentence_bad": "There was each documentary about music irritating Allison.", "field": "semantics", "linguistics_term": "quantifiers", "UID": "existential_there_quantifiers_1", "simple_LM_method": true, "one_prefix_method": false, "two_prefix_method": false, "lexically_identical": false, "pairID": "0"}
```

Download the dataset and build a datareader (similar to what you did for word2vec). The dataset structure you should aim for is (you don't need to worry about the other keys for this assignment):

```
good_sentence_1, bad_sentence_1
...
```

your task now is to compare the probability assigned to the good sentence with to the probability assigned to the bad sentence. To compute a probability for a sentence we consider the product of the probabilities assigned to the *gold* tokens, remember, at timestep ```t``` we're predicting which token comes *next* e.g. ```t+1``` (basically, you do the same thing as you did when training).

In rough pseudo code what your code should do is:

```
accuracy = []
for good_sentence, bad_sentence in dataset:
    gs_lm_output = LanguageModel(good_sentence)
    gs_token_probabilities = softmax(gs_lm_output)
    gs_sentence_probability = product(gs_token_probabilities[GOLD_TOKENS])

    bs_lm_output = LanguageModel(bad_sentence)
    bs_token_probabilities = softmax(bs_lm_output)
    bs_sentence_probability = product(bs_token_probabilities[GOLD_TOKENS])

    # int(True) = 1 and int(False) = 0
    is_correct = int(gs_sentence_probability > bs_sentence_probability)
    accuracy.append(is_correct)

print(numpy.mean(accuracy))
    
```

[6 marks]

In [None]:
# your code goes here
import json

def evaluate_model(path, vocab, model):
    
    accuracy = []
    with open(path) as f:
        # iterate over one pair of sentences at a time
        for line in f:
            # load the data
            data = json.loads(line)
            good_s = data['sentence_good']
            bad_s = data['sentence_bad']
            
            # the data is tokenized as whitespace
            tok_good_s = good_s.split()
            tok_bad_s = bad_s.split()
            
            # encode your words as integers using the vocab from the dataloader, size is (S)
            # we use unsqueeze to create the batch dimension 
            # in this case our input is only ONE batch, so the size of the tensor becomes: 
            # (S) -> (1, S) as the model expects batches
            enc_good_s = torch.tensor([vocab[word] for word in tok_good_s], device=device).unsqueeze(0)
            enc_bad_s = torch.tensor([vocab[word] for word in tok_bad_s], device=device).unsqueeze(0)
            
            # pass your encoded sentences to the model and predict the next tokens
            good_s = model(enc_good_s)
            bad_s = model(enc_bad_s)
            
            # get probabilities with softmax
            gs_probs = F.softmax(good_s, dim=1)
            bs_probs = F.softmax(bad_s, dim=1)
            
            # select the probability of the gold tokens
            gs_sent_prob = find_token_probs(gs_probs, enc_good_s)
            bs_sent_prob = find_token_probs(bs_probs, enc_bad_s)
            
            accuracy.append(int(gs_sent_prob>bs_sent_prob))
            
    return accuracy
            
def find_token_probs(model_probs, encoded_sentece):
    probs = []

    # iterate over the tokens in your encoded sentence
    for token, gold_token in enumerate(encoded_sentece):
        # select the probability of the gold tokens and save
        # hint: pytorch indexing is helpful here ;)
        prob = model_probs[0, token, gold_token].item()
        probs.append(prob)
    sentence_prob = np.prod(probs)
    return sentence_prob

path = 'data/existential_there_quantifiers_1.jsonl'
accuracy = evaluate_model(path, vocab, lm_model)

print('Final accuracy:')
print(np.round(np.mean(accuracy), 3))


### Analysis

Our model get some score, say, 55% correct predictions. Is this good? Suggest some *baseline* (i.e. a stupid "model" we hope ours is better than) we can compare the model against.

[3 marks]

A common baseline model for binary classification tasks is a **random guessing baseline**:

This baseline randomly predicts the class label for each instance in the dataset. For a binary classification task, where we predict whether a sentence pair aligns to the good-bad tag or not, we randomly assign each instance a class label (e.g., 0 or 1) with equal probability.

If our dataset is balanced, meaning it contains an equal number of positive and negative examples. In this case, if we randomly assign class labels to instances, we expect to achieve an accuracy close to 50%. If our model performs better than this random baseline, it suggests that it's learning meaningful patterns in the data. If it performs worse than the random baseline, it indicates that the model is not capturing any useful information.

If our dataset is imbalanced, meaning it contains a significantly larger number of instances from one class compared to the other, then the random baseline accuracy would be different. In such cases, we might need to adjust our baseline accordingly.

Suggest some improvements you could make to your language model.

[3 marks]

Suggest some other metrics we can use to evaluate our system

[2 marks]

# Literature


Neural architectures:

[1] Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin. A neural probabilistic language model. (Links to an external site.) Journal of Machine Learning Research, 3(6):1137–1155, 2003. (Sections 3 and 4 are less relevant today and hence you can glance through them quickly. Instead, look at the Mikolov papers where they describe training word embeddings with the current neural network architectures.)

[2] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.

[3] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.
    


## Statement of contribution

Briefly state how many times you have met for discussions, who was present, to what degree each member contributed to the discussion and the final answers you are submitting.

## Marks

The assignment is marked on a 7-level scale where 4 is sufficient to complete the assignment; 5 is good solid work; 6 is excellent work, covers most of the assignment; and 7: creative work. 

This assignment has a total of 63 marks. These translate to grades as follows: 1 = 17% 2 = 34%, 3 = 50%, 4 = 67%, 5 = 75%, 6 = 84%, 7 = 92% where %s are interpreted as lower bounds to achieve that grade.