    („ÄÉÔø£Ô∏∂Ôø£)‰∫∫(„ÄÉÔø£Ô∏∂Ôø£„ÄÉ)‰∫∫(„ÄÉÔø£Ô∏∂Ôø£„ÄÉ)‰∫∫(„ÄÉÔø£Ô∏∂Ôø£„ÄÉ)‰∫∫(Ôø£Ô∏∂Ôø£„ÄÉ)
      Judit      Minerva       Patri        Ranim       Adam

# Word Sense Disambiguation using Neural Networks
Adam Ek

----------------------------------

The lab is an exploration and learning exercise to be done in a group and also in discussion with the teachers and other students.

Before starting, please read the instructions on [how to work on group assignments](https://github.com/sdobnik/computational-semantics/blob/master/README.md).

Write all your answers and the code in the appropriate boxes below.

----------------------------------

A problem with static distributional vectors is the difficulty of distinguishing between different *word senses*. We will continue our exploration of word vectors by considering *trainable vectors* or *word embeddings* for Word Sense Disambiguation (WSD).

The goal of word sense disambiguation is to train a model to find the sense of a word (homonyms of a word-form). For example, the word "bank" can mean "sloping land" or "financial institution". 

(a) "I deposited my money in the **bank**" (financial institution)

(b) "I swam from the river **bank**" (sloping land)

In case a) and b) we can determine that the meaning of "bank" based on the *context*. To utilize context in a semantic model we use *contextualized word representations*. Previously we worked with *static word representations*, i.e. the representation does not depend on the context. To illustrate we can consider sentences (a) and (b), the word **bank** would have the same static representation in both sentences, which means that it becomes difficult for us to predict its sense. What we want is to create representations that depend on the context, i.e. *contextualized embeddings*. 

We will create contextualized embeddings with Recurrent Neural Networks. You can read more about recurrent neural netoworks [here](https://colah.github.io/posts/2015-08-Understanding-LSTMs/). Your overall task in this lab is to create a neural network model that can disambiguate the word sense of 15 different words. 

In [16]:
# first we import some packages that we need
import torch
import torch.nn as nn
import torchtext
import torch.optim as optim
from torchtext.data import Field, BucketIterator, Iterator, TabularDataset
import numpy as np
# our hyperparameters (add more when/if you need them)
device = torch.device('cuda:1')

batch_size = 4
learning_rate = 0.001
epochs = 3
max_grad_norm = 5

# 1. Working with data

A central part of any machine learning system is the data we're working with. In this section we will split the data (the dataset is located here: ``wsd-data/wsd_data.txt``) into a training set and a test set. We will also create a baseline to compare our model against. Finally, we will use TorchText to transform our data (raw text) into a convenient format that our neural network can work with.

## Data

The dataset we will use contain different word sense for 15 different words. The data is organized as follows (values separated by tabs): 
- Column 1: word-sense
- Column 2: word-form
- Column 3: index of word
- Column 4: white-space tokenized context

### Splitting the data

Your first task is to seperate the data into a *training set* and a *test set*. The training set should contain 80% of the examples and the test set the remaining 20%. The examples for the test/training set should be selected **randomly**. Save each dataset into a .csv file for loading later. **[2 marks]**

In [2]:
from sklearn.model_selection import train_test_split

path = 'wsd-data/wsd_data.txt'

def data_split(path_to_dataset):
    
    with open(path_to_dataset) as d:
        di = d.readlines()
    
        train, test = train_test_split(di, test_size=0.20, random_state=0)

        with open('train.csv', 'w') as g:
            for l in train:
                g.write(l)
        with open('test.csv', 'w') as h:
            for l in test:
                h.write(l)

data_split(path)

### Creating a baseline

Your second task is to create a *baseline* for the task. A baseline is a "reality check" for a model, given a very simple heuristic/algorithmic/model solution to the problem, can our neural network perform better than this?
The baseline you are to create is the "most common sense" (MCS) baseline. For each word form, find the most commonly assigned sense to the word, and label a words with that sense. **[2 marks]**

E.g. In a fictional dataset, "bank" have two senses, "financial institution" which occur 5 times and "side of river" 3 times. Thus, all 8 occurences of bank is labeled "financial institution" and this yields an MCS accuracy of 5/8 = 62.5%. If a model obtain a higher score than this, we can conclude that the model *at least* is better than selecting the most frequent word sense.

In [2]:
def mcs_baseline(data):
    result = {}
    with open(data) as file:
        for line in file.readlines():
            split_line = line.split("\t")
            if split_line[1] in result:
                if split_line[0] in result[split_line[1]]:
                    result[split_line[1]][split_line[0]] += 1
                else:
                    result[split_line[1]][split_line[0]] = 1
            else:
                result[split_line[1]] = {split_line[0]: 1}
    
    # select most common sense for words
    for key, value in result.items():
        result[key] = max(value, key=value.get)
        # print(key, len(value.values()))
        
    return result

baseline = mcs_baseline("train.csv") 

### Creating data iterators

To train a neural network, we first need to prepare the data. This involves converting words (and labels) to a number, and organizing the data into batches. We also want the ability to shuffle the examples such that they appear in a random order.  

To do all of this we will use the torchtext library (https://torchtext.readthedocs.io/en/latest/index.html). In addition to converting our data into numerical form and creating batches, it will generate a word and label vocabulary, and data iterators than can sort and shuffle the examples. 

Your task is to create a dataloader for the training and test set you created previously. So, how do we go about doing this?

1) First we create a ``Field`` for each of our columns. A field is a function which tokenize the input, keep a dictionary of word-to-numbers, and fix paddings. So, we need four fields, one for the word-sense, one for the position, one for the lemma and one for the context. 

2) After we have our fields, we need to process the data. For this we use the ``TabularDataset`` class. We pass the name and path of the training and test files we created previously, then we assign which field to use in each column. The result is that each column will be processed by the field indicated. So, the context column will be tokenized and processed by the context field and so on. 

3) After we have processed the dataset we need to build the vocabulary, for this we call the function ``build_vocab()`` on the different ``Fields`` with the output from ``TabularDataset`` as input. This looks at our dataset and creates the necessary vocabularies (word-to-number mappings). 

4) Finally, the last step. In the last step we load the data objects given by the ``TabularDataset`` and pass it to the ``BucketIterator`` class. This class will organize our examples into batches and shuffle them around (such that for each epoch the model observe the examples in a different order). When we are done with this we can let our function return the data iterators and vocabularies, then we are ready to train and test our model!

Implement the dataloader. [**2 marks**]

*hint: for TabularDataset and BucketIterator use the class function splits()* 

In [3]:
def dataloader(path):
    
    whitespacer = lambda x: x.split(' ')

    # "fields" that process the different columns in our CSV files
    WORDSENSE = Field(tokenize    = whitespacer,
               lower       = True,
               batch_first = True)

    LEMMA = Field(tokenize    = whitespacer,
                  lower       = True,
                  batch_first = True)
    
    POSITION = Field(tokenize    = whitespacer,
                     sequential = False,
                     use_vocab = False, # To make sure you don't use the vocabulary for this field
                     batch_first = True)
    
    CONTEXT = Field(tokenize    = whitespacer,
                    lower       = True,
                    batch_first = True)
    
    # read the csv files
    train, test = TabularDataset.splits(path   = path,
                                        train  = 'train.csv',
                                        test   = 'test.csv',
                                        format = 'csv',
                                        fields = [('sense', WORDSENSE),
                                                  ('lemma', LEMMA),
                                                 ('position', POSITION),
                                                 ('context', CONTEXT)],
                                        skip_header       = True,
                                        csv_reader_params = {'delimiter':'\t',
                                                             'quotechar':'¬Ω'})
    
    # build vocabularies based on what our csv files contained and create word2id mapping
    WORDSENSE.build_vocab(train) #, min_freq=3) 
    LEMMA.build_vocab(train)
    CONTEXT.build_vocab(train)

    # create batches from our data, and shuffle them for each epoch
    train_iter, test_iter = BucketIterator.splits((train, test),
                                                  batch_size        = batch_size,
                                                  sort_within_batch = True,
                                                  sort_key          = lambda x: len(x.lemma),
                                                  shuffle           = True,
                                                  device            = device)

    return train_iter, test_iter, WORDSENSE, LEMMA, POSITION, CONTEXT

# 2.1 Creating and running a Neural Network for WSD

In this section we will create and run a neural network to predict word senses based on *contextualized representations*.

### Model

We will use a bidirectional Long-Short-Term Memory (LSTM) network to create a representation for the sentences and a Linear classifier to predict the sense of each word.

When we initialize the model, we need a few things:

    1) An embedding layer: a dictionary from which we can obtain word embeddings
    2) A LSTM-module to obtain contextual representations
    3) A classifier that compute scores for each word-sense given *some* input


The general procedure is the following:

    1) For each word in the sentence, obtain word embeddings
    2) Run the embedded sentences through the RNN
    3) Select the appropriate hidden state
    4) Predict the word-sense 

**Suggestion for efficiency:**  *Use a low dimensionality (32) for word embeddings and the LSTM when developing and testing the code, then scale up when running the full training/tests*
    
Your tasks will be to create two different models (both follow the two outlines described above), described below:

In the first approach to WSD, you are to select the index of our target word (column 3 in the dataset) and predict the word sense. **[5 marks]**


Adam: "So in the first approach we want to use the LSTM (contextual) representation of the ambiguous word to predict its sense, so we need to **extract that representation and pass it to our prediction layer**"

In [58]:
class WSDModel_approach1(nn.Module):
    def __init__(self, num_words, num_senses, i_dim, o_dim):
        super(WSDModel_approach1, self).__init__()
        self.embeddings = nn.Embedding(num_words, i_dim) 
        self.rnn = nn.LSTM(i_dim, o_dim, bidirectional=True, batch_first=True)
        self.classifier = nn.Linear(o_dim*2, num_senses) # the output of the lstm (TIMES TWO) is the input of the linnear
    
    def forward(self, batch):
        embedded_batch = self.embeddings(batch.context)
        hidden_states, (final_hidden, cell_state) = self.rnn(embedded_batch) # you put the embedding in the context
        # hidden_states => multidim matrix of size [batch_size, sequence_lentght, o_dim]
        # select by index from hidden_states
        word = hidden_states[range(hidden_states.shape[0]), batch.position]
        
        return self.classifier(word)

In the second approach to WSD, you are to predict the word sense based on the final hidden state given by the RNN. **[5 marks]**

In [59]:
class WSDModel_approach2(nn.Module):
    def __init__(self, num_words, num_senses, i_dim, o_dim):
        super(WSDModel_approach2, self).__init__()
        self.embeddings = nn.Embedding(num_words, i_dim)
        self.rnn = nn.LSTM(i_dim, o_dim, bidirectional=True, batch_first=True)
        self.classifier = nn.Linear(o_dim*2, num_senses)
    
    def forward(self, batch):
        embedded_batch = self.embeddings(batch.context) 
        hidden_states, (final_hidden, cell_state) = self.rnn(embedded_batch)
        
        not_cool_approach = torch.cat((final_hidden[0], final_hidden[1]), dim=1)
        
        # cool_approach = torch.cat((final_hidden[0], cell_state[0], final_hidden[1], cell_state[1]), dim=1)
        
        # concatenate forward and backward OR concatenate final_hidden with cell_state
        pred = self.classifier(not_cool_approach)
        
        # pred = self.classifier(cool_approach)

        return pred 

### Training and testing the model

Now we are ready to train and test our model. What we need now is a loss function, an optimizer, and our data. 

- First, create the loss function and the optimizer.
- Next, we iterate over the number of epochs (i.e. how many times we let the model see our data). 
- For each epoch, iterate over the dataset (``train_iter``) to obtain batches. Use the batch as input to the model, and let the model output scores for the different word senses.
- For each model output, calculate the loss (and print the loss) on the output and update the model parameters.
- Reset the gradients and repeat.
- After all epochs are done, test your trained model on the test set (``test_iter``) and calculate the total and per-word-form accuracy of your model.

Implement the training and testing of the model **[4 marks]**

**Suggestion for efficiency:** *when developing your model, try training and testing the model on one or two batches (for each epoch) of data to make sure everything works! It's very annoying if you train for N epochs to find out that something went wrong when testing the model, or to find that something goes wrong when moving from epoch 0 to epoch 1.*

In [60]:
def train(model, train_iter):

    loss = nn.CrossEntropyLoss(reduction='mean')
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)

    model.train()

    for e in range(epochs):
        epoch_loss = 0


        for i, batch in enumerate(train_iter): 
            sentences = batch.context
            senses    = batch.sense

            # run sentences through the model
            output = model(batch) 

            # compute loss
            # output: from (B, L, C) to (B*L, C)
            # labels: from (B, C) to (B*C)
            # where B = batch size, L = sequence length and C the number of labels

            batch_loss  = loss(output, senses.view(-1))
            epoch_loss += batch_loss.item()

            # report results
            print(e, (i+1)*sentences.size(0), np.round(epoch_loss/(i+1),4),
                  end='\r')

            # calculate gradients
            batch_loss.backward()
            
            # update model weights
            optimizer.step()
            
            # reset gradients
            optimizer.zero_grad()
            
        print()
    return model

def test_model(model, test_iter):
    
    loss = nn.CrossEntropyLoss(reduction='mean')
    
    model.eval()
    # test model after all epochs are completed
    test_loss = 0

    # iterate over the test data and compute the class probabilities, same
    # procedure as before, but now we don't backpropagate
    
    correct_guesses = 0
    
    for i, batch in enumerate(test_iter):
    #     sentences = batch.context
        senses    = batch.sense

        with torch.no_grad(): # don't collect gradients when testing
            output = model(batch)

        batch_loss = loss(output.view(-1,num_senses), senses.view(-1))
        test_loss += batch_loss.item()

        # finding accuracy
        correct_guesses += torch.sum(torch.eq(torch.argmax(output, dim=1), senses.view(-1)).long())
    
    accuracy = int(correct_guesses) / ((i+1) * batch_size)

    print('>', np.round(test_loss/(i+1), 4))
    print('accuracy: ', accuracy)

In [61]:
# In case of a RuntimeError: The NVIDIA driver on your system is too old (found version 10010),
# run the following command: pip install torch==1.3.1+cu100 torchvision==0.4.2+cu100 -f https://download.pytorch.org/whl/torch_stable.html            

path_to_folder = '.'
train_iter, test_iter, sense, lemma, position, context = dataloader(path_to_folder)

num_words  = len(context.vocab)
num_senses = len(sense.vocab)

model1 = WSDModel_approach1(num_words, num_senses, 50, 50).to(device) 
model2 = WSDModel_approach2(num_words, num_senses, 50, 50).to(device) 

In [62]:
m1 = train(model1, train_iter)

0 60840 1.32412.08991.7512 1.3407
1 60840 0.868319776 0.88150.87670.87590.872153428 0.8696
2 60840 0.71080.7050.7065


In [63]:
m2 = train(model2, train_iter)

0 60840 4.8295.12864.959423832 4.95894.8797
1 60840 4.0989.48194.42934.3796
2 30420 2.7269


In [64]:
test_model(m1, test_iter)

> 0.9117
accuracy:  0.6888640546936629


In [65]:
test_model(m2, test_iter)

> 2.2046
accuracy:  0.41171443597160134


# 2.2 Running a transformer for WSD

In this section of the lab you'll try out the transformer, specifically the BERT model. For this we'll use the huggingface library (https://huggingface.co/).

You can find the documentation for the BERT model here (https://huggingface.co/transformers/model_doc/bert.html) and a general usage guide here (https://huggingface.co/transformers/quickstart.html).

What we're going to do is *fine-tune* the BERT model, i.e. update the weights of a pre-trained model. That is, we have a model that is trained on language modeling, but now we apply it to word sense disambiguation with the word representations it learnt from language modeling.

We'll use the same data splits for training and testing as before, but this time you'll not use a torchtext dataloader. Rather now you create an iterator that collects N sentences (where N is the batch size) then use the **BertTokenizer to transform the sentence into integers**. For your dataloader, remember to:
* Shuffle the data in each batch
* Make sure you get a new iterator for each *epoch*
* Create a vocabulary of *sense-labels* so you can calculate accuracy 

We then pass this batch into the BERT model and train as before. The BERT model will encode the sentence, then we send this encoded sentence into a prediction layer (you can either the the sentence-representation from bert, or the ambiguous word) like before and collect sense predictions.

About the hyperparameters and training:
* For BERT, usually a lower learning rate works best, between 0.0001-0.000001.
* BERT takes alot of resources, running it on CPU will take ages, utilize the GPUs :)
* Since BERT takes alot of resources, use a small batch size (4-8)
* Computing the BERT representation, make sure you pass the mask

**[10 marks]**

In [18]:
learning_rate = 0.000001 # ‚îó( TÔπèT )‚îõ

In [19]:
import random
from transformers import BertTokenizer


def yield_batches(lst, batch_size):
    # Shuffle the data
    random.shuffle(lst)
    for i in range(0, len(lst), batch_size):
        yield lst[i:i + batch_size]

def dataloader_for_bert(path_to_file, batch_size):
    with open(path_to_file) as file:
        lines = file.readlines()
        
        #contexts = [ line.split('\t')[3] for line in lines]
        contexts_and_labels = [ (line.split('\t')[3], line.split('\t')[0]) for line in lines]
        
        #Create a vocabulary of sense-labels so you can calculate accuracy
        sense_labels = {line.split('\t')[0] for line in lines}
        
        # create batches
        iterator = yield_batches(contexts_and_labels, batch_size)
        
        # use BertTokenizer to encode sentences 
        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

        input_ids = []
        attention_masks = []
        context_labels = []
        
        # read batch_size sentences at a time
        for i, batch in enumerate(iterator):
            contexts, labels = zip(*batch)
            tokenized_text = tokenizer.batch_encode_plus(contexts, 
                                                        max_length=128,
                                                        add_special_tokens = True, # add CLS and SEP tokens
                                                        pad_to_max_length=True,
                                                        padding = 'longest',
                                                        truncation=True,
                                                        return_attention_mask=True)
            input_ids.append(tokenized_text['input_ids'])
            attention_masks.append(tokenized_text['attention_mask'])
            # for every batch of tokenized_text we also need to add its labels
            context_labels.append(labels)
       
    return sense_labels, input_ids, attention_masks, context_labels, i #torch.LongTensor(input_ids), torch.FloatTensor(attention_masks)

In [20]:
train_dataloader = dataloader_for_bert("train.csv", batch_size) # (‚à™.‚à™ )...zzz

In [21]:
test_dataloader = dataloader_for_bert("test.csv", batch_size)

In [22]:
from transformers import BertModel

class BERT_WSD(nn.Module):
    def __init__(self, sense_labels_size):
        super(BERT_WSD, self).__init__()
        self.bert = BertModel.from_pretrained('bert-base-uncased')
        self.classifier = nn.Linear(self.bert.config.hidden_size, sense_labels_size)
    
    def forward(self, batch):
        input_ids = torch.LongTensor(batch[0]).to(device)
        attention_mask = torch.FloatTensor(batch[1]).to(device)
        output = self.bert(input_ids=input_ids, attention_mask=attention_mask, return_dict=False) 
        # select cls tokens -> the first element of last_hidden_state 
        predictions = self.classifier(output[0][:,0,:])
        #predictions = self.classifier(output[1]) # automatically selects cls tokens
        return predictions

Writing the Train function
Now we are all set to train our model. This train function is just like how we process a normal Pytorch model. We first set the mode to training, then we iterate through each batch and transfer it to the GPU. Then we pass the input_ids, attention_mask and input_ids to the model. It gives us the output, which consists of loss, logits, hidden_states_output and attention_mask_output. The loss contains the classification loss value. We call the backward function of the loss to calculate the gradients of the parameters of the BERT model. We then call clip_grad_norm_ to prevent the gradients from getting too high or too low. Then we call the optimizer.step() to update the gradients which are calculated by loss.backward(). scheduler.step() is used to update the learning rate according to the scheduler.

In [23]:
import transformers

sense_labels, input_ids, attention_masks, context_labels, num_batches = train_dataloader
model = BERT_WSD(len(sense_labels)).to(device)
loss_function = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
scheduler = transformers.get_linear_schedule_with_warmup(optimizer,
                                                         num_warmup_steps = int((num_batches * epochs) * 0.05),
                                                         num_training_steps = num_batches * epochs)

# convert sense_labels to integers
sense_labels_dict = {label: i for i, label in enumerate(sense_labels)}

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [25]:
# (‚ïØ¬∞‚ñ°¬∞Ôºâ‚ïØÔ∏µ ‚îª‚îÅ‚îª *flips table*

model.train()

for e in range(epochs):
    total_loss = 0
    # Suggestion for a different approach: get batches here instead
    for i, batch_tuple in enumerate(zip(input_ids, attention_masks, context_labels)):
        # output from model
        out = model(batch_tuple)
        
        target_labels = torch.tensor([sense_labels_dict[label] for label in batch_tuple[2]]).to(device)
        loss  = loss_function(out, target_labels)
        total_loss += loss.item()
        
        # print average loss for the epoch
        print(total_loss/(i+1), end='\r')

        # backpropagation
        loss.backward()
        
        # optimizing
        optimizer.step()
            
        # clear gradients
        optimizer.zero_grad()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)
        
        scheduler.step()
        
    print()
torch.save(model, 'bert_model_0.5.pt')

2.9145213935462992
2.9122591788207597
2.9129524720177535


  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "


Hi Ranim, Minerva and Adam! This is Pat. I screwed up and did a keybord interrupt so I had to restart it. I am SO sorry!!! Please forgive me my dudes. („Å£ ¬∞–î ¬∞;)„Å£ 

In [None]:
# model = torch.load("/scratch/bert_model.pt")

In [27]:
sense_labels, input_ids, attention_masks, context_labels, _ = test_dataloader

model.eval()

test_loss = 0
correct_guesses = 0

for i, batch_tuple in enumerate(zip(input_ids, attention_masks, context_labels)):
    
    with torch.no_grad():
        output = model(batch_tuple)

    target_labels = torch.tensor([sense_labels_dict[label] for label in batch_tuple[2]]).to(device)

    loss  = loss_function(output, target_labels)
    test_loss += loss.item()
    
    #print("output", torch.argmax(output, dim=1))
    #print("gold", target_labels.view(-1))
    
    correct_guesses += torch.sum(torch.eq(torch.argmax(output, dim=1), target_labels.view(-1)).long())

accuracy = int(correct_guesses) / ((i+1) * batch_size)

print('>', np.round(test_loss/(i+1), 4))
print('BERT accuracy: ', accuracy)

> 2.7181
BERT accuracy:  0.5275440441756508


# 3. Evaluation

Explain the difference between the first and second approach. What kind of representations are the different approaches using to predict word-senses? **[4 marks]**

    In the first approach we predicted based on the word selected by the index and in the second approach we predicted based on the final hidden state. The final hidden states contain the memory of the weights of the different layers while the indexed method selects the the hidden state of the word in question.

    The first approach seems to work better based on the lower test loss. We expected the second approach to work better because it contains more information but maybe the first approach is better after all because it is more specific.

Evaluate your model with per-word-form *accuracy* and comment on the results you get, how does the model perform in comparison to the baseline, and how do the models compare to each other? 

Expand on the evaluation by sorting the word-forms by the number of senses they have. Are word-forms with fewer senses easier to predict? Give a short explanation of the results you get based on the number of senses per word.

**[6 marks]**

In [12]:
# baseline accuracy
with open('test.csv') as test:
    t = test.readlines()
    accuracy = 0
    for line in t:
        word, sense = line.split('\t')[1], line.split('\t')[0]
        if sense == baseline[word]:
            accuracy += 1
    print('accuracy of baseline: ', accuracy/len(t))



accuracy of baseline:  0.314266929651545


How does the LSTMs perform in comparison to BERT? What's the difference between representations obtained by the LSTMs and BERT? **[2 marks]**

      The following table shows the accuracy results of these models, here explained:

- Baseline: a model which picks the most common word sense for each word.
- LSTM1: a LSTM model which picks the word sense according to the position of the word.
- LSTM2: a LSTM model which predicts the sense based on the final hidden state.


      As an extra experiment to improve the accuracy of our Bert model, we attempted to fine-tune the scheduler by modifying the number of warmup steps. 


- BERT no_scheduler: BERT with no scheduler.
- BERT scheduler1: BERT with a num_warmup_steps of 2% of the training steps.
- BERT scheduler2: BERT with a num_warmup_steps of 5% of the training steps.
- BERT scheduler2_oops: BERT with a num_warmup_steps of 5% of the training steps after an accidental keyboard interrupt in the third epoch.

|          | Baseline          | LSTM1              | LSTM2               | BERT no_scheduler    | BERT scheduler1   | BERT scheduler2    | BERT scheduler2_oops   |
|----------|-------------------|--------------------|---------------------|----------------------|-------------------|--------------------|--------------------|
| Accuracy | 0.3143 | 0.6889 | 0.4117 | 0.0105 | 0.4538 | 0.4727 | 0.5275 |

    Ranking
        1. LSTM1 üëë
        2. BERT scheduler2_oops ü•à
        3. BERT scheduler2 ü•â
        4. BERT scheduler1 ‚úå
        5. Baseline üëç
        6. BERT no_scheduler üëé

    Due to an unfortunate mistake, we did a keyboard interrupt while in the third epoch in training the model with 5% of warmup steps. After rerunning the training cell, we observed that, at the first epoch, the loss started 1 point lower and the decrease was noticeably slower in the second and third epochs. 

    This unplanned model is the one that gives a higher accuracy (of about 52%) among all of the trained BERT models. This might be due to two reasons: 
    
        1. First of all, the randomization of the data (different for BERT scheduler2 and BERT scheduler2_oops).
        
        2. Second, we believe that it is possible that the data ended up doing five epochs instead of three (2 before the keyboard interrupt and 3 after). This leads us to hypothesize that, despite having to start the training again, part of the information was retained and, therefore, the resulting model was slightly more accurate. This is only a supposition, given that we do not really know what happened when the cell was interrupted or what was saved. If that were true, this would lead us to assume that an increase in the number of epochs might yield even better results, but whether the resulting model would be as good with different testing sets and what the right number of epochs would be is still an open question.

    Surprisingly, the model that yields the best results is the first LSTM, the one that picks the most common sense depending on the position of the word - we were expecting the BERT to work better than the others. Since this is the first time we fine-tune BERT, we believe that this result is due to the our lack of familiarity and intuition fine-tuning it.

    We are especially concerned about the results for the BERT no_scheduler, which are dramatically low. This definitely proves how important it is to use a scheduler and fine-tune it correctly. 
    The baseline model, which is very basic, yield surprisingly good results, especially comparing it to the other models. 

    The first LSTM model represents the target word in the context based on the index. The second one, on the other hand, represents the whole context using the final hidden state to predicts the word sense. BERT is similar to the second LSTM, although it uses pretrained embeddings, which might give it a higher accuracy in correctly fine-tuned models, and attention. 

    Note: the different fine-tunes of BERT (BERT no_scheduler, BERT scheduler 1 and scheduler2) were done in different notebooks and computers to save time (and sanity) (check ../Dancing\ Potatoes/assignment-04/Judit and ../Dancing\ Potatoes/assignment-04/Patricia). This is the reason why the results are not shown in this notebook. 

What could we do to improve our LSTM word sense disambiguation models and our BERT model? **[4 marks]**

    (Partially answered in the previous question)
    
    When it comes to Neural Networks, the data that we use to train, develop and test the model plays a very important role in its performance. As a consequence, we believe that improving the quality and size of data is the most straightforward way to improve the models. For that, we suggest adding more words senses to the dataset. If the data was not revised by human annotators, we suggest that they might be useful to attest to the quality of the data.

    In our second LSTM model approach, we concatenated the hidden states before using the classifier. We could try to concatenate the cell states as well to get a different and perhaps better result (commented out in the building of LSTM2). According to Hadiwinoto et al.(2019) another way to improve LSTM word sense disambiguation models would be "using pre-trained contextualized word representation" i.e., BERT.
    
    To improve BERT, we could try different forms of fine-tuning to see what works best. Since our knowledge of BERT is quite limited, we would have to do trial and error to see which approach works best. We tried different forms of fine-tuning, especially with the warmup steps in the scheduler (see answer above), but did not get optimal results. We suggest trying other configurations to get the better results we were hoping to see for BERT. We could try the approach suggested by Hadiwinoto et al.(2019), to use "linear projection of the hidden vectors, coupled with gating to filter the values".

    Bibliography: 
    Hadiwinoto, C., Ng, H. T., & Gan, W. C. (2019). Improved word sense disambiguation using pre-trained contextualized word representations. arXiv preprint arXiv:1910.00194. https://www.aclweb.org/anthology/D19-1533.pdf 

# Readings:

[1] K√•geb√§ck, M., & Salomonsson, H. (2016). Word Sense Disambiguation using a Bidirectional LSTM. arXiv preprint arXiv:1606.03568.

[2] https://cl.lingfil.uu.se/~nivre/master/NLP-LexSem.pdf