# Word Sense Disambiguation using Neural Networks
Adam Ek

----------------------------------

The lab is an exploration and learning exercise to be done in a group and also in discussion with the teachers and other students.

Before starting, please read the instructions on [how to work on group assignments](https://github.com/sdobnik/computational-semantics/blob/master/README.md).

Write all your answers and the code in the appropriate boxes below.

----------------------------------

A problem with static distributional vectors is the difficulty of distinguishing between different *word senses*. We will continue our exploration of word vectors by considering *trainable vectors* or *word embeddings* for Word Sense Disambiguation (WSD).

The goal of word sense disambiguation is to train a model to find the sense of a word (homonyms of a word-form). For example, the word "bank" can mean "sloping land" or "financial institution". 

(a) "I deposited my money in the **bank**" (financial institution)

(b) "I swam from the river **bank**" (sloping land)

In case a) and b) we can determine that the meaning of "bank" based on the *context*. To utilize context in a semantic model we use *contextualized word representations*. Previously we worked with *static word representations*, i.e. the representation does not depend on the context. To illustrate we can consider sentences (a) and (b), the word **bank** would have the same static representation in both sentences, which means that it becomes difficult for us to predict its sense. What we want is to create representations that depend on the context, i.e. *contextualized embeddings*. 

We will create contextualized embeddings with Recurrent Neural Networks. You can read more about recurrent neural netoworks [here](https://colah.github.io/posts/2015-08-Understanding-LSTMs/). Your overall task in this lab is to create a neural network model that can disambiguate the word sense of 15 different words. 

In [1]:
import numpy as np
import pandas as pd
from scipy import stats as s

import random

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torchtext.data import Field, LabelField, TabularDataset, BucketIterator
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence

from IPython.core.display import display
from IPython.display import Image

# Reproducing same results
SEED = 2009
np.random.seed(SEED)
random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print("available device:", device)

available device: cuda


### Documenting you code
**Note:** This lab is focused quite abit on programming and working with neural networs, i.e. writing code. Comment the code that you write and explain what it does, the code documentation will be taken into account when grading. Also it's very beneficial for you to explain to yourself what the code is doing, and it helps me give feedback to you :)

# 1. Working with data

A central part of any machine learning system is the data we're working with. In this section we will split the data (the dataset is located here: ``wsd-data/wsd_data.txt``) into a training set and a test set. We will also create a baseline to compare our model against. Finally, we will use TorchText to transform our data (raw text) into a convenient format that our neural network can work with.

## Data

The dataset we will use contain different word sense for 15 different words. The data is organized as follows (values separated by tabs): 
- Column 1: word-sense
- Column 2: word-form
- Column 3: index of word
- Column 4: white-space tokenized context

### Splitting the data

Your first task is to seperate the data into a *training set* and a *test set*. The training set should contain 80% of the examples and the test set the remaining 20%. The examples for the test/training set should be selected **randomly**. Save each dataset into a .csv file for loading later. **[2 marks]**

In [4]:
def data_split(path_to_dataset):
    # your code goes here
    with open (path_to_dataset, "r") as fin:
        # data = map(lambda x: x.split("\t"), fin.readlines())
        data = map(lambda x: x.split("\t"), fin.read().splitlines())

    # use sklearn to split data reserving the classes ratio in the split dataset.
    data = np.array(list(data))
    Xtrain, Xtest, ytrain, ytest = train_test_split(
            data[:, 1:], data[:, 0], 
            test_size = 0.2, 
            stratify = data[:, 0], 
            random_state = SEED, 
            shuffle = True
            )
    
    # combine [Xtrain,ytrain] to form train set, the same for Xtest, ytest
    trainset = np.concatenate(
        (Xtrain, np.array([ytrain]).T), axis=1)
    testset = np.concatenate(
        (Xtest, np.array([ytest]).T), axis=1)

    # Use pandas to write to cv | easier to write :)
    colsname = ["form", "id", "tokens", "sense"]
    trainset = pd.DataFrame(trainset, columns=colsname)
    testset = pd.DataFrame(testset, columns=colsname)

    trainset.to_csv("trainset.csv", index=False)
    testset.to_csv("testset.csv", index=False)
    print("Done...")

dataset_path = "./wsd-data/wsd_data.txt"
data_split(dataset_path)

Done...


---
--- AE: Marks=2

---

### Creating a baseline

Your second task is to create a *baseline* for the task. A baseline is a "reality check" for a model, given a very simple heuristic/algorithmic/model solution to the problem, can our neural network perform better than this?
The baseline you are to create is the "most common sense" (MCS) baseline. For each word form, find the most commonly assigned sense to the word, and label a words with that sense. **[2 marks]**

E.g. In a fictional dataset, "bank" have two senses, "financial institution" which occur 5 times and "side of river" 3 times. Thus, all 8 occurences of bank is labeled "financial institution" and this yields an MCS accuracy of 5/8 = 62.5%. If a model obtain a higher score than this, we can conclude that the model *at least* is better than selecting the most frequent word sense.

In [5]:
def mcs_baseline(data):
    # import data file using pandas
    # data_imported = pd.read_csv(data, index_col=0, sep=";")
    data_imported = pd.read_csv(data, index_col=0)
    # Group and count data by word sense
    # Then sum words sense counts for each word form = total number of sense labels for each word forms.
    # Then, calculate the percentage for each word sense
    # Then, get the max percentage
    # Finaly, select the word sense that have the max percentage
    groupd_sense = data_imported.groupby(["form","sense"])["sense"].count()
    groupd_form = groupd_sense.groupby(level = 0).transform(sum)
    groupd_perc = groupd_sense / groupd_form
    groupd_perc_max = groupd_perc.groupby(level = 0).transform(max)
    baseline = groupd_perc[groupd_perc==groupd_perc_max].reset_index(name="percentage")
    
    baseline = baseline.set_index(['form', "sense"])   # tranform the "form" column to be the index of dataframe
    return baseline

traindata_path = "./trainset.csv"
baseline = mcs_baseline(traindata_path)
display(baseline)

Unnamed: 0_level_0,Unnamed: 1_level_0,percentage
form,sense,Unnamed: 2_level_1
active.a,active%3:00:03::,0.320596
bad.a,bad%5:00:00:intense:00,0.60722
bring.v,bring%2:38:00::,0.211529
build.v,build%2:36:00::,0.21203
case.n,case%1:11:00::,0.203822
common.a,common%3:00:01::,0.250716
common.a,common%3:00:02::,0.250716
common.a,common%5:00:00:shared:00,0.250716
critical.a,critical%3:00:01::,0.274541
extend.v,extend%2:30:01::,0.180428


---
--- AE: Marks=2
    
---

### Creating data iterators

To train a neural network, we first need to prepare the data. This involves converting words (and labels) to a number, and organizing the data into batches. We also want the ability to shuffle the examples such that they appear in a random order.  

To do all of this we will use the torchtext library (https://torchtext.readthedocs.io/en/latest/index.html). In addition to converting our data into numerical form and creating batches, it will generate a word and label vocabulary, and data iterators than can sort and shuffle the examples. 

Your task is to create a dataloader for the training and test set you created previously. So, how do we go about doing this?

1) First we create a ``Field`` for each of our columns. A field is a function which tokenize the input, keep a dictionary of word-to-numbers, and fix paddings. So, we need four fields, one for the word-sense, one for the position, one for the lemma and one for the context. 

2) After we have our fields, we need to process the data. For this we use the ``TabularDataset`` class. We pass the name and path of the training and test files we created previously, then we assign which field to use in each column. The result is that each column will be processed by the field indicated. So, the context column will be tokenized and processed by the context field and so on. 

3) After we have processed the dataset we need to build the vocabulary, for this we call the function ``build_vocab()`` on the different ``Fields`` with the output from ``TabularDataset`` as input. This looks at our dataset and creates the necessary vocabularies (word-to-number mappings). 

4) Finally, the last step. In the last step we load the data objects given by the ``TabularDataset`` and pass it to the ``BucketIterator`` class. This class will organize our examples into batches and shuffle them around (such that for each epoch the model observe the examples in a different order). When we are done with this we can let our function return the data iterators and vocabularies, then we are ready to train and test our model!

Implement the dataloader. [**4 marks**]

*hint: for TabularDataset and BucketIterator use the class function splits()* 

In [63]:
def mytokenizer(text):
    return text.split()


def dataloader(path, batch_sizes):
    print("Loading data ...\n")
    # contexts = Field(tokenize=mytokenizer, eos_token="<eos>",
    #                  init_token="<bos>", batch_first=True, include_lengths=True)
    sentences = Field(  tokenize=mytokenizer,
                        include_lengths=True, 
                        batch_first=True
                      )

    senses = Field( batch_first=True,
                    sequential=False,
                    is_target=True
                    )

    word_id = Field(batch_first=True, 
                    sequential=False,
                    use_vocab=False, 
                    dtype=torch.long)

    word_f = Field( batch_first=True, 
                    sequential=False,
                   )

    # read only required columns – question and label
    fields = [("form", word_f), ("id", word_id), ('tokens', sentences),
              ('senses', senses)]

    train_ds, test_ds = TabularDataset.splits(
        path=path,
        train='trainset.csv', validation="testset.csv",
        format='csv',
        # csv_reader_params={"delimiter": ";"},
        skip_header=True,
        fields=fields)

    sentences.build_vocab(train_ds, test_ds, 
                          min_freq=3, vectors="glove.6B.300d")
    senses.build_vocab(train_ds, test_ds)
    word_id.build_vocab(train_ds, test_ds)
    word_f.build_vocab(train_ds, test_ds)

    train_iterator, test_iterator = BucketIterator.splits(
        (train_ds, test_ds),
        batch_size=batch_sizes,
        sort_key = lambda x: len(x.tokens),
        sort_within_batch=True,
        # sort=False,
        shuffle=True,
        device=device)
    
    print("Loading data done.\n")

    return train_iterator, test_iterator, sentences, senses, word_id, word_f


---
--- AE: Marks=4
    
---

# 2. Creating and running a Neural Network for WSD

In this section we will create and run a neural network to predict word senses based on *contextualized representations*.

### Model

We will use a bidirectional Long-Short-Term Memory (LSTM) network to create a representation for the sentences and a Linear classifier to predict the sense of each word.

When we initialize the model, we need a few things:

    1) An embedding layer: a dictionary from which we can obtain word embeddings
    2) A LSTM-module to obtain contextual representations
    3) A classifier that compute scores for each word-sense given *some* input


The general procedure is the following:

    1) For each word in the sentence, obtain word embeddings
    2) Run the embedded sentences through the RNN
    3) Select the appropriate hidden state
    4) Predict the word-sense 

**Suggestion for efficiency:**  *Use a low dimensionality (32) for word embeddings and the LSTM when developing and testing the code, then scale up when running the full training/tests*
    
Your tasks will be to create two different models (both follow the two outlines described above), described below:


We have a question if taking the last hidden layer of a batch of sentences that have variable lengths is an accurate representation of a word. The last set of tokens is padded to equalize the lengths of sentences. We want to examine the performance when we have the nth hidden vectors, where n is the length of the respective sentence in the batch. See the figure below for the architecture of all methods we implement in this assignment. The `pack_padded_sequence` enables us to do this easily as the generated lat hidden layer is equal to the hidden layers that we want to extract. 

<img src="https://raw.githubusercontent.com/juliaklezl/lt2213-lab-1-group-3/master/final/image/Architecture1.PNG?token=AOROR3KJSI3X7FSQXA5C5OK675V6M" width="800" />





In the first approach to WSD, you are to select the index of our target word (column 3 in the dataset) and predict the word sense. **[8 marks]**


In [15]:
def get_intr_output(tensors, indx, x_len, n_dir):
    # Extract tesnsors at required indecies (indx)
    # the passed tensors dimensions [ batch_size x max_seq_length x n_dir x hidden_dim ]
    # indx is a tensor of size 1 x batch_size

    # Get words positons for backward sequence, if required
    indx_back = (x_len-1) - indx

    hidden_dim = tensors.shape[3]
    # forward output or final output
    # Extract the forward output at first batch then build on it
    tensors_final = tensors[0, indx[0], 0, :].contiguous().view(
        1, hidden_dim)  # batch=0 [1xhidden_dim]

    # Extract backward output at first batch then concatenate it with the forward output @batch=0 [tensors_final]
    if n_dir == 2:
        tnsr_back_b0 = tensors[0, indx_back[0], 1, :].contiguous().view(
            1, hidden_dim)  # [1xhidden_dim]
        tensors_final = torch.cat((tensors_final, tnsr_back_b0), dim=1)

    # build upon output of batch 0, batch_size=tensors.shape[0]
    for i in range(1, tensors.shape[0]):
        # Extract forward output or final output at batch=i
        tnsr_forward = tensors[i, indx[i], 0, :].contiguous().view(
            1, hidden_dim)  # [1xhidden_dim]

        # Extract backward output at batch=i then concatenate it with the forward output
        if n_dir == 2:
            tnsr_back = tensors[0, indx_back[i], 1, :].contiguous().view(
                1, hidden_dim)  # [1xhidden_dim]
            tensor_temp = torch.cat((tnsr_forward, tnsr_back), dim=1)
            # build on old tensors
            tensors_final = torch.cat((tensors_final, tensor_temp), dim=0)

        else:
            tensors_final = torch.cat((tensors_final, tnsr_forward), dim=0)

    return tensors_final

In [44]:
class WSDModel_approach1(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, bidirectional, dropout):

        self.hidden_dim = hidden_dim
        self.n_layers = n_layers
        self.n_dir = 2 if bidirectional else 1

        # Constructor
        super().__init__()

        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.rnn = nn.LSTM(embedding_dim,
                           hidden_dim,
                           num_layers=n_layers,
                           bidirectional=bidirectional,
                           dropout=dropout,
                           batch_first=True)
        self.classifier = nn.Linear(hidden_dim*self.n_dir, output_dim)

    def forward(self, batch):

        x, x_len = batch.tokens   # Contexts, and lengthes of each sentence
        word_pos = batch.id      # target word indecies
        batch_size = x.size(0)
        
        # embedded size = b x seq x embd
        embedded = self.embeddings(x)
        
#         h_0 = torch.zeros(self.n_layers*self.n_dir, batch_size,
#                           self.hidden_dim).requires_grad_()
#         # # Initialize cell state
#         c_0 = torch.zeros(self.n_layers*self.n_dir, batch_size,
#                           self.hidden_dim).requires_grad_()
#         h_0 = h_0.to(device)
#         c_0 = c_0.to(device)
        
        # hidden size = l*dir x b x h
        # cell   size = l*dir x b x h
        # out    size = b x max(seq) x h*dir
#         lstm_out, (_, _) = self.rnn(embedded, (h_0.detach(), c_0.detach()))
        lstm_out, (_, _) = self.rnn(embedded)
    
        #out = lstm_out[torch.arange(lstm_out.size(0)),word_pos,:]

        # change lstm output dimention to be b x max(seq) x dir x h
        seq_length = lstm_out.shape[1]
        lstm_out = lstm_out.contiguous().view(batch_size, seq_length, 
                                 self.n_dir, self.hidden_dim)
        # get intermediate outputs that corresponding indecies
        out = get_intr_output(lstm_out, word_pos, x_len, self.n_dir)
        
        # Fully connected layer
        predictions = self.classifier(out)

        return predictions

---
--- AE: A tip, you can easily extract the concatenation at the indices with the following line of code: `out = lstm_out[torch.arange(lstm_out.size(0)),word_pos,:]`. So, what this code does is first to select each batch by `torch.arange(lstm_out.size(0))` then, for each batch it selects the index given by `word_pos`. Marks=8
    
---

In the second approach to WSD, you are to predict the word sense based on the final hidden state given by the RNN. **[8 marks]**

In [8]:
class WSDModel_approach2(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, bidirectional, dropout):

        self.hidden_dim = hidden_dim
        self.n_layers = n_layers
        self.n_dir = 2 if bidirectional else 1

        # Constructor
        super().__init__()

        # your code goes here
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)

        self.rnn = nn.LSTM(embedding_dim,
                           hidden_dim,
                           num_layers=n_layers,
                           bidirectional=bidirectional,
                           dropout=dropout,
                           batch_first=True)

        self.classifier = nn.Linear(hidden_dim*self.n_dir, output_dim)

    def forward(self, batch):
        x, x_len = batch.tokens    # Contexts, and lengthes of each sentence
        batch_size = x.size(0)

        # embedded size = b x seq x embd
        embedded = self.embeddings(x)

        
        # hidden size = l*dir x b x h
        # cell   size = l*dir x b x h
        # out    size = b x max(seq) x h*dir
        _, (hidden, _) = self.rnn(embedded)

        # concat final hidden layers
        # either forward or backward final hidden layer
        hidden_final = hidden[-1, :, :]
        # concat the final forward and backward hidden state
        if self.n_dir == 2:
                out = torch.cat((hidden[-2, :, :], hidden_final), dim=1)

        # hidden = [batch size, hid dim * num directions]
        predictions = self.classifier(out)

        return predictions

In [9]:
class WSDModel_approach3(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, bidirectional, dropout):

        self.hidden_dim = hidden_dim
        self.n_layers = n_layers
        self.n_dir = 2 if bidirectional else 1

        # Constructor
        super().__init__()

        # your code goes here
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        
        self.rnn = nn.LSTM(embedding_dim,
                           hidden_dim,
                           num_layers=n_layers,
                           bidirectional=bidirectional,
                           dropout=dropout,
                           batch_first=True)

        self.classifier = nn.Linear(hidden_dim*self.n_dir, output_dim)

    def forward(self, batch):
        x, x_len = batch.tokens    # Contexts, and lengthes of each sentence
        batch_size = x.size(0)

        # embedded size = b x seq x embd
        embedded = self.embeddings(x)

        packed_embedded = pack_padded_sequence( embedded,
                                                x_len,
                                                batch_first=True)

        # hidden size = l*dir x b x h
        # cell   size = l*dir x b x h
        # out    size = b x max(seq) x h*dir
        _, (hidden, _) = self.rnn(packed_embedded)

        # concat final hidden layers
        # either forward or backward final hidden layer
        hidden_final = hidden[-1, :, :]
        # concat the final forward and backward hidden state
        if self.n_dir == 2:
                out = torch.cat((hidden[-2, :, :], hidden_final), dim=1)

        # hidden = [batch size, hid dim * num directions]
        predictions = self.classifier(out)

        return predictions

---
--- AE: Marks=8
    
---

### Training and testing the model

Now we are ready to train and test our model. What we need now is a loss function, an optimizer, and our data. 

- First, create the loss function and the optimizer.
- Next, we iterate over the number of epochs (i.e. how many times we let the model see our data). 
- For each epoch, iterate over the dataset (``train_iter``) to obtain batches. Use the batch as input to the model, and let the model output scores for the different word senses.
- For each model output, calculate the loss (and print the loss) on the output and update the model parameters.
- Reset the gradients and repeat.
- After all epochs are done, test your trained model on the test set (``test_iter``) and calculate the total and per-word-form accuracy of your model.

Implement the training and testing of the model **[4 marks]**

**Suggestion for efficiency:** *when developing your model, try training and testing the model on one or two batches (for each epoch) of data to make sure everything works! It's very annoying if you train for N epochs to find out that something went wrong when testing the model, or to find that something goes wrong when moving from epoch 0 to epoch 1.*

In [77]:
def train(model, iterator, epochs=3, lr=0.01, approach_flag=False):
    # Define Loss, Optimizer
    criterion = nn.CrossEntropyLoss()
    criterion = criterion.to(device)
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    
    for epoch in range(epochs):
        running_loss = 0
        model.train()

        for i, batch in enumerate(iterator):
            
            # resets the gradients after every batch
            optimizer.zero_grad()
            
            # forward + backward + optimize
            outputs = model(batch)
            
            # predictions = model(batch)
            loss = criterion(outputs, batch.senses)
            loss.backward()
            optimizer.step()

            # loss
            running_loss += loss.item()

            if i % 300 == 299:
                print('[%d, %5d] loss: %.3f' %
                      (epoch+1, i+1, running_loss / 300))
                running_loss = 0.0
    return model


def test(model, iterator, approach_flag=False):
    predictions = []    # list all predictions
    predictions_forms = {}  # dict to hold predictions and label per word
    labels = []
    
    # deactivating dropout layers
    model.eval()
    # deactivates autograd
    with torch.no_grad():
        for batch in iterator:
            outputs = model(batch)

            # probability dist over word senses
            probabilities_dist = F.softmax(outputs, dim=0)
            # get the word sense prediction
            senses_index = torch.max(probabilities_dist, dim=1)[1]
            
            predictions = predictions + senses_index.tolist()
            labels = labels + batch.senses.tolist()
            
            #print(predictions)
            #print(labels)
            #assert False

            # save prediction for each word form in a dict
            myword_form_ids = batch.form.tolist()
            # get word form
            myword_forms = [word_form.vocab.itos[i] for i in myword_form_ids]

            # update dict
            for i, wf_id in enumerate(myword_forms):
                if wf_id in predictions_forms:
                    predictions_forms.get(wf_id, []).append(
                        [senses_index.tolist()[i], batch.senses.tolist()[i]])
                else:
                    predictions_forms[wf_id] = \
                        [[senses_index.tolist()[i], batch.senses.tolist()[i]]]

    return predictions, labels, predictions_forms


def evaluate(ys, yspredict, ys_forms):
    # Calculate overall accuracies and perfromance measures given predictions and labels
    accuracy = accuracy_score(ys, yspredict)
    report = classification_report(ys, yspredict, output_dict=True)[
        "weighted avg"]
    # store the overall perfromnce report
    summary_o = [accuracy, report["precision"],
               report["recall"], report["f1-score"]]
    
    index_df = []
    summary = []
    # Calculate  accuracies and perfromance measures given predictions and labels for each word form
    for f in ys_forms:
        # extract columns from 2D array [y, y_prediction] --> [y], [y_prediction]
        yspredict = list(map(lambda x, : x[0], ys_forms[f]))
        ys = list(map(lambda x, : x[1], ys_forms[f]))
        
        accuracy = accuracy_score(ys, yspredict)
        report = classification_report(ys, yspredict, output_dict=True)[
            "weighted avg"]
        summary.append([accuracy, report["precision"],
               report["recall"], report["f1-score"]])
        index_df.append(f)
    # append the overall perfromance report
    summary.append(summary_o); index_df.append("Overall")

    # Put performace measure in a dataframe
    columns_name = ["accuracy", "precision", "recall", "F-measure"]
    summary_df = pd.DataFrame(summary, index=index_df, columns=columns_name)

    return summary_df

In [75]:
# loading training and test iterators
path_to_folder = "./"
train_iter, test_iter, mycontexts, mysense_lables, _, word_form = dataloader(
    path_to_folder, 32)

# define hyperparameters
pretrained_embeddings = mycontexts.vocab.vectors
size_of_vocab = len(mycontexts.vocab)          # input size of embedding layer
# output size of the fully connected layer = number of word senses we have
output_size = len(mysense_lables.vocab)
# output size of embedding layer = size of words vector
embedding_dim = pretrained_embeddings.shape[1]
hidden_size = 300#(embedding_dim // 3) + 1         # LSTM hidden layer size
num_layers = 2                                 # Number of stacked lstms
bidirection = True                             # lstm directions
dropout = 0
lr = 0.001
epochs = 3


Loading data ...

Loading data done.



In [78]:
# build, train, then test the models
# Model Aprroach 1
model_1 = WSDModel_approach1(size_of_vocab, embedding_dim, hidden_size,
                             output_size, num_layers, bidirectional=bidirection, dropout=dropout)

print("Approach 1: Predict the word sense based on word position.\n")
print("-"*50, "\n\n")

print(model_1)
print()

print("Training the Model ...\n")
# initialize weights of embeddings layers
model_1.embeddings.weight.data.copy_(pretrained_embeddings)
model_1 = model_1.to(device)    # train the model
model_1 = train(model_1, train_iter, epochs)
print("Training finished.\n\n")

print("Testing the Model ...\n")
# Tetss the NN
results_1, labels_1, results_dict_1 = test(model_1, test_iter)
# Calculate the performance measures
results_df1 = evaluate(labels_1, results_1, results_dict_1)
results_df1 = results_df1*100
print("Testing finished\n\n")

print("{:-^75}".format("Performance Measures"))
# display summary
pd.options.display.float_format = '{:.2f}%'.format
display(results_df1)
# send model_1 to cpu to clear memoru on cuda
model_1 = model_1.cpu()
print()

Approach 1: Predict the word sense based on word position.

-------------------------------------------------- 


WSDModel_approach1(
  (embeddings): Embedding(36442, 300)
  (rnn): LSTM(300, 300, num_layers=2, batch_first=True, bidirectional=True)
  (classifier): Linear(in_features=600, out_features=223, bias=True)
)

Training the Model ...

[1,   300] loss: 2.222
[1,   600] loss: 1.439
[1,   900] loss: 1.301
[1,  1200] loss: 1.298
[1,  1500] loss: 1.287
[1,  1800] loss: 1.273
[2,   300] loss: 1.230
[2,   600] loss: 1.213
[2,   900] loss: 1.206
[2,  1200] loss: 1.211
[2,  1500] loss: 1.218
[2,  1800] loss: 1.182
[3,   300] loss: 1.094
[3,   600] loss: 1.121
[3,   900] loss: 1.135
[3,  1200] loss: 1.130
[3,  1500] loss: 1.107
[3,  1800] loss: 1.148
Training finished.


Testing the Model ...

Testing finished


---------------------------Performance Measures----------------------------


Unnamed: 0,accuracy,precision,recall,F-measure
force.n,72.55%,73.84%,72.55%,72.71%
follow.v,41.58%,42.21%,41.58%,40.98%
place.n,46.89%,59.46%,46.89%,51.93%
physical.a,47.11%,54.71%,47.11%,49.65%
bad.a,55.62%,64.09%,55.62%,58.58%
position.n,39.86%,45.80%,39.86%,40.21%
bring.v,35.67%,46.90%,35.67%,39.09%
keep.v,42.63%,63.79%,42.63%,48.93%
security.n,70.37%,76.07%,70.37%,72.59%
line.n,45.27%,90.28%,45.27%,57.50%





In [59]:
# %%
model_2 = WSDModel_approach2(size_of_vocab, embedding_dim, hidden_size,
                             output_size, num_layers, bidirectional=bidirection, dropout=dropout)

print("Approach 2: Predict the word sense based last hidden layer of lstm.\n")
print("-"*50, "\n\n")

print(model_2)
print()

print("Training the Model ...\n")
# initialize weights of embeddings layers
model_2.embeddings.weight.data.copy_(pretrained_embeddings)
model_2 = model_2.to(device)    # train the model
model_2 = train(model_2, train_iter, epochs)
print("Training finished.\n\n")

print("Testing the Model ...")
# Tetss the NN
results_2, labels_2, results_dict_2 = test(model_2, test_iter)
# Calculate the performance measures
results_df2 = evaluate(labels_2, results_2, results_dict_2)
results_df2 = results_df2*100
print("Testing finished\n\n")

print("{:-^75}".format("Performance Measures"))
# display summary
pd.options.display.float_format = '{:.2f}%'.format
display(results_df2)
# send model_1 to cpu to clear memoru on cuda
model_2 = model_2.cpu()
print()

Approach 2: Predict the word sense based last hidden layer of lstm.

-------------------------------------------------- 


WSDModel_approach2(
  (embeddings): Embedding(36442, 50)
  (rnn): LSTM(50, 50, num_layers=2, batch_first=True, bidirectional=True)
  (classifier): Linear(in_features=100, out_features=223, bias=True)
)

Training the Model ...

[1,   300] loss: 4.991
[1,   600] loss: 4.659
[1,   900] loss: 3.777
[2,   300] loss: 3.043
[2,   600] loss: 2.632
[2,   900] loss: 2.382
[3,   300] loss: 2.039
[3,   600] loss: 1.959
[3,   900] loss: 1.936
Training finished.


Testing the Model ...
Testing finished


---------------------------Performance Measures----------------------------


Unnamed: 0,accuracy,precision,recall,F-measure
common.a,33.05%,45.67%,33.05%,37.80%
positive.a,20.66%,62.26%,20.66%,29.97%
bring.v,19.04%,36.70%,19.04%,22.67%
case.n,12.92%,29.69%,12.92%,16.24%
serve.v,37.69%,42.13%,37.69%,38.44%
line.n,30.39%,89.09%,30.39%,43.23%
follow.v,27.16%,36.94%,27.16%,28.84%
find.v,28.48%,53.38%,28.48%,36.05%
see.v,27.12%,68.56%,27.12%,36.23%
position.n,30.63%,45.48%,30.63%,35.64%





In [61]:
model_3 = WSDModel_approach3(size_of_vocab, embedding_dim, hidden_size,
                             output_size, num_layers, bidirectional=bidirection, dropout=dropout)

print("Approach 3: Predict the word sense based hidden layer of last word of the sequence neglegtin the padding.\n")
print("-"*50, "\n\n")

print(model_3)
print()

print("Training the Model ...\n")
# initialize weights of embeddings layers
model_3.embeddings.weight.data.copy_(pretrained_embeddings)
model_3 = model_3.to(device)    # train the model
train(model_3, train_iter, epochs)
print("Training finished.\n\n")

print("Testing the Model ...")
# Tetss the NN
results_3, labels_3, results_dict_3 = test(model_3, test_iter)
# Calculate the performance measures
results_df3 = evaluate(labels_3, results_3, results_dict_3)
results_df3 = results_df3*100
print("Testing finished\n\n")

print("{:-^75}".format("Performance Measures"))
# display summary
pd.options.display.float_format = '{:.2f}%'.format
display(results_df3)
# send model_1 to cpu to clear memoru on cuda
model_3 = model_3.cpu()
print()



Approach 3: Predict the word sense based hidden layer of last word of the sequence neglegtin the padding.

-------------------------------------------------- 


WSDModel_approach3(
  (embeddings): Embedding(36442, 50)
  (rnn): LSTM(50, 50, num_layers=2, batch_first=True, bidirectional=True)
  (classifier): Linear(in_features=100, out_features=223, bias=True)
)

Training the Model ...

[1,   300] loss: 4.891
[1,   600] loss: 3.876
[1,   900] loss: 3.045
[2,   300] loss: 2.410
[2,   600] loss: 2.158
[2,   900] loss: 1.998
[3,   300] loss: 1.650
[3,   600] loss: 1.629
[3,   900] loss: 1.607
Training finished.


Testing the Model ...
Testing finished


---------------------------Performance Measures----------------------------


Unnamed: 0,accuracy,precision,recall,F-measure
common.a,32.47%,45.63%,32.47%,37.53%
positive.a,48.35%,64.00%,48.35%,54.36%
bring.v,24.85%,40.72%,24.85%,28.91%
case.n,17.16%,31.92%,17.16%,21.45%
serve.v,39.67%,50.19%,39.67%,41.90%
line.n,43.37%,91.52%,43.37%,56.79%
follow.v,26.25%,36.34%,26.25%,29.51%
find.v,31.19%,52.05%,31.19%,38.51%
see.v,30.63%,69.86%,30.63%,41.19%
position.n,36.71%,52.55%,36.71%,42.40%





We tried different sets of **Hyperparameters**; we noticed that the system's overall accuracy decreased with the increase in the embedding layers' output size. We used `Glove` to embed words; the highest accuracy achieved when we used `Glove` with size 50 to embed words while `Glove` with size 300 has the worst accuracy. Also, we noticed that the system accuracy has the same inverse relation with the hidden layer size of LSTM.

It seems that the `Glove` representation has noise data that affect system accuracy. Such an adverse effect may be attenuated when the size of the word embedding decreases in both the embedding layer and the LSTM layer; the LSTM reduces the dimensionality of the word embeddings further.

---
--- AE: Looks good! I liked your third approach, this is generally a good idea to do. As you noticed, it does not typically give a HUGE performance boost (since well, the NN typically learns that the padding is useless), but when optimizing your models this is very recommended!

So, typically, larger dimensions (128-256) in this task would yield better performance, but this does not seem to be the case for your system where your analysis holds!

Marks=4
    
---

# 3. Evaluation

Explain the difference between the first and second approach. What kind of representations are the different approaches using to predict word-senses? **[5 marks]**

**Your answer should go here:**

In the first approach, the representation of the specific word in question is used as input to the final prediction layer. So the word sense is chosen based on information about the ambiguous word; its context is encoded in the embedding. We have used bidirectional LSTM so we can also include the contexts that come after the word in the embeddings.

In the second approach, the final embedding layer takes the final hidden state of the RNN layer as input. This is the LSTM’s long-term memory, based on the entire sequence. However, due to the fact that the context data was padded in the preprocessing step, this final state also includes semantically irrelevant padding data. Therefore, we added a third approach which follows the same principle as the second model, but excludes the padded datapoints. So the final hidden state refers to the state after the last actual word was processed, without including <pad>’s. 

So the first approach relies more on the information about the individual word whereas the second (and third) one focuse more on the sentence overall.

---
--- AE: Good analysis! The second approach is essentially the CBOW approach we did for word2vec in the previous lab. I.e. we look at all the words in a sentence and try to find the best representation given all of them. Marks=5
    
---

Evaluate your model with per-word-form *accuracy* and comment on the results you get, how does the model perform in comparison to the baseline, and how does the first approach compare to the second? Which one is more successful? **[5 marks]**

**Your answer should go here:**


Overall, the first model performs the best (overall 56% accuracy), followed by our third approach (20.6%), then the second approach (10.6%). 

### The comments below are related to the new training and evaluation session:

The figure below shows that the first approach has the height accuracy for all words form.
Only the first model predicts the sense of most of the words with accuracies higher than the baseline, except for one word The other models, model2 and model 3, fail to get accuracies higher than the baseline. See the figure below

Also, the two figures, the old and new training sessions, show that the perfromance for the model-1 does not change a lot (relatively to the other models). This consistency indicates that the first approach has a better capability to prdict the the word sense.

<img src="https://raw.githubusercontent.com/juliaklezl/lt2213-lab-1-group-3/master/final/image/accuraciesperword_new.png?token=AOROR3LIX7VJ2M7CGKLWPUC675XS66" width="800" />

---

### The comments below are related to the old training and evaluation session:

Comparing Model-2 and Model -3, There is a slight increase in precision (2.5%) relative to the increase in accuracy (1.7%). This small increase may be due to that the embeddings do not include the padding sequence.

The figure below shows that the first approach has the height accuracy for all words form. Our three models predict the sense of most of the words with accuracies higher than the baseline, except for two words for model-1 and six words for model2 and model 3.

<img src="https://raw.githubusercontent.com/juliaklezl/lt2213-lab-1-group-3/master/final/image/accuraciesperword.PNG?token=AOROR3MF7NTLJX2W7ZY5D5S675XQ6" width="800" />

---
--- AE: Reasonable! Generally, we would expect model 1 to perform better with *this* setup, which it does for you. The code you have implemented for all models is also correct, so I'm abit puzzled as to why the performance of model 2 and 3 is so low.  

Marks=5
    
---

What could we do to improve our word sense disambiguation model? **[8 marks]**

**Your answer should go here:**

- Add more syntactical data, such as POS or dependency information. This probably gets encoded by the embeddings to some extent, but explicitly adding it might help to disambiguate, since word senses often differ among others in their POS or the POS-tags of words they co-occur with.

- Experiment with different parameters, batch sizes, learning rates, different or added layers, different loss, or optimizer functions. Can evolution algorithms help in optimizing these parameters?

- More preprocessing of the input data (the context column), e.g., removing stopwords

- Investigate applying dimensionality reduction on pre-trained word embeddings, please see our note at the end of question 2

- The accuracies for the models changed as the tarining/evaluation data changed, specially for Model-2 and Model-3. To metigate such effect it is prefered to train and evaluate the models over multiple 3 or 5 folds.   

---
--- AE: Good suggestions! Another thing we should add is some for of regularization to help our model generalize, such as dropout. Marks=7

---

---
--- AE: 44 marks! Very good job, well written code and reasonable analysis. 

Hope you have a nice summer!

Best, Adam
    
---

# Readings:

[1] Kågebäck, M., & Salomonsson, H. (2016). Word Sense Disambiguation using a Bidirectional LSTM. arXiv preprint arXiv:1606.03568.

[2] https://cl.lingfil.uu.se/~nivre/master/NLP-LexSem.pdf