#  Assignment 4 (part 2)(CPSC 436N): LSTM-Based Part-Of-Speech (POS) Tagger

In Part2 of this assignment, we are also using the text file called __cmpt-hw2-3.txt__ to train a bidirectional stacked LSTM-based neural sequence labeling model to predict the part of speech tags of unknown input sentences.

## 1. import libs and packages needed for this assignment

In [11]:
import torch
import torch.nn as nn
from torch.nn.utils import clip_grad_norm_
import torch.nn.functional as F
from torch.autograd import Variable
import numpy as np
import math
from sklearn.metrics import f1_score

from tqdm import tqdm

## 2. Load the dataset (cmpt-hw2-3.txt)

We will use the same dataset as in (Part1), namely **cmpt-hw2-3.txt**. Please first upload this text file from your local computer to Colab by following these instructions:

https://colab.research.google.com/notebooks/io.ipynb

(Please note that the uploaded file will be deleted once the connection is terminated. So you need to upload the file everytime you connect to google colab.)

<br>

For convenience, we will load and save each line of the dataset as **a token list** and **a pos list**. For example, one line in our dataset:

*There_EX are_VBP also_RB plant_NN and_CC gift_NN shops_NNS ._.*

will be saved as:

__token list:__ ['there', 'are', 'also', 'plant', 'and', 'gift', 'shops', '.']

__pos list:__ ['EX', 'VBP', 'RB', 'NN', 'CC', 'NN', 'NNS', '.']

(Please note that we convert each token into its lower case.)

<br>

**Alternatively, you can run the entire Notebook on your local machine, where training takes around 2 minutes per epoch on a relatively modern machine (i.e. MacBook Pro).**

In [2]:
data_path = './cmpt-hw2-3.txt'

dataset = [] #initialize dataset list.
for line in open(data_path): #load data from the text file.
    text_list = []; pos_list = [];
    l = line.strip().split(' ') #split each line in the file by space.
    for w in l:
      w = w.split('_') #each token and its pos tag are connected by an underscore, here we split it by undescore.
      text_list.append(w[0].lower()) #add token to the text list for the current line.
      pos_list.append(w[1]) #add pos tag to the pos list for the current line.
    dataset.append((text_list, pos_list)) #add the processed line (text and pos list) into the dataset list.

## 3. Split the loaded dataset into training/dev/testing subsets

Here we follow the common scheme to split this dataset into training/dev/test set with ratio 80%-10%-10%. After shuffling the data to avoid any possible ordering bias, we take the first 80% samples as training set, then the next 10% samples as dev set, and the last 10% samples as testing set.

In [4]:
import random
random.seed(436)
random.shuffle(dataset)

training_data = dataset[0:math.floor(len(dataset)*0.8)]
dev_data = dataset[math.floor(len(dataset)*0.8):math.floor(len(dataset)*0.9)]
test_data = dataset[math.floor(len(dataset)*0.9):]

## 4. Construct word-to-index and tag-to-index mapping dictionary

We model POS tagging  as a sequence labeling task. The pipeline can be described as:

<br>

*a sequence of tokens -(mapping)-> a sequence of token indices --(input)--> POS Tagger --(output)--> a sequence of POS tag indices --(mapping)--> a sequence of POS tags*

<br>

So in this step, we want to construct two mapping dictionaries to assign a unique index to each token and tag respectively (see code below). These two mapping dictionaries will be used in the first and last mapping steps in the above pipeline.

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

As in Part1, we also need to deal with **unknown tokens** in testing. The code below is implementing a possible way to adderess this problem. ***Q5: Please add comments to the lines with "COMMENT NEEDED". And describe in your words the implemented solution.***

In [5]:
class Word_Dictionary(object): #this class is used to generate and return the word-to-index (index-to-word) vocabulary dictionary.
    def __init__(self): #initializing.
        self.word2idx = {'_unk_':0} # "COMMENT NEEDED" #a word for unknown tokens is added to the dictionary word2idx
        self.idx2word = {0:'_unk_'} # "COMMENT NEEDED" #a word for unknown tokens is added to the dictionary idx2word
        self.idx = 1
    
    def add_word(self, word): #add a new word into the dictionary and assign an unique index to it.
        if not word in self.word2idx:
            self.word2idx[word] = self.idx
            self.idx2word[self.idx] = word
            self.idx += 1
        
    def __len__(self): #return the number of words in this dictionary.
        return len(self.word2idx)


class Tag_Dictionary(object): #this class is used to generate and return the tag-to-index (index-to-tag) dictionary.
    def __init__(self): #initializing.
        self.tag2idx = {}
        self.idx2tag = {}
        self.idx = 0
    
    def add_tag(self, tag): #add a new tag into the dictionary and assign an unique index to it.
        if not tag in self.tag2idx:
            self.tag2idx[tag] = self.idx
            self.idx2tag[self.idx] = tag
            self.idx += 1
        
    def __len__(self): #return the number of tags in this dictionary.
        return len(self.tag2idx)


wd_dict = Word_Dictionary() #initialize the word-to-index dictionary.
tag_dict = Tag_Dictionary() #initialize the tag-to-index dictionary.
unknown_threshold = 1 # "COMMENT NEEDED" # set frequency threshold for a word to be included in the dictionary.

word_count = {} #initialize the dictionary to save frequencies of words.

for sample in training_data: #fill the word frequency dictionary with training data.
    text, tags = sample
    for word in text:
      if word not in word_count.keys():
        word_count[word] = 1
      else:
        word_count[word] += 1

for sample in training_data: #fill the word-to-index and tag-to-index dictionary with training data.
    text, tags = sample
    for word in text:
      if word_count[word] > unknown_threshold:  #"COMMENT NEEDED"# only include the words with frequencies over a certain threshold.
        wd_dict.add_word(word)
    for tag in tags:
      tag_dict.add_tag(tag)

## 5. Implement the function for data processing

In this step, we implement a function processing our data into the format which is ready for neural POS tagger's training and testing.

More specifically, we convert the input token and pos tag list into tensors (i.e., vectors) containing their corresponding indexes.

In [6]:
def data_processing_for_lstm(sample, word_dict, tag_dict, unknown_threshold): #this function convert a sample into the format ready for our neural POS tagger.
    text, tags = sample
    word_ids = []; tag_ids = []

    for word in text:
      if word in word_dict.word2idx.keys() and word_count[word] > unknown_threshold: #map the token to its index if its frequency is over a threshold.
        word_ids.append(word_dict.word2idx[word])
      else: #map the token to the index of unknwon token if its frequency is not over a threshold.
        word_ids.append(word_dict.word2idx['_unk_'])

    for tag in tags: #map pos tags into indices using the tag-to-index dictionary.
      tag_ids.append(tag_dict.tag2idx[tag])

    word_ids = torch.from_numpy(np.array(word_ids)) #convert the list of token ids into tensor format.
    tag_ids = torch.from_numpy(np.array(tag_ids)) #convert the list of tag ids into tensor format.

    return word_ids, tag_ids

## 6. The class of LSTM-based POS tagger

Now let's design the architecture of our bidirectional stacked LSTM-based POS tagger. It should consist of three layers:

* [Embedding layer](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html): projecting input token ids into its embedding space.
* [(Bi-)LSTM hidden state layer](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html)
* [Output layer](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html): converting the hidden states to POS predictions.

**Q6:** Please after carefully reading the pytorch links above, ***add comments to the lines with "COMMENT NEEDED".***

In [7]:
def zero_state(module, batch_size, device): #the function to initialize the states of the bidirectionsl LSTM.
    # * 2 is for the two directions
    return Variable(torch.zeros(module.num_layers * 2, batch_size, module.hidden)).to(device), \
           Variable(torch.zeros(module.num_layers * 2, batch_size, module.hidden)).to(device)

class LSTM_tagger(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden, num_layers, tag_size, device):
        super(LSTM_tagger, self).__init__()
        self.device = device #the device we run our model on.
        self.num_layers = num_layers #ADD COMMENT # number of layers of LSTM.
        self.hidden = hidden #the dimension of hidden state.
        self.input_size = embed_size #ADD COMMENT #the size of word embedding (input of LSTM).
        self.embed = nn.Embedding(vocab_size, embed_size) #the word embedding layer (convert words to embeddings).
        self.lstm = nn.LSTM(input_size=self.input_size,
                            hidden_size=self.hidden,
                            num_layers=self.num_layers,
                            dropout=0, #ADD COMMENT
                            bidirectional=True) #ADD COMMENT #the LSTM layer is bidirectional.
        self.h2s = nn.Linear(hidden * 2, tag_size) #the linear output layer (convert from hidden states of LSTM to the list of tag scores).

    def forward(self, x):
        x = self.embed(x) #convert word list to embedding list for the input sample.
        x = x.view(len(x),1,-1)
        s = zero_state(self, 1, self.device) #initialize the states of LSTM.
        lstm_output, _ = self.lstm(x, s) #LSTM.
        outputs = self.h2s(lstm_output.view(len(x),-1)) #ADD COMMENT #convert the output of LSTM into the prediction (sequence of tag score lists).
        tags_probs = F.log_softmax(outputs,dim=1) #ADD COMMENT #for each score list, apply softmax to turn those into probabilities.

        return tags_probs

## 7. Initialize model and define parameters

**ONLY IN SOLUTION =====**
The parameters we need to initialize include:

* embed_size: the dimension of token embeddings.
* intermediate_size: the dimension of the hidden state of LSTM.
* num_layers: the layer number of LSTM.
* vocab_size: the size of vocabulary (in the word-to-index mapping dictionary).
* tag_size: the number of POS tags (in the tag-to-index mapping dictionary).
* num_epochs: how many times we want our model to be trained on training set.
* learning rate.
* device: the no. of GPU you want our model to run on.
* criterion: the loss function, here we choose [cross entropy loss](https://pytorch.org/docs/1.9.1/generated/torch.nn.CrossEntropyLoss.html).
* optimizer: here we use adam to train the model.

The models we need to intialize is:

* model: the one we implemented above.
**=====**

In [9]:
embed_size = 128
intermediate_size = 256
num_layers = 2
vocab_size = wd_dict.__len__()
tag_size = tag_dict.__len__()

num_epochs = 10
learning_rate = 0.002
device = 0 if torch.cuda.is_available() else 'cpu'
criterion = nn.CrossEntropyLoss() #ADD COMMENT #the loss function, here we choose cross entropy loss.

model = LSTM_tagger(vocab_size, embed_size, intermediate_size, num_layers, tag_size, device)
model.to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate) #ADD COMMENT

## 8. Train the model

In [12]:
for epoch in range(num_epochs):
    avg_ppl = 0; avg_loss = 0;
    for i in tqdm(range(0, len(training_data))): #input one sample a time to train the model.
        inputs, targets = data_processing_for_lstm(training_data[i], wd_dict, tag_dict, unknown_threshold)  #prepare the sample for model training.
        
        #forward pass.
        outputs = model(inputs.to(device)) #ADD COMMENT #get prediction.
        loss = criterion(outputs, targets.reshape(-1).to(device)) #ADD COMMENT #compute the loss between prediction and ground truth.
        avg_loss += loss.item() #ADD COMMENT #add up the loss we get so far.
        
        #backward and optimize.
        optimizer.zero_grad()
        loss.backward()
        clip_grad_norm_(model.parameters(), 0.5)
        optimizer.step()
    
    #validation step.
    preds = []; targets = [];
    for i in tqdm(range(0, len(dev_data))):
        dev_inputs, dev_targets = data_processing_for_lstm(dev_data[i], wd_dict, tag_dict, unknown_threshold)
        dev_outputs = model(dev_inputs.to(device))
        dev_pred = torch.argmax(dev_outputs, dim=-1)
        preds += dev_pred.detach().cpu().numpy().tolist()
        targets += dev_targets.tolist()
    
    print ('Epoch [{}/{}], Loss: {:.4f}, F1 score: {:5.2f}'
        .format(epoch+1, num_epochs, avg_loss/len(training_data), f1_score(targets, preds, average='macro')))

100%|████████████████████████████████████████████████████████████████████████████████████████████████| 1776/1776 [02:08<00:00, 13.77it/s]


Epoch [1/10], Loss: 0.4701, F1 score:  0.88


 51%|█████████████████████████████████████████████████▍                                               | 906/1776 [01:05<01:02, 13.83it/s]


KeyboardInterrupt: 

## 9. Test the model

In [15]:
test_preds = []; test_targets = [];
for i in tqdm(range(0, len(test_data))):
    test_input, test_target = data_processing_for_lstm(test_data[i], wd_dict, tag_dict, unknown_threshold)
    test_output = model(test_input.to(device))
    test_pred = torch.argmax(test_output, dim=-1)
    test_preds += test_pred.detach().cpu().numpy().tolist()
    test_targets += test_target.tolist()
print(f1_score(test_targets, test_preds, average='macro'))

0.9193238354070685


**Q7:** Now train and test a different model with a dropout rate of 0.05. Report the performance and a plausible explanation for the different performance of the two tested models (if any).

THE FOLLOWING WILL NOT BE PART OF THIS ASSIGMENT

In [11]:
test_sentence = (['there', 'are', 'also', 'plant', 'and', 'local', 'shops', '.'], ['NN', 'VBZ', 'NN','NN', 'VBZ', 'NN','NN'])

# see what the scores are after training
inputs, targets = data_processing_for_lstm(test_sentence, wd_dict, tag_dict, unknown_threshold)
tag_scores = model(inputs.to(device))

# print the most likely tag index, by grabbing the index with the maximum score!
# recall that these numbers correspond to tag2idx = {"DET": 0, "NN": 1, "V": 2}
_, predicted_tags = torch.max(tag_scores, 1)

print('The predicted indices for tags: \n',predicted_tags)
for id, word in enumerate(test_sentence[0]):
  print('Word:{} , Predicted Tag : {} ==> {} '.format(word,predicted_tags[id], tag_dict.idx2tag[predicted_tags[id].item()]))

The predicted indices for tags: 
 tensor([31, 25, 21,  2,  4,  1,  9, 13], device='cuda:0')
Word:there , Predicted Tag : 31 ==> EX 
Word:are , Predicted Tag : 25 ==> VBP 
Word:also , Predicted Tag : 21 ==> RB 
Word:plant , Predicted Tag : 2 ==> NN 
Word:and , Predicted Tag : 4 ==> CC 
Word:local , Predicted Tag : 1 ==> JJ 
Word:shops , Predicted Tag : 9 ==> NNS 
Word:. , Predicted Tag : 13 ==> . 
