# HW2 BiLSTM for PoS tagging

## Introduction

In this project, you will use PyTorch to implement a BiLSTM for the PoS tagging task. We use the Universal Dependencies English Web Treebank (UDPOS) dataset, which is provided by the TorchText library.

In [None]:
! pip install torchtext==0.6.0
! pip install Cython
#! pip install -U pip setuptools wheel
#! pip install -U spacy
import torch
import torch.nn as nn
import torch.optim as optim
import torchtext
print(torchtext.__version__)
from torchtext import data
from torchtext import datasets

import spacy
import numpy as np

import time
import random
SEED = 6

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

0.6.0


## Data Preprocessing
The key parts of TorchText is the `Field`. The `Field` handles how your dataset is processed. You can find more detiails in official document (https://torchtext.readthedocs.io/en/latest/data.html#field).

Here, the `TEXT` field handles how the  text is dealt with. You can use  `lower = True` to make texts written in lowercases.

Next, define the `Fields` for the tags. This dataset actually has two different sets of tags, [universal dependency (UD) tags](https://universaldependencies.org/u/pos/) and [Penn Treebank (PTB) tags](https://www.sketchengine.eu/penn-treebank-tagset/). In this project, you can only train the model on the UD tags. `UD_TAGS` handles how the UD tags should be handled. Our `TEXT` vocabulary - which we'll build later - will have *unknown* tokens in it, i.e. tokens that are not within our vocabulary. However, we won't have unknown tags as we are dealing with a finite set of possible tags. TorchText `Fields` initialize a default unknown token, `<unk>`, which we remove by setting `unk_token = None`. `PTB_TAGS` does the same as `UD_TAGS`, but handles the PTB tags instead.

Then, define `fields`, which handles passing our fields to the dataset.

Next, load the UDPOS dataset using the defined fields.

In [None]:
TXT = data.Field(lower = True)
UD_TAGS = data.Field(unk_token = None)
PTB_TAGS = data.Field(unk_token = None)
fields = (("text", TXT), ("udtags", UD_TAGS), ("ptbtags", PTB_TAGS))
train_data, valid_data, test_data = datasets.UDPOS.splits(fields)

print(f"Number of training examples: {len(train_data)}")
print(f"Number of validation examples: {len(valid_data)}")
print(f"Number of testing examples: {len(test_data)}")

Number of training examples: 12543
Number of validation examples: 2002
Number of testing examples: 2077


### Statistics of Data

After loading the data, you will do some data anslysis.

(1) get the most common first $10$  tokens in texts.  
(2) get all the possible UD tags.  
(3) compute the number of samples for each UD tag.  

tips:
(1) The __vars__ function can be used to split the sample into differnet atoms.  You can also print some of them directly.   
(2) For better statistics, we can transform the texts into numeric representations. The `build_vocab` is used for this purpose (see anie.me/On-Torchtext/). For texts, we want some unknown tokens within our dataset in order to replicate how this model would be used in real life, so we set the `min_freq` to 2 which means only tokens that appear twice in the training set will be added to the vocabulary and the rest will be replaced by `<unk>` tokens.

In [None]:
def data_analysis(TXT, UD_TAGS, train_data, valid_data, test_data):
    TXT.build_vocab(train_data, min_freq=2, vectors="glove.6B.100d")
    UD_TAGS.build_vocab(train_data)
    PTB_TAGS.build_vocab(train_data)
    print(UD_TAGS.vocab.freqs.most_common(10))
    print(UD_TAGS.vocab.itos)

data_analysis(TXT, UD_TAGS, train_data, valid_data, test_data)

.vector_cache/glove.6B.zip: 862MB [02:39, 5.40MB/s]                           
100%|█████████▉| 399999/400000 [00:22<00:00, 18120.68it/s]


[('NOUN', 34781), ('PUNCT', 23679), ('VERB', 23081), ('PRON', 18577), ('ADP', 17638), ('DET', 16285), ('PROPN', 12946), ('ADJ', 12477), ('AUX', 12343), ('ADV', 10548)]
['<pad>', 'NOUN', 'PUNCT', 'VERB', 'PRON', 'ADP', 'DET', 'PROPN', 'ADJ', 'AUX', 'ADV', 'CCONJ', 'PART', 'NUM', 'SCONJ', 'X', 'INTJ', 'SYM']


In the final stage of data preparation, we focus on configuring the iterator. This iterator will be used to provide batches of data for processing.

In [None]:

BATCH_SIZE = 64

device = 'cuda' if torch.cuda.is_available() else 'cpu'

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data),
    batch_size = BATCH_SIZE,
    device = device)

## Building the Model


Next, we build our model - a multi-layer bi-directional LSTM.  The model takes in a sequence of tokens, $X = \{x_1, x_2,...,x_T\}$, passes them through an embedding layer, $e$, to get the token embeddings, $e(X) = \{e(x_1), e(x_2), ..., e(x_T)\}$.

These embeddings are processed - one per time-step - by the forward and backward LSTMs. The forward LSTM processes the sequence from left-to-right, whilst the backward LSTM processes the sequence right-to-left, i.e. the first input to the forward LSTM is $x_1$ and the first input to the backward LSTM is $x_T$. The LSTMs also take in the the hidden, $h$, and cell, $c$, states from the previous time-step After the whole sequence has been processed, the hidden and cell states are then passed to the next layer of the LSTM. The initial hidden and cell states, $h_0$ and $c_0$, for each direction and layer are initialized to a tensor full of zeros. We then concatenate both the forward and backward hidden states from the final layer of the LSTM, $H = \{h_1, h_2, ... h_T\}$, where $h_1 = [h^{\rightarrow}_1;h^{\leftarrow}_T]$, $h_2 = [h^{\rightarrow}_2;h^{\leftarrow}_{T-1}]$, etc. and pass them through a linear layer, $f$, which is used to make the prediction of which tag applies to this token, $\hat{y}_t = f(h_t)$.

We implement the model detailed above in the `BiLSTM` class. You can use the provided `embedding`, `lstm`, and `linear` module in PyTorch library.

In [None]:
# TODO
class BiLSTM(nn.Module):
    def __init__(self,
                 input_dim,
                 embedding_dim,
                 hidden_dim,
                 output_dim,
                 n_layers,
                 bidirectional,
                 pad_idx):

        super(BiLSTM, self).__init__()

        self.embedding = nn.Embedding(input_dim, embedding_dim)

        self.lstm = nn.LSTM(embedding_dim, hidden_dim)

        self.fc = nn.Linear(hidden_dim, output_dim)


    def forward(self, text):
        embeddings = self.embedding(text)

        #embeddings = [sent len, batch size, emb dim]

        #pass embeddings into LSTM
        outputs, (hidden, cell) = self.lstm(embeddings)

        #outputs holds the backward and forward hidden states in the final layer
        #hidden and cell are the backward and forward hidden and cell states at the final time-step

        #output = [sent len, batch size, hid dim * n directions]
        #hidden/cell = [n layers * n directions, batch size, hid dim]

        #use our outputs to make a prediction of what the tag should be
        predictions = self.fc(outputs)

        #predictions = [sent len, batch size, output dim]
        return predictions

## Training the Model

Moving forward, we proceed with the instantiation of our model. To ensure compatibility, we must confirm that the embedding dimensions match those of the GloVe embeddings we previously loaded.

The remaining hyperparameters have been carefully selected as reasonable defaults. However, it's worth noting that there might exist alternative combinations that could yield improved performance on this specific model and dataset.

For the input and output dimensions, we directly use the lengths of the respective vocabularies. To obtain the padding index, we extract it from the vocabulary and the Field associated with the text data. This ensures that our model is properly configured to handle the input and output dimensions while accommodating padding appropriately.

In [None]:
INPUT_DIM = len(TXT.vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 128
OUTPUT_DIM = len(UD_TAGS.vocab)
N_LAYERS = 1
BIDIRECTIONAL = True
PAD_IDX = TXT.vocab.stoi[TXT.pad_token]

model = BiLSTM(INPUT_DIM,
                EMBEDDING_DIM,
                HIDDEN_DIM,
                OUTPUT_DIM,
                N_LAYERS,
                BIDIRECTIONAL,
                PAD_IDX)

We'll now initialize our model's embedding layer with the pre-trained embedding values we loaded earlier.

This is done by getting them from the vocab's `.vectors` attribute and then performing a `.copy` to overwrite the embedding layer's current weights.

Notice: You have loaded the `Glove` embedding in the data processing step, stored in `TEXT.vocab.vectors`. It means that the embedding layer here is used to transform the texts into the numeric representations for the input of the following LSTM cells.

In [None]:
# TODO
def apply_embeddings(model):
    for name, param in model.named_parameters():
        nn.init.normal_(param.data, mean = 0, std = 0.1)
    return model
model = apply_embeddings(model)

Now, let's move on to defining the  loss function and optimizer for our model.

While it's important to note that our tag vocabulary doesn't contain <unk> tokens, we do have <pad> tokens. These <pad> tokens are introduced to ensure that all sentences within a batch have the same length, a requirement for efficient processing. However, we don't want our model to learn to predict or generate <pad> tokens, as they don't carry meaningful information.

To address this, there is a crucial adjustment when defining the loss function. Specifically, you should set the __`ignore_index`__ parameter in the loss function to the index corresponding to the <pad> token in the tag vocabulary. By doing this, you can  effectively instruct the loss function to disregard <pad> tokens when computing the loss during training. This ensures that the model focuses on learning the meaningful tags while ignoring the <pad> tokens, which are essentially placeholders for padding purposes.

In [None]:
TAG_PAD_IDX = UD_TAGS.vocab.stoi[UD_TAGS.pad_token]
optimizer = optim.Adam(model.parameters())
criterion = nn.CrossEntropyLoss(ignore_index = TAG_PAD_IDX)
model = model.to(device)
criterion = criterion.to(device)

You will implement the function to compute the accuracy to evaluate the model performance.

The issue is that we don't want to calculate accuracy over the `<pad>` tokens as we aren't interested in predicting them.

The function below only calculates accuracy over non-padded tokens. `non_pad_elements` is a tensor containing the indices of the non-pad tokens within an input batch. We then compare the predictions of those elements with the labels to get a count of how many predictions were correct. We then divide this by the number of non-pad elements to get our accuracy value over the batch.

In [None]:
# TODO
def categorical_accuracy(preds, y, tag_pad_idx):
    """
    Returns accuracy per batch as a decimal
    """
    max_preds = preds.argmax(dim = 1, keepdim = True) # find max probability index
    non_pad_elements = (y != tag_pad_idx).nonzero()
    correct = max_preds[non_pad_elements].squeeze(1).eq(y[non_pad_elements])

    return correct.sum() / y[non_pad_elements].shape[0]

Next is the function that handles training our model.

To initiate the training process, our initial step involves setting the model to '`train`' mode. This action activates certain techniques such as dropout and batch normalization if they are being employed. Subsequently, we proceed by iterating over our data iterator, which provides us with a batch of training examples.

Within each iteration of the training loop, the following actions are carried out:

1. Resetting the gradients for all model parameters, essentially zeroing them out, to prepare for the upcoming gradient calculation.
2. Feeding the batch of text data into our model to generate predictions.
3. Since PyTorch loss functions expect predictions to be in a particular shape, we reshape our model's predictions accordingly.
4. Compute both the loss and accuracy by comparing the predicted tags with the actual tags.
5. Utilize the 'backward' method to compute the gradients of the model's parameters with respect to the loss.
6. Execute an 'optimizer step' to update the model's parameters based on the calculated gradients.
7. Maintain running totals for both loss and accuracy to monitor the model's performance.  


This process is integral to the training of our model, and each step plays a crucial role in improving its ability to make accurate predictions.

In [None]:
# TODO
def train(model, iterator, optimizer, criterion, tag_pad_idx):
    epoch_loss = 0
    epoch_acc = 0

    model.train()

    for batch in iterator:
        text = batch.text
        tags = batch.udtags

        optimizer.zero_grad()

        #text = [sent len, batch size]

        predictions = model(text)

        #predictions = [sent len, batch size, output dim]
        #tags = [sent len, batch size]

        predictions = predictions.view(-1, predictions.shape[-1])
        tags = tags.view(-1)

        #predictions = [sent len * batch size, output dim]
        #tags = [sent len * batch size]

        loss = criterion(predictions, tags)
        acc = categorical_accuracy(predictions, tags, tag_pad_idx)
        loss.backward()

        optimizer.step()

        epoch_loss += loss.item()
        epoch_acc += acc.item()

    return epoch_loss / len(iterator), epoch_acc / len(iterator)

The evaluate function is similar to train but with key differences. We set the model to evaluation mode using model.eval(), which disables dropout and batch normalization. We use `torch.no_grad()` to avoid gradient calculation and skip `optimizer.zero_grad()` and `optimizer.step()` as we don't update model parameters during evaluation. This function is designed to assess model performance without altering its parameters.

In [None]:
# TODO
def evaluate(model, iterator, criterion, tag_pad_idx):

    epoch_loss = 0
    epoch_acc = 0

    model.eval()

    with torch.no_grad():

        for batch in iterator:
            text = batch.text
            tags = batch.udtags

            predictions = model(text)

            predictions = predictions.view(-1, predictions.shape[-1])
            tags = tags.view(-1)

            loss = criterion(predictions, tags)
            acc = categorical_accuracy(predictions, tags, tag_pad_idx)

            epoch_loss += loss.item()
            epoch_acc += acc.item()

    return epoch_loss / len(iterator), epoch_acc / len(iterator)

Now, let's move on to the final phase of training our model.

In [None]:
# TODO
N_EPOCHS = 10

model_save_path = 'hw2_bilstm.pt'
best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()

    train_loss, train_acc = train(model, train_iterator, optimizer, criterion, TAG_PAD_IDX)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion, TAG_PAD_IDX)

    end_time = time.time()

    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), model_save_path)

    print(f'Epoch: {epoch+1:02} ')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 01 
	Train Loss: 1.392 | Train Acc: 60.84%
	 Val. Loss: 0.729 |  Val. Acc: 82.18%
Epoch: 02 
	Train Loss: 0.376 | Train Acc: 88.94%
	 Val. Loss: 0.512 |  Val. Acc: 85.75%
Epoch: 03 
	Train Loss: 0.258 | Train Acc: 91.82%
	 Val. Loss: 0.465 |  Val. Acc: 86.28%
Epoch: 04 
	Train Loss: 0.214 | Train Acc: 92.90%
	 Val. Loss: 0.449 |  Val. Acc: 86.58%
Epoch: 05 
	Train Loss: 0.188 | Train Acc: 93.69%
	 Val. Loss: 0.442 |  Val. Acc: 86.82%
Epoch: 06 
	Train Loss: 0.167 | Train Acc: 94.38%
	 Val. Loss: 0.448 |  Val. Acc: 86.59%
Epoch: 07 
	Train Loss: 0.148 | Train Acc: 95.09%
	 Val. Loss: 0.459 |  Val. Acc: 86.14%
Epoch: 08 
	Train Loss: 0.133 | Train Acc: 95.61%
	 Val. Loss: 0.460 |  Val. Acc: 86.15%
Epoch: 09 
	Train Loss: 0.118 | Train Acc: 96.13%
	 Val. Loss: 0.478 |  Val. Acc: 86.04%
Epoch: 10 
	Train Loss: 0.105 | Train Acc: 96.59%
	 Val. Loss: 0.487 |  Val. Acc: 86.39%


Following the selection of our "best" parameters, we proceed to evaluate the model's performance using these optimized settings on the test dataset.

In [None]:
model.load_state_dict(torch.load(model_save_path))

test_loss, test_acc = evaluate(model, test_iterator, criterion, TAG_PAD_IDX)

print(f'Test Loss: {test_loss:.3f} |  Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.433 |  Test Acc: 86.98%
