# Sequence Tagging

In this lab we will train a part-of-speech (POS) tagger using an HMM and then an RNN.

### Outcomes

- Be able to train and apply an HMM.
- Understand what the steps of Viterbi are doing.
- Recognise how to adapt Pytorch models to use RNN layers and perform sequence tagging with neural networks.

### Overview

The first part of the notebook loads a POS dataset from the NLTK library.
The second part implements and tests an HMM POS tagger.
The third part adapts the neural network code from last week to train the RNN as a POS tagger.


# 1. Preparing the PoS Tagging Data

To train our POS tagger, we will use the Brown corpus, which contains many different sources of English text (books, essays, newspaper articles, government documents...) collected and hand-labelled by linguists in 1967.


Next, we split the dataset into train and test, then re-format it so that each split is represented by a list of sentences and a list of tag sequences.


In [3]:
import os
import sys

path = os.path.abspath(os.path.join(".."))

if path not in sys.path:
    sys.path.append(path)

Here we use LabelEncoder to map the tokens to IDs and convert the sentences to sequences of token IDs.


The final preprocessing step is to map the tags (class labels) to numerical IDs:


In [4]:
from dn.sequence_tagging.brown import get_brown_tagged_sentences

brown_tagged_sentences = get_brown_tagged_sentences()

2023-11-15 10:30:42,804 brown downloading brown
2023-11-15 10:30:42,888 brown downloading universal_tagset
2023-11-15 10:30:46,047 brown 57340 sentences
2023-11-15 10:30:54,795 brown 56057 words
2023-11-15 10:30:54,795 brown 12 tags
2023-11-15 10:30:54,826 pickle dumping brown_tagged_sentences.pickle
2023-11-15 10:30:57,081 pickle dumped brown_tagged_sentences.pickle


# 2. Implementing the HMM

Now, we are going to put together an HMM by estimating the different variables in the model from the training set.

**TO-DO 2.1:** Count the state transitions and starting state occurrences in the training set and store the counts in the `transitions` and `start_states` matrices below. In `transitions`, rows correspond to states at time t-1, the columns to the following state at time t.


**TO-DO 2.2:** Normalise the transition and start state counts to estimate the conditional probabilities in the transition matrix and \pi.


**TO-DO 2.3:** Count the number of occurrences of each word type given each tag.


In [5]:
from numpy import int64
from dn.sequence_tagging.hmm import HMMTagger

hmm_tagger = HMMTagger[int64](
    brown_tagged_sentences.n_words, brown_tagged_sentences.n_tags
)

hmm_tagger.fit(
    brown_tagged_sentences.words_train_encoded,
    brown_tagged_sentences.tags_train_encoded,
)

2023-11-15 10:30:57,096 hmm __initial_matrix (12,)


2023-11-15 10:30:57,251 hmm __transition_matrix (12, 12)
2023-11-15 10:30:57,424 hmm __emission_matrix (12, 56057)


**TO-DO 2.4:** Normalise the observation counts to obtain the observation probabilities.


**TO-DO 2.5:** Check the implementation of viterbi below for errors!


**TO-DO 2.6:** Use the viterbi function to estimate the most likely sequence of states on the test set.


The code below will convert the predicted tag IDs to names and print the predictions along with ground truth for selected examples so we can see where it made errors:


In [6]:
# Print the predicted tags for the first N sentences.
N_SENTENCES = 5
for test_words, test_tags in zip(
    brown_tagged_sentences.words_test_encoded[:N_SENTENCES],
    brown_tagged_sentences.tags_test_encoded[:N_SENTENCES],
):
    pred_tags = list(
        hmm_tagger.predict(
            test_words,
        )
    )

    test_words_decoded = brown_tagged_sentences.decode_words(test_words)
    test_tags_decoded = brown_tagged_sentences.decode_tags(test_tags)
    pred_tags_decoded = brown_tagged_sentences.decode_tags(pred_tags)

    print(" ".join(test_words_decoded))
    print(" ".join(test_tags_decoded))
    print(" ".join(pred_tags_decoded))

Open market policy
ADJ NOUN NOUN
NOUN NOUN NOUN
And you think you have language problems .
CONJ PRON VERB PRON VERB NOUN NOUN .
CONJ PRON VERB PRON VERB NOUN NOUN .
Mae entered the room from the hallway to the kitchen .
NOUN VERB DET NOUN ADP DET NOUN ADP DET NOUN .
NOUN VERB DET NOUN ADP DET NOUN PRT DET NOUN .
This will permit you to get a rough estimate of how much the materials for the shell will cost .
DET VERB VERB PRON PRT VERB DET ADJ NOUN ADP ADV ADJ DET NOUN ADP DET NOUN VERB VERB .
DET VERB VERB PRON PRT VERB DET ADJ NOUN ADP ADV ADV DET NOUN ADP DET NOUN VERB NOUN .
the multifigure `` Traveling Carnival '' , in which action is vivified by lighting ; ;
DET NOUN . VERB NOUN . . ADP DET NOUN VERB VERB ADP VERB . .
DET NOUN . VERB NOUN . . ADP DET NOUN VERB NOUN ADP NOUN . .


Let's see how well it did overall by computing performance metrics:


In [7]:
import numpy

# Compute the accuracy for the test set.
correct: int = 0
for test_words, test_tags in zip(
    brown_tagged_sentences.words_test_encoded,
    brown_tagged_sentences.tags_test_encoded,
):
    pred_tags = list(
        hmm_tagger.predict(
            test_words,
        )
    )
    correct += numpy.sum(pred_tags == test_tags)

accuracy = correct / len(brown_tagged_sentences.tags_test_encoded)
print(f"accuracy {accuracy:.3f}")

accuracy 0.417


# 3. POS Tagging with an RNN

The code below is adapted from last week's text classifier code to first pad the sequences, then format them into DataLoader objects.


In [8]:
from dn.sequence_tagging.rnn import get_brown_tagged_sentences_padded

(
    words_train_padded,
    words_test_padded,
    tags_train_padded,
    tags_test_padded,
    n_words,
    n_tags,
) = get_brown_tagged_sentences_padded()

2023-11-15 10:31:04,042 pickle loading brown_tagged_sentences.pickle
2023-11-15 10:31:04,472 pickle loaded brown_tagged_sentences.pickle
2023-11-15 10:31:04,778 rnn words_train_padded (45872, 40)
2023-11-15 10:31:04,873 rnn words_test_padded (11468, 40)
2023-11-15 10:31:05,151 rnn tags_train_padded (45872, 40)
2023-11-15 10:31:05,232 rnn tags_test_padded (11468, 40)
2023-11-15 10:31:05,254 pickle dumping brown_tagged_sentences_padded.pickle
2023-11-15 10:31:05,265 pickle dumped brown_tagged_sentences_padded.pickle


In [9]:
from dn.sequence_tagging.util import to_data_loader

data_loader_train = to_data_loader(words_train_padded, tags_train_padded)
data_loader_test = to_data_loader(words_test_padded, tags_test_padded)

Now, we're going to create a neural network sequence tagger using an RNN layer. This will be based on the code we used last time, with two key differences:

- Including an RNN hidden layer
- The output will have an additional dimension of size sequence_length, so that it can provide predictions for every token in the sequence.

**TODO 3.1:** Complete the code below to change the hidden layer to a single RNN layer. See [the documentation](https://pytorch.org/docs/stable/generated/torch.nn.RNN.html) for details.


Now, we can run the code below to train and test the RNN model. This uses basically the same code as last week.

**TO-DO 3.2:** What is wrong with comparing the RNN tagger's performance computed here with that of the HMM? Hint: all the sequences are length 40.

**TO-DO 3.3:** Can you fix the accuracy computations to make them comparable with the accuracy for the HMM?


The code below runs the trainin process by calling train_nn():


The code below implements a testing or prediction function and computes accuracy.

**TO-DO 3.4:** Adjust the code below to correctly compute the accuracy.


In [10]:
import torch
from dn.sequence_tagging.rnn import RNNTagger

EMBEDDING_DIM = 25
HIDDEN_DIM = 32
HIDDEN_LAYERS = 1
LEARNING_RATE = 0.0005
N_EPOCHS = 10

# Ignore the padding index when computing the loss.
cross_entropy_loss = torch.nn.CrossEntropyLoss(ignore_index=n_words)

rnn_tagger = RNNTagger(
    loss_fn=cross_entropy_loss,
    # Include the padding index in the input.
    n_words=n_words + 1,
    embedding_dim=EMBEDDING_DIM,
    hidden_dim=HIDDEN_DIM,
    hidden_layers=HIDDEN_LAYERS,
    # Include the padding index in the output.
    output_dim=n_tags + 1,
)

adam_optimizer = torch.optim.Adam(rnn_tagger.parameters(), lr=LEARNING_RATE)

rnn_tagger.train_(
    n_epochs=N_EPOCHS,
    train_loader=data_loader_train,
    val_loader=data_loader_test,
    optimizer=adam_optimizer,
)

  from .autonotebook import tqdm as notebook_tqdm
100%|██████████| 717/717 [00:03<00:00, 187.61it/s]
2023-11-15 10:31:09,385 rnn epoch 1
2023-11-15 10:31:09,385 rnn training loss 1.412
2023-11-15 10:31:09,386 rnn training accuracy 27.1%
2023-11-15 10:31:09,650 rnn validation loss 0.893
2023-11-15 10:31:09,650 rnn validation accuracy 34.5%
100%|██████████| 717/717 [00:03<00:00, 187.13it/s]
2023-11-15 10:31:13,483 rnn epoch 2
2023-11-15 10:31:13,483 rnn training loss 0.716
2023-11-15 10:31:13,484 rnn training accuracy 37.2%
2023-11-15 10:31:13,727 rnn validation loss 0.592
2023-11-15 10:31:13,728 rnn validation accuracy 39.1%
100%|██████████| 717/717 [00:03<00:00, 190.29it/s]
2023-11-15 10:31:17,497 rnn epoch 3
2023-11-15 10:31:17,497 rnn training loss 0.511
2023-11-15 10:31:17,497 rnn training accuracy 40.4%
2023-11-15 10:31:17,751 rnn validation loss 0.457
2023-11-15 10:31:17,751 rnn validation accuracy 41.2%
100%|██████████| 717/717 [00:03<00:00, 188.85it/s]
2023-11-15 10:31:21,548 rn