# Sequence Tagging

In this lab we will train a part-of-speech (POS) tagger using an HMM and then an RNN.

### Outcomes
* Be able to train and apply an HMM.
* Understand what the steps of Viterbi are doing.
* Recognise how to adapt Pytorch models to use RNN layers and perform sequence tagging with neural networks.

### Overview

The first part of the notebook loads a POS dataset from the NLTK library.
The second part implements and tests an HMM POS tagger.
The third part adapts the neural network code from last week to train the RNN as a POS tagger.

# 1. Preparing the PoS Tagging Data

To train our POS tagger, we will use the Brown corpus, which contains many different sources of English text (books, essays, newspaper articles, government documents...) collected and hand-labelled by linguists in 1967.

In [1]:
import nltk
from nltk.corpus import brown

nltk.download('brown')  # download Brown corpus
nltk.download('universal_tagset')  # download the universal tagset: the 17 PoS tags

[nltk_data] Downloading package brown to /Users/es1595/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package universal_tagset to
[nltk_data]     /Users/es1595/nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!


True

Next, we split the dataset into train and test, then re-format it so that each split is represented by a list of sentences and a list of tag sequences.

In [2]:
from sklearn.model_selection import train_test_split
import numpy as np

nltk_data = list(brown.tagged_sents(tagset='universal'))
train_set, test_set = train_test_split(
    nltk_data,
    train_size=0.80,
    test_size=0.20,
    random_state=101
)
print(f'Number of training sentences: {len(train_set)}')
print(f'Number of test sentences: {len(test_set)}')

# Separate the labels from the text
train_toks = []
train_tags = []
for tagged_sentence in train_set:
    sentence_toks = []
    sentence_tags = []
    for token, tag in tagged_sentence:
        sentence_toks.append(token)
        sentence_tags.append(tag)

    train_toks.append(sentence_toks)
    train_tags.append(sentence_tags)

test_toks = []
test_tags = []
for tagged_sentence in test_set:
    sentence_toks = []
    sentence_tags = []
    for token, tag in tagged_sentence:
        sentence_toks.append(token)
        sentence_tags.append(tag)
    test_toks.append(sentence_toks)
    test_tags.append(sentence_tags)

print(f'Number of tagged words in the training set: {len(train_toks)}')
print(f'Number of tagged words in the test set: {len(test_toks)}')

Number of training sentences: 45872
Number of test sentences: 11468
Number of tagged words in the training set: 45872
Number of tagged words in the test set: 11468


Here we use LabelEncoder to map the tokens to IDs and convert the sentences to sequences of token IDs.

In [3]:
from sklearn.preprocessing import LabelEncoder
from tqdm import tqdm

# Convert the tokens to IDs in a vocabulary ready for input to our models
tok_encoder = LabelEncoder()

all_train_words = [word for sentence in train_toks for word in sentence]
all_test_words = [word for sentence in test_toks for word in sentence]
all_words = all_train_words + all_test_words
tok_encoder.fit(all_words)

all_encoded = tok_encoder.transform(all_train_words)

train_toks_encoded = []
start_idx = 0
for sentence in tqdm(train_toks):
    train_toks_encoded.append(all_encoded[start_idx:start_idx+len(sentence)])
    start_idx += len(sentence)

all_encoded = tok_encoder.transform(all_test_words)

test_toks_encoded = []
start_idx = 0
for sentence in tqdm(test_toks):
    test_toks_encoded.append(all_encoded[start_idx:start_idx+len(sentence)])
    start_idx += len(sentence)

V = len(tok_encoder.classes_)
print(f'Size of vocabulary is {V}')

100%|███████████████████████████████████████████████████| 45872/45872 [00:00<00:00, 1022931.36it/s]
100%|███████████████████████████████████████████████████| 11468/11468 [00:00<00:00, 1146582.40it/s]

Size of vocabulary is 56057





The final preprocessing step is to map the tags (class labels) to numerical IDs:

In [4]:
# Convert the tags from their names to numbers
tag_encoder = LabelEncoder()
tag_encoder.fit([tag for sentence in train_tags for tag in sentence])
train_tags_encoded = [tag_encoder.transform(sentence) for sentence in train_tags]
test_tags_encoded = [tag_encoder.transform(sentence) for sentence in test_tags]

num_tags = len(tag_encoder.classes_)

# 2. Implementing the HMM

Now, we are going to put together an HMM by estimating the different variables in the model from the training set. 

**TO-DO 2.1:** Count the state transitions and starting state occurrences in the training set and store the counts in the `transitions` and `start_states` matrices below. In `transitions`, rows correspond to states at time t-1, the columns to the following state at time t.

In [5]:
transitions = np.zeros((num_tags, num_tags))
start_states = np.zeros(num_tags)

for sentence_tags in tqdm(train_tags_encoded):
    for i, tag in enumerate(sentence_tags):
        if i==0:
            start_states[tag] += 1
            continue
        ### WRITE YOUR OWN CODE HERE
        transitions[tag, sentence_tags[i-1]] += 1


100%|█████████████████████████████████████████████████████| 45872/45872 [00:00<00:00, 80211.48it/s]


**TO-DO 2.2:** Normalise the transition and start state counts to estimate the conditional probabilities in the transition matrix and \pi.

In [6]:
### WRITE YOUR CODE HERE
transitions /= np.sum(transitions, 1)[:, None]
start_states /= np.sum(start_states)

**TO-DO 2.3:** Count the number of occurrences of each word type given each tag.

In [7]:
observations = np.zeros((num_tags, V))

for i, sentence_toks in tqdm(enumerate(train_toks_encoded)):
    sentence_tags = train_tags_encoded[i]
    for j, tok in enumerate(sentence_toks):
        tag = sentence_tags[j]
        # WRITE YOUR OWN CODE HERE
        observations[tag, tok] += 1

45872it [00:00, 74530.45it/s]


**TO-DO 2.4:** Normalise the observation counts to obtain the observation probabilities.

In [8]:
#WRITE YOUR OWN CODE HERE
observations /= np.sum(observations, 1)[:, None]

**TO-DO 2.5:** Check the implementation of viterbi below for errors!

In [9]:
def viterbi(observed_seq, num_tags, start_probs, transition_probs, observation_probs):
    eps = 1e-7

    num_obs = observed_seq.shape[0]

    # Initialise the V and backpointers
    V = np.zeros((num_obs, num_tags))
    backpointer = np.zeros((num_obs, num_tags))

    # For the first data point in the sequence:
    V[0, :] = start_probs * observation_probs[:, observed_seq[0]]

    # Run Viterbi forward for t > 0
    for t in range(1, num_obs):

        for state in range(num_tags):
            # probabilities for all the sequences leading to this state at time t
            seq_prob = V[t-1, :] * transition_probs[:, state]

            # Choose the most likely sequence
            max_seq_prob = np.max(seq_prob)
            best_previous_state = np.argmax(seq_prob)

            # Calculate the probability of the most likely sequence leading to this state at time t, including the current observation.
            # Add eps to help with numerical issues.
            V[t, state] = (max_seq_prob + eps) * (observation_probs[state, observed_seq[t]] + eps)

            backpointer[t, state] = best_previous_state

    t = num_obs - 1

    # Initialise the sequence of predicted states
    state_seq = np.zeros(num_obs, dtype=int)

    # Get the most likely final state:
    state_seq[t] = np.argmax(V[t, :])

    # Backtrack until the first observation
    for t in range(len(observed_seq)-1, 0, -1):
        state_seq[t-1] = backpointer[t, state_seq[t]]

    return state_seq

**TO-DO 2.6:** Use the viterbi function to estimate the most likely sequence of states on the test set.

In [10]:
# WRITE YOUR OWN CODE HERE
predictions = []
for sentence in tqdm(test_toks_encoded):
    predictions.append(viterbi(sentence, num_tags, start_states, transitions, observations))

100%|███████████████████████████████████████████████████████| 11468/11468 [00:31<00:00, 367.85it/s]


The code below will convert the predicted tag IDs to names and print the predictions along with ground truth for selected examples so we can see where it made errors:

In [11]:
# Convert the sequence of tag IDs to tag names
predicted_tags = []
for sequence in tqdm(predictions):
    predicted_tags.append(tag_encoder.inverse_transform(sequence))

# print some examples:
examples = [2, 334, 4983, 2389]
for eg in examples:
    print(test_toks[eg])
    print(test_tags[eg])
    print(predicted_tags[eg])

100%|█████████████████████████████████████████████████████| 11468/11468 [00:00<00:00, 12297.53it/s]

['``', 'My', 'God', ',', "I'm", 'shot', "''", '!', '!']
['.', 'DET', 'NOUN', '.', 'PRT', 'VERB', '.', '.', '.']
['.' 'NOUN' 'NOUN' '.' 'PRT' 'VERB' '.' '.' '.']
['She', 'thought', 'she', 'was', 'bigger', 'than', 'we', 'are', 'because', 'she', 'came', 'from', 'Torino', "''", '.']
['PRON', 'VERB', 'PRON', 'VERB', 'ADJ', 'ADP', 'PRON', 'VERB', 'ADP', 'PRON', 'VERB', 'ADP', 'NOUN', '.', '.']
['PRON' 'VERB' 'PRON' 'VERB' 'ADJ' 'ADP' 'PRON' 'VERB' 'ADV' 'PRON' 'VERB'
 'ADP' 'NOUN' '.' '.']
['Meanwhile', ',', 'I', 'reloaded', 'my', 'gun', ',', 'as', 'the', 'other', 'men', 'were', 'doing', '.']
['ADV', '.', 'PRON', 'VERB', 'DET', 'NOUN', '.', 'ADP', 'DET', 'ADJ', 'NOUN', 'VERB', 'VERB', '.']
['ADV' '.' 'PRON' 'VERB' 'DET' 'NOUN' '.' 'ADV' 'DET' 'ADJ' 'NOUN' 'VERB'
 'VERB' '.']
['The', 'difficulty', 'was', 'that', 'each', 'day', 'seemed', 'to', 'produce', 'its', 'quota', 'of', 'details', 'which', 'must', 'be', 'cleaned', 'up', 'immediately', '.']
['DET', 'NOUN', 'VERB', 'ADP', 'DET', 'NOUN', 'V




Let's see how well it did overall by computing performance metrics:

In [12]:
# compute accuracy

from sklearn.metrics import accuracy_score

all_predictions = [tag for sentence in predictions for tag in sentence]
all_targets = [tag for sentence in test_tags_encoded for tag in sentence]

acc = accuracy_score(all_targets, all_predictions)
print(f'Accuracy = {acc}')

Accuracy = 0.9015292609995729


# 3. POS Tagging with an RNN
The code below is adapted from last week's text classifier code to first pad the sequences, then format them into DataLoader objects.

In [13]:
sequence_length = 40  # truncate all docs longer than this. Pad all docs shorter than this.
padding_token_id = V  # one value higher than the other tags

def pad_sequence(sequence):
    # pad with -1s
    if len(sequence) >= sequence_length:
        sequence = sequence[:sequence_length]
    else:
        sequence = np.concatenate((np.zeros(sequence_length-len(sequence)) + padding_token_id, sequence))
    return sequence

padded_train_toks_encoded = [pad_sequence(toks)[None, :] for toks in train_toks_encoded]
padded_train_tags_encoded = [pad_sequence(tags)[None, :] for tags in train_tags_encoded]

padded_test_toks_encoded = [pad_sequence(toks)[None, :] for toks in test_toks_encoded]
padded_test_tags_encoded = [pad_sequence(tags)[None, :] for tags in test_tags_encoded]

In [14]:
from torch.utils.data import DataLoader, TensorDataset
import torch

batch_size = 64

# convert from the Huggingface format to a TensorDataset so we can use the mini-batch sampling functionality
def convert_to_data_loader(inputs, labels):
    inputs = np.concatenate(inputs, axis=0)
    labels = np.concatenate(labels, axis=0)

    # convert from array to tensor
    input_tensor = torch.from_numpy(inputs).long()
    label_tensor = torch.from_numpy(labels).long()
    tensor_dataset = TensorDataset(input_tensor, label_tensor)
    loader = DataLoader(tensor_dataset, batch_size=batch_size, shuffle=True)

    return loader

train_loader = convert_to_data_loader(padded_train_toks_encoded, padded_train_tags_encoded)

In [15]:
test_loader = convert_to_data_loader(padded_test_toks_encoded, padded_test_tags_encoded)

Now, we're going to create a neural network sequence tagger using an RNN layer. This will be based on the code we used last time, with two key differences:
  * Including an RNN hidden layer
  * The output will have an additional dimension of size sequence_length, so that it can provide predictions for every token in the sequence.

**TODO 3.1:** Complete the code below to change the hidden layer to a single RNN layer. See [the documentation](https://pytorch.org/docs/stable/generated/torch.nn.RNN.html) for details.

In [16]:
from torch import nn

class FFTextClassifier(nn.Module):

    def __init__(self, vocab_size, embedding_size, hidden_size, num_classes):
        super(FFTextClassifier, self).__init__()

        self.embedding_size = embedding_size

        # Here we just need to construct the components of our network. We don't need to connect them together yet.
        self.embedding_layer = nn.Embedding(vocab_size, embedding_size) # embedding layer

        ### COMPLETE YOUR CODE HERE
        self.hidden_layer = nn.RNN(embedding_size, hidden_size, num_layers=1, batch_first=True) # Hidden layer
        ###

        self.output_layer = nn.Linear(hidden_size, num_classes) # Full connection layer

    def forward (self, input_words):
        # Input dimensions are:  (batch_size, seq_length)
        embedded_words = self.embedding_layer(input_words)  # (batch_size, seq_length, embedding_size)

        ### COMPLETE THE CODE HERE TO USE THE HIDDEN RNN LAYER
        h, _ = self.hidden_layer(embedded_words)   # (batch_size, seq_length, hidden_size)
        ###

        output = self.output_layer(h)                      # (batch_size, seq_length, num_classes)
        output = torch.transpose(output, 1, 2)              # (batch_size, num_classes, seq_length)
        # Notice we haven't applied a softmax activation to the output layer -- it's not required by Pytorch's loss function.

        return output

Now, we can run the code below to train and test the RNN model. This uses basically the same code as last week.

**TO-DO 3.2:** What is wrong with comparing the RNN tagger's performance computed here with that of the HMM? Hint: all the sequences are length 40.

**TO-DO 3.3:** Can you fix the accuracy computations to make them comparable with the accuracy for the HMM?

In [17]:
embedding_size = 25  # number of dimensions for embeddings
hidden_size = 32 # number of hidden units

ff_classifier_model = FFTextClassifier(V+1, embedding_size, hidden_size, num_tags)

In [18]:
def train_nn(num_epochs, model, train_dataloader, dev_dataloader, loss_fn, optimizer):

    for e in range(num_epochs):
        # Track performance on the training set as we are learning...
        total_correct = 0
        total_trained = 0
        train_losses = []

        model.train()  # Put the model in training mode.

        for i, (batch_input_ids, batch_labels) in enumerate(train_dataloader):
            # Iterate over each batch of data
            # print(f'batch no. = {i}')

            optimizer.zero_grad()  # Reset the optimizer

            # Use the model to perform forward inference on the input data.
            # This will run the forward() function.
            output = model(batch_input_ids)

            # Compute the loss for the current batch of data
            batch_loss = loss_fn(output, batch_labels)

            # Perform back propagation to compute the gradients with respect to each weight
            batch_loss.backward()

            # Update the weights using the compute gradients
            optimizer.step()

            # Record the loss from this sample to keep track of progress.
            train_losses.append(batch_loss.item())

            # Count correct labels so we can compute accuracy on the training set
            predicted_labels = output.argmax(1)
                        
            ### CHANGE CODE HERE
            # n_tokens_in_batch = batch_labels.size(0) * sequence_length
            n_tokens_in_batch = np.sum(batch_input_ids.numpy() != padding_token_id)
            predicted_labels = predicted_labels[batch_input_ids != padding_token_id]
            batch_labels = batch_labels[batch_input_ids != padding_token_id]
            ###
            
            total_correct += (predicted_labels == batch_labels).sum().item()
            total_trained += n_tokens_in_batch

        train_accuracy = total_correct/total_trained*100

        print("Epoch: {}/{}".format((e+1), num_epochs),
              "Training Loss: {:.4f}".format(np.mean(train_losses)),
              "Training Accuracy: {:.4f}%".format(train_accuracy))

        model.eval()  # Switch model to evaluation mode
        total_correct = 0
        total_trained = 0
        dev_losses = []

        for dev_input_ids, dev_labels in dev_dataloader:
            dev_output = model(dev_input_ids)
            dev_loss = loss_fn(dev_output, dev_labels)

            # Save the loss on the dev set
            dev_losses.append(dev_loss.item())

            # Count the number of correct predictions
            predicted_labels = dev_output.argmax(1)
            
            ### CHANGE CODE HERE
            # n_tokens_in_batch = dev_labels.size(0) * sequence_length
            n_tokens_in_batch = np.sum(dev_input_ids.numpy() != padding_token_id)
            predicted_labels = predicted_labels[dev_input_ids != padding_token_id]
            dev_labels = dev_labels[dev_input_ids != padding_token_id]           
            ###
            
            total_correct += (predicted_labels == dev_labels).sum().item()
            total_trained += n_tokens_in_batch

        dev_accuracy = total_correct/total_trained*100

        print("Epoch: {}/{}".format((e+1), num_epochs),
              "Validation Loss: {:.4f}".format(np.mean(dev_losses)),
              "Validation Accuracy: {:.4f}%".format(dev_accuracy))
    return model

The code below runs the trainin process by calling train_nn():

In [19]:
from torch import optim

learning_rate = 0.0005

loss_fn = nn.CrossEntropyLoss(ignore_index=V)
optimizer = optim.Adam(ff_classifier_model.parameters(), lr=learning_rate)

num_epochs = 10
trained_model = train_nn(num_epochs, ff_classifier_model, train_loader, test_loader, loss_fn, optimizer)

Epoch: 1/10 Training Loss: 1.3626 Training Accuracy: 57.0481%
Epoch: 1/10 Validation Loss: 0.8516 Validation Accuracy: 71.8970%
Epoch: 2/10 Training Loss: 0.6870 Training Accuracy: 77.3905%
Epoch: 2/10 Validation Loss: 0.5705 Validation Accuracy: 81.1442%
Epoch: 3/10 Training Loss: 0.4981 Training Accuracy: 83.4649%
Epoch: 3/10 Validation Loss: 0.4466 Validation Accuracy: 85.1668%
Epoch: 4/10 Training Loss: 0.3992 Training Accuracy: 86.7975%
Epoch: 4/10 Validation Loss: 0.3738 Validation Accuracy: 87.6466%
Epoch: 5/10 Training Loss: 0.3348 Training Accuracy: 88.9441%
Epoch: 5/10 Validation Loss: 0.3242 Validation Accuracy: 89.2042%
Epoch: 6/10 Training Loss: 0.2882 Training Accuracy: 90.4683%
Epoch: 6/10 Validation Loss: 0.2880 Validation Accuracy: 90.4718%
Epoch: 7/10 Training Loss: 0.2527 Training Accuracy: 91.6521%
Epoch: 7/10 Validation Loss: 0.2609 Validation Accuracy: 91.3927%
Epoch: 8/10 Training Loss: 0.2251 Training Accuracy: 92.5832%
Epoch: 8/10 Validation Loss: 0.2401 Valida

The code below implements a testing or prediction function and computes accuracy.

**TO-DO 3.4:** Adjust the code below to correctly compute the accuracy.

In [20]:
def test_nn(trained_model, test_loader, loss_fn):

    trained_model.eval()

    test_losses = []
    correct = 0  # count the number of correct classification labels
    total_labels = 0

    for inputs, labels in test_loader:
        test_output = trained_model(inputs)
        loss = loss_fn(test_output, labels)
        test_losses.append(loss.item())
        predicted_labels = test_output.argmax(1)

        ### CHANGE THE CODE BELOW
        # n_tokens_in_batch = labels.size(0) * sequence_length
        n_tokens_in_batch = np.sum(inputs.numpy() != padding_token_id)
        predicted_labels = predicted_labels[inputs != padding_token_id]
        labels = labels[inputs != padding_token_id]   
        ###
        
        count_correct = torch.sum(predicted_labels == labels).item()
        correct += count_correct
        total_labels += n_tokens_in_batch

    accuracy = correct/total_labels * 100
    print("Test Accuracy: {:.2f}%".format(accuracy))
    # print(predicted)

test_nn(trained_model, test_loader, loss_fn)

Test Accuracy: 93.11%
