# Tutorial: Part-of-Speech (POS) Tagging with PyTorch

------
***Session By***:
* `Debajyoti Dasgupta`
* `Somnath Jena`

***Course***: Deep Learning

***Session***: Spring 2022-23
-------

\\

In this tutorial, we will learn how to train a deep learning model for Part-of-Speech (POS) tagging using PyTorch. We will also cover how to use the trained model for inference on new text data.

The tutorial will be divided into the following sections:

1. Introduction
2. Dataset Preparation
3. Feature Extraction
4. Model Definition
5. Training the Model
6. Model Evaluation

## Introduction
We will start by discussing what is POS tagging and why it is important. We will also introduce the PyTorch library and its basic concepts that we will use in this tutorial.

## Dataset Preparation
In this section, we will describe the dataset that we will use for training and testing our model. We will also cover how to prepare the dataset for training.

## Feature Extraction
We will discuss how to extract useful features from the input text, which can be used as input to our model.

## Model Definition
In this section, we will define the architecture of our model using PyTorch. We will discuss different layers of the model and how to connect them.

## Training the Model
We will train the model using the dataset we prepared in section 2. We will define the loss function and optimization algorithm for training the model.

## Model Evaluation
We will evaluate the trained model on a test dataset and report the performance metrics.

Let's get started!

In [2]:
%pip install pytorch-lightning -q -U

Note: you may need to restart the kernel to use updated packages.


# 1. Introduction

## RNN
<div align="center">
  <img src="https://imgur.com/EUw3a0e.png" width="500px"/>
</div>

Recurrent Neural Networks (RNNs) are a class of neural networks that are designed to handle sequential data, such as text or time-series data. Unlike traditional feedforward neural networks, RNNs have a hidden state that allows them to maintain information about previous inputs, which makes them well-suited for tasks such as language modeling, machine translation, and POS tagging.

## Part of Speech Tagging


<div align="center">
  <img src="https://cdn-media-1.freecodecamp.org/images/1*f6e0uf5PX17pTceYU4rbCA.jpeg" width="500px"/>
</div>

\\

In POS tagging, RNNs can be used to model the context of a word in a sentence by considering the surrounding words and their respective POS tags. Specifically, a word in a sentence is represented as an input vector, which is passed through the RNN along with the hidden state from the previous timestep. The output of the RNN at each timestep is then used to predict the corresponding POS tag for the current word.

\\

<div align="center">
  <img src="https://i.imgur.com/IeDJtzW.png" width="1000px"/>
</div>


## LSTM
One common type of RNN used for POS tagging is the `Long Short-Term Memory (LSTM)` network, which is designed to avoid the `vanishing gradient` problem that can occur with traditional RNNs. LSTM networks use `gated units` that allow them to selectively forget or remember information from previous timesteps, which makes them more effective at modeling long-term dependencies in sequential data.


<div align="center">
  <img src="https://imgur.com/EjKEky2.png" width="800px"/>
</div>

Overall, RNNs, and specifically LSTM networks, have been shown to achieve high accuracy in POS tagging tasks, especially when combined with other techniques such as pre-training and attention mechanisms.

# 2. Dataset Preparation
The dataset that we will use for this tutorial is the Penn Treebank dataset, which is a widely used dataset for training and evaluating POS tagging models. It consists of a set of Wall Street Journal articles from `1989-1992`, which have been annotated with POS tags.

## Dataset Description
The `English Penn Treebank (PTB)` corpus, and in particular the section of the corpus corresponding to the articles of `Wall Street Journal (WSJ)`, is one of the most known and used corpus for the evaluation of models for sequence labelling. The task consists of annotating each word with its `Part-of-Speech tag`. In the most common split of this corpus, sections from `0 to 18` are used for training `(38 219 sentences, 912 344 tokens)`, sections from `19 to 21` are used for validation `(5 527 sentences, 131 768 tokens)`, and sections from `22 to 24` are used for testing `(5 462 sentences, 129 654 tokens)`. The corpus is also commonly used for character-level and word-level Language Modelling.


### Treebank Viewers
<div align="center">
  <img src="https://sandiway.arizona.edu/treebankviewer/viewer.png" width="500px"/>
</div>

### Detailed List of Penn Treebank Tags
src: [UPenn Treebank Tags](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)

| Number	| Tag	  |Description |
|---------|-------|---------------------- |
| 1.	    | CC	  |Coordinating conjunction |
| 2.	    | CD	  |Cardinal number |
| 3.	    | DT	  |Determiner |
| 4.	    | EX	  |Existential there |
| 5.	    | FW	  |Foreign word |
| 6.	    | IN	  |Preposition or subordinating conjunction |
| 7.	    | JJ	  |Adjective |
| 8.	    | JJR	  |Adjective, comparative |
| 9.	    | JJS	  |Adjective, superlative |
| 10.	    | LS	  |List item marker |
| 11.	    | MD	  |Modal |
| 12.	    | NN	  |Noun, singular or mass |
| 13.	    | NNS	  |Noun, plural |
| 14.	    | NNP	  |Proper noun, singular |
| 15.	    | NNPS	|Proper noun, plural |
| 16.	    | PDT	  |Predeterminer |
| 17.	    | POS	  |Possessive ending |
| 18.	    | PRP	  |Personal pronoun |
| 19.	    | PRP\$	|Possessive pronoun |
| 20.	    | RB	  |Adverb |
| 21.	    | RBR	  |Adverb, comparative |
| 22.	    | RBS	  |Adverb, superlative |
| 23.	    | RP	  |Particle |
| 24.	    | SYM	  |Symbol |
| 25.	    | TO	  |to |
| 26.	    | UH	  |Interjection |
| 27.	    | VB	  |Verb, base form |
| 28.	    | VBD	  |Verb, past tense |
| 29.	    | VBG	  |Verb, gerund or present participle |
| 30.	    | VBN	  |Verb, past participle |
| 31.	    | VBP	  |Verb, non-3rd person singular present |
| 32.	    | VBZ	  |Verb, 3rd person singular present |
| 33.	    | WDT	  |Wh-determiner |
| 34.	    | WP	  |Wh-pronoun |
| 35.	    | WP\$	  |Possessive wh-pronoun |
| 36.	    | WRB	  |Wh-adverb |


## Dataset Preprocessing
We will be using a subset of the `Penn Treebank` dataset that is defined within the `nltk` library and can be downloaded by simple calling the `nltk.downoad` command. For preprocessing we simply divide the dataset into three parts, namely
* training set (`train`)
* validation set (`dev`)
* test set (`test`)

In real use-case scenario, there will be a more rigorous and streamlined use cases of `train-valid-test` split that will be used, like the `k-fold corss validation` technique. But, for this tutotrial, we will be using the standard ***80-10-10*** split. That is, `80%` train, `10%` dev and `10%` test split.

In [1]:
# Preprocess the Penn Treebank dataset
import torch
import nltk
nltk.download('treebank')

from nltk.corpus import treebank

# Load the dataset
dataset = treebank.tagged_sents()

# Split into train, validation, and test sets
train_size = int(0.8 * len(dataset))
dev_size = int(0.1 * len(dataset))
test_size = len(dataset) - train_size - dev_size
train_dataset, dev_dataset, test_dataset = torch.utils.data.random_split(dataset, [train_size, dev_size, test_size])

print(train_dataset)

ModuleNotFoundError: No module named 'nltk'

# 3. Feature Extraction

In the given code, we are using the `Treebank` corpus from the `NLTK library` to extract `sentences` and `tags`. Each sentence in the corpus is a list of `(word, tag)` tuples, where word is the `token` and tag is its `part of speech` tag.

\\

The `preprocess` function takes the dataset as an input, and extracts sentences and tags using list comprehension. It converts all the words to `lowercase` to 
* reduce the vocabulary size
* ensure consistency

We also set a maximum sequence length of `25 tokens`, as longer sentences would require more computational resources.

\\

Next, we create two mappings, `word_to_idx` and `tag_to_idx`. These mappings associate each unique word and tag with a unique integer index. The `<PAD>` and `<UNK>` tokens are added to the word_to_idx mapping as the first two indices, respectively. The `<PAD>` token is used to pad sentences shorter than the maximum length, while the `<UNK>` token is used for out-of-vocabulary words.

\\

In the loop, we add the `<PAD>` tag to the end of each tag sequence until they are the same length as the sentences. Then, we add each unique tag to the tag_to_idx mapping, starting from index 1 (since 0 is reserved for `<PAD>`).

\\

Similarly, we add each unique word to the word_to_idx mapping, starting from index 2 (since 0 is reserved for `<PAD>` and 1 for `<UNK>`).

\\

Finally, we convert each word and tag to its corresponding integer index using the mappings. We create two tensors `X` and `Y`, where X contains the integer indices of the words and Y contains the integer indices of the tags. We also cast the tensors to torch.LongTensor data type.

\\

To summarize, the preprocess function takes in the dataset and returns two tensors `X` and `Y`, which are the integer indices of the words and tags respectively, and two mappings `word_to_idx` and `tag_to_idx`, which associate each unique word and tag with a unique integer index. These tensors and mappings can be used for training a POS tagging model.

In [None]:
SEQ_LEN = 25

# Create word_to_idx and tag_to_idx mappings
word_to_idx = {"<PAD>": 0, "<UNK>": 1}
tag_to_idx = {"<PAD>": 0}


def preprocess(dataset):
    # Extract sentences and tags
    sent = [[token.lower() for token, tag in sentence] for sentence in dataset]
    tags = [[tag for token, tag in sentence] for sentence in dataset]

    for i in range(len(sent)):
        while len(sent[i]) < SEQ_LEN:
            sent[i].append('<PAD>')
            tags[i].append('<PAD>')

        if len(sent[i]) > SEQ_LEN:
            sent[i] = sent[i][:SEQ_LEN]
            tags[i] = tags[i][:SEQ_LEN]
    
    for sentence_tags in tags:
        for tag in sentence_tags:
            if tag not in tag_to_idx:
                tag_to_idx[tag] = len(tag_to_idx)
    
    for sentence in sent:
        for word in sentence:
            if word not in word_to_idx:
                word_to_idx[word] = len(word_to_idx)

    # Convert words and tags to indices
    X = torch.tensor([[word_to_idx.get(word, 1) for word in sentence] for sentence in sent], dtype=torch.int).type(torch.LongTensor)
    Y = torch.tensor([[tag_to_idx[tag] for tag in sentence] for sentence in tags], dtype=torch.int).type(torch.LongTensor)
    
    return X, Y

In [None]:
train_X, train_Y = preprocess(train_dataset)
dev_X, dev_Y = preprocess(dev_dataset)
test_X, test_Y = preprocess(test_dataset)

In [None]:
# Print the sizes of the datasets
print(f"Number of training examples: {len(train_X)}")
print(f"Number of validation examples: {len(dev_X)}")
print(f"Number of testing examples: {len(test_X)}")

Number of training examples: 3131
Number of validation examples: 391
Number of testing examples: 392


# 4. Model Definition
We will define our model using PyTorch Lightning's LightningModule class, which allows us to organize our training logic into separate methods, making our code easier to understand and maintain.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import pytorch_lightning as pl

class POSModel(pl.LightningModule):
    def __init__(self, vocab_size, tagset_size, embedding_dim, hidden_dim, num_layers=1, bidirectional=False):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim) #B * seq_len, B * seq_len * embedding_dim
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True, num_layers=num_layers, bidirectional=bidirectional)
        #B * seq_len * embedding_dim -> B * seq_len * hidden_dim 
        #tags
        if bidirectional:
            self.fc = nn.Linear(2*hidden_dim, tagset_size)
        else:
            self.fc = nn.Linear(hidden_dim, tagset_size)
        self.loss_fn = nn.CrossEntropyLoss()
    
    def forward(self, x):
        embeds = self.embedding(x)
        #print(embeds.shape)
        lstm_out, _ = self.lstm(embeds)
        tag_space = self.fc(lstm_out)
        tag_scores = nn.functional.log_softmax(tag_space, dim=2)
        return tag_scores
    
    def training_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self.forward(x)
        loss = self.loss_fn(y_hat.view(-1, y_hat.shape[-1]), y.view(-1))
        self.log('train_loss', loss)
        return loss
    
    def validation_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self.forward(x)
        loss = self.loss_fn(y_hat.view(-1, y_hat.shape[-1]), y.view(-1))
        self.log('val_loss', loss)
        return loss

    def test_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self.forward(x)
        loss = self.loss_fn(y_hat.view(-1, y_hat.shape[-1]), y.view(-1))
        self.log('test_loss', loss)
        return loss
    
    def configure_optimizers(self):
        optimizer = optim.Adam(self.parameters())
        return optimizer

The `__init__` method defines the architecture of our model. In this case, we use an embedding layer followed by a bidirectional LSTM layer and a linear layer to output the predicted tags. The `forward` method takes the input `x` and applies the layers defined in the `__init__` method.

We also inherit from `pl.LightningModule` to get access to PyTorch Lightning's training loop.

# 5. Training the Model
We will use PyTorch Lightning's Trainer class to train our model. This class takes care of setting up the training loop, optimizing the model, and handling GPU acceleration.

In [None]:
from torch.utils.data import DataLoader, TensorDataset
from pytorch_lightning.callbacks.early_stopping import EarlyStopping

EMBEDDING_DIM = 100
HIDDEN_DIM    = 100
NUM_EPOCHS    = 10 
BATCH_SIZE    = 4

train_dataset = TensorDataset(train_X, train_Y)
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)

val_dataset = TensorDataset(dev_X, dev_Y)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE)

test_dataset = TensorDataset(test_X, test_Y)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE)

In [None]:
model = POSModel(vocab_size=len(word_to_idx), tagset_size=len(tag_to_idx), embedding_dim=EMBEDDING_DIM, hidden_dim=HIDDEN_DIM, bidirectional=False)
early_stopping = EarlyStopping(monitor="val_loss", patience=3, mode="min")
trainer = pl.Trainer(max_epochs=NUM_EPOCHS, gpus=1, callbacks=[early_stopping])
trainer.fit(model, train_dataloaders=train_loader, val_dataloaders=val_loader)

trainer.test(dataloaders=test_loader)

  rank_zero_deprecation(
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name      | Type             | Params
-----------------------------------------------
0 | embedding | Embedding        | 1.0 M 
1 | lstm      | LSTM             | 80.8 K
2 | fc        | Linear           | 4.7 K 
3 | loss_fn   | CrossEntropyLoss | 0     
-----------------------------------------------
1.1 M     Trainable params
0         Non-trainable params
1.1 M     Total params
4.361     Total estimated model params size (MB)


Sanity Checking: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

  rank_zero_warn(
INFO:pytorch_lightning.utilities.rank_zero:Restoring states from the checkpoint path at /content/lightning_logs/version_0/checkpoints/epoch=8-step=7047.ckpt
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.utilities.rank_zero:Loaded model weights from checkpoint at /content/lightning_logs/version_0/checkpoints/epoch=8-step=7047.ckpt


Testing: 0it [00:00, ?it/s]

────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       Test metric             DataLoader 0
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
        test_loss            0.418576180934906
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────


[{'test_loss': 0.418576180934906}]

In [None]:
model = POSModel(vocab_size=len(word_to_idx), tagset_size=len(tag_to_idx), embedding_dim=EMBEDDING_DIM, hidden_dim=HIDDEN_DIM, bidirectional=True)
early_stopping = EarlyStopping(monitor="val_loss", patience=3, mode="min")
trainer = pl.Trainer(max_epochs=NUM_EPOCHS, gpus=1, callbacks=[early_stopping])
trainer.fit(model, train_dataloaders=train_loader, val_dataloaders=val_loader)

trainer.test(dataloaders=test_loader)

  rank_zero_deprecation(
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name      | Type             | Params
-----------------------------------------------
0 | embedding | Embedding        | 1.0 M 
1 | lstm      | LSTM             | 161 K 
2 | fc        | Linear           | 9.4 K 
3 | loss_fn   | CrossEntropyLoss | 0     
-----------------------------------------------
1.2 M     Trainable params
0         Non-trainable params
1.2 M     Total params
4.703     Total estimated model params size (MB)


Sanity Checking: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:Restoring states from the checkpoint path at /content/lightning_logs/version_1/checkpoints/epoch=7-step=6264.ckpt
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.utilities.rank_zero:Loaded model weights from checkpoint at /content/lightning_logs/version_1/checkpoints/epoch=7-step=6264.ckpt


Testing: 0it [00:00, ?it/s]

────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       Test metric             DataLoader 0
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
        test_loss           0.3800565004348755
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────


[{'test_loss': 0.3800565004348755}]

# 6. Model Inference

In [None]:
from sklearn.metrics import classification_report

# define idx_to_tag
idx_to_tag = {idx: tag for tag, idx in tag_to_idx.items()}

# define device
device = torch.device('cpu')

# Create a dataloader for the test set
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE)

# Set the model to evaluation mode
model.eval()

y_true = []
y_pred = []

with torch.no_grad():
    for x, y in test_loader:
        # Move the data to the device
        x = x.to(device)
        y = y.to(device)

        # Forward pass
        y_hat = model(x)

        # Compute the predicted tags
        y_pred += [idx_to_tag[i] for i in y_hat.argmax(-1).cpu().numpy().flatten().tolist()]

        # Compute the true tags
        y_true += [idx_to_tag[i] for i in y.cpu().numpy().flatten().tolist()]

print(classification_report(y_true, y_pred))

              precision    recall  f1-score   support

           $       0.98      0.97      0.98        64
          ''       1.00      0.98      0.99        42
           ,       1.00      1.00      1.00       382
       -LRB-       1.00      1.00      1.00         8
      -NONE-       0.97      0.96      0.96       506
       -RRB-       1.00      1.00      1.00         4
           .       1.00      1.00      1.00       188
           :       1.00      0.98      0.99        59
       <PAD>       1.00      1.00      1.00      1697
          CC       0.98      0.99      0.99       166
          CD       0.86      0.91      0.88       264
          DT       0.99      1.00      1.00       662
          EX       0.75      1.00      0.86         3
          IN       0.96      0.98      0.97       811
          JJ       0.78      0.74      0.76       487
         JJR       0.85      0.79      0.82        29
         JJS       0.92      0.71      0.80        17
          MD       1.00    

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
# Set the model to evaluation mode
model.eval()

idx_to_word = {idx: word for word, idx in word_to_idx.items()}

y_true = []
y_pred = []

with torch.no_grad():
    for x, y in test_loader:
        # Move the data to the device
        x = x.to(device)
        y = y.to(device)

        # Forward pass
        y_hat = model(x)

        # Get back the sentence
        x_sent = [idx_to_word[i] for i in x.cpu().numpy().flatten().tolist()]

        # Compute the predicted tags
        y_pred += [idx_to_tag[i] for i in y_hat.argmax(-1).cpu().numpy().flatten().tolist()]

        # Compute the true tags
        y_true += [idx_to_tag[i] for i in y.cpu().numpy().flatten().tolist()]
        print("Sentence")
        print(x_sent)
        print("Predicted tags")
        print(y_pred)
        break

Sentence
['stock-index', 'futures', '--', 'contracts', '*', 'to', 'buy', 'or', 'sell', 'the', 'cash', 'value', 'of', 'a', 'stock', 'index', 'by', 'a', 'certain', 'date', '.', '<PAD>', '<PAD>', '<PAD>', '<PAD>', 'j.', 'landis', 'martin', ',', 'nl', 'president', 'and', 'chief', 'executive', 'officer', ',', 'said', '0', 'nl', 'and', 'mr.', 'simmons', 'cut', 'the', 'price', '*ich*-2', '0', 'they', 'were', 'proposing', 'in', 'san', 'francisco', ',', 'its', 'backers', 'concede', '0', 'the', 'ballpark', 'is', 'at', 'best', 'running', 'even', 'in', 'the', 'polls', '.', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', 'investors', 'took', 'advantage', 'of', 'tuesday', "'s", 'stock', 'rally', '*-1', 'to', 'book', 'some', 'profits', 'yesterday', ',', '*-1', 'leaving', 'stocks', 'up', 'fractionally', '.', '<PAD>', '<PAD>', '<PAD>', '<PAD>']
Predicted tags
['NN', 'NNS', ':', 'NNS', '-NONE-', 'TO', 'VB', 'CC', 'VB', 'DT', 'NN', 'NN', 'IN', 'DT', 'NN', 'NN', 'IN', 'DT', 'JJ', 'NN', '.', '<PAD>',