# BiLSTM-CRF for PoS Tagging

Implementing CRF by [pytorch-crf](https://pytorch-crf.readthedocs.io/en/stable/). Install the package:  
```Python
$ pip install pytorch-crf
``` 

In [1]:
import random
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

SEED = 515
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

## Preparing Data

The dataset is Universal Dependencies English Web Treebank (UDPOS).  
This dataset actually has two different sets of tags, [universal dependency (UD) tags](https://universaldependencies.org/u/pos/) and [Penn Treebank (PTB) tags](https://www.sketchengine.eu/penn-treebank-tagset/).  

In [2]:
from torchtext.data import Field, BucketIterator

TEXT = Field(lower=True, include_lengths=True)
# Because the set of possible tags is finite, do NOT use unknown token for it. 
UD_TAGS = Field(unk_token=None, include_lengths=True)
PTB_TAGS = Field(unk_token=None, include_lengths=True)

In [3]:
from torchtext.datasets import UDPOS

fields = [('text', TEXT), ('udtags', UD_TAGS), ('ptbtags', PTB_TAGS)]
train_data, valid_data, test_data = UDPOS.splits(fields=fields, root='data/')

In [4]:
print(train_data[0].text)
print(train_data[0].udtags)
print(train_data[0].ptbtags)

['al', '-', 'zaman', ':', 'american', 'forces', 'killed', 'shaikh', 'abdullah', 'al', '-', 'ani', ',', 'the', 'preacher', 'at', 'the', 'mosque', 'in', 'the', 'town', 'of', 'qaim', ',', 'near', 'the', 'syrian', 'border', '.']
['PROPN', 'PUNCT', 'PROPN', 'PUNCT', 'ADJ', 'NOUN', 'VERB', 'PROPN', 'PROPN', 'PROPN', 'PUNCT', 'PROPN', 'PUNCT', 'DET', 'NOUN', 'ADP', 'DET', 'NOUN', 'ADP', 'DET', 'NOUN', 'ADP', 'PROPN', 'PUNCT', 'ADP', 'DET', 'ADJ', 'NOUN', 'PUNCT']
['NNP', 'HYPH', 'NNP', ':', 'JJ', 'NNS', 'VBD', 'NNP', 'NNP', 'NNP', 'HYPH', 'NNP', ',', 'DT', 'NN', 'IN', 'DT', 'NN', 'IN', 'DT', 'NN', 'IN', 'NNP', ',', 'IN', 'DT', 'JJ', 'NN', '.']


In [5]:
TEXT.build_vocab(train_data, min_freq=2, 
                 vectors="glove.6B.100d", vectors_cache="vector_cache", 
                 unk_init=torch.Tensor.normal_)

UD_TAGS.build_vocab(train_data)
PTB_TAGS.build_vocab(train_data)

print(len(TEXT.vocab), len(UD_TAGS.vocab), len(PTB_TAGS.vocab))
print(UD_TAGS.vocab.itos)

8866 18 51
['<pad>', 'NOUN', 'PUNCT', 'VERB', 'PRON', 'ADP', 'DET', 'PROPN', 'ADJ', 'AUX', 'ADV', 'CCONJ', 'PART', 'NUM', 'SCONJ', 'X', 'INTJ', 'SYM']


In [6]:
BATCH_SIZE = 128

device = torch.device('cuda:3' if torch.cuda.is_available() else 'cpu')

train_iterator, valid_iterator, test_iterator = BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size=BATCH_SIZE, device=device)

In [7]:
for batch in train_iterator:
    batch_text, batch_text_lens = batch.text
    batch_tags, batch_tags_lens = batch.udtags
    break

print(batch_text)
print(batch_text_lens)
print(batch_tags)
print(batch_tags_lens)

tensor([[  27,   56,  116,  ...,  127,    9, 3715],
        [  12,  244,    4,  ...,    4,   76,    1],
        [  73,   13,    1,  ...,    1, 1904,    1],
        ...,
        [   1,    1,    1,  ...,    1,    1,    1],
        [   1,    1,    1,  ...,    1,    1,    1],
        [   1,    1,    1,  ...,    1,    1,    1]], device='cuda:3')
tensor([19, 16,  2, 20, 44, 11, 29, 13, 10, 38, 22, 71, 17,  7, 15, 12,  7, 10,
        12, 29, 20,  5, 42, 20, 25, 11, 11,  4, 22, 16, 31, 28,  2, 24, 60, 18,
         4,  7,  4, 17, 26, 38, 34,  5,  2,  6,  1,  4, 23, 24, 33,  9, 16,  1,
        20, 27, 26, 23, 20, 13, 14, 20, 29, 14,  7, 13,  6, 23, 15, 11, 14, 27,
        31, 18,  2, 38, 52,  2,  2,  5,  7, 22,  7, 12, 16, 12,  5, 42, 18, 19,
        15,  8, 11, 13,  3, 33,  7,  4,  7,  1, 25, 48, 20, 11,  2, 26, 22, 19,
        21,  4, 12,  9, 33, 16, 15, 25, 10, 36,  3,  9,  5, 20, 17, 14,  4,  2,
        19,  1], device='cuda:3')
tensor([[14, 13,  8,  ...,  1,  4,  7],
        [ 4,  1,  2,  .

## Building the Model

A Seq2Seq model  
* The elements in two sequences are not matched one by one  
* The two sequences may have different lengths  

A PoS-tagger  
* The elements in two sequences are strictly matched one by one  
* The two sequences have definitely the same length  

### Conditional Random Field (CRF)

In [8]:
VOC_DIM = len(TEXT.vocab)
EMB_DIM = 100
HID_DIM = 128
TAG_DIM = len(UD_TAGS.vocab)

N_LAYERS = 2
BIDIRECT = True
DROPOUT = 0.25
TEXT_PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token]
TAG_PAD_IDX = UD_TAGS.vocab.stoi[UD_TAGS.pad_token]


emb = nn.Embedding(VOC_DIM, EMB_DIM, padding_idx=TEXT_PAD_IDX).to(device)
rnn = nn.LSTM(EMB_DIM, HID_DIM, num_layers=N_LAYERS, bidirectional=BIDIRECT, dropout=DROPOUT).to(device)
hid2tag = nn.Linear(HID_DIM*2 if BIDIRECT else HID_DIM, TAG_DIM).to(device)


mask = (batch_text == TEXT_PAD_IDX)
print(mask.size())
embedded = emb(batch_text)
# Pack sequence
packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, batch_text_lens, enforce_sorted=False)
# hidden: (num_layers*num_directions, batch, hid_dim)
packed_outs, (hidden, cell) = rnn(packed_embedded)
# Unpack sequence
# outs: (step, batch, hid_dim)
outs, out_lens = nn.utils.rnn.pad_packed_sequence(packed_outs)

# feats: (step, batch, tag_dim)
feats = hid2tag(outs)
print(feats.size())

torch.Size([71, 128])
torch.Size([71, 128, 18])


In [9]:
from torchcrf import CRF
crf = CRF(TAG_DIM).to(device)

# The mask and losses accord to https://pytorch-crf.readthedocs.io/en/stable/
losses = -crf(feats, batch_tags, mask=(~mask).type(torch.uint8), reduction='none')
print(losses.size())

torch.Size([128])


In [10]:
best_paths = crf.decode(feats, mask=(~mask).type(torch.uint8))

torch.tensor([len(path) for path in best_paths], device=device) == batch_text_lens

tensor([True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True], device='cuda:3')

### BiLSTM-CRF PoS-Tagger

In [11]:
class PoSTagger(nn.Module):
    def __init__(self, voc_dim, emb_dim, hid_dim, tag_dim, n_layers, bidirect, dropout, text_pad_idx):
        super().__init__()
        self.emb = nn.Embedding(voc_dim, emb_dim, padding_idx=text_pad_idx)
        self.rnn = nn.LSTM(emb_dim, hid_dim, num_layers=n_layers, 
                           bidirectional=bidirect, dropout=dropout)
        self.hid2tag = nn.Linear(hid_dim*2 if bidirect else hid_dim, tag_dim)
        self.crf = CRF(tag_dim)
        self.dropout = nn.Dropout(dropout)

    def _get_rnn_features(self, text: torch.Tensor, seq_lens: torch.Tensor):
        # embedded: (step, batch, emb_dim)
        embedded = self.dropout(self.emb(text))
        # Pack sequence
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, seq_lens, enforce_sorted=False)
        # hidden: (num_layers*num_directions, batch, hid_dim)
        packed_outs, (hidden, cell) = self.rnn(packed_embedded)
        # Unpack sequence
        # outs: (step, batch, hid_dim)
        outs, out_lens = nn.utils.rnn.pad_packed_sequence(packed_outs)

        # feats: (step, batch, tag_dim)
        feats = self.hid2tag(self.dropout(outs))
        return feats

    def forward(self, text: torch.Tensor, seq_lens: torch.Tensor, tags: torch.Tensor):
        # text/mask: (step, batch)
        mask = (text == self.emb.padding_idx)
        feats = self._get_rnn_features(text, seq_lens)
        
        # losses: (batch)
        losses = -self.crf(feats, tags, mask=(~mask).type(torch.uint8), reduction='none')
        return losses

    def decode(self, text: torch.Tensor, seq_lens: torch.Tensor):
        # text/mask: (step, batch)
        mask = (text == self.emb.padding_idx)
        feats = self._get_rnn_features(text, seq_lens)

        best_paths = self.crf.decode(feats, mask=(~mask).type(torch.uint8))
        return best_paths

In [12]:
tagger = PoSTagger(VOC_DIM, EMB_DIM, HID_DIM, TAG_DIM, N_LAYERS, 
                   BIDIRECT, DROPOUT, TEXT_PAD_IDX).to(device)
losses = tagger(batch_text, batch_text_lens, batch_tags)
print(losses.size())

torch.Size([128])


In [13]:
best_paths = tagger.decode(batch_text, batch_text_lens)

torch.tensor([len(path) for path in best_paths], device=device) == batch_text_lens

tensor([True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True], device='cuda:3')

In [14]:
best_paths = torch.tensor([path + [TAG_PAD_IDX]*(batch_tags.size(0)-len(path)) for path in best_paths], device=device).T
best_paths.size()

torch.Size([71, 128])

In [15]:
# Check if data are mixed across different samples in a batch.
tagger.eval()
max_len_012 = batch_text_lens[0:3].max()
losses_012 = tagger(batch_text[:max_len_012, 0:3], batch_text_lens[0:3], batch_tags[:max_len_012, 0:3])
max_len_123 = batch_text_lens[1:4].max()
losses_123 = tagger(batch_text[:max_len_123, 1:4], batch_text_lens[1:4], batch_tags[:max_len_123, 1:4])

losses_012[1:] - losses_123[:2]

tensor([0., 0.], device='cuda:3', grad_fn=<SubBackward0>)

## Training the Model

In [16]:
def init_weights(m):
    for name, param in m.named_parameters():
        # NOTE: The CRF parameters have already been initialized. 
        if not name.startswith('crf'):
            nn.init.normal_(param.data, mean=0, std=0.1)

def count_parameters(model: nn.Module):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)


tagger = PoSTagger(VOC_DIM, EMB_DIM, HID_DIM, TAG_DIM, N_LAYERS, 
                   BIDIRECT, DROPOUT, TEXT_PAD_IDX).to(device)

tagger.apply(init_weights)
print(f'The model has {count_parameters(tagger):,} trainable parameters')

The model has 1,522,370 trainable parameters


In [17]:
# Initialize Embeddings with Pre-Trained Vectors
print(TEXT.vocab.vectors.size())
print(tagger.emb.weight.size())

tagger.emb.weight.data.copy_(TEXT.vocab.vectors)

TEXT_UNK_IDX = TEXT.vocab.stoi[TEXT.unk_token]
tagger.emb.weight.data[TEXT_UNK_IDX].zero_()
tagger.emb.weight.data[TEXT_PAD_IDX].zero_()

print(tagger.emb.weight[:5, :8])

torch.Size([8866, 100])
torch.Size([8866, 100])
tensor([[ 0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000],
        [-0.0382, -0.2449,  0.7281, -0.3996,  0.0832,  0.0440, -0.3914,  0.3344],
        [-0.3398,  0.2094,  0.4635, -0.6479, -0.3838,  0.0380,  0.1713,  0.1598],
        [-0.1077,  0.1105,  0.5981, -0.5436,  0.6740,  0.1066,  0.0389,  0.3548]],
       device='cuda:3', grad_fn=<SliceBackward>)


In [18]:
optimizer = optim.AdamW(tagger.parameters())

In [19]:
def train_epoch(tagger, iterator, optimizer):
    tagger.train()
    epoch_loss = 0
    epoch_acc = 0
    for batch in iterator:
        # Forward pass & Calculate loss
        text, text_lens = batch.text
        tags, tags_lens = batch.udtags
        losses = tagger(text, text_lens, tags)
        loss = losses.mean()

        # Backward propagation
        optimizer.zero_grad()
        loss.backward()
        # Update weights
        optimizer.step()
        # Accumulate loss and acc
        epoch_loss += loss.item()

        best_paths = tagger.decode(text, text_lens)
        best_paths = torch.tensor([path + [TAG_PAD_IDX]*(tags.size(0)-len(path)) for path in best_paths], device=device).T
        non_padding = (tags != TAG_PAD_IDX)
        epoch_acc += (best_paths == tags)[non_padding].sum().item() / non_padding.sum().item()
    return epoch_loss/len(iterator), epoch_acc/len(iterator)

def eval_epoch(tagger, iterator):
    tagger.eval()
    epoch_loss = 0
    epoch_acc = 0
    with torch.no_grad():
        for batch in iterator:
            # Forward pass & Calculate loss
            text, text_lens = batch.text
            tags, tags_lens = batch.udtags
            losses = tagger(text, text_lens, tags)
            loss = losses.mean()
            
            # Accumulate loss and acc
            epoch_loss += loss.item()

            best_paths = tagger.decode(text, text_lens)
            best_paths = torch.tensor([path + [TAG_PAD_IDX]*(tags.size(0)-len(path)) for path in best_paths], device=device).T
            non_padding = (tags != TAG_PAD_IDX)
            epoch_acc += (best_paths == tags)[non_padding].sum().item() / non_padding.sum().item()
    return epoch_loss/len(iterator), epoch_acc/len(iterator)

In [20]:
import time
N_EPOCHS = 10
best_valid_loss = np.inf

for epoch in range(N_EPOCHS):
    t0 = time.time()
    train_loss, train_acc = train_epoch(tagger, train_iterator, optimizer)
    valid_loss, valid_acc = eval_epoch(tagger, valid_iterator)
    epoch_secs = time.time() - t0

    epoch_mins, epoch_secs = int(epoch_secs // 60), int(epoch_secs % 60)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(tagger.state_dict(), 'models/tut4-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 01 | Epoch Time: 0m 30s
	Train Loss: 21.066 | Train Acc: 61.07%
	 Val. Loss: 6.924 |  Val. Acc: 81.24%
Epoch: 02 | Epoch Time: 0m 30s
	Train Loss: 7.608 | Train Acc: 85.64%
	 Val. Loss: 4.829 |  Val. Acc: 85.19%
Epoch: 03 | Epoch Time: 0m 29s
	Train Loss: 5.519 | Train Acc: 89.43%
	 Val. Loss: 4.235 |  Val. Acc: 86.47%
Epoch: 04 | Epoch Time: 0m 30s
	Train Loss: 4.522 | Train Acc: 91.25%
	 Val. Loss: 3.815 |  Val. Acc: 87.46%
Epoch: 05 | Epoch Time: 0m 31s
	Train Loss: 3.948 | Train Acc: 92.25%
	 Val. Loss: 3.659 |  Val. Acc: 88.04%
Epoch: 06 | Epoch Time: 0m 29s
	Train Loss: 3.508 | Train Acc: 93.13%
	 Val. Loss: 3.483 |  Val. Acc: 88.44%
Epoch: 07 | Epoch Time: 0m 31s
	Train Loss: 3.200 | Train Acc: 93.67%
	 Val. Loss: 3.346 |  Val. Acc: 88.53%
Epoch: 08 | Epoch Time: 0m 30s
	Train Loss: 2.973 | Train Acc: 94.22%
	 Val. Loss: 3.256 |  Val. Acc: 88.89%
Epoch: 09 | Epoch Time: 0m 29s
	Train Loss: 2.730 | Train Acc: 94.59%
	 Val. Loss: 3.314 |  Val. Acc: 88.86%
Epoch: 10 | Epoch 

In [21]:
tagger.load_state_dict(torch.load('models/tut4-model.pt'))

valid_loss, valid_acc = eval_epoch(tagger, valid_iterator)
test_loss, test_acc = eval_epoch(tagger, test_iterator)

print(f'Val. Loss: {valid_loss:.3f} | Val. Acc: {valid_acc*100:.2f}%')
print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

Val. Loss: 3.176 | Val. Acc: 89.17%
Test Loss: 3.305 | Test Acc: 89.02%


## Check Embeddings
* The Embeddings of `<unk>` and `<pad>` tokens
    * Because the `padding_idx` has been passed to `nn.Embedding`, so the `<pad>` embedding will remain zeros throughout training.  
    * While the `<unk>` embedding will be learned.

In [22]:
print(tagger.emb.weight[:5, :8])

tensor([[-0.1086,  0.1216,  0.0154,  0.0478,  0.0463,  0.1025,  0.1454,  0.1421],
        [ 0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000],
        [-0.2705, -0.3767,  0.7898, -0.4855,  0.1201,  0.2157, -0.5363,  0.5170],
        [-0.5208,  0.2889,  0.5888, -0.7453, -0.4674, -0.0257,  0.3090,  0.2438],
        [-0.2714,  0.1839,  0.7546, -0.6773,  0.4255, -0.0097,  0.2300,  0.4293]],
       device='cuda:3', grad_fn=<SliceBackward>)
