#### BERT based Sequence Labeller

We've explored training HMM (Viterbi) and RNN-based POS (part of speech) taggers on tagged sentences from the Stanford treebank dataset. We saw that the HMM tagger had a validation accuracy of about 90% and the RNN based tagger had about 91%. We will now try a different type of neural approach. For the RNN, recall that we used pretrained GloVe embeddings to represent the words in a sentence. Since the meaning of words in a sentence can be ambiguous, we should use contextualized vector representations of words instead of fixed GloVe word embeddings to overcome this problem of word sense. Pretrained BERT models are perfect for this task because they can be used to extract contextualized word embedding. 

In this notebook, we will finetune a BERT model on the POS tagging task. Since BERT uses subword tokenization and POS labels are assigned to whole words, we need to figure out a way of assigning labels to the subword tokens. A simple approach is to assign the POS label of a word to the first subword in the sequence of subwords corresponding to that word, then assign a special tag 'X' to the remaining subwords, which indicates a continuation of the preceding POS label. e.g.

`(spokesman, NN)` --> {`(spokes, NN)`, `(##man, X)`}



In [8]:
import torch
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from transformers import RobertaTokenizerFast, RobertaModel, get_linear_schedule_with_warmup
from nltk.corpus import treebank
from tqdm import tqdm
import psutil

In [26]:
tokenizer = RobertaTokenizerFast.from_pretrained('roberta-base', add_prefix_space=True)

In [50]:
test_sentence = [elem[0] for elem in corpus[0]]
labels = [elem[1] for elem in corpus[0]]
print(test_sentence)
print(labels)
encoding = tokenizer.encode_plus(test_sentence, is_split_into_words=True, return_offsets_mapping=False, padding=False, truncation=False, add_special_tokens=True)
subword_tokens = encoding.tokens()
word_ids = encoding.word_ids()[1:-1]

print(encoding['input_ids'])
print(subword_tokens)
print(word_ids)


['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'will', 'join', 'the', 'board', 'as', 'a', 'nonexecutive', 'director', 'Nov.', '29', '.']
['NNP', 'NNP', ',', 'CD', 'NNS', 'JJ', ',', 'MD', 'VB', 'DT', 'NN', 'IN', 'DT', 'JJ', 'NN', 'NNP', 'CD', '.']
[0, 11300, 468, 4291, 225, 2156, 5659, 107, 793, 2156, 40, 1962, 5, 792, 25, 10, 47554, 3204, 19172, 736, 1442, 4, 1132, 479, 2]
['<s>', 'ĠPierre', 'ĠV', 'ink', 'en', 'Ġ,', 'Ġ61', 'Ġyears', 'Ġold', 'Ġ,', 'Ġwill', 'Ġjoin', 'Ġthe', 'Ġboard', 'Ġas', 'Ġa', 'Ġnonex', 'ec', 'utive', 'Ġdirector', 'ĠNov', '.', 'Ġ29', 'Ġ.', '</s>']
[0, 1, 1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 13, 13, 14, 15, 15, 16, 17]


In [51]:
labels_subword = [-100]
for i in range(len(word_ids)):
    if word_ids[i] != word_ids[i-1]:
        labels_subword.append(labels[word_ids[i]])
    else:
        labels_subword.append('X')    

labels_subword.append(-100)
print(labels_subword)

[-100, 'NNP', 'NNP', 'X', 'X', ',', 'CD', 'NNS', 'JJ', ',', 'MD', 'VB', 'DT', 'NN', 'IN', 'DT', 'JJ', 'X', 'X', 'NN', 'NNP', 'X', 'CD', '.', -100]


#### Set up the dataset.

In [17]:
# get the POS tagged corpus, 3914 tagged sentences
corpus = treebank.tagged_sents()
print("Number of sentences: ", len(corpus))
print(f"Longest sentence length: {max([len(s) for s in corpus])}")

        
        

Number of sentences:  3914
Longest sentence length: 271


In [None]:
class Treebank(Dataset):
    def __init__(self, corpus):
        # get the sentences and labels
        self.sentences = [[elem[0] for elem in s] for s in corpus]
        self.pos_labels = [[elem[1] for elem in s] for s in corpus]
        # define special tag
        self.continuation_tag = "X"
        # get tag set
        self.tags = sorted([self.continuation_tag] + list(set([elem[1] for s in corpus for elem in s])))
        # get tag set
        tags = sorted(list(set([elem[1] for s in corpus for elem in s])))
        # tag to idx mapping
        self.tag2idx = {tag: idx for idx, tag in enumerate(self.tags)}
        self.tokenizer = RobertaTokenizerFast.from_pretrained('roberta-base', add_prefix_space=True)

    def __len__(self):
        return len(self.sentences)

    def __getitem__(self, idx):
        # get sentence and labels
        sentence = self.sentences[idx]
        labels = self.pos_labels[idx]
        # tokenize the sentence
        input_encoding = self.tokenizer.encode_plus(test_sentence, is_split_into_words=True, return_offsets_mapping=False, padding=False, truncation=False, add_special_tokens=True)
        input_idx = input_encoding['input_ids']
        # assign labels to subword tokens
        labels_subword = [-100]
        for i in range(len(word_ids)):
            if word_ids[i] != word_ids[i-1]:
                labels_subword.append(self.tag2idx[labels[word_ids[i]]])
            else:
                labels_subword.append(self.tag2idx["X"])    

        labels_subword.append(-100)