# LSTM Tagger

## Sequence Labeling With LSTMs.

For sequence labeling, our goal is to predict labels for each token: Barack Obama went to DC -> B-PER I-PER O O B-LOC. Two considerations:

1. **Input**: for sentence classification, we were using representation for the entire sentence (using conv/LSTM) as the input. For sequence labeling, we will use the LSTM hidden states for each token.

2. **Loss**: The loss is the sum of individual token losses (using the cross entropy loss). Imagine doing n (sentence length) classifications for a sentence instead of a single one.

We will use the ADE (adverse drug event) dataset from assignment 3 (this is already explained). We have two CoNLL format files: review_train and review_valid.  A CoNLL file is just text file where each line (except lines starting with #, which are comments) contains a token and its tag (class label). 

For example, here are two sentences from the train file:

```
# lexapro.Post118.Sentence2
Lexapro	O
keeps	O
me	O
out	O
of	O
deep	B-SSI
depression	I-SSI
.	O
# zoloft.Post150.Sentence3
Works	O
for	O
my	O
form	O
of	O
depression	B-SSI
,	O
however	O
it	O
has	O
destroyed	B-AE
my	I-AE
sleeping	I-AE
patterns	I-AE
.	O
# lexapro.Post175.Sentence6
I	O
started	O
off	O
with	O
```

Let's start by downloading the data.

In [None]:
import os
if not os.path.exists("review_data"):
  !wget https://www.dropbox.com/s/yqgff7de73iwosr/review_data.zip?dl=1 -O review_data.zip
  !unzip review_data.zip
  !ls review_data

--2023-04-11 01:55:01--  https://www.dropbox.com/s/yqgff7de73iwosr/review_data.zip?dl=1
Resolving www.dropbox.com (www.dropbox.com)... 162.125.3.18, 2620:100:6022:18::a27d:4212
Connecting to www.dropbox.com (www.dropbox.com)|162.125.3.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /s/dl/yqgff7de73iwosr/review_data.zip [following]
--2023-04-11 01:55:01--  https://www.dropbox.com/s/dl/yqgff7de73iwosr/review_data.zip
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc936c6397c568edd46812a6d81a.dl.dropboxusercontent.com/cd/0/get/B59lcGjdfEuoCmzecJQUp9f8Ea6l9tCiif1cJBpzPEcWrRHedrm5fYHqwJnxf2xvYNrqjH1pJ1_2ZB5-WvcmMsfgCdoAlaFxPWtWzt6YZqqm46KI027dqr4Oxvp0OLc9sG1HmR_jv0hCkhEe1jqsR0xTeMmeomrtvPKUiBJH47kGNXUqpl2aNKmFR3IsRnJk-wc/file?dl=1# [following]
--2023-04-11 01:55:02--  https://uc936c6397c568edd46812a6d81a.dl.dropboxusercontent.com/cd/0/get/B59lcGjdfEuoCmzecJQUp9f8Ea6l9tCiif1cJBpzPEcWrRH

Now we will start by defining the model: an LSTM based tagger. Some definitions follow:
1. B  = batch size
2. T = time dimension (seq length or number of words)
3. e = embedding dim (300 for word2vec)
4. h = rnn_hidden_dim *2 if bidirectional else rnn_hidden_dim. 
The output of this network will be a (B x T x num_classes) tensor.

What does the input look like? As before, each sentence is a sequence of token ids in the vocabulary (see the first parts of the [previous colab tutorial](https://bit.ly/lhs712w23_feb13))

1. input = A batch of sentences (B x T) tensor. 
2. embedding layer (input) -> (B x T x e) tensor.
3. LSTM(embedding output) -> (B x T x h) tensor if bidirectional = False else (B x T x 2h) tensor.
4. Now, we want a linear layer that will take a representation for each word and output a probability for each class. To acieve this, we `reshape` the (B x T x h) tensor to a (B*T x h) tensor.  
5. A dense layer (h x num_classes) on top of the reshaped tensor -> this will output a tensor of dimension (B*T, num_classes).
6. As for the output, we can reshape the output from the last part into (B x T x num_classes): this will give us the probabilities we are looking for.



In [None]:
import torch.nn as nn
from torch.nn.functional import softmax
import torch
torch.manual_seed(42) # hitch

class LSTMTagger(nn.Module):
    def __init__(self, embeddings, num_classes, embed_dim, rnn_hidden_dim, rnn_layers=1, bidirectional=False, dropout=0.5):
        super().__init__()
        self.embeddings = embeddings
        self.dropout = nn.Dropout(dropout)
        self.rnn = nn.LSTM(embed_dim,
                                 rnn_hidden_dim,
                                 rnn_layers,
                                 dropout=dropout,
                                 bidirectional=bidirectional,
                                 batch_first=True)
        nn.init.orthogonal_(self.rnn.weight_hh_l0)
        nn.init.orthogonal_(self.rnn.weight_ih_l0)
        self.linear_in = rnn_hidden_dim if not bidirectional else rnn_hidden_dim * 2
        self.num_classes = num_classes
        self.dense = nn.Linear(self.linear_in, self.num_classes)
        
    def forward(self, inputs):
        one_hots, lengths = inputs
        embed = self.dropout(self.embeddings(one_hots))
        packed = torch.nn.utils.rnn.pack_padded_sequence(embed, lengths.tolist(), 
                                                         batch_first=True) 
        h_is, (h_n, c_n) = self.rnn(packed) 
        h_is, _ = torch.nn.utils.rnn.pad_packed_sequence(h_is, batch_first=True) # h_is is now (B x l x h) where l = max(sentence lengths in the batch)
        linear = self.dense(h_is)
        # we can either apply the classifier on top of each of the word representation 
        # i.e., the B x l x h tensor. Or we can flatten the B x l x h tensor to 
        # a (B * l) x h tensor.  
        # h_is_reshaped = h_is.reshape((h_is.shape[0] * h_is.shape[1]), -1)
        # linear = self.dense(h_is_reshaped)
        # linear = linear.reshape(h_is.shape[0], h_is.shape[1], self.num_classes)
        return softmax(linear, dim=-1)
    
    def predict(self, inputs):
        h_is, (h_n, c_n) = self.rnn(self.embeddings(inputs))
        return self.dense(h_is).max(-1)[1]    

Before we train this LSTM model, we need to make sure we process the data correctly. A couple of things to notice:
1. We read a CoNLL file instead of a normal text one.
2. The data is **right padded**, IOW, the `<PAD>` token is added at the end of a sentnce to pad it to the max length.
3. We have 5 labels: `<PAD>`, `B-AE`, `I-AE`, `B-SSI`, `I-SSI`, `O`. 

## Data Loader

In [None]:
from collections import Counter
import codecs
import re
import numpy as np
import random


class DictExamples(object):
    """This object holds a list of dictionaries, and knows how to shuffle, sort and batch them
    """
    def __init__(self, example_list, do_shuffle=True, sort_key=None):
        """Constructor

        :param example_list:  A list of examples
        :param do_shuffle: (``bool``) Shuffle the data? Defaults to `True`
        :param do_sort: (``bool``) Sort the data.  Defaults to `True`
        """
        self.example_list = example_list
        if do_shuffle:
            random.shuffle(self.example_list)
        if sort_key is not None:
            self.example_list = sorted(self.example_list, key=lambda x: x[sort_key])
        self.sort_key = sort_key

    def __getitem__(self, i):
        """Get a single example

        :param i: (``int``) simple index
        :return: an example
        """
        return self.example_list[i]

    def __len__(self):
        """Number of examples

        :return: (``int``) length of data
        """
        return len(self.example_list)

    def batch(self, start, batchsz):
        """Get a batch of data
        :param start: (``int``) The step index
        :param batchsz: (``int``) The batch size
        :return batched dictionary
        """
        ex = self.example_list[start]
        keys = ex.keys()
        batch = {}

        for k in keys:
            batch[k] = []
        sz = len(self.example_list)
        idx = start * batchsz
        for i in range(batchsz):
            if idx >= sz:
                break
            ex = self.example_list[idx]
            for k in keys:
                batch[k].append(ex[k])
            idx += 1

        for k in keys:
            batch[k] = np.stack(batch[k])
        return batch


class ExampleDataFeed(object):

    """Abstract base class that works on a list of examples
    This doesn't use any torch abstraction, but you could replace this 
    class with a torch DataLoader if you wanted.
    """
    def __init__(self, examples, batchsz, **kwargs):
        """Constructor from a list of examples

        Use the examples requested to provide data.  Options for batching and shuffling are supported,
        along with some optional processing function pointers

        :param examples: A list of examples
        :param batchsz: Batch size per step
        :param kwargs: See below

        :Keyword Arguments:
            * *shuffle* -- Shuffle the data per epoch? Defaults to `False`
        """
        self.examples = examples
        self.batchsz = batchsz
        self.shuffle = bool(kwargs.get('shuffle', False))
        self.steps = (len(self.examples) + self.batchsz - 1) // self.batchsz

    def _batch(self, i):
        """
        Get a batch of data at step `i`
        :param i: (``int``) step index
        :return: A batch tensor x, batch tensor y
        """
        batch = self.examples.batch(i, self.batchsz)
        return batch

    def __getitem__(self, i):
        return self._batch(i)

    def __iter__(self):
        shuffle = np.random.permutation(np.arange(self.steps)) if self.shuffle else np.arange(self.steps)
        for i in range(self.steps):
            si = shuffle[i]
            yield self._batch(si)

    def __len__(self):
        return self.steps


class DictVectorizer:
    def __init__(self, field, transform_fn=None, **kwargs):
        self.transform_fn = lambda x: x if transform_fn is None else transform_fn
        self.field = field
        self.mxlen = kwargs.get('mxlen', -1)
        self.max_seen = 0

    def iterable(self, tokens):
        for tok in tokens:
            yield self.transform_fn(tok[self.field])

    def _next_element(self, tokens, vocab):
        for atom in self.iterable(tokens):
            yield atom, vocab[atom]

    def count(self, tokens):
        seen = 0
        counter = Counter()
        for tok in self.iterable(tokens):
            counter[tok] += 1
            seen += 1
        self.max_seen = max(self.max_seen, seen)
        return counter

    def run(self, tokens, vocab):
        if self.mxlen < 0:
            self.mxlen = self.max_seen
        vec1d = np.zeros(self.mxlen, dtype=int)
        tok1d = ['<PAD>'] * self.mxlen
        i = 0
        for i, (token, token_index) in enumerate(self._next_element(tokens, vocab)):
            if i == self.mxlen:
                i -= 1
                break
            vec1d[i] = token_index
            tok1d[i] = token
        valid_length = i + 1
        return tok1d, vec1d, valid_length

    def get_length(self):
        return self.mxlen


class CoNLLSeqReader(object):

    def __init__(self, train_file, valid_file, test_file, mxlen=-1):
        self.train_file = train_file
        self.valid_file = valid_file
        self.test_file = test_file
        self.text_vectorizer = DictVectorizer(field='text', mxlen=mxlen)
        self.label_vectorizer = DictVectorizer(field='y', mxlen=mxlen)
        self.vocab = Counter()
        self.label2index = {"PAD": 0}
        self.build_vocab([train_file, valid_file, test_file])

    def build_vocab(self, files):
        labels = Counter()
        for file in files:
            if file is None:
                continue
            examples = self.read_examples(file)
            for example in examples:
                labels.update(self.label_vectorizer.count(example))
                self.vocab.update(self.text_vectorizer.count(example))
        for i, k in enumerate(labels.keys()):
          self.label2index[k] = i + 1
      
    @staticmethod
    def read_examples(tsfile: str):
        tokens = []
        examples = []
        sentence_id = None
        with codecs.open(tsfile, encoding='utf-8', mode='r') as f:
            for i, line in enumerate(f):
                if line.startswith('#'):  # The following lines will have this sentence id
                    sentence_id = line.strip().split()[1]
                    continue
                splits = re.split("\\s+", line.strip())
                if len(splits) == 1:
                    if len(tokens) > 0:
                        examples.append(tokens)
                        tokens = []
                    continue
                assert sentence_id is not None, "Sentence id is not set"
                token = {"text": splits[0], "y": splits[1], "sentence_id": sentence_id}
                tokens.append(token)
            if len(tokens) > 0:
                examples.append(tokens)
        return examples

    def load(self, filename, batchsz, shuffle=False):
        ts = []
        texts = self.read_examples(filename)
        sort_key = "text_lengths"
        for i, example_tokens in enumerate(texts):
            example = {}
            example["token"], example["text"], example["text_lengths"] = self.text_vectorizer.run(example_tokens, self.vocab)
            example["label"], example['y'], example["y_lengths"] = self.label_vectorizer.run(example_tokens, self.label2index)
            example['ids'] = example_tokens[0]['sentence_id']
            ts.append(example)
        examples = DictExamples(ts, do_shuffle=shuffle, sort_key=sort_key)
        return ExampleDataFeed(examples, batchsz=batchsz, shuffle=shuffle)

## Embedding Layer

In [None]:
import io
import torch.nn as nn
import numpy as np

def init_embeddings(vocab_size, embed_dim, unif):
    return np.random.uniform(-unif, unif, (vocab_size, embed_dim))
    

class EmbeddingsReader:
    @staticmethod
    def load(filename, vocab, unif=0.25):
        def read_word(f):
            s = bytearray()
            ch = f.read(1)
            while ch != b' ':
                s.extend(ch)
                ch = f.read(1)
            s = s.decode('utf-8')
            # Only strip out normal space and \n not other spaces which are words.
            return s.strip(' \n')

        vocab_size = len(vocab)
        with io.open(filename, "rb") as f:
            header = f.readline()
            file_vocab_size, embed_dim = map(int, header.split())
            weight = init_embeddings(len(vocab), embed_dim, unif)
            if '[PAD]' in vocab:
                weight[vocab['[PAD]']] = 0.0
            width = 4 * embed_dim
            for i in range(file_vocab_size):
                word = read_word(f)
                raw = f.read(width)
                if word in vocab:
                    vec = np.frombuffer(raw, dtype=np.float32)
                    weight[vocab[word]] = vec
        embeddings = nn.Embedding(weight.shape[0], weight.shape[1])
        embeddings.weight = nn.Parameter(torch.from_numpy(weight).float())
        return embeddings, embed_dim

## Evaluation Metrics

In [None]:
import csv
from collections import OrderedDict

class ConfusionMatrix:
    """Confusion matrix with metrics

    This class accumulates classification output, and tracks it in a confusion matrix.
    Metrics are available that use the confusion matrix
    """
    def __init__(self, labels):
        """Constructor with input labels

        :param labels: Either a dictionary (`k=int,v=str`) or an array of labels
        """
        if type(labels) is dict:
            self.labels = []
            for i in range(len(labels)):
                self.labels.append(labels[i])
        else:
            self.labels = labels
        self.labels = [str(x) for x in labels]
        nc = len(self.labels)
        self._cm = np.zeros((nc, nc), dtype=int)

    def add(self, truth, guess):
        """Add a single value to the confusion matrix based off `truth` and `guess`

        :param truth: The real `y` value (or ground truth label)
        :param guess: The guess for `y` value (or assertion)
        """

        self._cm[truth, guess] += 1

    def __str__(self):
        values = []
        width = max(8, max(len(x) for x in self.labels) + 1)
        for i, label in enumerate([''] + self.labels):
            values += ["{:>{width}}".format(label, width=width+1)]
        values += ['\n']
        for i, label in enumerate(self.labels):
            values += ["{:>{width}}".format(label, width=width+1)]
            for j in range(len(self.labels)):
                values += ["{:{width}d}".format(self._cm[i, j], width=width + 1)]
            values += ['\n']
        values += ['\n']
        return ''.join(values)

    def save(self, outfile):
        ordered_fieldnames = OrderedDict([("labels", None)] + [(l, None) for l in self.labels])
        with open(outfile, 'w') as f:
            dw = csv.DictWriter(f, delimiter=',', fieldnames=ordered_fieldnames)
            dw.writeheader()
            for index, row in enumerate(self._cm):
                row_dict = {l: row[i] for i, l in enumerate(self.labels)}
                row_dict.update({"labels": self.labels[index]})
                dw.writerow(row_dict)

    def reset(self):
        """Reset the matrix
        """
        self._cm *= 0

    def get_correct(self):
        """Get the diagonals of the confusion matrix

        :return: (``int``) Number of correct classifications
        """
        return self._cm.diagonal().sum()

    def get_total(self):
        """Get total classifications

        :return: (``int``) total classifications
        """
        return self._cm.sum()

    def get_acc(self):
        """Get the accuracy

        :return: (``float``) accuracy
        """
        return float(self.get_correct())/self.get_total()

    def get_recall(self):
        """Get the recall

        :return: (``float``) recall
        """
        total = np.sum(self._cm, axis=1)
        total = (total == 0) + total
        return np.diag(self._cm) / total.astype(float)

    def get_support(self):
        return np.sum(self._cm, axis=1)

    def get_precision(self):
        """Get the precision
        :return: (``float``) precision
        """

        total = np.sum(self._cm, axis=0)
        total = (total == 0) + total
        return np.diag(self._cm) / total.astype(float)

    def get_mean_precision(self):
        """Get the mean precision across labels

        :return: (``float``) mean precision
        """
        return np.mean(self.get_precision())

    def get_weighted_precision(self):
        return np.sum(self.get_precision() * self.get_support())/float(self.get_total())

    def get_mean_recall(self):
        """Get the mean recall across labels

        :return: (``float``) mean recall
        """
        return np.mean(self.get_recall())

    def get_weighted_recall(self):
        return np.sum(self.get_recall() * self.get_support())/float(self.get_total())

    def get_weighted_f(self, beta=1):
        return np.sum(self.get_class_f(beta) * self.get_support())/float(self.get_total())

    def get_macro_f(self, beta=1):
        """Get the macro F_b, with adjustable beta (defaulting to F1)

        :param beta: (``float``) defaults to 1 (F1)
        :return: (``float``) macro F_b
        """
        if beta < 0:
            raise Exception('Beta must be greater than 0')
        return np.mean(self.get_class_f(beta))

    def get_class_f(self, beta=1):
        p = self.get_precision()
        r = self.get_recall()

        b = beta*beta
        d = (b * p + r)
        d = (d == 0) + d

        return (b + 1) * p * r / d

    def get_f(self, beta=1):
        """Get 2 class F_b, with adjustable beta (defaulting to F1)

        :param beta: (``float``) defaults to 1 (F1)
        :return: (``float``) 2-class F_b
        """
        p = self.get_precision()[1]
        r = self.get_recall()[1]
        if beta < 0:
            raise Exception('Beta must be greater than 0')
        d = (beta*beta * p + r)
        if d == 0:
            return 0
        return (beta*beta + 1) * p * r / d

    def get_all_metrics(self):
        """Make a map of metrics suitable for reporting, keyed by metric name

        :return: (``dict``) Map of metrics keyed by metric names
        """
        metrics = {'acc': self.get_acc()}
        # If 2 class, assume second class is positive AKA 1
        if len(self.labels) == 2:
            metrics['precision'] = self.get_precision()[1]
            metrics['recall'] = self.get_recall()[1]
            metrics['f1'] = self.get_f(1)
        else:
            metrics['mean_precision'] = self.get_mean_precision()
            metrics['mean_recall'] = self.get_mean_recall()
            metrics['macro_f1'] = self.get_macro_f(1)
            metrics['weighted_precision'] = self.get_weighted_precision()
            metrics['weighted_recall'] = self.get_weighted_recall()
            metrics['weighted_f1'] = self.get_weighted_f(1)
        return metrics

    def add_batch(self, truth, guess):
        """Add a batch of data to the confusion matrix

        :param truth: The truth tensor
        :param guess: The guess tensor
        :return:
        """
        for truth_i, guess_i in zip(truth, guess):
            self.add(truth_i, guess_i)

To see how this works, look at the following code:

## Reading Data

In [None]:
    train_file, valid_file, test_file = ["review_data/review_train.conll", "review_data/review_valid.conll",
                                         "review_data/review_test.conll"]
    MAXLEN = -1
    ade_reader = CoNLLSeqReader(train_file, valid_file, test_file, mxlen=MAXLEN)
    batchsz = 16
    ade_train_dl = ade_reader.load(filename=train_file, batchsz=batchsz, shuffle=False)
    ade_dev_dl = ade_reader.load(filename=valid_file, batchsz=batchsz, shuffle=False)
    ade_test_dl = ade_reader.load(filename=valid_file, batchsz=batchsz, shuffle=False)
    print(ade_reader.label2index)
    print(f"{len(ade_reader.vocab)} tokens in vocab")
    # for item in ade_test_dl:
    #   if len(set(item['y_lengths'])) > 1:
    #     print(item)
    #     break

{'PAD': 0, 'B-AE': 1, 'I-AE': 2, 'O': 3, 'B-SSI': 4, 'I-SSI': 5}
9837 tokens in vocab


Now we will have to change the trainer and evaluator a little bit as well.

## Trainer

In [None]:
import torch
from tqdm import tqdm

class Trainer:
    def __init__(self, optimizer: torch.optim.Optimizer):
        self.optimizer = optimizer

    def run(self, model, labels, data_loader, loss_fn): 
        model.train()
        cm = ConfusionMatrix(labels)
        for batch in tqdm(data_loader):
            loss_value, y_pred, y_actual = self.update(model, loss_fn, batch)
            _, best = y_pred.max(1)
            yt = y_actual.cpu().int().numpy()
            yp = best.cpu().int().numpy()
            cm.add_batch(yt, yp)
        # print(cm)
        print(cm.get_all_metrics())
        print("training loss:", loss_value)
        print("-"*30)
        return cm
    
    def update(self, model, loss_fn, batch):
        self.optimizer.zero_grad()
        x, lengths, y = torch.LongTensor(batch["text"]), torch.LongTensor(batch["text_lengths"]), torch.LongTensor(batch["y"])
        
        lengths_sorted, perm_idx = lengths.sort(dim=0, descending=True)
        x_sorted = x[perm_idx]
        inputs = (x_sorted.to('cpu'), lengths_sorted)
        
        y_true_sorted = y[perm_idx]
        y_true_sorted = y_true_sorted[:, :max(lengths_sorted)]
        y_pred = model(inputs) # this is B x l x num_classes where l is the max len of the batch
        assert y_true_sorted.shape[0] == y_pred.shape[0] and y_true_sorted.shape[1] == y_pred.shape[1]
        y_true_sorted = y_true_sorted.reshape(y_true_sorted.shape[0]*y_true_sorted.shape[1]).to('cpu')
        y_pred = y_pred.reshape(y_pred.shape[0]*y_pred.shape[1], -1).to('cpu')
        loss_value = loss_fn(y_pred, y_true_sorted)
        loss_value.backward()
        self.optimizer.step()
        return loss_value.item(), y_pred, y_true_sorted

class Evaluator:
    def __init__(self):
        pass

    def run(self, model, labels, data_loader):
        model.eval()
        cm = ConfusionMatrix(labels)
        for batch in tqdm(data_loader):
            y_pred, y_true, _, _, _ = self.inference(model, batch)
            _, yp = y_pred.max(-1)
            yp = yp.flatten().cpu().int().numpy()
            yt = y_true.flatten().cpu().int().numpy()
            cm.add_batch(yt, yp)
        return cm

    def inference(self, model, batch):
        model.eval()
        with torch.no_grad():
            x, lengths, y, token, ids = torch.LongTensor(batch["text"]), \
            torch.LongTensor(batch["text_lengths"]), \
            torch.LongTensor(batch["y"]), \
            batch["token"], \
            batch["ids"]
            lengths_sorted, perm_idx = lengths.sort(0, descending=True)
            x_sorted = x[perm_idx]
            token_sorted = token[perm_idx]
            ids_sorted = ids[perm_idx]
            y_true_sorted = y[perm_idx]        
            y_true_sorted = y_true_sorted[:, :max(lengths_sorted)]
            inputs = (x_sorted.to('cpu'), lengths_sorted)
            y_pred = model(inputs)    
            return y_pred, y_true_sorted, lengths_sorted, token_sorted, ids_sorted


Tie everything together in a fit fn.

In [None]:
from torch.nn import CrossEntropyLoss

def fit(model, labels, optimizer, loss_fn, epochs, train_data_loader, dev_data_loader, test_data_loader):
    trainer = Trainer(optimizer)
    evaluator = Evaluator()
    best_macro_f = 0.0
    test = True
    for epoch in range(epochs):
        print('EPOCH {}'.format(epoch + 1))
        print('=================================')
        print('Training Results')
        cm = trainer.run(model, labels, train_data_loader, loss_fn)
        print('Validation Results')
        cm = evaluator.run(model, labels, dev_data_loader)
        # print(cm)
        print(cm.get_all_metrics())
        if cm.get_macro_f() > best_macro_f:
            print('New best model {:.2f}'.format(cm.get_macro_f()))
            best_macro_f = cm.get_macro_f()
            torch.save(model.state_dict(), './checkpoint.pyt')
    if test:
        model.load_state_dict(torch.load('./checkpoint.pyt'))
        cm = evaluator.run(model, labels, test_data_loader)
        print('Final result')
        print(cm)
        print(cm.get_all_metrics())
    return cm.get_macro_f()

In [None]:
import torch.nn as nn
import torch
embed_dim = 300
emb_random = nn.Embedding(len(ade_reader.vocab), embed_dim)
import os
do_emb_word2vec = False
if do_emb_word2vec:
  if not os.path.exists("GoogleNews-vectors-negative300.bin"):
    # download the word2vec file and unzip
    !wget https://www.dropbox.com/s/699kgut7hdb5tg9/GoogleNews-vectors-negative300.bin.gz?dl=1
    !mv 'GoogleNews-vectors-negative300.bin.gz?dl=1' GoogleNews-vectors-negative300.bin.gz
    print("file downloaded, unzipping...")
    !gunzip GoogleNews-vectors-negative300.bin.gz
    print("unzipped")


  PRETRAINED_EMBEDDINGS_FILE = 'GoogleNews-vectors-negative300.bin'

  # generate the embeddings for our dataset. The word2vec file has embeddings for 
  # 30K tokens, but our embedding matrix will only hold the weights for the tokens
  # that are in our vocab (i.e., all unique tokens in train + dev + test data).

  emb_word2vec, embed_dim = EmbeddingsReader.load(PRETRAINED_EMBEDDINGS_FILE, 
                                                        ade_reader.vocab)

In [None]:
rnn_hidden_dim = 200
rnn_layers = 1
bidirectional = True
model  = LSTMTagger(embeddings=emb_random, 
                    num_classes=len(ade_reader.label2index.keys()), 
                    embed_dim=embed_dim, rnn_hidden_dim=rnn_hidden_dim, 
                    bidirectional=bidirectional,
                    rnn_layers=rnn_layers)
model.to('cpu')
loss_fn = torch.nn.CrossEntropyLoss().to('cpu')
learnable_params = [name for name, p in model.named_parameters() if p.requires_grad]
print(learnable_params)
learnable_params = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.SGD(learnable_params, lr=10.0)
fit(model=model, labels=list(ade_reader.label2index.values()), optimizer=optimizer, loss_fn=loss_fn, epochs=50, 
    train_data_loader=ade_train_dl, dev_data_loader=ade_dev_dl, test_data_loader=ade_test_dl)

## Generate Output File

In [None]:
try:
    from baseline.utils import to_chunks
except ImportError:
    !pip install mead-baseline
    from baseline.utils import to_chunks

from tqdm import tqdm
batchsz = 16

def generate_labelseq(sentence_ids, all_sentence_tokens, all_sentence_labels, 
                   output_base):
  assert len(sentence_ids) == len(all_sentence_tokens) == len(all_sentence_labels)
  with open(f"{output_base}.labelseq", "w") as wf:
    wf.write("ID\tTAGSEQ\n")
    for sentence_id, sentence_tokens, sentence_labels in zip(sentence_ids, all_sentence_tokens, all_sentence_labels):
      assert len(sentence_tokens) == len(sentence_labels)
      wf.write(f'{sentence_id}\t{" ".join(sentence_labels)}\n')
  print(f"generated labelseq file {output_base}.labelseq")
  


def predict_tags_for_file(_file, model_state_dict, reader, evaluator, output_base, output_formats=["human_readable", "labelseq"]):
  file_data_loader = reader.load(filename=_file, batchsz=batchsz, shuffle=False)
  index2label = {v:k for k,v in reader.label2index.items()}
  model = LSTMTagger(embeddings=emb_random, 
                    num_classes=len(ade_reader.label2index.keys()), 
                    embed_dim=embed_dim, rnn_hidden_dim=rnn_hidden_dim, 
                    bidirectional=bidirectional,
                    rnn_layers=rnn_layers)
  model.load_state_dict(torch.load(model_state_dict))
  model.cpu()
  model.eval()
  all_sentence_ids = []
  all_sentence_tokens = []
  all_sentence_pred_labels = []
  all_sentence_true_labels = []

  for batch in tqdm(file_data_loader):
    y_true_batch, sentence_lengths_batch, sentence_ids_batch, sentence_tokens_batch = batch["y"], batch["text_lengths"], batch["ids"], batch["token"]
    y_pred_batch = model.predict(torch.LongTensor(batch["text"]).cpu())
    for sentence_tokens, sentence_pred_labels, sentence_true_labels, sentence_length, sentence_id in \
        zip(sentence_tokens_batch, y_pred_batch, y_true_batch, sentence_lengths_batch, sentence_ids_batch):
      all_sentence_ids.append(sentence_id)
      sentence_tokens = sentence_tokens[:sentence_length]
      sentence_pred_labels = [index2label[k.item()] for k in sentence_pred_labels[:sentence_length]] 
      sentence_true_labels = [index2label[k.item()] for k in sentence_true_labels[:sentence_length]] 
      all_sentence_tokens.append(sentence_tokens)
      all_sentence_true_labels.append(sentence_true_labels)
      all_sentence_pred_labels.append(sentence_pred_labels)

  if "labelseq" in output_formats:
    generate_labelseq(
        all_sentence_ids, 
        all_sentence_tokens, 
        all_sentence_pred_labels, 
        output_base
    )

# finally, run this method on the test data and look at the generated labelseq file.
test_file="review_data/review_test.conll"
predict_tags_for_file(test_file, model_state_dict="./checkpoint.pyt", 
                      reader=ade_reader, evaluator=Evaluator(), 
                      output_base="test_output", output_formats=["labelseq"])


100%|██████████| 79/79 [00:08<00:00,  9.63it/s]

generated labelseq file test_output.labelseq



