<a href="https://colab.research.google.com/github/ronenbendavid/IDC_NLP/blob/master/Ronen_Assignment_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 3
Training a neural named entity recognition (NER) tagger 

In [None]:
import torch
import torch.nn as nn

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print('version: {}, device: {}'.format(torch.__version__, device))

version: 1.5.0+cu101, device: cuda


In this assignment you are required to build a full training and testing pipeline for a neural sequentail tagger for named entities, using LSTM.

The dataset that you will be working on is called ReCoNLL 2003, which is a corrected version of the CoNLL 2003 dataset: https://www.clips.uantwerpen.be/conll2003/ner/

[Train data](https://drive.google.com/file/d/1hG66e_OoezzeVKho1w7ysyAx4yp0ShDz/view?usp=sharing)

[Dev data](https://drive.google.com/file/d/1EAF-VygYowU1XknZhvzMi2CID65I127L/view?usp=sharing)

[Test data](https://drive.google.com/file/d/16gug5wWnf06JdcBXQbcICOZGZypgr4Iu/view?usp=sharing)

As you can see, the annotated texts are labeled according to the IOB annotation scheme, for 3 entity types: Person, Organization, Location.

**Task 1:** Write a funtion for reading the data from a single file (of the ones that are provided above). The function recieves a filepath and then it encodes every sentence individually using a pair of lists, one list contains the words and one list contains the tags. Each list pair will be added to a general list (data), which will be returned back from the function.

In [None]:
import requests
import re

def read_data(filepath):
    data = []

    result = re.compile(".*drive.google.com/file/d/([^/]*)/.*").match(filepath)
    if result:
      filepath = 'https://docs.google.com/uc?export=download&id={}'.format(result.group(1))
    print(filepath)

    response = requests.get(filepath)
    words = []
    tags = []

    for line in response.text.split('\n'):
        if not line:
            if len(words) > 0:
                data.append((words, tags))
            words = []
            tags = []
        else:
            line = line.strip().split()
            words.append(line[0])
            tags.append(line[1])

    return data

train = read_data('https://drive.google.com/file/d/1hG66e_OoezzeVKho1w7ysyAx4yp0ShDz/view?usp=sharing')
dev = read_data('https://drive.google.com/file/d/1EAF-VygYowU1XknZhvzMi2CID65I127L/view?usp=sharing')
test = read_data('https://drive.google.com/file/d/16gug5wWnf06JdcBXQbcICOZGZypgr4Iu/view?usp=sharing')


https://docs.google.com/uc?export=download&id=1hG66e_OoezzeVKho1w7ysyAx4yp0ShDz
https://docs.google.com/uc?export=download&id=1EAF-VygYowU1XknZhvzMi2CID65I127L
https://docs.google.com/uc?export=download&id=16gug5wWnf06JdcBXQbcICOZGZypgr4Iu


The following Vocab class can be served as a dictionary that maps words and tags into Ids. The UNK_TOKEN should be used for words that are not part of the training data.

In [None]:
UNK_TOKEN = 0

class Vocab:
    def __init__(self):
        self.word2id = {"__unk__": UNK_TOKEN}
        self.id2word = {UNK_TOKEN: "__unk__"}
        self.n_words = 1
        
        self.tag2id = {"O":0, "B-PER":1, "I-PER": 2, "B-LOC": 3, "I-LOC": 4, "B-ORG": 5, "I-ORG": 6}
        self.id2tag = {0:"O", 1:"B-PER", 2:"I-PER", 3:"B-LOC", 4:"I-LOC", 5:"B-ORG", 6:"I-ORG"}
        
    def index_words(self, words):
      word_indexes = [self.index_word(w) for w in words]
      return word_indexes

    def index_tags(self, tags):
      tag_indexes = [self.tag2id[t] for t in tags]
      return tag_indexes
    
    def index_word(self, w):
        if w not in self.word2id:
            self.word2id[w] = self.n_words
            self.id2word[self.n_words] = w
            self.n_words += 1
        return self.word2id[w]
            

**Task 2:** Write a function prepare_data that takes one of the [train, dev, test] and the Vocab instance, for converting each pair of (words,tags) to a pair of indexes. Each pair should be added to data_sequences, which will be returned back from the function.

In [None]:
vocab = Vocab()

def prepare_data(data, vocab):
    data_sequences = []
    for words, tags in data:
      iwords = vocab.index_words(words)
      itags = vocab.index_tags(tags)
      data_sequences.append((iwords, itags))

    return data_sequences, vocab

train_sequences, vocab = prepare_data(train, vocab)
dev_sequences, vocab = prepare_data(dev, vocab)
test_sequences, vocab = prepare_data(test, vocab)

**Task 3:** Write NERNet, a PyTorch Module for labeling words with NER tags. 

*input_size:* the size of the vocabulary

*embedding_size:* the size of the embeddings

*hidden_size:* the LSTM hidden size

*output_size:* the number tags we are predicting for

*n_layers:* the number of layers we want to use in LSTM

*directions:* could 1 or 2, indicating unidirectional or bidirectional LSTM, respectively

The input for your forward function should be a single sentence tensor.

In [None]:
class NERNet(nn.Module):
    def __init__(self, input_size, embedding_size, hidden_size, output_size, n_layers, bidirectional):
        super(NERNet, self).__init__()

        self.input_size = input_size
        self.embedding_size = embedding_size
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.n_layers = n_layers
        self.bidirectional = bidirectional

        in_features = self.hidden_size
        if bidirectional:
            in_features = self.hidden_size * 2

        self.embedding = nn.Embedding(num_embeddings=input_size, embedding_dim=embedding_size)
        self.lstm = nn.LSTM(input_size=embedding_size, hidden_size=hidden_size, num_layers=n_layers, bidirectional=bidirectional, batch_first=True)
        self.out = nn.Linear(in_features=in_features, out_features=output_size)

    def forward(self, input_sentence):
        embedded = self.embedding(input_sentence)
        output, _ = self.lstm(embedded)

        # shape for the linear layer (batch*seq, num_tags)
        output = output.view(-1, output.shape[2])
        return self.out(output)


**Task 4:** write a training loop, which takes a model (instance of NERNet) and number of epochs to train on. The loss is always CrossEntropyLoss and the optimizer is always Adam.

In [None]:
import torch.optim as optim

def train_loop(model, n_epochs):
    # Loss function
    criterion = nn.CrossEntropyLoss()

    # Optimizer (ADAM is a fancy version of SGD)
    optimizer = optim.Adam(model.parameters(), lr=0.0001)

    print_every = 10000
    for e in range(1, n_epochs + 1):

        e_loss = 0
        loss = 0
        for counter, (sequence, labels) in enumerate(train_sequences):

            sequence_tensor = torch.LongTensor(sequence).to(device).view(1, -1)
            labels_tensor = torch.LongTensor(labels).to(device)

            # compute model output and loss
            output_batch = model(sequence_tensor)
            sentence_loss = criterion(output_batch, labels_tensor)

            optimizer.zero_grad()
            sentence_loss.backward()

            # updating weights
            optimizer.step()

            # averaging total loss
            sentence_loss = sentence_loss.item() / len(sequence)
            loss += sentence_loss
            e_loss += sentence_loss

            if counter > 0 and counter % print_every == 0:
                loss = loss / print_every
                print('Epoch %d/%d, %d/%d, Current Loss = %.4f' % (e, n_epochs, counter, len(train_sequences), loss))
                loss = 0

        e_loss = e_loss / len(train_sequences)
        print('Epoch %d/%d, Current Loss = %.4f' % (e, n_epochs, e_loss))



**Task 5:** write an evaluation loop on a trained model, using the dev and test datasets. This function print the true positive rate (TPR), also known as Recall and the opposite to false positive rate (FPR), also known as precision, of each label seperately (7 labels in total), and for all the 6 labels (except O) together. The caption argument for the function should be served for printing, so that when you print include it as a prefix.

In [None]:
import numpy as np

def evaluate_dataset(model, dataset):
    tp = 0
    fp = 0
    fn = 0
    tn = 0
    for counter, (sequence, labels) in enumerate(dataset):
        sequence_tensor = torch.LongTensor(sequence).to(device).view(1, -1)
        output_labels = model(sequence_tensor)

        labels = np.array(labels)
        pred_labels = torch.argmax(output_labels, dim=1).detach().cpu().numpy()

        tp += np.sum([labels[i[0]] == pred_labels[i[0]] for i in np.argwhere(pred_labels > 0)])
        tn += np.sum([labels[i[0]] == pred_labels[i[0]] for i in np.argwhere(pred_labels == 0)])

        fp += np.sum([labels[i[0]] != pred_labels[i[0]] for i in np.argwhere(pred_labels > 0)])
        fn += np.sum([labels[i[0]] != pred_labels[i[0]] for i in np.argwhere(pred_labels == 0)])

    # Precision = False Positive Rate = True Positive / (True Positive + False Positive)
    fpr = tp / (tp + fp)

    # Recall = True Positive Rate = True Positive / (True Positive + False Negative)
    tpr = tp / (tp + fn)

    print("tp: {}, tn: {}, fp: {}, fn: {}, fpr: {}, tpr: {}".format(tp, tn, fp, fn, fpr, tpr))

    return tpr, fpr


def evaluate(model, caption):
  dev_tpr, dev_fpr = evaluate_dataset(model, dev_sequences)
  test_tpr, test_fpr = evaluate_dataset(model, test_sequences)

  # TODO - your code goes here
  print("{} - dev tpr: {}, dev fpr: {}, test tpr: {}, test fpr: {}".format(caption, dev_tpr, dev_fpr, test_tpr, test_fpr))


**Task 6:** Train and evaluate a few models, all with embedding_size=300, and with the following hyper parameters (you may use that as captions for the models as well):

Model 1: (hidden_size: 500, n_layers: 1, directions: 1)

Model 2: (hidden_size: 500, n_layers: 2, directions: 1)

Model 3: (hidden_size: 500, n_layers: 3, directions: 1)

Model 4: (hidden_size: 500, n_layers: 1, directions: 2)

Model 5: (hidden_size: 500, n_layers: 2, directions: 2)

Model 6: (hidden_size: 500, n_layers: 3, directions: 2)

Model 4: (hidden_size: 800, n_layers: 1, directions: 2)

Model 5: (hidden_size: 800, n_layers: 2, directions: 2)

Model 6: (hidden_size: 800, n_layers: 3, directions: 2)

In [None]:
def train_and_evaluate(hidden_size, n_layers, bidirectional, embedding_size=100, n_epochs=10):
  model = NERNet(vocab.n_words, hidden_size, hidden_size, len(vocab.tag2id), n_layers, bidirectional).to(device)
  train_loop(model, n_epochs)

  caption = "hidden_size: {}, n_layers: {}, bidirectional: {}".format(hidden_size, n_layers, bidirectional)
  evaluate(model, caption)


In [None]:
train_and_evaluate(500, 1, False, n_epochs=10)
train_and_evaluate(500, 2, False, n_epochs=10)
train_and_evaluate(500, 3, False, n_epochs=10)
train_and_evaluate(500, 1, True, n_epochs=10)
train_and_evaluate(500, 2, True, n_epochs=10)
train_and_evaluate(500, 3, True, n_epochs=10)
train_and_evaluate(800, 1, True, n_epochs=10)
train_and_evaluate(800, 2, True, n_epochs=10)
train_and_evaluate(800, 3, True, n_epochs=10)


Epoch 1/10, Current Loss = 0.1179
Epoch 2/10, Current Loss = 0.0560
Epoch 3/10, Current Loss = 0.0267
Epoch 4/10, Current Loss = 0.0131
Epoch 5/10, Current Loss = 0.0063
Epoch 6/10, Current Loss = 0.0031
Epoch 7/10, Current Loss = 0.0017
Epoch 8/10, Current Loss = 0.0009
Epoch 9/10, Current Loss = 0.0008
Epoch 10/10, Current Loss = 0.0005
tp: 537.0, tn: 3023.0, fp: 134.0, fn: 249.0, fpr: 0.8002980625931445, tpr: 0.683206106870229
tp: 1049.0, tn: 6408.0, fp: 309.0, fn: 477.0, fpr: 0.772459499263623, tpr: 0.6874180865006553
hidden_size: 500, n_layers: 1, bidirectional: False - dev tpr: 0.683206106870229, dev fpr: 0.8002980625931445, test tpr: 0.6874180865006553, test fpr: 0.772459499263623
Epoch 1/10, Current Loss = 0.1121
Epoch 2/10, Current Loss = 0.0525
Epoch 3/10, Current Loss = 0.0204
Epoch 4/10, Current Loss = 0.0065
Epoch 5/10, Current Loss = 0.0025
Epoch 6/10, Current Loss = 0.0014
Epoch 7/10, Current Loss = 0.0006
Epoch 8/10, Current Loss = 0.0008
Epoch 9/10, Current Loss = 0.00

In [None]:
train_and_evaluate(500, 3, False, n_epochs=10, embedding_size=50)
train_and_evaluate(500, 3, True, n_epochs=10, embedding_size=50)
train_and_evaluate(800, 3, True, n_epochs=10, embedding_size=50)

train_and_evaluate(500, 3, False, n_epochs=10, embedding_size=500)
train_and_evaluate(500, 3, True, n_epochs=10, embedding_size=500)
train_and_evaluate(800, 3, True, n_epochs=10, embedding_size=500)

Epoch 1/10, Current Loss = 0.1191
Epoch 2/10, Current Loss = 0.0567
Epoch 3/10, Current Loss = 0.0218
Epoch 4/10, Current Loss = 0.0074
Epoch 5/10, Current Loss = 0.0029
Epoch 6/10, Current Loss = 0.0016
Epoch 7/10, Current Loss = 0.0011
Epoch 8/10, Current Loss = 0.0007
Epoch 9/10, Current Loss = 0.0008
Epoch 10/10, Current Loss = 0.0007
tp: 544.0, tn: 3015.0, fp: 176.0, fn: 208.0, fpr: 0.7555555555555555, tpr: 0.723404255319149
tp: 1103.0, tn: 6396.0, fp: 337.0, fn: 407.0, fpr: 0.7659722222222223, tpr: 0.7304635761589404
hidden_size: 500, n_layers: 3, bidirectional: False - dev tpr: 0.723404255319149, dev fpr: 0.7555555555555555, test tpr: 0.7304635761589404, test fpr: 0.7659722222222223
Epoch 1/10, Current Loss = 0.0874
Epoch 2/10, Current Loss = 0.0227
Epoch 3/10, Current Loss = 0.0053
Epoch 4/10, Current Loss = 0.0018
Epoch 5/10, Current Loss = 0.0020
Epoch 6/10, Current Loss = 0.0008
Epoch 7/10, Current Loss = 0.0007
Epoch 8/10, Current Loss = 0.0005
Epoch 9/10, Current Loss = 0.

In [None]:
train_and_evaluate(100, 3, True, n_epochs=20, embedding_size=50)
train_and_evaluate(200, 3, True, n_epochs=20, embedding_size=50)
train_and_evaluate(400, 3, True, n_epochs=20, embedding_size=50)

train_and_evaluate(100, 3, True, n_epochs=20, embedding_size=100)
train_and_evaluate(200, 3, True, n_epochs=20, embedding_size=100)
train_and_evaluate(400, 3, True, n_epochs=20, embedding_size=100)


Epoch 1/10, Current Loss = 0.1356
Epoch 2/10, Current Loss = 0.1017
Epoch 3/10, Current Loss = 0.0706
Epoch 4/10, Current Loss = 0.0480
Epoch 5/10, Current Loss = 0.0325
Epoch 6/10, Current Loss = 0.0212
Epoch 7/10, Current Loss = 0.0133
Epoch 8/10, Current Loss = 0.0080
Epoch 9/10, Current Loss = 0.0048
Epoch 10/10, Current Loss = 0.0029
tp: 503.0, tn: 3042.0, fp: 129.0, fn: 269.0, fpr: 0.7958860759493671, tpr: 0.6515544041450777
tp: 951.0, tn: 6463.0, fp: 278.0, fn: 551.0, fpr: 0.7737998372660699, tpr: 0.6331557922769641
hidden_size: 100, n_layers: 3, bidirectional: True - dev tpr: 0.6515544041450777, dev fpr: 0.7958860759493671, test tpr: 0.6331557922769641, test fpr: 0.7737998372660699
Epoch 1/10, Current Loss = 0.1199
Epoch 2/10, Current Loss = 0.0594
Epoch 3/10, Current Loss = 0.0302
Epoch 4/10, Current Loss = 0.0151
Epoch 5/10, Current Loss = 0.0061
Epoch 6/10, Current Loss = 0.0026
Epoch 7/10, Current Loss = 0.0014
Epoch 8/10, Current Loss = 0.0012
Epoch 9/10, Current Loss = 0.

In [None]:
train_and_evaluate(800, 3, True, n_epochs=10, embedding_size=800)
train_and_evaluate(1000, 3, True, n_epochs=10, embedding_size=1000)
train_and_evaluate(1200, 3, True, n_epochs=10, embedding_size=1200)

Epoch 1/10, Current Loss = 0.0739
Epoch 2/10, Current Loss = 0.0135
Epoch 3/10, Current Loss = 0.0029
Epoch 4/10, Current Loss = 0.0013
Epoch 5/10, Current Loss = 0.0016
Epoch 6/10, Current Loss = 0.0012
Epoch 7/10, Current Loss = 0.0008
Epoch 8/10, Current Loss = 0.0005
Epoch 9/10, Current Loss = 0.0005
Epoch 10/10, Current Loss = 0.0005
tp: 616.0, tn: 3031.0, fp: 134.0, fn: 162.0, fpr: 0.8213333333333334, tpr: 0.7917737789203085
tp: 1224.0, tn: 6389.0, fp: 309.0, fn: 321.0, fpr: 0.7984344422700587, tpr: 0.7922330097087379
hidden_size: 800, n_layers: 3, bidirectional: True - dev tpr: 0.7917737789203085, dev fpr: 0.8213333333333334, test tpr: 0.7922330097087379, test fpr: 0.7984344422700587
Epoch 1/10, Current Loss = 0.0715
Epoch 2/10, Current Loss = 0.0108
Epoch 3/10, Current Loss = 0.0033
Epoch 4/10, Current Loss = 0.0012
Epoch 5/10, Current Loss = 0.0007
Epoch 6/10, Current Loss = 0.0017
Epoch 7/10, Current Loss = 0.0006
Epoch 8/10, Current Loss = 0.0006
Epoch 9/10, Current Loss = 0

In [None]:
train_and_evaluate(800, 4, True, n_epochs=15, embedding_size=500)
train_and_evaluate(800, 5, True, n_epochs=15, embedding_size=500)
train_and_evaluate(800, 6, True, n_epochs=15, embedding_size=500)


Epoch 1/10, Current Loss = 0.0865
Epoch 2/10, Current Loss = 0.0220
Epoch 3/10, Current Loss = 0.0056
Epoch 4/10, Current Loss = 0.0028
Epoch 5/10, Current Loss = 0.0022
Epoch 6/10, Current Loss = 0.0013
Epoch 7/10, Current Loss = 0.0009
Epoch 8/10, Current Loss = 0.0006
Epoch 9/10, Current Loss = 0.0006
Epoch 10/10, Current Loss = 0.0021
tp: 615.0, tn: 3011.0, fp: 177.0, fn: 140.0, fpr: 0.7765151515151515, tpr: 0.8145695364238411
tp: 1226.0, tn: 6377.0, fp: 365.0, fn: 275.0, fpr: 0.7705845380263985, tpr: 0.8167888074616922
hidden_size: 800, n_layers: 4, bidirectional: True - dev tpr: 0.8145695364238411, dev fpr: 0.7765151515151515, test tpr: 0.8167888074616922, test fpr: 0.7705845380263985
Epoch 1/10, Current Loss = 0.0990
Epoch 2/10, Current Loss = 0.0320
Epoch 3/10, Current Loss = 0.0112
Epoch 4/10, Current Loss = 0.0053
Epoch 5/10, Current Loss = 0.0027
Epoch 6/10, Current Loss = 0.0022
Epoch 7/10, Current Loss = 0.0021
Epoch 8/10, Current Loss = 0.0024
Epoch 9/10, Current Loss = 0

In [None]:
train_and_evaluate(800, 3, True, n_epochs=10, embedding_size=500)

Epoch 1/10, Current Loss = 0.0747
Epoch 2/10, Current Loss = 0.0132
Epoch 3/10, Current Loss = 0.0031
Epoch 4/10, Current Loss = 0.0019
Epoch 5/10, Current Loss = 0.0026
Epoch 6/10, Current Loss = 0.0007
Epoch 7/10, Current Loss = 0.0009
Epoch 8/10, Current Loss = 0.0008
Epoch 9/10, Current Loss = 0.0005
Epoch 10/10, Current Loss = 0.0005
tp: 625.0, tn: 3015.0, fp: 150.0, fn: 153.0, fpr: 0.8064516129032258, tpr: 0.8033419023136247
tp: 1215.0, tn: 6404.0, fp: 299.0, fn: 325.0, fpr: 0.8025099075297226, tpr: 0.788961038961039
hidden_size: 800, n_layers: 3, bidirectional: True - dev tpr: 0.8033419023136247, dev fpr: 0.8064516129032258, test tpr: 0.788961038961039, test fpr: 0.8025099075297226


In [None]:
train_and_evaluate(1000, 4, True, n_epochs=15, embedding_size=1000)
train_and_evaluate(1000, 5, True, n_epochs=15, embedding_size=1000)
train_and_evaluate(1000, 6, True, n_epochs=15, embedding_size=1000)


Epoch 1/15, Current Loss = 0.0814
Epoch 2/15, Current Loss = 0.0175
Epoch 3/15, Current Loss = 0.0054
Epoch 4/15, Current Loss = 0.0034
Epoch 5/15, Current Loss = 0.0014
Epoch 6/15, Current Loss = 0.0013
Epoch 7/15, Current Loss = 0.0017
Epoch 8/15, Current Loss = 0.0016
Epoch 9/15, Current Loss = 0.0007
Epoch 10/15, Current Loss = 0.0008
Epoch 11/15, Current Loss = 0.0009
Epoch 12/15, Current Loss = 0.0006
Epoch 13/15, Current Loss = 0.0004
Epoch 14/15, Current Loss = 0.0010
Epoch 15/15, Current Loss = 0.0002
tp: 644.0, tn: 3021.0, fp: 130.0, fn: 148.0, fpr: 0.8320413436692506, tpr: 0.8131313131313131
tp: 1240.0, tn: 6430.0, fp: 264.0, fn: 309.0, fpr: 0.824468085106383, tpr: 0.8005164622336992
hidden_size: 1000, n_layers: 4, bidirectional: True - dev tpr: 0.8131313131313131, dev fpr: 0.8320413436692506, test tpr: 0.8005164622336992, test fpr: 0.824468085106383
Epoch 1/15, Current Loss = 0.0950
Epoch 2/15, Current Loss = 0.0268
Epoch 3/15, Current Loss = 0.0096
Epoch 4/15, Current Loss

**Task 6:** Download the GloVe embeddings from https://nlp.stanford.edu/projects/glove/ (use the 300-dim vectors from glove.6B.zip). Then intialize the nn.Embedding module in your NERNet with these embeddings, so that you can start your training with pre-trained vectors. Repeat Task 6 and print the results for each model.

Note: make sure that vectors are aligned with the IDs in your Vocab, in other words, make sure that for example the word with ID 0 is the first vector in the GloVe matrix of vectors that you initialize nn.Embedding with. For a dicussion on how to do that, check it this link:
https://discuss.pytorch.org/t/can-we-use-pre-trained-word-embeddings-for-weight-initialization-in-nn-embedding/1222

In [None]:
# TODO - your code goes here...

**Good luck!**