# Classroom 6 - Training a Named Entity Recognition Model with a LSTM

The classroom today is primarily geared towards preparing you for Assignment 4 which you'll be working on after today. The notebook is split into three main parts to get you thinking. You should work through these sections in groups together in class. 

If you have any questions or things you don't understand, make a note of them so you can remember to ask - or, even better, post them to Slack!

If you get through everything here, make a start on the assignment. If you don't, dont' worry about it - but I suggest you finish all of the exercises here before starting the assignment.

## 1. A very short intro to NER
Named entity recognition (NER) also known as named entity extraction, and entity identification is the task of tagging an entity is the task of extracting which seeks to extract named entities from unstructured text into predefined categories such as names, medical codes, quantities or similar.

The most common variant is the [CoNLL-20003](https://www.clips.uantwerpen.be/conll2003/ner/) format which uses the categories, person (PER), organization (ORG) location (LOC) and miscellaneous (MISC), which for example denote cases such nationalies. For example:

*Hello my name is $Ross_{PER}$ I live in $Aarhus_{LOC}$ and work at $AU_{ORG}$.*

For example, let's see how this works with ```spaCy```. NB: you might need to remember to install a ```spaCy``` model:

```python -m spacy download en_core_web_sm```

In [None]:
#A RANDOM VISUALISATION UNRELATED TO THE PROCESSES

#import spacy

#nlp = spacy.load("en_core_web_sm")
#doc = nlp("Hello my name is Ross. I live in Denmark and work at Aarhus University, I am Scottish and today is Friday 27th.")

In [None]:
#from spacy import displacy
#displacy.render(doc, style="ent")

## Tagging standards
There exist different tag standards for NER. The most used one is the BIO-format which frames the task as token classification denoting inside, outside and beginning of a token. 

Words marked with *O* are not a named entity. Words with NER tags which start with *B-\** indicate the start of a multiword entity (i.e. *B-ORG* for the *Aarhus* in *Aarhus University*), while *I-\** indicate the continuation of a token (e.g. University).

    B = Beginning
    I = Inside
    O = Outside

<details>
<summary>Q: What other formats and standards are available? What kinds of entities do they make it possible to tag?</summary>
<br>
You can see more examples on the spaCy documentation for their [different models(https://spacy.io/models/en)
</details>


In [1]:
#EXAMPLE OF SOMETHING FROM SPACY, NOT SURE IF PRETAGGED OR GETTING TAGGED IN THE CODE SOMEWHERE
#for t in doc:
#    if t.ent_type:
#        print(t, f"{t.ent_iob_}-{t.ent_type_}")
#    else:
#        print(t, t.ent_iob_)

NameError: name 'doc' is not defined

### Some challenges with NER
While NER is currently framed as above this formulating does contain some limitations. 

For instance the entity Aarhus University really refers to both the location Aarhus, the University within Aarhus, thus nested NER (N-NER) argues that it would be more correct to tag it in a nested fashion as \[\[$Aarhus_{LOC}$\] $University$\]$_{ORG}$ (Plank, 2020). 

Other task also include named entity linking. Which is the task of linking an entity to e.g. a wikipedia entry, thus you have to both know that it is indeed an entity and which entity it is (if it is indeed a defined entity).

In this assignment, we'll be using Bi-LSTMs to train an NER model on a predifined data set which uses IOB tags of the kind we outlined above.

## 2. Training in batches

When you trained your document classifier for the last assignment, you probably noticed that the neural network was quite brittle. Small changes in the hyperparameters could cause massive changes in performance. Likewise, you probably noticed that they tend to substantially overfit the training data and underperform on the validation and test data.

One way we can get around this is by processing the data in smaller chunks known as *batches*. 

<details>
<summary>Q: Why might it be a good idea to train on batches, rather than the whole dataset?</summary>
<br>
These batches are usually small (something like 32 instances at a time) but they have couple of important effects on training:

- Batches can be processed in parallel, rather the sequentially. This can result in substantial speed up from computational perspective
- Similarly, smaller batch sizes make it easier to fit training data into memory
- Lastly,  smaller batch sizes are noisy, meaning that they have a regularizing effect and thus lead to less overfitting.

In this assignment, we're going to be using batches of data to train our NER model. To do that, we first have to prepare our batches for training. You can read more about batching in [this blog post](https://machinelearningmastery.com/how-to-control-the-speed-and-stability-of-training-neural-networks-with-gradient-descent-batch-size/).

</details>



In [2]:
# this allows us to look one step up in the directory
# for importing custom modules from src
import sys
sys.path.append("..")
from src.util import batch
from src.LSTM import RNN
from src.embedding import gensim_to_torch_embedding

# numpy and pytorch
import numpy as np
import torch

# loading data and embeddings
from datasets import load_dataset
import gensim.downloader as api

We can download the datset using the ```load_dataset()``` function we've already seen. Here we take only the training data.

When you've downloaded the dataset, you're welcome to save a local copy so that we don't need to constantly download it again everytime the code runs.

Q: What do the ```train.features``` values refer to?

In [4]:
# DATASET
dataset = load_dataset("conllpp")
train = dataset["train"]

# inspect the dataset
print(train["tokens"][:1])
print(train["ner_tags"][:1])

# get number of classes
num_classes = train.features["ner_tags"].feature.num_classes

Found cached dataset conllpp (/home/coder/.cache/huggingface/datasets/conllpp/conllpp/1.0.0/04f15f257dff3fe0fb36e049b73d51ecdf382698682f5e590b7fb13898206ba2)


  0%|          | 0/3 [00:00<?, ?it/s]

[['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.']]
[[3, 0, 7, 0, 0, 0, 7, 0, 0]]


We then use ```gensim``` to get some pretrained word embeddings for the input layer to the model. 

In this example, we're going to use a GloVe model pretrained on Wikipedia, with 50 dimensions.

I've provided a helper function to take the ```gensim``` embeddings and prepare them for ```pytorch```.

In [5]:
# CONVERTING EMBEDDINGS
model = api.load("glove-wiki-gigaword-50")

# convert gensim word embedding to torch word embedding
embedding_layer, vocab = gensim_to_torch_embedding(model)

### Preparing a batch

The first thing we want to do is to shuffle our dataset before training. 

Why might it be a good idea to shuffle the data?

In [6]:
# shuffle dataset
shuffled_train = dataset["train"].shuffle(seed=1)

Loading cached shuffled indices for dataset at /home/coder/.cache/huggingface/datasets/conllpp/conllpp/1.0.0/04f15f257dff3fe0fb36e049b73d51ecdf382698682f5e590b7fb13898206ba2/cache-bba38be1af2f532c.arrow


Next, we want to bundle the shuffled training data into smaller batches of predefined size. I've written a small utility function here to help. 

<details>
<summary>Q: Can you explain how the ```batch()``` function works?</summary>
<br>
 Hint: Check out [this link](https://realpython.com/introduction-to-python-generators/).
</details>



In [7]:
#If this is run multiple times, stuff may go wrong; creates generator objects
batch_size = 32
batches_tokens = batch(shuffled_train["tokens"], batch_size)
batches_tags = batch(shuffled_train["ner_tags"], batch_size)

In [9]:
batches_tokens

TypeError: 'generator' object is not subscriptable

Next, we want to use the ```tokens_to_idx()``` function below on our batches.

<details>
<summary>Q: What is this function doing? Why is it doing it?</summary>
<br>
We're making everything lowercase and adding a new, arbitrary token called <UNK> to the vocabulary. This <UNK> means "unknown" and is used to replace out-of-vocabulary tokens in the data - i.e. tokens that don't appear in the vocabulary of the pretrained word embeddings.
</details>


In [10]:
def tokens_to_idx(tokens, vocab=model.key_to_index):
    """
    - Write documentation for this function including type hints for each argument and return statement
    - What does the .get method do?
    - Why lowercase?
    """
    return [vocab.get(t.lower(), vocab["UNK"]) for t in tokens]

In [11]:
help(next) # Seems to change generator to tuple

Help on built-in function next in module builtins:

next(...)
    next(iterator[, default])
    
    Return the next item from the iterator. If default is given and the iterator
    is exhausted, it is returned instead of raising StopIteration.



We'll check below that everything is working as expected as expected by testing it on a single batch.

In [12]:
# sample using only the first batch; maybe this is the run that breaks when run multiple times
batch_tokens = next(batches_tokens)
batch_tags = next(batches_tags)
batch_tok_idx = [tokens_to_idx(sent) for sent in batch_tokens]
#list comprehension formulated differently:
#for sent in batch_tokens
#   tokens_to_idx(sent)

In [16]:
type(batch_tokens)
batch_tok_idx

[[544, 3535],
 [19605, 947, 790, 176, 421, 46532, 57483, 904],
 [5330,
  7764,
  1161,
  1644,
  400000,
  21157,
  31,
  1101,
  7,
  477,
  3,
  174,
  93,
  107797,
  580,
  9731,
  1,
  13,
  4970,
  5985,
  1,
  410,
  865,
  5330,
  7764,
  1161,
  1644,
  16,
  2],
 [4825, 3195, 4760, 347581, 314, 2639],
 [4179, 400000],
 [65, 93147, 35913, 4762, 926, 8749, 156, 3, 4905, 49, 76, 364, 9, 508, 2],
 [8455, 4628, 23, 19658, 24],
 [2859, 16, 12, 0, 2072, 3, 96951, 78, 1307, 1358, 445, 4, 211, 23837, 2],
 [4020, 22, 614, 1518],
 [8,
  41,
  54,
  835,
  4,
  2199,
  59,
  1174,
  1,
  8,
  16,
  3701,
  1,
  38,
  3416,
  49,
  26,
  17544,
  4,
  596,
  0,
  8868,
  5,
  1112,
  7,
  3667,
  6595,
  2],
 [0,
  30773,
  1890,
  10811,
  284343,
  25,
  1905,
  4,
  126,
  6,
  0,
  1250,
  85,
  366,
  6499,
  17,
  10974,
  226,
  34,
  73996,
  1827,
  4,
  802,
  7,
  7657,
  410,
  17,
  17083,
  226,
  49,
  502,
  3,
  0,
  418,
  3460,
  2],
 [1193,
  43,
  30,
  7145,
  7,
  5

As with document classification, our model needs to take input sequences of a fixed length. To get around this we do a couple of different steps.

- Find the length of the longest sequence in the batch
- Pad shorter sequences to the max length using an arbitrary token like <PAD>
- Give the <PAD> token a new label ```-1``` to differentiate it from the other labels

In [17]:
# compute length of longest sentence in batch
batch_max_len = max([len(s) for s in batch_tok_idx])

Q: Can you figure out the logic of what is happening in the next two cells?

In [19]:
batch_input = vocab["PAD"] * np.ones((batch_size, batch_max_len))
batch_labels = -1 * np.ones((batch_size, batch_max_len))

In [20]:
# copy the data to the numpy array
for i in range(batch_size):
    tok_idx = batch_tok_idx[i]
    tags = batch_tags[i]
    size = len(tok_idx)

    batch_input[i][:size] = tok_idx
    batch_labels[i][:size] = tags

The last step is to conver the arrays into ```pytorch``` tensors, ready for the NN model.

In [None]:
# since all data are indices, we convert them to torch LongTensors (integers)
batch_input, batch_labels = torch.LongTensor(batch_input), torch.LongTensor(
    batch_labels
)

With our data now batched and processed, we want to run it through our RNN the same way as when we trained a clasifier. Note that this cell is incomplete and won't yet run; that's part of the assignment!

Q: Why is ```output_dim = num_classes + 1```?

In [1]:
#I think this means I have to check the previous assignment, I guess there is only one on NNs and that is the one
# CREATE MODEL
model = RNN(
    embedding_layer=embedding_layer, output_dim=num_classes + 1, hidden_dim_size=256
)

# FORWARD PASS
X = batch_input
y = model(X)

loss = model.loss_fn(outputs=y, labels=batch_labels)

# etc, etc

NameError: name 'RNN' is not defined

## 3. Creating an LSTM with ```pytorch```

In the file [LSTM.py](../src/LSTM.py), I've aready created an LSTM for you using ```pytorch```. Take some time to read through the code and make sure you understand how it's built up.

Some questions for you to discuss in groups:

- How is an LSTM layer created using ```pytorch```? How does the code compare to the classifier code you wrote last week?
- What's going on with that weird bit that says ```@staticmethod```?
  - [This might help](https://realpython.com/instance-class-and-static-methods-demystified/).
- On the forward pass, we use ```log_softmax()``` to make output predictions. What is this, and how does it relate to the output from the sigmoid function that we used in the document classification?
- How would we make this LSTM model *bidirectional* - i.e. make it a Bi-LSTM? 
  - Hint: Check the documentation for the LSTM layer on the ```pytorch``` website.