## NER - Named Entity Recognition


### Module imports


In [1]:
import numpy as np
import pandas as pd
import tensorflow as tf

# set random seeds to make this notebook easier to replicate
tf.keras.utils.set_random_seed(33)




### Exploring the data


We will be using a dataset from Kaggle.

The original data consists of four columns:

- the sentence number,
- the word,
- the part of speech of the word
- and the tags.

A few tags are:

- geo: geographical entity
- org: organization
- per: person
- gpe: geopolitical entity
- tim: time indicator
- art: artifact
- eve: event
- nat: natural phenomenon
- O: filler word

The prepositions in the tags mean:

- I: Token is inside an entity.
- B: Token begins an entity.

**Example:**

**"Sharon flew to Miami on Friday"**

The tags would look like:

```
Sharon B-per
flew   O
to     O
Miami  B-geo
on     O
Friday B-tim
```

**"Sharon Floyd flew to Miami on Friday"**

```
Sharon B-per
Floyd  I-per
flew   O
to     O
Miami  B-geo
on     O
Friday B-tim
```


In [2]:
# Original data

data = pd.read_csv("data/ner_dataset.csv", encoding="ISO-8859-1")
data.head(25)

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,,of,IN,O
2,,demonstrators,NNS,O
3,,have,VBP,O
4,,marched,VBN,O
5,,through,IN,O
6,,London,NNP,B-geo
7,,to,TO,O
8,,protest,VB,O
9,,the,DT,O


We are using a preprocessed version of the data which was generated from the original data


In [3]:
# Exploring the preprocessed data (loading from small group)

train_sents = open('data/small/train/sentences.txt', 'r').readline()
train_labels = open('data/small/train/labels.txt', 'r').readline()

print('SENTENCE:', train_sents)
print('SENTENCE LABEL:', train_labels)

print('ORIGINAL DATA:\n', data.head())

del (data, train_sents, train_labels)

SENTENCE: Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country .

SENTENCE LABEL: O O O O O O B-geo O O O O O B-geo O O O O O B-gpe O O O O O

ORIGINAL DATA:
     Sentence #           Word  POS Tag
0  Sentence: 1      Thousands  NNS   O
1          NaN             of   IN   O
2          NaN  demonstrators  NNS   O
3          NaN           have  VBP   O
4          NaN        marched  VBN   O


Loading the data for training validation and testing


In [4]:
def load_data(file_path):
    with open(file_path, 'r', encoding="ISO-8859-1") as file:
        data = np.array([line.strip() for line in file.readlines()])
    return data

Data is divided into sentences and labels for each set


In [5]:
# Loading the data data from large group

train_sentences = load_data('data/large/train/sentences.txt')
train_labels = load_data('data/large/train/labels.txt')

val_sentences = load_data('data/large/val/sentences.txt')
val_labels = load_data('data/large/val/labels.txt')

test_sentences = load_data('data/large/test/sentences.txt')
test_labels = load_data('data/large/test/labels.txt')

In [6]:
print(len(train_sentences))
print(len(val_sentences))
print(len(test_sentences))

33570
7194
7194


In [7]:
print(train_sentences[:5])
print(train_labels[:5])

['Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country .'
 'Families of soldiers killed in the conflict joined the protesters who carried banners with such slogans as " Bush Number One Terrorist " and " Stop the Bombings . "'
 'They marched from the Houses of Parliament to a rally in Hyde Park .'
 'Police put the number of marchers at 10,000 while organizers claimed it was 1,00,000 .'
 "The protest comes on the eve of the annual conference of Britain 's ruling Labor Party in the southern English seaside resort of Brighton ."]
['O O O O O O B-geo O O O O O B-geo O O O O O B-gpe O O O O O'
 'O O O O O O O O O O O O O O O O O O B-per O O O O O O O O O O O'
 'O O O O O O O O O O O B-geo I-geo O' 'O O O O O O O O O O O O O O O'
 'O O O O O O O O O O O B-geo O O B-org I-org O O O B-gpe O O O B-geo O']


In [8]:
print(val_sentences[:5])
print(val_labels[:5])

["Russia 's victory put the eight-time Olympic champions into the quarterfinals and also clinched a spot for Sweden ."
 'Slovakia advanced with a win over the United States ( 02-Jan ) on Saturday , leaving one remaining spot from Group-B .'
 'China has announced its sixth human bird flu death .'
 'Chinese health officials said Wednesday a 35-year-old woman from Sichuan province died last week from the H5N1 strain .'
 'Meanwhile , Turkish officials say bird flu killed an 11-year-old girl on her way to a hospital Wednesday .']
['B-geo O O O O O O O O O O O O O O O O B-org O'
 'B-geo O O O O O O B-geo I-geo O O O O B-tim O O O O O O B-art O'
 'B-org O O O O O O O O O'
 'B-gpe O O O B-tim O O O O B-geo O O O O O O B-nat B-tim O'
 'O O B-gpe O O O O O O O O O O O O O O B-tim O']


In [9]:
print(test_sentences[:5])
print(test_labels[:5])

['Argentina benefits from rich natural resources , a highly literate population , an export-oriented agricultural sector , and a diversified industrial base .'
 "Although one of the world 's wealthiest countries 100 years ago , Argentina suffered during most of the 20th century from recurring economic crises , persistent fiscal and current account deficits , high inflation , mounting external debt , and capital flight ."
 "A severe depression , growing public and external indebtedness , and a bank run culminated in 2001 in the most serious economic , social , and political crisis in the country 's turbulent history ."
 "Interim President Adolfo RODRIGUEZ SAA declared a default - the largest in history - on the government 's foreign debt in December of that year , and abruptly resigned only a few days after taking office ."
 "His successor , Eduardo DUHALDE , announced an end to the peso 's decade-long 1-to-1 peg to the US dollar in early 2002 ."]
['B-geo O O O O O O O O O O O O O O O O

### Encoding the data


#### Encoding the sentences


We will use `tf.keras.layers.TextVectorization`. We will explicitly pass `standardize = None`. By default, `standardize = 'lower_and_strip_punctuation'`.

This means the parser will remove all punctuation and make everything lowercase. This may influence the NER task, since an upper case in the middle of a sentence may indicate an entity. The sentences in the dataset are already split into tokens, and all tokens, including punctuation, are separated by a whitespace. The punctuations are also labeled. So everything will just be split into single tokens and then mapped to a positive integer.

`tf.keras.layers.TextVectorization` will also pad the sentences. But padding won't impact at all the model's output.

- padding token: "", integer mapped: 0
- unknown token "[UNK]", integer mapped: 1


We will use the object `tf.keras.layers.TextVectorization` and the appropriate parameters to build a function that inputs an array of sentences and outputs an adapted sentence vectorizer and its vocabulary list.


In [10]:
def get_sentence_vectorizer(sentences):

    # Setting the random seed
    tf.keras.utils.set_random_seed(33)

    """
    Parameters:
    sentences (list of str): Sentences for vocabulary adaptation.

    Returns:
    sentence_vectorizer (tf.keras.layers.TextVectorization): TextVectorization layer for sentence tokenization.
    vocab (list of str): Extracted vocabulary.
    """

    # Define TextVectorization object with the appropriate standardize parameter
    sentence_vectorizer = tf.keras.layers.TextVectorization(standardize=None)

    # Adapt the sentence vectorization object to the given sentences
    sentence_vectorizer.adapt(sentences)

    # Get the vocabulary
    vocab = sentence_vectorizer.get_vocabulary()

    return sentence_vectorizer, vocab

In [27]:
# Using the function to get the adapted vectorizer and vocabulary on a subset of the training data

test_vectorizer, test_vocab = get_sentence_vectorizer(train_sentences[:1000])

print(f"Test vocab size: {len(test_vocab)}")

Test vocab size: 4650


In [28]:
print(test_vocab[:20])

['', '[UNK]', 'the', '.', ',', 'in', 'of', 'to', 'a', 'and', 'The', "'s", 'for', 'is', 'has', 'said', 'on', 'have', 'that', 'from']


In [29]:
# Testing the test vectorizer

sentence = "I like learning new NLP models !"

sentence_vectorized = test_vectorizer(sentence)

print(f"Sentence: {sentence}\nSentence vectorized: {sentence_vectorized}")

Sentence: I like learning new NLP models !
Sentence vectorized: [ 296  314    1   59    1    1 4649]


In [30]:
# Getting the sentence vectorizer adapted on the complete training data

sentence_vectorizer, vocab = get_sentence_vectorizer(train_sentences)

In [31]:
print(f"Vocab size: {len(vocab)}")

Vocab size: 29847


In [32]:
# Testing the complete vectorizer on the same example

sentence = "I like learning new NLP models !"

sentence_vectorized = sentence_vectorizer(sentence)

print(f"Sentence: {sentence}\nSentence vectorized: {sentence_vectorized}")

Sentence: I like learning new NLP models !
Sentence vectorized: [  654  1211  6896    69     1 11044  4126]


#### Encoding the labels


The process is a bit simpler than encoding the sentences, because there are only a few tags. Also, there will be one extra tag to represent the padded token that some sentences may have included. Padding will not interfere at all in this task.

There is no meaning in having an UNK token for labels and also the padding token will be another number different from 0 for the labels. So, TextVectorization is not a good choice here.

We will code our own label vectorizer.


In [14]:
print(f"Sentence: {train_sentences[0]}")
print(f"Labels: {train_labels[0]}")

Sentence: Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country .
Labels: O O O O O O B-geo O O O O O B-geo O O O O O B-gpe O O O O O


The following function extract all the different tags in a given set of labels (all the training example set) and return us the uniqe ones.


In [15]:
def get_tags(labels):
    tag_set = set()  # Define an empty set
    for el in labels:
        for tag in el.split(" "):
            tag_set.add(tag)
    tag_list = list(tag_set)
    tag_list.sort()
    return tag_list

In [16]:
tags = get_tags(train_labels)
print(len(tags))
print(tags)

17
['B-art', 'B-eve', 'B-geo', 'B-gpe', 'B-nat', 'B-org', 'B-per', 'B-tim', 'I-art', 'I-eve', 'I-geo', 'I-gpe', 'I-nat', 'I-org', 'I-per', 'I-tim', 'O']


The following function generate a **tag map**, i.e., a mapping between the tags and **positive** integers.


In [17]:
def make_tag_map(tags):
    tag_map = {}
    for i, tag in enumerate(tags):
        tag_map[tag] = i
    return tag_map

In [18]:
tag_map = make_tag_map(tags)
print(tag_map)

{'B-art': 0, 'B-eve': 1, 'B-geo': 2, 'B-gpe': 3, 'B-nat': 4, 'B-org': 5, 'B-per': 6, 'B-tim': 7, 'I-art': 8, 'I-eve': 9, 'I-geo': 10, 'I-gpe': 11, 'I-nat': 12, 'I-org': 13, 'I-per': 14, 'I-tim': 15, 'O': 16}


#### Padding the labels


TextVectorization has already padded the sentences, so we must ensure that the labels are properly padded as well.Tensorflow has built-in functions for padding `tf.keras.utils.pad_sequences`.

We will pad the vectorized labels with the value -1. We will not use 0 to simplify loss masking and evaluation in further steps.

This is because to properly classify one token, a log softmax transformation will be performed and the index with greater value will be the index label. Since index starts at 0, it is better to keep the label 0 as a valid index, even though it is possible to also use 0 as a mask value for labels, but it would require some tweaks in the model architecture or in the loss computation.


#### Label vectorizer


We will build the label vectorizer.

This function inputs a list of labels and a tag mapping and outputs their respective label ids via a tag map lookup.


In [41]:
def label_vectorizer(labels, tag_map, DEBUG_PRINT=False):
    """
    Convert list of label strings to padded label IDs using a tag mapping.

    Parameters:
    labels (list of str): List of label strings.
    tag_map (dict): Dictionary mapping tags to IDs.
    Returns:
    label_ids (numpy.ndarray): Padded array of label IDs.
    """
    label_ids = []  # It can't be a numpy array yet, since each sentence has a different size

    # Each element in labels is a string of tags so for each of them:
    for element in labels:
        # Split it into single tokens. You may use .split function for strings. Be aware to split it by a blank space!
        tokens = element.split()

        # DEBUG PRINT
        if DEBUG_PRINT == True:
            print("tokens", tokens, "\n")

        # Use the dictionaty tag_map passed as an argument to the label_vectorizer function
        # to make the correspondence between tags and numbers.
        element_ids = []

        for token in tokens:
            # Tag map lookup
            # Appending the ids corresponding to the tokens in the label
            element_ids.append(tag_map[token])

        # DEBUG PRINT
        if DEBUG_PRINT == True:
            print("element_ids", element_ids, "\n")

        # Append the found ids to corresponding to the current element to label_ids list
        label_ids.append(element_ids)

    # DEBUG PRINT
    if DEBUG_PRINT == True:
        print("label_ids", label_ids, "\n")

    # Pad the elements
    label_ids = tf.keras.utils.pad_sequences(
        label_ids, padding='post', value=-1)

    # DEBUG PRINT
    if DEBUG_PRINT == True:
        print("label_ids", label_ids, "\n")

    return label_ids

Looking at a single example


In [36]:
print(f"Sentence: {train_sentences[2]}")
print(f"Labels: {train_labels[2]}")

Sentence: They marched from the Houses of Parliament to a rally in Hyde Park .
Labels: O O O O O O O O O O O B-geo I-geo O


In [37]:
print(f"Vectorized labels: {label_vectorizer([train_labels[2]], tag_map)}")

tokens ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'I-geo', 'O'] 

element_ids [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 2, 10, 16] 

label_ids [[16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 2, 10, 16]] 

label_ids [[16 16 16 16 16 16 16 16 16 16 16  2 10 16]] 

Vectorized labels: [[16 16 16 16 16 16 16 16 16 16 16  2 10 16]]


Looking at two examples together


In [38]:
print(f"Sentence: {train_sentences[2:4]}")

Sentence: ['They marched from the Houses of Parliament to a rally in Hyde Park .'
 'Police put the number of marchers at 10,000 while organizers claimed it was 1,00,000 .']


In [39]:
print(f"Labels: {train_labels[2:4]}")

Labels: ['O O O O O O O O O O O B-geo I-geo O' 'O O O O O O O O O O O O O O O']


In [40]:
print(f"Vectorized labels: {label_vectorizer(train_labels[2:4], tag_map)}")

tokens ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'I-geo', 'O'] 

element_ids [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 2, 10, 16] 

tokens ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'] 

element_ids [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16] 

label_ids [[16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 2, 10, 16], [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16]] 

label_ids [[16 16 16 16 16 16 16 16 16 16 16  2 10 16 -1]
 [16 16 16 16 16 16 16 16 16 16 16 16 16 16 16]] 

Vectorized labels: [[16 16 16 16 16 16 16 16 16 16 16  2 10 16 -1]
 [16 16 16 16 16 16 16 16 16 16 16 16 16 16 16]]


### Building the dataset


We will build the dataset for training, validation and testing.

We will be using `tf.data.Dataset` class, which provides an optimized way to handle data to feed into a tensorflow model. It avoids keeping all the data in memory, thus it makes the training faster.

We will be using the `tf.data.Dataset.from_tensor_slices` function that converts any iterable into a Tensorflow dataset.

We can pass a tuple of `(sentences,labels)` and Tensorflow will understand that each sentence is mapped to its respective label, therefore it is expected that if a tuple of arrays is passed, both arrays have the same length.


In [42]:
def generate_dataset(sentences, labels, sentence_vectorizer, tag_map):
    sentences_ids = sentence_vectorizer(sentences)
    labels_ids = label_vectorizer(labels, tag_map=tag_map)
    dataset = tf.data.Dataset.from_tensor_slices((sentences_ids, labels_ids))
    return dataset

In [43]:
train_dataset = generate_dataset(
    train_sentences, train_labels, sentence_vectorizer, tag_map)
val_dataset = generate_dataset(
    val_sentences, val_labels,  sentence_vectorizer, tag_map)
test_dataset = generate_dataset(
    test_sentences, test_labels,  sentence_vectorizer, tag_map)

In [47]:
# Exploring information about the training data
# The number of vocabulary tokens (including <PAD>)

g_vocab_size = len(vocab)

print(f"Num of vocabulary words in the training set: {g_vocab_size}")

print()

print('The training size is', len(train_dataset))
print('The validation size is', len(val_dataset))

print()

print('An example of the first sentence is\n\t',
      next(iter(train_dataset))[0].numpy())
print('An example of its corresponding label is\n\t',
      next(iter(train_dataset))[1].numpy())

Num of vocabulary words in the training set: 29847

The training size is 33570
The validation size is 7194

An example of the first sentence is
	 [1046    6 1121   18 1832  232  543    7  528    2  158    5   60    9
  648    2  922    6  192   87   22   16   54    3    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0]
An example of its corresponding label is
	 [16 16 16 16 16 16  2 16 16 16 16 16  2 16 16 16 16 16  3 16 16 16 16 16
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1

#### About RNNs and LSTMs inputs

Tensorflow implementation of RNNs (in particular LSTMs) allow us to pass a variable size of input sentences. However this cannot be done in the same batch. You must assure that, for each batch, the shapes for our input tensors are the same.

For this purpose, the size of the padding should not influence the final result. Therefore, it does not matter if we perform the padding for each batch or in the entire dataset.


### Model building


Inputs will be sentences represented as tensors that are fed to a model with,

- An Embedding layer,
- A LSTM layer
- A Dense layer
- A log softmax layer

We may choose between outputting only the very last LSTM output for each sentence, but we may also request the LSTM to output every value for a sentence - this is what we want.

We will need every output, because the idea is to label every token in the sentence and not to predict the next token or even make an overall classification task for that sentence.

This implies that when we input a single sentence, such as `[452, 3400, 123, 0, 0, 0]`, the expected output should be an array for each word ID, with a length equal to the number of tags. This output is obtained by applying the LogSoftfmax function for each of the `len(tags)` values.

So, in the case of the example array with a shape of `(6,)`, the output should be an array with a shape of `(6, len(tags))`.

In our case, we've seen that each sentence in the training set is 104 values long, so in a batch of, say, 64 tensors, the model shoud input a tensor of shape `(64,104)` and output another tensor with shape `(64,104,17)`.


#### About tensorflow layers


**`tf.keras.Sequential`**

- This combinator applies layers serially (by function composition). It is a tensorflow model object.

---

**`tf.keras.layers.Embedding`**

- Initializes the embedding layer `Embedding(input_dim, output_dim, mask_zero = False)`
- `input_dim` is the expected range of integers for each tensor in the batch. Note that the `input_dim` is not related to array size, but to the possible range of integers expected in the input. Usually this is the vocabulary size, but it may differ by 1, depending on further parameters.
- `output_dim` is the number of elements in the word embedding (some choices for a word embedding size range from 150 to 300, for example). Each word processed will be assigned an array of size `output_dim`. So if one array of shape (3,) is passed (example of such an array `[100,203,204]`), then the Embedding layer should have output shape (3,output_dim).
- `mask_zero` is a boolean telling whether 0 is a mask value or not. If `mask_zero = True`, then some considerations must be done:
  - The value 0 should be reserved as the mask value, as it will be ignored in training.
  - We need to add 1 in `input_dim`, since now Tensorflow will consider that one extra 0 value may show up in each sentence.

---

**`tf.keras.layers.LSTM`**

- `LSTM(units, return_sequences)` Builds an LSTM layer with hidden state and cell sizes equal to `units`.
- `units`: It is the number of `LSTM` cells we will create to pass every input to. In this case, set the `units` as the Embedding `output_dim`. This is just a choice, in fact there is no static rule preventing one from choosing any amount of LSTM units.
- `return_sequences`: A boolean, telling whether we want to return every output value from the LSTM cells. If `return_sequences = False`, then the LSTM output shape will be `(batch_size, units)`. Otherwise, it is `(batch_size, sentence_length, units)`, since there will be an output for each word in the sentence.

---

**`tf.keras.layers.Dense`**

- `Dense(units, activation)`
- `units`: It is the number of units chosen for this dense layer, i.e., it is the dimensionality of the output space. In this case, each value passed through the Dense layer must be mapped into a vector with length `num_of_classes` (in this case, `len(tags)`).
- `activation`: This is the activation that will be performed after computing the values in the Dense layer. Since the Dense layer comes before the LogSoftmax step, we can pass the LogSoftmax function as activation function here. \*\*We can find the implementation for LogSoftmax under `tf.nn`. So we may call it as `tf.nn.log_softmax`.


In [48]:
def NER(len_tags, vocab_size, embedding_dim=50):
    """
    Create a Named Entity Recognition (NER) model.

    Parameters:
    len_tags (int): Number of NER tags (output classes).
    vocab_size (int): Vocabulary size.
    embedding_dim (int, optional): Dimension of embedding and LSTM layers (default is 50).

    Returns:
    model (Sequential): NER model.
    """

    model = tf.keras.Sequential(name='sequential')

    # Add the tf.keras.layers.Embedding layer. Do not forget to mask out the zeros!
    model.add(tf.keras.layers.Embedding(input_dim=vocab_size +
              1, output_dim=embedding_dim, mask_zero=True))

    # Add the LSTM layer. Make sure we are passing the right dimension (defined in the docstring above)
    # and returning every output for the tf.keras.layers.LSTM layer and not the very last one.
    model.add(tf.keras.layers.LSTM(units=embedding_dim, return_sequences=True))

    # Add the final tf.keras.layers.Dense with the appropriate activation function. Remember we must pass the activation function itself ant not its call!
    # We must use tf.nn.log_softmax instead of tf.nn.log_softmax().
    model.add(tf.keras.layers.Dense(
        units=len_tags, activation=tf.nn.log_softmax))

    return model

### Masked loss and metrics


**Custom Accuracy Function**

Before training the model, we need to create our own function to compute the accuracy. Tensorflow has built-in accuracy metrics but we cannot pass values to be ignored. This will impact the calculations, since we must remove the padded values.

Usually, the metric that inputs true labels and predicted labels and outputs how many times the predicted and true labels match is called `accuracy`.

In some cases, however, there is one more step before getting the predicted labels. This may happen if, instead of passing the predicted labels, a vector of probabilities is passed.

In such case, there is a need to perform an `argmax` for each prediction to find the appropriate predicted label. Such situations happen very often, therefore Tensorflow has a set of functions, with prefix `Sparse`, that performs this operation in the backend.

Unfortunately, it does not provide values to ignore in the accuracy case. This is what we will work on now.

Note that the model's prediction has 3 axes:

- the number of examples (batch size)
- the number of words in each example (padded to be as long as the longest sentence in the batch)
- the number of possible targets (the 17 named entity tags).


**Custom Loss Function**

Another important function is the loss function. In this case, we will use the Cross Entropy loss, but we need a multiclass implementation of it, also we may look for its `Sparse` version.

Tensorflow has a SparseCategoricalCrossentropy loss function, which is already imported by the name SparseCategoricalCrossEntropy.

The arguments we will need:

- `from_logits`: This indicates if the values are raw values or normalized values (probabilities). Since the last layer of the model finishes with a LogSoftMax call, the results are **not** normalized - they do not lie between 0 and 1.
- `ignore_class`: This indicates which class should be ignored when computing the crossentropy. Remember that the class related to padding value is set to be 0.


#### Custom masked loss function

In [49]:
def masked_loss(y_true, y_pred):
    """
    Calculate the masked sparse categorical cross-entropy loss.

    Parameters:
    y_true (tensor): True labels.
    y_pred (tensor): Predicted logits.

    Returns:
    loss (tensor): Calculated loss.
    """

    # Calculate the loss for each item in the batch. Remember to pass the right arguments, as discussed above!
    loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(
        from_logits=True, ignore_class=-1)

    # Use the previous defined function to compute the loss
    loss = loss_fn(y_true, y_pred)

    return loss

In [50]:
# Trying out the custom loss function

true_labels = [0, 1, 2, 0]

predicted_logits = [[0.1, 0.6, 0.3],
                    [0.2, 0.7, 0.1],
                    [0.1, 0.5, 0.4],
                    [0.4, 0.4, 0.2]]

print(masked_loss(true_labels, predicted_logits))

tf.Tensor(1.0508584, shape=(), dtype=float32)


#### Custom masked accuracy function

We will make a masked version of the accuracy function.

We will need to perform an argmax to get the predicted label for each element in the batch. We need to make sure to provide the appropriate axis in the argmax function.

Furthermore, remember to use only tensorflow operations.

Even though numpy has every function we will need, to pass it as a loss function and/or metric function, we must use tensorflow operations, due to internal optimizations that Tensorflow performs for reliable fitting.

The following tensorflow functions are already loaded in memory, so we can directly call them.

- `tf.equal`, equivalent to `np.equal`
- `tf.cast`, equivalent to `np.astype`
- `tf.reduce_sum`, equiavalent to `np.sum`
- `tf.math.argmax`, equivalent to `np.argmax`
- We may need `tf.float32` while casting


In [51]:
def masked_accuracy(y_true, y_pred):
    """
    Calculate masked accuracy for predicted labels.

    Parameters:
    y_true (tensor): True labels.
    y_pred (tensor): Predicted logits.

    Returns:
    accuracy (tensor): Masked accuracy.

    """

    # Calculate the loss for each item in the batch.
    # We must always cast the tensors to the same type in order to use them in training.
    # Since we will make divisions, it is safe to use tf.float32 data type.
    y_true = tf.cast(y_true, tf.float32)

    # Create the mask, i.e., the values that will be ignored
    mask = tf.not_equal(y_true, -1)
    mask = tf.cast(mask, tf.float32)

    # Perform argmax to get the predicted values
    y_pred_class = tf.argmax(y_pred, axis=-1)
    y_pred_class = tf.cast(y_pred_class, tf.float32)

    # Compare the true values with the predicted ones
    matches_true_pred = tf.equal(y_true, y_pred_class)
    matches_true_pred = tf.cast(matches_true_pred, tf.float32)

    # Multiply the acc tensor with the masks
    matches_true_pred *= mask

    # Compute masked accuracy
    # quotient between the total matches and the total valid values, i.e., the amount of non-masked values
    masked_acc = tf.reduce_sum(matches_true_pred) / \
        tf.maximum(tf.reduce_sum(mask), 1)

    return masked_acc

In [52]:
true_labels = [0, 1, 2, 0]

predicted_logits = [[0.1, 0.6, 0.3],
                    [0.2, 0.7, 0.1],
                    [0.1, 0.5, 0.4],
                    [0.4, 0.4, 0.2]]

print(masked_accuracy(true_labels, predicted_logits))

tf.Tensor(0.5, shape=(), dtype=float32)


### The model

In [53]:
model = NER(len(tag_map), len(vocab))
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 50)          1492400   
                                                                 
 lstm (LSTM)                 (None, None, 50)          20200     
                                                                 
 dense (Dense)               (None, None, 17)          867       
                                                                 
Total params: 1513467 (5.77 MB)
Trainable params: 1513467 (5.77 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


### Note on padding

We will check now how padding does not affect the model's output. 

Of course the output dimension will change. If ten zeros are added at the end of the tensor, then the resulting output dimension will have 10 more elements (more specifically, 10 more arrays of length 17 each). 

However, those are removed from any calculation further on, so it won't impact at all the model's performance and training.

In [54]:
# Expanding dims is needed to pass it to the model, 
# since it expects batches and not single prediction arrays

x = tf.expand_dims(np.array([545, 467, 896]), axis = 0)
    
x_padded = tf.expand_dims(np.array([545, 467, 896, 0, 0, 0]), axis = 0)

In [55]:
pred_x = model(x)
pred_x_padded = model(x_padded)

print(f'x shape: {pred_x.shape}\nx_padded shape: {pred_x_padded.shape}')

x shape: (1, 3, 17)
x_padded shape: (1, 6, 17)


If the last three elements of `pred_x_padded` are removed, both `pred_x` and `pred_x_padded[:3]` must have the same elements.

In [56]:
np.allclose(pred_x, pred_x[:3])

True

Now one last check: let's see that both `pred_x` and `pred_x_padded` return the same loss and accuracy values. 

For that, we will need a `y_true` and `y_true_padded` arrays.

In [57]:
y_true = tf.expand_dims([16, 6, 12], axis = 0)

# Remember we mapped the padded values to -1 in the labels
y_true_padded = tf.expand_dims([16,6,12,-1,-1,-1], axis = 0) 

print(f"masked_loss is the same: {np.allclose(masked_loss(y_true,pred_x), masked_loss(y_true_padded,pred_x_padded))}")

print(f"masked_accuracy is the same: {np.allclose(masked_accuracy(y_true,pred_x), masked_accuracy(y_true_padded,pred_x_padded))}")

masked_loss is the same: True
masked_accuracy is the same: True


### Model compilation

We will compile the model as follows:

- Use the Adam optimizer to compute the stochastic gradient descent, with learning rate 0.01
- Use the loss function `masked_loss` as loss function,
- As evaluation metrics, we will use both masked_loss and masked_accuracy

In [58]:
model.compile(optimizer=tf.keras.optimizers.Adam(0.01),
              loss=masked_loss,
              metrics=[masked_accuracy])