## NER - Named Entity Recognition

### Module imports

In [1]:
import numpy as np
import pandas as pd
import tensorflow as tf

# set random seeds to make this notebook easier to replicate
tf.keras.utils.set_random_seed(33)




### Exploring the data

We will be using a dataset from Kaggle.

The original data consists of four columns: 
- the sentence number, 
- the word, 
- the part of speech of the word  
- and the tags. 

A few tags are: 
- geo: geographical entity
- org: organization
- per: person 
- gpe: geopolitical entity
- tim: time indicator
- art: artifact
- eve: event
- nat: natural phenomenon
- O: filler word

The prepositions in the tags mean:
- I: Token is inside an entity.
- B: Token begins an entity.

**Example:**

**"Sharon flew to Miami on Friday"**

The tags would look like:

```
Sharon B-per
flew   O
to     O
Miami  B-geo
on     O
Friday B-tim
```

**"Sharon Floyd flew to Miami on Friday"**

```
Sharon B-per
Floyd  I-per
flew   O
to     O
Miami  B-geo
on     O
Friday B-tim
```

In [5]:
# Original data

data = pd.read_csv("data/ner_dataset.csv", encoding = "ISO-8859-1")
data.head(25)

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,,of,IN,O
2,,demonstrators,NNS,O
3,,have,VBP,O
4,,marched,VBN,O
5,,through,IN,O
6,,London,NNP,B-geo
7,,to,TO,O
8,,protest,VB,O
9,,the,DT,O


We are using a preprocessed version of the data which was generated from the original data

In [6]:
# Exploring the preprocessed data (loading from small group)

train_sents = open('data/small/train/sentences.txt', 'r').readline()
train_labels = open('data/small/train/labels.txt', 'r').readline()

print('SENTENCE:', train_sents)
print('SENTENCE LABEL:', train_labels)

print('ORIGINAL DATA:\n', data.head())

del(data, train_sents, train_labels)

SENTENCE: Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country .

SENTENCE LABEL: O O O O O O B-geo O O O O O B-geo O O O O O B-gpe O O O O O

ORIGINAL DATA:
     Sentence #           Word  POS Tag
0  Sentence: 1      Thousands  NNS   O
1          NaN             of   IN   O
2          NaN  demonstrators  NNS   O
3          NaN           have  VBP   O
4          NaN        marched  VBN   O


Loading the data for training validation and testing

In [12]:
def load_data(file_path):
    with open(file_path,'r', encoding = "ISO-8859-1") as file:
        data = np.array([line.strip() for line in file.readlines()])
    return data

Data is divided into sentences and labels for each set

In [13]:
# Loading the data data from large group

train_sentences = load_data('data/large/train/sentences.txt')
train_labels = load_data('data/large/train/labels.txt')

val_sentences = load_data('data/large/val/sentences.txt')
val_labels = load_data('data/large/val/labels.txt')

test_sentences = load_data('data/large/test/sentences.txt')
test_labels = load_data('data/large/test/labels.txt')

In [19]:
print(len(train_sentences))
print(len(val_sentences))
print(len(test_sentences))

33570
7194
7194


In [15]:
print(train_sentences[:5])
print(train_labels[:5])

['Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country .'
 'Families of soldiers killed in the conflict joined the protesters who carried banners with such slogans as " Bush Number One Terrorist " and " Stop the Bombings . "'
 'They marched from the Houses of Parliament to a rally in Hyde Park .'
 'Police put the number of marchers at 10,000 while organizers claimed it was 1,00,000 .'
 "The protest comes on the eve of the annual conference of Britain 's ruling Labor Party in the southern English seaside resort of Brighton ."]
['O O O O O O B-geo O O O O O B-geo O O O O O B-gpe O O O O O'
 'O O O O O O O O O O O O O O O O O O B-per O O O O O O O O O O O'
 'O O O O O O O O O O O B-geo I-geo O' 'O O O O O O O O O O O O O O O'
 'O O O O O O O O O O O B-geo O O B-org I-org O O O B-gpe O O O B-geo O']


In [16]:
print(val_sentences[:5])
print(val_labels[:5])

["Russia 's victory put the eight-time Olympic champions into the quarterfinals and also clinched a spot for Sweden ."
 'Slovakia advanced with a win over the United States ( 02-Jan ) on Saturday , leaving one remaining spot from Group-B .'
 'China has announced its sixth human bird flu death .'
 'Chinese health officials said Wednesday a 35-year-old woman from Sichuan province died last week from the H5N1 strain .'
 'Meanwhile , Turkish officials say bird flu killed an 11-year-old girl on her way to a hospital Wednesday .']
['B-geo O O O O O O O O O O O O O O O O B-org O'
 'B-geo O O O O O O B-geo I-geo O O O O B-tim O O O O O O B-art O'
 'B-org O O O O O O O O O'
 'B-gpe O O O B-tim O O O O B-geo O O O O O O B-nat B-tim O'
 'O O B-gpe O O O O O O O O O O O O O O B-tim O']


In [17]:
print(test_sentences[:5])
print(test_labels[:5])

['Argentina benefits from rich natural resources , a highly literate population , an export-oriented agricultural sector , and a diversified industrial base .'
 "Although one of the world 's wealthiest countries 100 years ago , Argentina suffered during most of the 20th century from recurring economic crises , persistent fiscal and current account deficits , high inflation , mounting external debt , and capital flight ."
 "A severe depression , growing public and external indebtedness , and a bank run culminated in 2001 in the most serious economic , social , and political crisis in the country 's turbulent history ."
 "Interim President Adolfo RODRIGUEZ SAA declared a default - the largest in history - on the government 's foreign debt in December of that year , and abruptly resigned only a few days after taking office ."
 "His successor , Eduardo DUHALDE , announced an end to the peso 's decade-long 1-to-1 peg to the US dollar in early 2002 ."]
['B-geo O O O O O O O O O O O O O O O O

### Encoding the data

#### Encoding the sentences

We will use `tf.keras.layers.TextVectorization`. We will explicitly pass `standardize = None`. By default, `standardize = 'lower_and_strip_punctuation'`. 

This means the parser will remove all punctuation and make everything lowercase. This may influence the NER task, since an upper case in the middle of a sentence may indicate an entity. The sentences in the dataset are already split into tokens, and all tokens, including punctuation, are separated by a whitespace. The punctuations are also labeled. So everything will just be split into single tokens and then mapped to a positive integer.

`tf.keras.layers.TextVectorization` will also pad the sentences. But padding won't impact at all the model's output.

- padding token: "", integer mapped: 0
- unknown token "[UNK]", integer mapped: 1

We will use the object `tf.keras.layers.TextVectorization` and the appropriate parameters to build a function that inputs an array of sentences and outputs an adapted sentence vectorizer and its vocabulary list.

In [18]:
def get_sentence_vectorizer(sentences):
    
    # Setting the random seed
    tf.keras.utils.set_random_seed(33)

    """
    Parameters:
    sentences (list of str): Sentences for vocabulary adaptation.

    Returns:
    sentence_vectorizer (tf.keras.layers.TextVectorization): TextVectorization layer for sentence tokenization.
    vocab (list of str): Extracted vocabulary.
    """

    # Define TextVectorization object with the appropriate standardize parameter
    sentence_vectorizer = tf.keras.layers.TextVectorization(standardize=None)

    # Adapt the sentence vectorization object to the given sentences
    sentence_vectorizer.adapt(sentences)
    
    # Get the vocabulary
    vocab = sentence_vectorizer.get_vocabulary()
 
    return sentence_vectorizer, vocab

In [20]:
# Using the function to get the adapted vectorizer and vocabulary

test_vectorizer, test_vocab = get_sentence_vectorizer(train_sentences)

print(f"Test vocab size: {len(test_vocab)}")



Test vocab size: 29847


In [23]:
print(test_vocab[:20])

['', '[UNK]', 'the', '.', ',', 'in', 'of', 'to', 'a', 'and', 'The', "'s", 'for', 'has', 'is', 'on', 'that', 'with', 'have', 'said']


In [24]:
# Testing the vectorizer

sentence = "I like learning new NLP models !"

sentence_vectorized = test_vectorizer(sentence)

print(f"Sentence: {sentence}\nSentence vectorized: {sentence_vectorized}")

Sentence: I like learning new NLP models !
Sentence vectorized: [  654  1211  6896    69     1 11044  4126]


#### Encoding the labels

The process is a bit simpler than encoding the sentences, because there are only a few tags. Also, there will be one extra tag to represent the padded token that some sentences may have included. Padding will not interfere at all in this task.

There is no meaning in having an UNK token for labels and also the padding token will be another number different from 0 for the labels. So, TextVectorization is not a good choice here.

We will code our own label vectorizer.

In [25]:
print(f"Sentence: {train_sentences[0]}")
print(f"Labels: {train_labels[0]}")

Sentence: Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country .
Labels: O O O O O O B-geo O O O O O B-geo O O O O O B-gpe O O O O O


The following function extract all the different tags in a given set of labels (all the training example set) and return us the uniqe ones.

In [26]:
def get_tags(labels):
    tag_set = set() # Define an empty set
    for el in labels:
        for tag in el.split(" "):
            tag_set.add(tag)
    tag_list = list(tag_set) 
    tag_list.sort()
    return tag_list

In [28]:
tags = get_tags(train_labels)
print(len(tags))
print(tags)

17
['B-art', 'B-eve', 'B-geo', 'B-gpe', 'B-nat', 'B-org', 'B-per', 'B-tim', 'I-art', 'I-eve', 'I-geo', 'I-gpe', 'I-nat', 'I-org', 'I-per', 'I-tim', 'O']


The following function generate a **tag map**, i.e., a mapping between the tags and **positive** integers.

In [29]:
def make_tag_map(tags):
    tag_map = {}
    for i,tag in enumerate(tags):
        tag_map[tag] = i 
    return tag_map

In [30]:
tag_map = make_tag_map(tags)
print(tag_map)

{'B-art': 0, 'B-eve': 1, 'B-geo': 2, 'B-gpe': 3, 'B-nat': 4, 'B-org': 5, 'B-per': 6, 'B-tim': 7, 'I-art': 8, 'I-eve': 9, 'I-geo': 10, 'I-gpe': 11, 'I-nat': 12, 'I-org': 13, 'I-per': 14, 'I-tim': 15, 'O': 16}


#### Padding the labels

TextVectorization has already padded the sentences, so we must ensure that the labels are properly padded as well.Tensorflow has built-in functions for padding `tf.keras.utils.pad_sequences`.

We will pad the vectorized labels with the value -1. We will not use 0 to simplify loss masking and evaluation in further steps. 

This is because to properly classify one token, a log softmax transformation will be performed and the index with greater value will be the index label. Since index starts at 0, it is better to keep the label 0 as a valid index, even though it is possible to also use 0 as a mask value for labels, but it would require some tweaks in the model architecture or in the loss computation.

#### Label vectorizer

We will build the label vectorizer.

This function inputs a list of labels and a tag mapping and outputs their respective label ids via a tag map lookup.

In [36]:
def label_vectorizer(labels, tag_map):
    """
    Convert list of label strings to padded label IDs using a tag mapping.

    Parameters:
    labels (list of str): List of label strings.
    tag_map (dict): Dictionary mapping tags to IDs.
    Returns:
    label_ids (numpy.ndarray): Padded array of label IDs.
    """
    label_ids = [] # It can't be a numpy array yet, since each sentence has a different size

    # Each element in labels is a string of tags so for each of them:
    for element in labels:
        # Split it into single tokens. You may use .split function for strings. Be aware to split it by a blank space!
        tokens = element.split()
        
        # DEBUG PRINT
        print(tokens,"\n")

        # Use the dictionaty tag_map passed as an argument to the label_vectorizer function
        # to make the correspondence between tags and numbers. 
        element_ids = []

        for token in tokens:
            # Tag map lookup
            # Appending the ids corresponding to the tokens in the label
            element_ids.append(tag_map[token])
        
        # DEBUG PRINT
        print(element_ids,"\n")

        # Append the found ids to corresponding to the current element to label_ids list
        label_ids.append(element_ids)

    # DEBUG PRINT
    print(label_ids,"\n")
        
    # Pad the elements
    label_ids = tf.keras.utils.pad_sequences(label_ids, padding='post', value=-1)
    
    # DEBUG PRINT
    print(label_ids,"\n")

    return label_ids

Looking at a single example

In [37]:
print(f"Sentence: {train_sentences[2]}")
print(f"Labels: {train_labels[2]}")

Sentence: They marched from the Houses of Parliament to a rally in Hyde Park .
Labels: O O O O O O O O O O O B-geo I-geo O


In [38]:
print(f"Vectorized labels: {label_vectorizer([train_labels[2]], tag_map)}")

['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'I-geo', 'O'] 

[16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 2, 10, 16] 

[[16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 2, 10, 16]] 

[[16 16 16 16 16 16 16 16 16 16 16  2 10 16]] 

Vectorized labels: [[16 16 16 16 16 16 16 16 16 16 16  2 10 16]]


Looking at two examples together

In [40]:
print(f"Sentence: {train_sentences[2:4]}")

Sentence: ['They marched from the Houses of Parliament to a rally in Hyde Park .'
 'Police put the number of marchers at 10,000 while organizers claimed it was 1,00,000 .']


In [42]:
print(f"Labels: {train_labels[2:4]}")

Labels: ['O O O O O O O O O O O B-geo I-geo O' 'O O O O O O O O O O O O O O O']


In [43]:
print(f"Vectorized labels: {label_vectorizer([train_labels[2:4]], tag_map)}")

AttributeError: 'numpy.ndarray' object has no attribute 'split'

The function can process one label at a time

### Building the dataset

We will build the dataset for training, validation and testing. 

We will be using `tf.data.Dataset` class, which provides an optimized way to handle data to feed into a tensorflow model. It avoids keeping all the data in memory, thus it makes the training faster.