# Named Entity Recognition with Tensorflow and LSTMs

## imports

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
import ast

from namedentityrecognition import NamedEntityRecognition

random_state = 42

## About Dataset

### Dataset
We using the `Named Entity Recognition (NER) Corpus` dataset by `NASER AL-QAYDEH` available [here](https://www.kaggle.com/datasets/naseralqaydeh/named-entity-recognition-ner-corpus) on Kaggle.

### Task
Named Entity Recognition(NER) is a task of categorizing the entities in a text into categories like names of persons, locations, organizations, etc.

### Dataset
Each row in the CSV file is a complete sentence, list of POS tags for each word in the sentence, and list of NER tags for each word in the sentence

We will be using Pandas to read and manipulate this dataset.

### Acknowledgements by Dataset Author
This dataset is taken from Annotated Corpus for Named Entity Recognition by Abhinav Walia dataset and then processed.

Annotated Corpus for Named Entity Recognition is annotated Corpus for Named Entity Recognition using GMB(Groningen Meaning Bank) corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set.

### Essential Info About Entities in the Dataset

* geo = Geographical Entity
* org = Organization
* per = Person
* gpe = Geopolitical Entity
* tim = Time indicator
* art = Artifact
* eve = Event
* nat = Natural Phenomenon

## Reading and Preprocessing the Data

In [2]:
df = pd.read_csv("./data/ner.csv")
df.head()

Unnamed: 0,Sentence #,Sentence,POS,Tag
0,Sentence: 1,Thousands of demonstrators have marched throug...,"['NNS', 'IN', 'NNS', 'VBP', 'VBN', 'IN', 'NNP'...","['O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', '..."
1,Sentence: 2,Families of soldiers killed in the conflict jo...,"['NNS', 'IN', 'NNS', 'VBN', 'IN', 'DT', 'NN', ...","['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', ..."
2,Sentence: 3,They marched from the Houses of Parliament to ...,"['PRP', 'VBD', 'IN', 'DT', 'NNS', 'IN', 'NN', ...","['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', ..."
3,Sentence: 4,"Police put the number of marchers at 10,000 wh...","['NNS', 'VBD', 'DT', 'NN', 'IN', 'NNS', 'IN', ...","['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', ..."
4,Sentence: 5,The protest comes on the eve of the annual con...,"['DT', 'NN', 'VBZ', 'IN', 'DT', 'NN', 'IN', 'D...","['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', ..."


Dropping `POS` and `Sentence #` columns

In [3]:
data = df.drop(columns=["Sentence #", "POS"], axis = 1)
data.head()

Unnamed: 0,Sentence,Tag
0,Thousands of demonstrators have marched throug...,"['O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', '..."
1,Families of soldiers killed in the conflict jo...,"['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', ..."
2,They marched from the Houses of Parliament to ...,"['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', ..."
3,"Police put the number of marchers at 10,000 wh...","['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', ..."
4,The protest comes on the eve of the annual con...,"['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', ..."


Number of rows

In [4]:
data.shape[0]

47959

Let's split the data into train, val and test:

In [5]:
train_ratio = 0.7
val_ratio = 0.15
test_ratio = 0.15

train_val_df, test_df = train_test_split(data, test_size=test_ratio, shuffle=True, random_state=random_state)
train_df, val_df = train_test_split(train_val_df, test_size=val_ratio/(1-test_ratio), shuffle=True, random_state=random_state)

Here is a helper function to preprocess the data

In [6]:
def extract_data(df):
    
    sentences = df['Sentence'].to_list()
    labels = df["Tag"].to_list()
    labels = [ast.literal_eval(label) for label in labels]

    return sentences, labels

In [7]:
train_sentences, train_labels = extract_data(train_df)
val_sentences, val_labels = extract_data(val_df)
test_sentences, test_labels = extract_data(test_df)

## Model Initialization and Training

In [8]:
ner = NamedEntityRecognition(
    embedding_dim=50,
    num_lstm_layers=2,
    bidirectional_lstms=True,
    random_state= random_state)

In [9]:
ner.fit(
    sentences= train_sentences, 
    labels= train_labels, 
    epochs= 1, 
    validation_data= [val_sentences, val_labels], 
    batch_size= 64
    )

[1m525/525[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m84s[0m 145ms/step - _masked_accuracy: 0.9013 - loss: 0.4207 - val__masked_accuracy: 0.9661 - val_loss: 0.1111

Model Summary:


<namedentityrecognition.NamedEntityRecognition at 0x25c47d2ae90>

## Model Testing

If we provide sentences and labels to teh predict method of the `NamedEntityRecognition class, we get the accuracy of the model and it returns predictions on sentences.

In [10]:
test_predictions = ner.predict(test_sentences, test_labels)

The model's accuracy on the provided test set is: 0.9668


Let's look at true labels and predicted labels of one of the sentences. sentence_id can be changed to look at other sentences.

In [11]:
sentence_id = 2

print(test_labels[sentence_id])
print(test_predictions[sentence_id][:len(test_labels[sentence_id])])

['O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'I-geo', 'O', 'O', 'B-tim', 'O', 'O']
['O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'I-geo', 'O', 'O', 'B-tim', 'O', 'O']


We can use the predict method to predict on a string or on a list of strings. The first case returns a list of predicted labels, and the second case returns a list of lists, one for each sentenc in the sentences list.

In [12]:
sentence = "Peter Parker , the White House director of trade and manufacturing policy of U.S , said in an interview on Sunday morning that the White House was working to prepare for the possibility of a second wave of the coronavirus in the fall , though he said it wouldn ’t necessarily come"
predictions = ner.predict(sentence)

for x,y in zip(sentence.split(' '), predictions):
    if y != 'O':
        print(x,y)

Peter B-per
Parker I-per
White B-org
House I-org
Sunday B-tim
morning I-tim
White B-org
House I-org
