##Named Entity Recognition and Linking

***Introduction***

In this notebook, we are going to show how we can analyze a text to find the mentions of named entities in it and link those named entities to their corresponding Wikipedia page. This notebook is intended as a starting point, showing examples of using two pre-trained models, with the goal that we can then use other models for different analysis of our text.

We will start by importing and installing required libraries. We will be using Transformers library and pre-trained models maintained by Huggingface.

We will install Transformers

In [None]:
!pip install transformers seqeval[gpu]

Now we will import various libraries

In [None]:
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score
import torch
import helper_functions as hf
from torch.utils.data import Dataset, DataLoader
from transformers import BertTokenizerFast, BertConfig, BertForTokenClassification
from transformers import AutoTokenizer, AutoModelForTokenClassification, AutoModelForSeq2SeqLM
from transformers import pipeline

Now we will import the models. For the task of Named Entity Recognition, we will use the base model of BERT NER "bert-base-NER" and for Linking we will use Facebook's mgenre-wiki.

In [None]:
tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
modelNER = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")
tokenizerNEL = AutoTokenizer.from_pretrained("facebook/mgenre-wiki")
modelNEL = AutoModelForSeq2SeqLM.from_pretrained("facebook/mgenre-wiki").eval()

We now extract the named entities. We do that by first defining a default sentence. The helper function will fetch the text from Discovery API using the record ID provided.

In [None]:
record_id = "C17023786"
inp_id = input("Please enter the id of the record you want to analyse. Leave it blank if you want to use the default text")
if inp_id != "":
  record_id = inp_id
sentence = hf.populate_texts(record_id)
print(sentence)

In the form below, enter a text you want to analyze. Leave it blank and press enter in the input box if you want to use the above text.

In [None]:
inp_text = input("Please enter a text you want to analyse. Leave it blank if you want to use the default text")
print(inp_text)

In [None]:
if inp_text != "":
  sentence = inp_text
print(sentence)

Now, we pass the sentence through our model and print the extracted named entity. It will have the labels "PER", "ORG", "LOC", or "MISC" according to its type.

In [None]:
nlp = pipeline("ner", model=modelNER, tokenizer=tokenizer)

ner_results = nlp(sentence)
named_entities = hf.getNE(ner_results)
print(named_entities)

If you have used the default example, you can see that the model has extracted "Konni ZIL" instead of "Konni ZILLIACUS". This is because the model breaks the words into smaller tokens and it was not able to predict the token "IACUS" is part of a name.

Now we predict the linking for the recognized named entity. It gives a list of Wikipedia page titles for the named entity predicted earlier.

In [None]:
outputs = modelNEL.generate(
    **tokenizerNEL(sentence, return_tensors="pt"),
    num_beams=10,
    num_return_sequences=10,
)

tokenizerNEL.batch_decode(outputs, skip_special_tokens=True)

Now, if we want to get the list for a specific named entity then we pass the concerned named entity with [START] and [END] tags. The technical term for this is padding. For this example, we will pad the named entity of a person.

In [None]:
import re
sentence_pad = hf.get_pad(sentence, named_entities)
print(sentence_pad)

But you can pad it according to your needs. Leave it blank if you want to use the above padding.

In [None]:
text = input(sentence_pad)

In [None]:
print(text)
if text != "":
  sentence_pad = text

In [None]:
outputs = modelNEL.generate(
    **tokenizerNEL(sentence_pad, return_tensors="pt"),
    num_beams=3,
    num_return_sequences=3,
)

tokenizerNEL.batch_decode(outputs, skip_special_tokens=True)