##Named Entity Recognition and Linking

***Introduction***

In this notebook, we are going to show how we can analyze a text to find the mentions of named entities in it and link those named entities to their corresponding Wikipedia page. This notebook is intended as a starting point, showing examples of using two pre-trained models, with the goal that we can then use other models for different analysis of our text.

We will start by importing and installing required libraries. We will be using Transformers library and pre-trained models maintained by Huggingface.

We will install Transformers

In [2]:
!pip install transformers seqeval[gpu]

Collecting seqeval[gpu]
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: seqeval
  Building wheel for seqeval (setup.py) ... [?25l[?25hdone
  Created wheel for seqeval: filename=seqeval-1.2.2-py3-none-any.whl size=16161 sha256=522d8eb439c4a6e04ef8258afbed674170ee86e20dc14d85a80c804f4ff5daba
  Stored in directory: /root/.cache/pip/wheels/1a/67/4a/ad4082dd7dfc30f2abfe4d80a2ed5926a506eb8a972b4767fa
Successfully built seqeval
Installing collected packages: seqeval
Successfully installed seqeval-1.2.2


Now we will import various libraries

In [3]:
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score
import torch
import helper_functions as hf
from torch.utils.data import Dataset, DataLoader
from transformers import BertTokenizerFast, BertConfig, BertForTokenClassification
from transformers import AutoTokenizer, AutoModelForTokenClassification, AutoModelForSeq2SeqLM
from transformers import pipeline

Now we will import the models. For the task of Named Entity Recognition, we will use the base model of BERT NER "bert-base-NER" and for Linking we will use Facebook's mgenre-wiki.

In [4]:
tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
modelNER = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")
tokenizerNEL = AutoTokenizer.from_pretrained("facebook/mgenre-wiki")
modelNEL = AutoModelForSeq2SeqLM.from_pretrained("facebook/mgenre-wiki").eval()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/829 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/376 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/4.87M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.3M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/279 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/845 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/196 [00:00<?, ?B/s]

We now extract the named entities. We do that by first defining a default sentence. The helper function will fetch the text from Discovery API using the record ID provided.

In [5]:
record_id = "C17023786"
inp_id = input("Please enter the id of the record you want to analyse. Leave it blank if you want to use the default text")
if inp_id != "":
  record_id = inp_id
sentence = hf.populate_texts(record_id)
print(sentence)

Please enter the id of the record you want to analyse. Leave it blank if you want to use the default text
  Konni ZILLIACUS: British. A Russian Finn by birth, he came to notice through his Communist associations while working for the League of Nations in 1932. He was elected Labour MP for Gateshead in the 1945 General Election, but was expelled from the party in 1949 for persistently attacking party policy 


In the form below, enter a text you want to analyze. Leave it blank and press enter in the input box if you want to use the above text.

In [9]:
inp_text = input("Please enter a text you want to analyse. Leave it blank if you want to use the default text")
print(inp_text)

Please enter a text you want to analyse. Leave it blank if you want to use the default text



In [10]:
if inp_text != "":
  sentence = inp_text
print(sentence)

  Konni ZILLIACUS: British. A Russian Finn by birth, he came to notice through his Communist associations while working for the League of Nations in 1932. He was elected Labour MP for Gateshead in the 1945 General Election, but was expelled from the party in 1949 for persistently attacking party policy 


Now, we pass the sentence through our model and print the extracted named entity. It will have the labels "PER", "ORG", "LOC", or "MISC" according to its type.

In [11]:
nlp = pipeline("ner", model=modelNER, tokenizer=tokenizer)

ner_results = nlp(sentence)
named_entities = hf.getNE(ner_results)
print(named_entities)

[('Konni ZIL', 'B-PER'), ('British', 'B-MISC'), ('Russian', 'B-MISC'), ('Finn', 'B-MISC'), ('Communist', 'B-ORG'), ('League of Nations', 'B-ORG'), ('Labour', 'B-LOC'), ('Gateshead', 'B-MISC'), ('General Election', 'B-MISC')]


If you have used the default example, you can see that the model has extracted "Konni ZIL" instead of "Konni ZILLIACUS". This is because the model breaks the words into smaller tokens and it was not able to predict the token "IACUS" is part of a name.

Now we predict the linking for the recognized named entity. It gives a list of Wikipedia page titles for the named entity predicted earlier.

In [12]:
outputs = modelNEL.generate(
    **tokenizerNEL(sentence, return_tensors="pt"),
    num_beams=10,
    num_return_sequences=10,
)

tokenizerNEL.batch_decode(outputs, skip_special_tokens=True)



['Conservative Party (UK) >> en',
 'Member of Parliament (United Kingdom) >> en',
 'Gateshead (UK Parliament constituency) >> en',
 'Labour Party (UK) >> en',
 'Politics of the United Kingdom >> en',
 'House of Commons of the United Kingdom >> en',
 'Member of parliament >> en',
 'Parliament of the United Kingdom >> en',
 'United Kingdom >> en',
 'Conservative Party (United Kingdom) >> en']

Now, if we want to get the list for a specific named entity then we pass the concerned named entity with [START] and [END] tags. The technical term for this is padding. For this example, we will pad the named entity of a person.

In [13]:
import re
sentence_pad = hf.get_pad(sentence, named_entities)
print(sentence_pad)

Konni ZILLIACUS
British
Russian
Finn
Communist
League of Nations
Labour
Gateshead
General Election
  [START]Konni ZILLIACUS[END]: British. A Russian Finn by birth, he came to notice through his Communist associations while working for the League of Nations in 1932. He was elected Labour MP for Gateshead in the 1945 General Election, but was expelled from the party in 1949 for persistently attacking party policy 


But you can pad it according to your needs. Leave it blank if you want to use the above padding.

In [14]:
text = input(sentence_pad)

  [START]Konni ZILLIACUS[END]: British. A Russian Finn by birth, he came to notice through his Communist associations while working for the League of Nations in 1932. He was elected Labour MP for Gateshead in the 1945 General Election, but was expelled from the party in 1949 for persistently attacking party policy 


In [15]:
print(text)
if text != "":
  sentence_pad = text




In [None]:
outputs = modelNEL.generate(
    **tokenizerNEL(sentence_pad, return_tensors="pt"),
    num_beams=3,
    num_return_sequences=3,
)

tokenizerNEL.batch_decode(outputs, skip_special_tokens=True)

['Konni Zilliacus >> en', 'Konni Zillacus >> en', 'Zilliacus >> en']