##Named Entity Recognition and Linking

***Introduction***

In this notebooks, we are going to show how we can analyze a text to find the mentions of named entities in it and link those named entities to their corresponding wikipedia page. This notebook is intended as a starting point, showing examples of using two pretrained models, with the goal that we can then use other models for different analysis of our text.

We will start by importing and installing required libraries. We will be using Transformers library and pre-trained models maintained by Huggingface.

We will install Transformers

In [1]:
!pip install transformers seqeval[gpu]

Collecting seqeval[gpu]
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m894.7 kB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: seqeval
  Building wheel for seqeval (setup.py) ... [?25l[?25hdone
  Created wheel for seqeval: filename=seqeval-1.2.2-py3-none-any.whl size=16162 sha256=edb8fa6d9a44cf0d7bea42851931dbf9ae3662d2ade36dab93bdbfb7d1cb530e
  Stored in directory: /root/.cache/pip/wheels/1a/67/4a/ad4082dd7dfc30f2abfe4d80a2ed5926a506eb8a972b4767fa
Successfully built seqeval
Installing collected packages: seqeval
Successfully installed seqeval-1.2.2


Now we will import various libraries

In [2]:
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import BertTokenizerFast, BertConfig, BertForTokenClassification
from transformers import AutoTokenizer, AutoModelForTokenClassification, AutoModelForSeq2SeqLM
from transformers import pipeline

Now we will import the models. For the task of Named Entity Recognition we will use base model of BERT NER "bert-base-NER" and for Linking we will use Facebook's mgenre-wiki.

In [3]:
tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
modelNER = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")
tokenizerNEL = AutoTokenizer.from_pretrained("facebook/mgenre-wiki")
modelNEL = AutoModelForSeq2SeqLM.from_pretrained("facebook/mgenre-wiki").eval()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/829 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/376 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/4.87M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.3M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/279 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/845 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


generation_config.json:   0%|          | 0.00/196 [00:00<?, ?B/s]

We now extract the named entities. We do that by first defining the sentence.

In [4]:
#sentence = "Konni ZILLIACUS: British. A Russian Finn by birth, he came to notice through his Communist associations while working for the League of Nations in 1932. He was elected Labour MP for Gateshead in the 1945 General Election, but was expelled from the party in 1949 for persistently attacking party policy"
sentence = "Albert Einstein was born in Germany"

Now, we pass the sentence through our model and print extracted named entity.

In [5]:
nlp = pipeline("ner", model=modelNER, tokenizer=tokenizer)

ner_results = nlp(sentence)
seq = ""
for i, j in enumerate(ner_results):
  if j["entity"][0] == 'B':
    if seq != "":
      print(seq)
      seq = ""
    seq = j["word"]
  if j["entity"][0] == 'I':
    seq = seq + " " + j["word"]
print(seq)

Albert Einstein
Germany


Now we predict the linking for the recognized named entity. It gives a list of Wikipedia page title for the named entity predicted earlier.

In [6]:
outputs = modelNEL.generate(
    **tokenizerNEL(sentence, return_tensors="pt"),
    num_beams=6,
    num_return_sequences=3,
)

tokenizerNEL.batch_decode(outputs, skip_special_tokens=True)



['Einstein family >> en', 'Albert Einstein >> en', 'Einsteinism >> en']

Now, if we want to get the list for specific named entity then we pass the concerned named entity with [START] and [END] tags. The technical term for this is padding.

In [11]:
sentencePad = "Albert Einstein was born in [START]Germany[END]"

In [12]:
outputs = modelNEL.generate(
    **tokenizerNEL(sentencePad, return_tensors="pt"),
    num_beams=5,
    num_return_sequences=3,
)

tokenizerNEL.batch_decode(outputs, skip_special_tokens=True)

['Germany >> en',
 'Weimar Republic >> en',
 'History of Germany (1945–1990) >> en']