**Named Entity Recognition (NER)**

Named Entity Recognition (NER) is the task of classifying tokens according to a class, for example identifying a token as a person, an organisation or a location. 

Using the pipelines do to named entity recognition, trying to identify tokens as belonging to one of 9 classes:

**O**, Outside of a named entity

**B-MIS**, Beginning of a miscellaneous entity right after another miscellaneous entity

**I-MIS**, Miscellaneous entity

**B-PER**, Beginning of a person’s name right after another person’s name

**I-PER**, Person’s name

**B-ORG**, Beginning of an organisation right after another organisation

**I-ORG**, Organisation

**B-LOC**, Beginning of a location right after another location

**I-LOC**, Location

In [1]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.9.1-py3-none-any.whl (2.6 MB)
[K     |████████████████████████████████| 2.6 MB 5.0 MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 42.4 MB/s 
[?25hCollecting huggingface-hub==0.0.12
  Downloading huggingface_hub-0.0.12-py3-none-any.whl (37 kB)
Collecting pyyaml>=5.1
  Downloading PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636 kB)
[K     |████████████████████████████████| 636 kB 61.5 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.45-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 65.6 MB/s 
Installing collected packages: tokenizers, sacremoses, pyyaml, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninstalling PyYAML-3.13:
      Successfully u

In [4]:
from transformers import pipeline

nlp = pipeline("ner")

sequence = "William Henry Gates III, dit Bill Gates [bɪl ɡeɪts]1, né le 28 octobre 1955 à Seattle  (État de Washington), est un informaticien, entrepreneur et milliardaire américain."\
           "l est connu comme le fondateur de Microsoft en 1975 et son principal actionnaire jusqu’en 2014"

print(nlp(sequence))

[{'entity': 'I-PER', 'score': 0.9833834, 'index': 1, 'word': 'William', 'start': 0, 'end': 7}, {'entity': 'I-PER', 'score': 0.9708729, 'index': 2, 'word': 'Henry', 'start': 8, 'end': 13}, {'entity': 'I-PER', 'score': 0.9845228, 'index': 3, 'word': 'Gates', 'start': 14, 'end': 19}, {'entity': 'I-PER', 'score': 0.7777629, 'index': 4, 'word': 'III', 'start': 20, 'end': 23}, {'entity': 'I-PER', 'score': 0.988081, 'index': 8, 'word': 'Bill', 'start': 29, 'end': 33}, {'entity': 'I-PER', 'score': 0.98613477, 'index': 9, 'word': 'Gates', 'start': 34, 'end': 39}, {'entity': 'I-LOC', 'score': 0.9994245, 'index': 31, 'word': 'Seattle', 'start': 78, 'end': 85}, {'entity': 'I-LOC', 'score': 0.88363516, 'index': 33, 'word': 'É', 'start': 88, 'end': 89}, {'entity': 'I-LOC', 'score': 0.978025, 'index': 34, 'word': '##tat', 'start': 89, 'end': 92}, {'entity': 'I-LOC', 'score': 0.9898726, 'index': 35, 'word': 'de', 'start': 93, 'end': 95}, {'entity': 'I-LOC', 'score': 0.99597615, 'index': 36, 'word': 'W

**Named entity using a model and a tokenizer**

In [6]:
from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch

In [9]:


model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

label_list = [
    "O",       # Outside of a named entity
    "B-MISC",  # Beginning of a miscellaneous entity right after another miscellaneous entity
    "I-MISC",  # Miscellaneous entity
    "B-PER",   # Beginning of a person's name right after another person's name
    "I-PER",   # Person's name
    "B-ORG",   # Beginning of an organisation right after another organisation
    "I-ORG",   # Organisation
    "B-LOC",   # Beginning of a location right after another location
    "I-LOC"    # Location
]


sequence = "William Henry Gates , dit Bill Gates, né le 28 octobre 1955 à Seattle  (État de Washington), est un informaticien, entrepreneur et milliardaire américain."\
           "l est connu comme le fondateur de Microsoft en 1975 et son principal actionnaire jusqu’en 2014"

# Bit of a hack to get the tokens with the special tokens
tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sequence)))
inputs = tokenizer.encode(sequence, return_tensors="pt")

outputs = model(inputs)[0]
predictions = torch.argmax(outputs, dim=2)

print([(token, label_list[prediction]) for token, prediction in zip(tokens, predictions[0].tolist())])

# Bit of a hack to get the tokens with the special tokens
tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sequence)))
inputs = tokenizer.encode(sequence, return_tensors="pt")

outputs = model(inputs)[0]
predictions = torch.argmax(outputs, dim=2)

print([(token, label_list[prediction]) for token, prediction in zip(tokens, predictions[0].tolist())])

[('[CLS]', 'O'), ('William', 'I-PER'), ('Henry', 'I-PER'), ('Gates', 'I-PER'), (',', 'O'), ('di', 'O'), ('##t', 'O'), ('Bill', 'I-PER'), ('Gates', 'I-PER'), (',', 'O'), ('n', 'O'), ('##é', 'O'), ('le', 'O'), ('28', 'O'), ('o', 'O'), ('##ct', 'O'), ('##ob', 'O'), ('##re', 'O'), ('1955', 'O'), ('à', 'O'), ('Seattle', 'I-LOC'), ('(', 'O'), ('É', 'I-LOC'), ('##tat', 'I-LOC'), ('de', 'I-LOC'), ('Washington', 'I-LOC'), (')', 'O'), (',', 'O'), ('est', 'O'), ('un', 'O'), ('inform', 'O'), ('##atic', 'O'), ('##ien', 'O'), (',', 'O'), ('entrepreneur', 'O'), ('et', 'O'), ('mill', 'O'), ('##iard', 'O'), ('##aire', 'O'), ('am', 'O'), ('##é', 'O'), ('##rica', 'I-MISC'), ('##in', 'O'), ('.', 'O'), ('l', 'O'), ('est', 'O'), ('con', 'O'), ('##nu', 'O'), ('com', 'O'), ('##me', 'O'), ('le', 'O'), ('fond', 'O'), ('##ate', 'O'), ('##ur', 'O'), ('de', 'O'), ('Microsoft', 'I-ORG'), ('en', 'O'), ('1975', 'O'), ('et', 'O'), ('son', 'O'), ('principal', 'O'), ('action', 'O'), ('##naire', 'O'), ('j', 'O'), ('##us'