# Named-entity recognition (NER)

+ The goal of NER is to extract spans of named entities in text
+ Initially, named entities are persons, locations, organizations
+ Usually there are more types: dates, monetary amounts, etc. (e.g. brand names)
+ Why? To solve problems of reference, referential selection and coreference, metonymy, which are central to search, question-answering systems, text coherence, syntactic and morphological parsing, etc.
+ Difficulties:
    - homonymy: "Washington" -- city, state, last name, giraffe name, company name?
    - technical: what tags? where are the entity boundaries?
+ BIOES scheme: a prefix is ​​added to the entity label (e.g. PER for persons or ORG for organizations), which denotes the position of the token in the entity span:
    - B – beginning – the first token in the entity span, which consists of several tokens;
    - I – inside – inside the span;
    - O – outside – the token does not belong to any entity;
    - E – ending – the last token of the entity, which consists of several tokens;
    - S – single – the entity consists of one token.

In [4]:
ru_text = '''Вторая нитка "Северного потока - 2" заполнена газом, и теперь газопровод полностью готов к эксплуатации. Об этом доложил в среду глава "Газпрома" Алексей Миллер президенту РФ Владимиру Путину.

Однако это не означает, что газопровод будет запущен в ближайшие дни и даже месяцы, - ранее вице-премьер Александр Новак выражал надежду, что его сертификация завершится к концу первой половины 2022 года. В самой Германии заявляли, что соответствующее решение не будет принято в первом полугодии. '''

In [5]:
eng_text = '''MOSCOW, Dec 29 (Reuters) - Russian President Vladimir Putin said on Wednesday the Nord Stream 2 undersea gas pipeline would help to calm a surge in European gas prices and was ready to start exports now a second stretch of the pipeline has been filled.

Nord Stream 2, completed in September but awaiting regulatory approval from Germany and the European Union, faces resistance from the United States and several countries including Poland and Ukraine, which say it will increase Russia's leverage over Europe.

The pipeline had been scheduled to be completed in 2019, but construction was suspended following the threat of sanctions by the U.S. administration of Donald Trump and the subsequent withdrawal of the Swiss-Dutch company Allseas from pipe-laying.'''

In [6]:
de_text = '''Der neu gewählte Bundeskanzler Olaf Scholz (SPD) traf am Freitag in Brüssel mit Kommissionspräsidentin Ursula von der Leyen zusammen. Beide richteten warnende Worte an Russland. Von der Leyen sagte: „Wir erwarten, dass Russland deeskaliert und jegliche Art von Aggression gegenüber seinen Nachbarn unterlässt und die Rechte souveräner Staaten achtet. Andernfalls ist die Europäische Union bereit, nicht nur die bestehenden Sanktionen zu verschärfen, sondern auch neue, spürbare Maßnahmen zu ergreifen.“ Die Kommissionspräsidentin nannte die Felder Wirtschaft und Finanzen. '''

## Stanza

+ https://stanfordnlp.github.io/stanza/
+ library for NLP
+ 66 languages, including Russian
+ tokenization, lemmatization, morphological and syntactic parsing, NER

In [1]:
!pip install stanza

Collecting stanza
  Downloading stanza-1.6.1-py3-none-any.whl (881 kB)
     -------------------------------------- 881.2/881.2 kB 4.3 MB/s eta 0:00:00
Collecting emoji (from stanza)
  Downloading emoji-2.14.0-py3-none-any.whl (586 kB)
     ------------------------------------- 586.9/586.9 kB 12.3 MB/s eta 0:00:00
Collecting protobuf>=3.15.0 (from stanza)
  Downloading protobuf-4.24.4-cp37-cp37m-win_amd64.whl (430 kB)
     ------------------------------------- 430.0/430.0 kB 13.1 MB/s eta 0:00:00
Collecting typing-extensions (from torch>=1.3.0->stanza)
  Downloading typing_extensions-4.7.1-py3-none-any.whl (33 kB)
Installing collected packages: typing-extensions, protobuf, emoji, stanza
  Attempting uninstall: typing-extensions
    Found existing installation: typing_extensions 4.4.0
    Uninstalling typing_extensions-4.4.0:
      Successfully uninstalled typing_extensions-4.4.0
  Attempting uninstall: protobuf
    Found existing installation: protobuf 3.9.2
    Uninstalling protobuf-3.

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
fake-useragent 1.1.1 requires importlib-resources>=5.0; python_version < "3.10", which is not installed.
spyder 5.2.2 requires pyqtwebengine<5.13, which is not installed.
flask 2.2.5 requires click>=8.0, but you have click 7.1.2 which is incompatible.
deeppavlov 0.6.1 requires Cython==0.29.12, but you have cython 0.29.21 which is incompatible.
deeppavlov 0.6.1 requires flask==1.1.1, but you have flask 2.2.5 which is incompatible.
deeppavlov 0.6.1 requires h5py==2.9.0, but you have h5py 2.10.0 which is incompatible.
deeppavlov 0.6.1 requires keras==2.2.4, but you have keras 2.4.3 which is incompatible.
deeppavlov 0.6.1 requires numpy==1.16.4, but you have numpy 1.21.6 which is incompatible.
deeppavlov 0.6.1 requires overrides==1.9, but you have overrides 3.1.0 which is incompatible.
deeppavlov 0.6.1 requires pandas

In [1]:
import stanza
stanza.download('en')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.6.0.json:   0%|   …

2024-11-27 13:28:43 INFO: Downloading default packages for language: en (English) ...


Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.6.0/models/default.zip:   0%|          | 0…

2024-11-27 13:29:10 INFO: Finished downloading models and saved to C:\Users\Xiaomi\stanza_resources.


To find named entities, one needs to tokenize text

In [2]:
def stanza_nlp(text):
    nlp = stanza.Pipeline(lang='en', processors='tokenize,ner')
    doc = nlp(text)
    print(*[f'entity: {ent.text}\ttype: {ent.type}' for sent in doc.sentences for ent in sent.ents], sep='\n')

In [7]:
stanza_nlp(eng_text)

2024-11-27 13:29:19 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.6.0.json:   0%|   …

2024-11-27 13:29:20 INFO: Loading these models for language: en (English):
| Processor | Package          |
--------------------------------
| tokenize  | combined         |
| ner       | ontonotes_charlm |

2024-11-27 13:29:20 INFO: Using device: cpu
2024-11-27 13:29:20 INFO: Loading: tokenize
2024-11-27 13:29:20 INFO: Loading: ner
2024-11-27 13:29:20 INFO: Done loading processors!


entity: MOSCOW	type: GPE
entity: Dec 29 (Reuters) -	type: DATE
entity: Russian	type: NORP
entity: Vladimir Putin	type: PERSON
entity: Wednesday	type: DATE
entity: 2	type: CARDINAL
entity: European	type: NORP
entity: second	type: ORDINAL
entity: Nord Stream 2	type: ORG
entity: September	type: DATE
entity: Germany	type: GPE
entity: the European Union	type: ORG
entity: the United States	type: GPE
entity: Poland	type: GPE
entity: Ukraine	type: GPE
entity: Russia	type: GPE
entity: Europe	type: LOC
entity: 2019	type: DATE
entity: U.S.	type: GPE
entity: Donald Trump	type: PERSON
entity: Swiss	type: NORP
entity: Allseas	type: ORG


To get BIOES NER tags for each token:

In [8]:
nlp = stanza.Pipeline(lang='en', processors='tokenize,ner')
doc = nlp(eng_text)
print(*[f'token: {token.text}\tner: {token.ner}' for sent in doc.sentences for token in sent.tokens][10:50], sep='\n')

2024-11-27 13:29:44 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.6.0.json:   0%|   …

2024-11-27 13:29:45 INFO: Loading these models for language: en (English):
| Processor | Package          |
--------------------------------
| tokenize  | combined         |
| ner       | ontonotes_charlm |

2024-11-27 13:29:45 INFO: Using device: cpu
2024-11-27 13:29:45 INFO: Loading: tokenize
2024-11-27 13:29:45 INFO: Loading: ner
2024-11-27 13:29:46 INFO: Done loading processors!


token: Vladimir	ner: B-PERSON
token: Putin	ner: E-PERSON
token: said	ner: O
token: on	ner: O
token: Wednesday	ner: S-DATE
token: the	ner: O
token: Nord	ner: O
token: Stream	ner: O
token: 2	ner: S-CARDINAL
token: undersea	ner: O
token: gas	ner: O
token: pipeline	ner: O
token: would	ner: O
token: help	ner: O
token: to	ner: O
token: calm	ner: O
token: a	ner: O
token: surge	ner: O
token: in	ner: O
token: European	ner: S-NORP
token: gas	ner: O
token: prices	ner: O
token: and	ner: O
token: was	ner: O
token: ready	ner: O
token: to	ner: O
token: start	ner: O
token: exports	ner: O
token: now	ner: O
token: a	ner: O
token: second	ner: S-ORDINAL
token: stretch	ner: O
token: of	ner: O
token: the	ner: O
token: pipeline	ner: O
token: has	ner: O
token: been	ner: O
token: filled	ner: O
token: .	ner: O
token: Nord	ner: B-ORG


Let's test on Russian:

In [10]:
stanza.download('ru')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.6.0.json:   0%|   …

2024-11-27 13:31:42 INFO: Downloading default packages for language: ru (Russian) ...
2024-11-27 13:31:45 INFO: File exists: C:\Users\Xiaomi\stanza_resources\ru\default.zip
2024-11-27 13:31:56 INFO: Finished downloading models and saved to C:\Users\Xiaomi\stanza_resources.


In [11]:
def stanza_nlp_ru(text):
    nlp = stanza.Pipeline(lang='ru', processors='tokenize,ner')
    doc = nlp(text)
    print(*[f'entity: {ent.text}\ttype: {ent.type}' for sent in doc.sentences for ent in sent.ents], sep='\n')

In [12]:
stanza_nlp_ru(ru_text)

2024-11-27 13:31:56 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.6.0.json:   0%|   …

2024-11-27 13:31:57 INFO: Loading these models for language: ru (Russian):
| Processor | Package   |
-------------------------
| tokenize  | syntagrus |
| ner       | wikiner   |

2024-11-27 13:31:57 INFO: Using device: cpu
2024-11-27 13:31:57 INFO: Loading: tokenize
2024-11-27 13:31:57 INFO: Loading: ner
2024-11-27 13:31:59 INFO: Done loading processors!


entity: Северного потока - 2	type: MISC
entity: Газпрома	type: ORG
entity: Алексей Миллер	type: PER
entity: РФ	type: LOC
entity: Владимиру Путину	type: PER
entity: Александр Новак	type: PER
entity: Германии	type: LOC


Choose a language that interests you and test stanza in it, what problems of identifying named entities are specific to the language you have chosen?

In [13]:
def stanza_nlp_de(text):
    nlp = stanza.Pipeline(lang='de', processors='tokenize,ner')
    doc = nlp(text)
    print(*[f'entity: {ent.text}\ttype: {ent.type}' for sent in doc.sentences for ent in sent.ents], sep='\n')
stanza_nlp_de(de_text)

2024-11-27 13:33:24 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.6.0.json:   0%|   …



Downloading https://huggingface.co/stanfordnlp/stanza-de/resolve/v1.6.0/models/tokenize/gsd.pt:   0%|         …

Downloading https://huggingface.co/stanfordnlp/stanza-de/resolve/v1.6.0/models/mwt/gsd.pt:   0%|          | 0.…

Downloading https://huggingface.co/stanfordnlp/stanza-de/resolve/v1.6.0/models/ner/germeval2014.pt:   0%|     …

Downloading https://huggingface.co/stanfordnlp/stanza-de/resolve/v1.6.0/models/pretrain/fasttextwiki.pt:   0%|…

Downloading https://huggingface.co/stanfordnlp/stanza-de/resolve/v1.6.0/models/backward_charlm/newswiki.pt:   …

Downloading https://huggingface.co/stanfordnlp/stanza-de/resolve/v1.6.0/models/forward_charlm/newswiki.pt:   0…

2024-11-27 13:33:39 INFO: Loading these models for language: de (German):
| Processor | Package      |
----------------------------
| tokenize  | gsd          |
| mwt       | gsd          |
| ner       | germeval2014 |

2024-11-27 13:33:39 INFO: Using device: cpu
2024-11-27 13:33:39 INFO: Loading: tokenize
2024-11-27 13:33:39 INFO: Loading: mwt
2024-11-27 13:33:39 INFO: Loading: ner
2024-11-27 13:33:41 INFO: Done loading processors!


entity: Olaf Scholz	type: PER
entity: SPD	type: ORG
entity: Brüssel	type: LOC
entity: Ursula von der Leyen	type: PER
entity: Russland	type: LOC
entity: Von der Leyen	type: PER
entity: Russland	type: LOC
entity: Europäische Union	type: ORG


## SpaCy

+ Library for advanced NLP
+ A number of languages, English, Chinese, German, French, Italian, Polish, Spanish, etc., models are being developed for more languages
+ About spaCy: https://spacy.io/usage

In [14]:
!pip install -U pip setuptools wheel
!pip install -U spacy
!python -m spacy download en_core_web_sm

Collecting pip
  Downloading pip-24.0-py3-none-any.whl (2.1 MB)
     ---------------------------------------- 2.1/2.1 MB 7.1 MB/s eta 0:00:00
Collecting setuptools
  Downloading setuptools-68.0.0-py3-none-any.whl (804 kB)
     ------------------------------------- 804.0/804.0 kB 16.9 MB/s eta 0:00:00
Collecting wheel
  Downloading wheel-0.42.0-py3-none-any.whl (65 kB)
     ---------------------------------------- 65.4/65.4 kB 3.4 MB/s eta 0:00:00


ERROR: To modify pip, please run the following command:
C:\Users\Xiaomi\Anaconda3\python.exe -m pip install -U pip setuptools wheel


Collecting spacy
  Downloading spacy-3.8.2.tar.gz (1.3 MB)
     ---------------------------------------- 1.3/1.3 MB 6.3 MB/s eta 0:00:00
  Installing build dependencies: started
  Installing build dependencies: still running...
  Installing build dependencies: finished with status 'error'


  error: subprocess-exited-with-error
  
  pip subprocess to install build dependencies did not run successfully.
  exit code: 1
  
  [107 lines of output]
  Ignoring numpy: markers 'python_version >= "3.9"' don't match your environment
  Collecting setuptools
    Using cached setuptools-68.0.0-py3-none-any.whl (804 kB)
  Collecting cython<3.0,>=0.25
    Downloading Cython-0.29.37-py2.py3-none-any.whl (989 kB)
       -------------------------------------- 989.5/989.5 kB 5.2 MB/s eta 0:00:00
  Collecting cymem<2.1.0,>=2.0.2
    Downloading cymem-2.0.10.tar.gz (10 kB)
    Installing build dependencies: started
    Installing build dependencies: finished with status 'done'
    Getting requirements to build wheel: started
    Getting requirements to build wheel: finished with status 'done'
    Installing backend dependencies: started
    Installing backend dependencies: finished with status 'done'
    Preparing metadata (pyproject.toml): started
    Preparing metadata (pyproject.toml): fin

Collecting en_core_web_sm==2.1.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.1.0/en_core_web_sm-2.1.0.tar.gz (11.1 MB)
     --------------------------------------- 11.1/11.1 MB 31.1 MB/s eta 0:00:00
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: en_core_web_sm
  Building wheel for en_core_web_sm (setup.py): started
  Building wheel for en_core_web_sm (setup.py): finished with status 'done'
  Created wheel for en_core_web_sm: filename=en_core_web_sm-2.1.0-py3-none-any.whl size=11074412 sha256=4d16352d6a106e4ff00efccf922b77a5065efe0ad18bcdeb6995f276003689f0
  Stored in directory: C:\Users\Xiaomi\AppData\Local\Temp\pip-ephem-wheel-cache-1134zopl\wheels\59\4f\8c\0dbaab09a776d1fa3740e9465078bfd903cc22f3985382b496
Successfully built en_core_web_sm
Installing collected packages: en_core_web_sm
Successfully installed en_core_web_sm-2.1.0
[+] Download and

DEPRECATION: https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.1.0/en_core_web_sm-2.1.0.tar.gz#egg=en_core_web_sm==2.1.0 contains an egg fragment with a non-PEP 508 name pip 25.0 will enforce this behaviour change. A possible replacement is to use the req @ url syntax, and remove the egg fragment. Discussion can be found at https://github.com/pypa/pip/issues/11617


In [15]:
import spacy

nlp = spacy.load("en_core_web_sm")

OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

In [None]:
doc = nlp(eng_text)

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

SpaCy offers 18 tags (see the list [here](https://towardsdatascience.com/named-entity-recognition-ner-using-spacy-nlp-part-4-28da2ece57c6)) for different types of named entities, you can add your custom tag:

In [43]:
from spacy.tokens import Span

nlp = spacy.load("en_core_web_sm")
doc = nlp(eng_text)
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print('Before', ents)
# the model didn't recognize "Nord Stream" as an entity :(

ns2_ent = Span(doc, 16, 18, label="OBJ") # create a Span for the new entity, OBJ = object
doc.ents = list(doc.ents) + [ns2_ent]

ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print('After', ents)
#('Nord Stream', 82, 93, 'OBJ') got recognized as an object entity


Before [('MOSCOW', 0, 6, 'GPE'), ('Dec 29', 8, 14, 'DATE'), ('Reuters', 16, 23, 'ORG'), ('Russian', 27, 34, 'NORP'), ('Vladimir Putin', 45, 59, 'PERSON'), ('Wednesday', 68, 77, 'DATE'), ('2', 94, 95, 'CARDINAL'), ('European', 148, 156, 'NORP'), ('second', 205, 211, 'ORDINAL'), ('September', 282, 291, 'DATE'), ('Germany', 330, 337, 'GPE'), ('the European Union', 342, 360, 'ORG'), ('the United States', 384, 401, 'GPE'), ('Poland', 434, 440, 'GPE'), ('Ukraine', 445, 452, 'GPE'), ('Russia', 481, 487, 'GPE'), ('Europe', 504, 510, 'LOC'), ('2019', 564, 568, 'DATE'), ('U.S.', 642, 646, 'GPE'), ('Donald Trump', 665, 677, 'PERSON'), ('Swiss', 715, 720, 'NORP')]
After [('MOSCOW', 0, 6, 'GPE'), ('Dec 29', 8, 14, 'DATE'), ('Reuters', 16, 23, 'ORG'), ('Russian', 27, 34, 'NORP'), ('Vladimir Putin', 45, 59, 'PERSON'), ('Wednesday', 68, 77, 'DATE'), ('Nord Stream', 82, 93, 'OBJ'), ('2', 94, 95, 'CARDINAL'), ('European', 148, 156, 'NORP'), ('second', 205, 211, 'ORDINAL'), ('September', 282, 291, 'DATE'

In [45]:
len([ent for ent in doc.ents if ent.label_ == 'OBJ'])

1

In [47]:
from spacy import displacy

In [49]:
displacy.render(doc, style = "ent", jupyter=True)

Let's test in Russian:

In [None]:
! python -m spacy download ru_core_news_sm

In [51]:
nlp = spacy.load("ru_core_news_sm")

In [52]:
doc = nlp(ru_text)

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Газпрома 136 144 ORG
Алексей Миллер 146 160 PER
РФ 172 174 LOC
Владимиру Путину 175 191 PER
Александр Новак 299 314 PER
Германии 407 415 LOC


In [53]:
displacy.render(doc, style = "ent", jupyter=True)

We may color only certain types of NE:

In [54]:
options = {'ents':['ORG','LOC']}
displacy.render(doc, style = "ent", jupyter=True, options=options)


Does Spacy work fine on the language of your choice? Are there problems with NER? 

In [None]:
! python -m spacy download de_core_news_sm

In [56]:
nlp = spacy.load("de_core_news_sm")

In [57]:
doc = nlp(de_text)

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Olaf Scholz 31 42 PER
Brüssel 68 75 LOC
Ursula 103 109 PER
Leyen 118 123 LOC
Russland. 168 177 MISC
Leyen 186 191 LOC
Russland 219 227 LOC
Staaten 335 342 LOC
Europäische Union 371 388 ORG
Die Kommissionspräsidentin 503 529 MISC


## Natasha

+ Previously, the Natasha library solved the NER problem for the Russian language, was built on the rules
+ now it is a full-fledged NLP project for the Russian language
+ tokenization, lemmatization, parsing, NER tagging, etc.
+ https://github.com/natasha/natasha


In [None]:
!pip install natasha

In [59]:
from natasha import (
    Segmenter,
    MorphVocab,

    NewsEmbedding,
    NewsMorphTagger,
    NewsSyntaxParser,
    NewsNERTagger,

    PER,
    NamesExtractor,

    Doc
)

In [60]:
segmenter = Segmenter()
morph_vocab = MorphVocab()

emb = NewsEmbedding()
morph_tagger = NewsMorphTagger(emb)
syntax_parser = NewsSyntaxParser(emb)
ner_tagger = NewsNERTagger(emb)

names_extractor = NamesExtractor(morph_vocab)
doc = Doc(ru_text)

NER depends on segmentation:

In [62]:
doc.segment(segmenter)
print(doc.tokens[:5])
print(doc.sents[:5])

[DocToken(stop=6, text='Вторая'), DocToken(start=7, stop=12, text='нитка'), DocToken(start=13, stop=14, text='"'), DocToken(start=14, stop=23, text='Северного'), DocToken(start=24, stop=30, text='потока')]
[DocSent(stop=104, text='Вторая нитка "Северного потока - 2" заполнена газ..., tokens=[...]), DocSent(start=105, stop=192, text='Об этом доложил в среду глава "Газпрома" Алексей ..., tokens=[...]), DocSent(start=194, stop=398, text='Однако это не означает, что газопровод будет запу..., tokens=[...]), DocSent(start=399, stop=490, text='В самой Германии заявляли, что соответствующее ре..., tokens=[...])]


In [67]:
doc.tag_ner(ner_tagger)
display(doc.spans)
doc.ner.print()

[DocSpan(start=136, stop=144, type='ORG', text='Газпрома', tokens=[...]),
 DocSpan(start=146, stop=160, type='PER', text='Алексей Миллер', tokens=[...]),
 DocSpan(start=172, stop=174, type='LOC', text='РФ', tokens=[...]),
 DocSpan(start=175, stop=191, type='PER', text='Владимиру Путину', tokens=[...]),
 DocSpan(start=299, stop=314, type='PER', text='Александр Новак', tokens=[...]),
 DocSpan(start=407, stop=415, type='LOC', text='Германии', tokens=[...])]

Вторая нитка "Северного потока - 2" заполнена газом, и теперь 
газопровод полностью готов к эксплуатации. Об этом доложил в среду 
глава "Газпрома" Алексей Миллер президенту РФ Владимиру Путину.
       ORG─────  PER───────────            LO PER───────────── 
Однако это не означает, что газопровод будет запущен в ближайшие дни и
 даже месяцы, - ранее вице-премьер Александр Новак выражал надежду, 
                                   PER────────────                  
что его сертификация завершится к концу первой половины 2022 года. В 
самой Германии заявляли, что соответствующее решение не будет принято 
      LOC─────                                                        
в первом полугодии. 


It is possible to bring entities to normal form, for this it is necessary to carry out morphological analysis and lemmatization (Natasha uses Pymorphy2):

In [71]:
doc.tag_morph(morph_tagger)
doc.sents[0].morph.print()

              Вторая ADJ|Case=Nom|Degree=Pos|Gender=Fem|Number=Sing
               нитка NOUN|Animacy=Inan|Case=Nom|Gender=Fem|Number=Sing
                   " PUNCT
           Северного ADJ|Case=Gen|Degree=Pos|Gender=Masc|Number=Sing
              потока NOUN|Animacy=Inan|Case=Gen|Gender=Masc|Number=Sing
                   - PUNCT
                   2 NUM
                   " PUNCT
           заполнена VERB|Aspect=Perf|Gender=Fem|Number=Sing|Tense=Past|Variant=Short|VerbForm=Part|Voice=Pass
               газом NOUN|Animacy=Inan|Case=Ins|Gender=Masc|Number=Sing
                   , PUNCT
                   и CCONJ
              теперь ADV|Degree=Pos
          газопровод NOUN|Animacy=Inan|Case=Nom|Gender=Masc|Number=Sing
           полностью ADV|Degree=Pos
               готов ADJ|Degree=Pos|Gender=Masc|Number=Sing|Variant=Short
                   к ADP
        эксплуатации NOUN|Animacy=Inan|Case=Dat|Gender=Fem|Number=Sing
                   . PUNCT


In [72]:
for token in doc.tokens:
    token.lemmatize(morph_vocab)
    
{_.text: _.lemma for _ in doc.tokens}

{'"': '"',
 ',': ',',
 '-': '-',
 '.': '.',
 '2': '2',
 '2022': '2022',
 'Александр': 'александр',
 'Алексей': 'алексей',
 'В': 'в',
 'Владимиру': 'владимир',
 'Вторая': 'второй',
 'Газпрома': 'газпром',
 'Германии': 'германия',
 'Миллер': 'миллер',
 'Новак': 'новак',
 'Об': 'о',
 'Однако': 'однако',
 'Путину': 'путин',
 'РФ': 'рф',
 'Северного': 'северный',
 'ближайшие': 'близкий',
 'будет': 'быть',
 'в': 'в',
 'вице-премьер': 'вице-премьер',
 'выражал': 'выражать',
 'газом': 'газ',
 'газопровод': 'газопровод',
 'глава': 'глава',
 'года': 'год',
 'готов': 'готовый',
 'даже': 'даже',
 'дни': 'день',
 'доложил': 'доложить',
 'его': 'его',
 'завершится': 'завершиться',
 'заполнена': 'заполнить',
 'запущен': 'запустить',
 'заявляли': 'заявлять',
 'и': 'и',
 'к': 'к',
 'концу': 'конец',
 'месяцы': 'месяц',
 'надежду': 'надежда',
 'не': 'не',
 'нитка': 'нитка',
 'означает': 'означать',
 'первой': 'первый',
 'первом': 'первый',
 'полностью': 'полностью',
 'половины': 'половина',
 'полугодии'

Приводим сущности к нормальной форме:

In [73]:
for span in doc.spans:
    span.normalize(morph_vocab)

{_.text: _.normal for _ in doc.spans if _.text != _.normal}

{'Владимиру Путину': 'Владимир Путин',
 'Газпрома': 'Газпром',
 'Германии': 'Германия'}

Можно извлечь для нормированных имен отдельно имена и фамилии:

In [78]:
for span in doc.spans:
    if span.type == PER:
        span.extract_fact(names_extractor)

{_.normal: _.fact.as_dict for _ in doc.spans if _.type == PER}

{'Александр Новак': {'first': 'Александр'},
 'Алексей Миллер': {'first': 'Алексей', 'last': 'Миллер'},
 'Владимир Путин': {'first': 'Владимир', 'last': 'Путин'}}

In [82]:
#names_extractor = NamesExtractor(morph_vocab)
#dates_extractor = DatesExtractor(morph_vocab)
#money_extractor = MoneyExtractor(morph_vocab)
#addr_extractor = AddrExtractor(morph_vocab)