# Name Enitity Recognisation

In [0]:
import nltk
from nltk.tag import pos_tag
from nltk.chunk import ne_chunk

In [0]:
import en_core_web_sm
nlp = en_core_web_sm.load()

In [0]:
from spacy import displacy

In [0]:
sentences = 'European authorities fined Google a record $5.1 billion on Wednesday for abusing its power in the mobile phone market and ordered the company to alter its practices'

## Using Nltk 'ne_chunk()'
#### With the function nltk.ne_chunk(), we can recognize named entities using a classifier, the classifier adds category labels such as 'PERSON', 'ORGANIZATION' and 'LOCATION'.

### Tokenizing and finding the pos tags

In [0]:
word_list=[]
words=nltk.word_tokenize(sentences)
word_list=pos_tag(words)

In [56]:
word_list

[('European', 'JJ'),
 ('authorities', 'NNS'),
 ('fined', 'VBD'),
 ('Google', 'NNP'),
 ('a', 'DT'),
 ('record', 'NN'),
 ('$', '$'),
 ('5.1', 'CD'),
 ('billion', 'CD'),
 ('on', 'IN'),
 ('Wednesday', 'NNP'),
 ('for', 'IN'),
 ('abusing', 'VBG'),
 ('its', 'PRP$'),
 ('power', 'NN'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('mobile', 'JJ'),
 ('phone', 'NN'),
 ('market', 'NN'),
 ('and', 'CC'),
 ('ordered', 'VBD'),
 ('the', 'DT'),
 ('company', 'NN'),
 ('to', 'TO'),
 ('alter', 'VB'),
 ('its', 'PRP$'),
 ('practices', 'NNS')]

In [0]:
ne=ne_chunk(word_list)

In [58]:
print(ne)

(S
  (GPE European/JJ)
  authorities/NNS
  fined/VBD
  (PERSON Google/NNP)
  a/DT
  record/NN
  $/$
  5.1/CD
  billion/CD
  on/IN
  Wednesday/NNP
  for/IN
  abusing/VBG
  its/PRP$
  power/NN
  in/IN
  the/DT
  mobile/JJ
  phone/NN
  market/NN
  and/CC
  ordered/VBD
  the/DT
  company/NN
  to/TO
  alter/VB
  its/PRP$
  practices/NNS)


### It Classifies Europe as location and Google as person which is quite bad .

## Using SpaCy

#### SpaCy’s named entity recognition has been trained on the OntoNotes 5 corpus and it supports the following entity types:

*  PERSON -	People, including fictional.
* NORP -	Nationalities or religious or political groups.
* FAC -	Buildings, airports, highways, bridges, etc.
* ORG -	Companies, agencies, institutions, etc.
* GPE -	Countries, cities, states.
* LOC	- Non-GPE locations, mountain ranges, bodies of water.
* PRODUCT	- Objects, vehicles, foods, etc. (Not services.)
* EVENT	- Named hurricanes, battles, wars, sports events, etc.
* WORK_OF_ART	- Titles of books, songs, etc.
* LAW	- Named documents made into laws.
* LANGUAGE	- Any named language.
* DATE	- Absolute or relative dates or periods.
* TIME	- Times smaller than a day.
* PERCENT	- Percentage, including ”%“.
* MONEY	- Monetary values, including unit.
* QUANTITY	- Measurements, as of weight or distance.
* ORDINAL	- “first”, “second”, etc.
* CARDINAL	- Numerals that do not fall under another type.

In [0]:
# Spacy we need to apply nlp and ,the entire background pipeline will return the objects.
doc=nlp(sentences)

In [60]:
for d in doc.ents:
  print(d.text,d.label_)

European NORP
Google ORG
$5.1 billion MONEY
Wednesday DATE


### It classifies everything correctly.

### Trying On a Large Corpus and printing it in attractive way

In [0]:
sentence='''After Mr. Horowitz uncovered the texts, Mr. Mueller, who had by then taken over the investigation, removed Mr. Strzok from his team. He was reassigned to the F.B.I.’s human resources division. Ms. Page, who had left Mr. Mueller’s team before the discovery of the text messages, quit the F.B.I. in May.
The inspector general’s report also took issue with the reaction by Mr. Strzok and other F.B.I. officials to the discovery of possible new evidence in the Clinton investigation, known internally as Midyear Exam, in late September 2016 on a laptop that belonged to the disgraced politician Anthony D. Weiner, the husband of a top Clinton aide.
At the time, Mr. Strzok was in the early stages of investigating whether any Trump associates had conspired with Russia’s interference in the presidential election, and nearly a month passed before agents and analysts began to act on the emails found on Mr. Weiner’s laptop. Mr. Horowitz could not rule out that Mr. Strzok had slow-walked the examination of the new emails to help Mrs. Clinton’s presidential bid.
“Under these circumstances, we did not have confidence that Strzok’s decision to prioritize the Russia investigation over following up on the Midyear-related investigative lead discovered on the Weiner laptop was free from bias,” he wrote.
The delays were merely the “result of bureaucratic snafus,” Mr. Strzok’s lawyer wrote last month in USA Today.
But the justifications for the delay were “unpersuasive” and had “far-reaching consequences,” the inspector general said. James B. Comey, the former F.B.I. director, has told investigators that if he had known about the emails earlier, it might have influenced his decision to alert Congress to their existence days before the election.
In addition, the inspector general said that Mr. Strzok had forwarded a proposed search warrant to his personal email account. The inspector general said the email, which included a draft of the search warrant affidavit, contained information that appeared to be under seal.'''

In [0]:
doc=nlp(sentence)

In [63]:
for d in doc.ents:
  print(d.text,d.label_)

Horowitz PERSON
Mueller PERSON
Strzok PERSON
F.B.I. ORG
Page PERSON
Mueller PERSON
F.B.I. ORG
May. GPE
Strzok PERSON
F.B.I. ORG
Clinton PERSON
Midyear Exam LOC
Anthony D. Weiner PERSON
Clinton PERSON
Strzok PERSON
Trump PERSON
Russia GPE
nearly a month DATE
Weiner PERSON
Horowitz PERSON
Strzok PERSON
Clinton PERSON
Strzok PERSON
Russia GPE
Midyear DATE
Weiner ORG
Strzok PERSON
last month DATE
USA Today ORG
James B. Comey PERSON
F.B.I. ORG
Congress ORG
days DATE
Strzok PERSON


In [0]:
sent=[x for x in doc.sents]

In [72]:
displacy.render(nlp(str(sent)), jupyter=True, style='ent')