### Named Entity Recognizer


######  NER is the first step towards information extraction that seeks to locate and classify named entities in text into predefined categories such as the names of person, organization, location, expression of times, quantities, monetary values, percentages etc. 
Below are the steps to build Named Entity Recognizer with NLTK and Spacy, to identify name of things such as persons, organizations, or loctions in the raw text

#### Using NLTK

In [3]:
import nltk
from nltk.tokenize import word_tokenize

In [4]:
from nltk.tag import pos_tag

 Working with an example sentence

In [6]:
ex = 'European authorities fined Google a record $5.1 billion on Wednesday for abusing its power in the mobile phone market and ordered the company to alter its practices'

Applying word tokenize and part of speech tagging

In [10]:
def preprocess(sent):
    sent = nltk.word_tokenize(sent) 
    sent = pos_tag(sent)
    return sent

We get a list of tuples containing individual words in the sentence and their associated parts of speech

In [18]:
sent = preprocess(ex)
sent

[('European', 'JJ'),
 ('authorities', 'NNS'),
 ('fined', 'VBD'),
 ('Google', 'NNP'),
 ('a', 'DT'),
 ('record', 'NN'),
 ('$', '$'),
 ('5.1', 'CD'),
 ('billion', 'CD'),
 ('on', 'IN'),
 ('Wednesday', 'NNP'),
 ('for', 'IN'),
 ('abusing', 'VBG'),
 ('its', 'PRP$'),
 ('power', 'NN'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('mobile', 'JJ'),
 ('phone', 'NN'),
 ('market', 'NN'),
 ('and', 'CC'),
 ('ordered', 'VBD'),
 ('the', 'DT'),
 ('company', 'NN'),
 ('to', 'TO'),
 ('alter', 'VB'),
 ('its', 'PRP$'),
 ('practices', 'NNS')]

implementing noun phrase chunking to identify named entities using a regular expression consisting of rules indicationg how 
sentences should be chunked.
We use a chunk pattern consisting of one rule that a noun phrase, NP, should be formed whenever the chunker finds an optional determiner, followed by any number of adjectives, JJ, and then a noun, NN.


In [19]:
pattern = 'NP: {<DT>?<JJ>*<NN>}'

Chunking:
    Using the pattern we create a chunk parser and test it in our sentence

In [20]:
cp = nltk.RegexpParser(pattern)

In [22]:
cs = cp.parse(sent)

In [24]:
print(cs)


(S
  European/JJ
  authorities/NNS
  fined/VBD
  Google/NNP
  (NP a/DT record/NN)
  $/$
  5.1/CD
  billion/CD
  on/IN
  Wednesday/NNP
  for/IN
  abusing/VBG
  its/PRP$
  (NP power/NN)
  in/IN
  (NP the/DT mobile/JJ phone/NN)
  (NP market/NN)
  and/CC
  ordered/VBD
  (NP the/DT company/NN)
  to/TO
  alter/VB
  its/PRP$
  practices/NNS)


Inside Outside Beginning (IOB) tags have become the standard way to represent chunk structures in files, following is the format for the same

In [25]:
from nltk.chunk import conlltags2tree, tree2conlltags
from pprint import pprint

In [26]:
iob_tagged = tree2conlltags(cs)
pprint(iob_tagged)

[('European', 'JJ', 'O'),
 ('authorities', 'NNS', 'O'),
 ('fined', 'VBD', 'O'),
 ('Google', 'NNP', 'O'),
 ('a', 'DT', 'B-NP'),
 ('record', 'NN', 'I-NP'),
 ('$', '$', 'O'),
 ('5.1', 'CD', 'O'),
 ('billion', 'CD', 'O'),
 ('on', 'IN', 'O'),
 ('Wednesday', 'NNP', 'O'),
 ('for', 'IN', 'O'),
 ('abusing', 'VBG', 'O'),
 ('its', 'PRP$', 'O'),
 ('power', 'NN', 'B-NP'),
 ('in', 'IN', 'O'),
 ('the', 'DT', 'B-NP'),
 ('mobile', 'JJ', 'I-NP'),
 ('phone', 'NN', 'I-NP'),
 ('market', 'NN', 'B-NP'),
 ('and', 'CC', 'O'),
 ('ordered', 'VBD', 'O'),
 ('the', 'DT', 'B-NP'),
 ('company', 'NN', 'I-NP'),
 ('to', 'TO', 'O'),
 ('alter', 'VB', 'O'),
 ('its', 'PRP$', 'O'),
 ('practices', 'NNS', 'O')]


The above output represents one token per line, each with its part of speech tag and its named entity tag.
B-NP : beginning of a noun phrase
I-NP : descibes that the word is inside of the current noun phrase.
O : end of the sentence.
B-VP and I-VP : beginning and inside of a verb phrase. (not present in this example)

The function nltk.ne_chunk(), can be used to recognize named entities using a classifier, the classifier adds category labels such as person, organization and GPE

In [38]:
ne_tree = nltk.ne_chunk(pos_tag(word_tokenize(ex)))

In [39]:
print(ne_tree)

(S
  (GPE European/JJ)
  authorities/NNS
  fined/VBD
  (PERSON Google/NNP)
  a/DT
  record/NN
  $/$
  5.1/CD
  billion/CD
  on/IN
  Wednesday/NNP
  for/IN
  abusing/VBG
  its/PRP$
  power/NN
  in/IN
  the/DT
  mobile/JJ
  phone/NN
  market/NN
  and/CC
  ordered/VBD
  the/DT
  company/NN
  to/TO
  alter/VB
  its/PRP$
  practices/NNS)


In [48]:
ne_tree = nltk.ne_chunk(pos_tag(word_tokenize(ex)))
print(ne_tree)

(S
  (GPE European/JJ)
  authorities/NNS
  fined/VBD
  (PERSON Google/NNP)
  a/DT
  record/NN
  $/$
  5.1/CD
  billion/CD
  on/IN
  Wednesday/NNP
  for/IN
  abusing/VBG
  its/PRP$
  power/NN
  in/IN
  the/DT
  mobile/JJ
  phone/NN
  market/NN
  and/CC
  ordered/VBD
  the/DT
  company/NN
  to/TO
  alter/VB
  its/PRP$
  practices/NNS)


In [82]:
ex1 = "Nice place Better than some reviews give it credit for. Overall, the rooms were a bit small but nice. Everything was clean, the view was wonderful and it is very well located (the Prudential Center makes shopping and eating easy and the T is nearby for jaunts out and about the city). Overall, it was a good experience and the staff was quite friendly. "

In [83]:
ne_tree1 = nltk.ne_chunk(pos_tag(word_tokenize(ex1)))
print(ne_tree1)

(S
  Nice/JJ
  place/NN
  Better/NNP
  than/IN
  some/DT
  reviews/NNS
  give/VBP
  it/PRP
  credit/NN
  for/IN
  ./.
  Overall/NNP
  ,/,
  the/DT
  rooms/NNS
  were/VBD
  a/DT
  bit/NN
  small/JJ
  but/CC
  nice/JJ
  ./.
  Everything/NN
  was/VBD
  clean/JJ
  ,/,
  the/DT
  view/NN
  was/VBD
  wonderful/JJ
  and/CC
  it/PRP
  is/VBZ
  very/RB
  well/RB
  located/VBN
  (/(
  the/DT
  (ORGANIZATION Prudential/NNP Center/NNP)
  makes/VBZ
  shopping/NN
  and/CC
  eating/VBG
  easy/JJ
  and/CC
  the/DT
  T/NNP
  is/VBZ
  nearby/JJ
  for/IN
  jaunts/NNS
  out/RP
  and/CC
  about/IN
  the/DT
  city/NN
  )/)
  ./.
  Overall/UH
  ,/,
  it/PRP
  was/VBD
  a/DT
  good/JJ
  experience/NN
  and/CC
  the/DT
  staff/NN
  was/VBD
  quite/RB
  friendly/JJ
  ./.)


Google is recognized as a person, which is quite disappointing

### Using Spacy

In [54]:
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()

In [55]:
doc = nlp('European authorities fined Google a record $5.1 billion on Wednesday for abusing its power in the mobile phone market and ordered the company to alter its practices')

In [56]:
doc.ents

(European, Google, $5.1 billion, Wednesday)

In [57]:
pprint([(X.text,X.label_) for X in doc.ents])

[('European', 'NORP'),
 ('Google', 'ORG'),
 ('$5.1 billion', 'MONEY'),
 ('Wednesday', 'DATE')]


European is NORD (nationalities or religious or political groups), Google is an organization, $5.1 billion is monetary value and Wednesday is a date object. They are all correct.

We used only entities in the above example by filtering the docs with doc.ents. In the below example, we will use the entire doc using token level entity annotation using BILUO (BEGIN IN LAST UNIT OUT) tagging scheme to describe the entity boundaries

    BEGIN --- The first token of a multi-token entity
    IN    --- An inner token of a multi-token entity
    LAST  --- The final token of a multi-token entity
    UNIT  --- A single token entity
    OUT   --- A non-token entity
    

In [58]:
pprint([(X,X.ent_iob_,X.ent_type_) for X in doc])

[(European, 'B', 'NORP'),
 (authorities, 'O', ''),
 (fined, 'O', ''),
 (Google, 'B', 'ORG'),
 (a, 'O', ''),
 (record, 'O', ''),
 ($, 'B', 'MONEY'),
 (5.1, 'I', 'MONEY'),
 (billion, 'I', 'MONEY'),
 (on, 'O', ''),
 (Wednesday, 'B', 'DATE'),
 (for, 'O', ''),
 (abusing, 'O', ''),
 (its, 'O', ''),
 (power, 'O', ''),
 (in, 'O', ''),
 (the, 'O', ''),
 (mobile, 'O', ''),
 (phone, 'O', ''),
 (market, 'O', ''),
 (and, 'O', ''),
 (ordered, 'O', ''),
 (the, 'O', ''),
 (company, 'O', ''),
 (to, 'O', ''),
 (alter, 'O', ''),
 (its, 'O', ''),
 (practices, 'O', '')]


"B" means the token begins an entity, "I" means it is inside an entity, "O" means it is outside an entity, and "" means no entity tag is set.



### Extracting named entity from an article

In [59]:
from bs4 import BeautifulSoup
import requests
import re

In [60]:
def url_to_string(url):
    res = requests.get(url)
    html = res.text
    soup = BeautifulSoup(html,'html5lib')
    for script in soup(["script","style",'asside']):
        script.extract()
    return " ".join(re.split(r'[\n\t]+', soup.get_text()))

In [61]:
ny_bb = url_to_string('https://www.nytimes.com/2018/08/13/us/politics/peter-strzok-fired-fbi.html?hp&action=click&pgtype=Homepage&clickSource=story-heading&module=first-column-region&region=top-news&WT.nav=top-news')

In [64]:
ny_bb

"     F.B.I. Agent Peter Strzok, Who Criticized Trump in Texts, Is Fired - The New York Times                                                                            SectionsSEARCHSkip to contentSkip to site indexPoliticsSubscribeLog InLog InToday’s PaperPolitics|F.B.I. Agent Peter Strzok, Who Criticized Trump in Texts, Is FiredAdvertisementSupported byF.B.I. Agent Peter Strzok, Who Criticized Trump in Texts, Is FiredImagePeter Strzok, a top F.B.I. counterintelligence agent who was taken off the special counsel investigation after his disparaging texts about President Trump were uncovered, was fired.CreditCreditT.J. Kirkpatrick for The New York TimesBy Adam Goldman and Michael S. SchmidtAug. 13, 2018WASHINGTON — Peter Strzok, the F.B.I. senior counterintelligence agent who disparaged President Trump in inflammatory text messages and helped oversee the Hillary Clinton email and Russia investigations, has been fired for violating bureau policies, Mr. Strzok’s lawyer said Monday.Mr. Tr

In [65]:
article = nlp(ny_bb)
article


     F.B.I. Agent Peter Strzok, Who Criticized Trump in Texts, Is Fired - The New York Times                                                                            SectionsSEARCHSkip to contentSkip to site indexPoliticsSubscribeLog InLog InToday’s PaperPolitics|F.B.I. Agent Peter Strzok, Who Criticized Trump in Texts, Is FiredAdvertisementSupported byF.B.I. Agent Peter Strzok, Who Criticized Trump in Texts, Is FiredImagePeter Strzok, a top F.B.I. counterintelligence agent who was taken off the special counsel investigation after his disparaging texts about President Trump were uncovered, was fired.CreditCreditT.J. Kirkpatrick for The New York TimesBy Adam Goldman and Michael S. SchmidtAug. 13, 2018WASHINGTON — Peter Strzok, the F.B.I. senior counterintelligence agent who disparaged President Trump in inflammatory text messages and helped oversee the Hillary Clinton email and Russia investigations, has been fired for violating bureau policies, Mr. Strzok’s lawyer said Monday.Mr. Tru

In [66]:
len(article.ents)

174

    There are 174 entities in the article. Now, identifying the labels

In [68]:
labels = [X.label_ for X in article.ents]
Counter(labels)

Counter({'PERSON': 82,
         'GPE': 16,
         'CARDINAL': 5,
         'ORG': 39,
         'DATE': 24,
         'NORP': 2,
         'ORDINAL': 1,
         'FAC': 1,
         'PRODUCT': 2,
         'LOC': 1,
         'TIME': 1})

In [69]:
len(Counter(labels))

11

There are 11 labels identified. Now, we can identify the most frequent tokens

In [70]:
items = [x.text for x in article.ents]

In [71]:
Counter(items).most_common(3) #identfies three most common tokens

[('Strzok', 28), ('F.B.I.', 13), ('Trump', 12)]

Randomly selecting one sentence to learn more

In [72]:
sentences = [x for x in article.sents]

In [74]:
print(sentences[19])

Firing Mr. Strzok, however, removes a favorite target of Mr. Trump from the ranks of the F.B.I. and gives Mr. Bowdich and the F.B.I. director, Christopher A. Wray, a chance to move beyond the president


Running displacy.render to generate new markup

In [75]:
displacy.render(nlp(str(sentences[19])),jupyter = True, style='ent')

In [77]:
displacy.render(nlp(str(sentences[19])), style='dep', jupyter = True, options = {'distance': 120})

In [78]:
[(x.orth_,x.pos_, x.lemma_) for x in [y 
                                      for y
                                      in nlp(str(sentences[19])) 
                                      if not y.is_stop and y.pos_ != 'PUNCT']]

[('Firing', 'VERB', 'fire'),
 ('Mr.', 'PROPN', 'Mr.'),
 ('Strzok', 'PROPN', 'Strzok'),
 ('removes', 'VERB', 'remove'),
 ('favorite', 'ADJ', 'favorite'),
 ('target', 'NOUN', 'target'),
 ('Mr.', 'PROPN', 'Mr.'),
 ('Trump', 'PROPN', 'Trump'),
 ('ranks', 'NOUN', 'rank'),
 ('F.B.I.', 'PROPN', 'F.B.I.'),
 ('gives', 'VERB', 'give'),
 ('Mr.', 'PROPN', 'Mr.'),
 ('Bowdich', 'PROPN', 'Bowdich'),
 ('F.B.I.', 'PROPN', 'F.B.I.'),
 ('director', 'NOUN', 'director'),
 ('Christopher', 'PROPN', 'Christopher'),
 ('A.', 'PROPN', 'A.'),
 ('Wray', 'PROPN', 'Wray'),
 ('chance', 'NOUN', 'chance'),
 ('president', 'NOUN', 'president')]

In [80]:
dict([(str(x), x.label_) for x in nlp(str(sentences[19])).ents])

{'Strzok': 'PERSON',
 'Trump': 'PERSON',
 'F.B.I.': 'ORG',
 'Bowdich': 'PERSON',
 'Christopher A. Wray': 'PERSON'}

In [81]:
print([(x, x.ent_iob_, x.ent_type_) for x in sentences[19]])

[(Firing, 'O', ''), (Mr., 'O', ''), (Strzok, 'B', 'PERSON'), (,, 'O', ''), (however, 'O', ''), (,, 'O', ''), (removes, 'O', ''), (a, 'O', ''), (favorite, 'O', ''), (target, 'O', ''), (of, 'O', ''), (Mr., 'O', ''), (Trump, 'B', 'PERSON'), (from, 'O', ''), (the, 'O', ''), (ranks, 'O', ''), (of, 'O', ''), (the, 'O', ''), (F.B.I., 'B', 'ORG'), (and, 'O', ''), (gives, 'O', ''), (Mr., 'O', ''), (Bowdich, 'B', 'PERSON'), (and, 'O', ''), (the, 'O', ''), (F.B.I., 'B', 'ORG'), (director, 'O', ''), (,, 'O', ''), (Christopher, 'B', 'PERSON'), (A., 'I', 'PERSON'), (Wray, 'I', 'PERSON'), (,, 'O', ''), (a, 'O', ''), (chance, 'O', ''), (to, 'O', ''), (move, 'O', ''), (beyond, 'O', ''), (the, 'O', ''), (president, 'O', '')]
