<a href="https://colab.research.google.com/github/sargupta/freelance-/blob/master/Named_Entity_Recognition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 2. Load the All Sources Metadata file

In [0]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

## Test Sample

In [0]:
ex = "European authorities fined Google a record $5.1 billion on Wednesday for abusing its power in the mobile phone market and ordered the company to alter its practices"

## To avoid nltk.download(punkt) ERROR

In [3]:
!python3 -c "import nltk; nltk.download('all')"


[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/biocreative_ppi.zip.
[nltk_data]    | Downloading package brown to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/brown.zip.
[nltk_data]    | Downloading package brown_tei to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/brown_tei.zip.
[nltk_data]    | Downloading package cess_cat to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cess_cat.zip.
[nltk_data]    | Downloading package cess_esp to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cess_esp.zip.
[nltk_data]    | Downloading package chat80 to /root/nltk_data...
[nltk_data]    |   Unzipp

## Apply word tokenization and part-of-speech tagging

In [0]:
def preprocess(sent):
  sent = nltk.word_tokenize(sent)
  sent = nltk.pos_tag(sent)
  return sent

In [5]:
sent = preprocess(ex)
sent

[('European', 'JJ'),
 ('authorities', 'NNS'),
 ('fined', 'VBD'),
 ('Google', 'NNP'),
 ('a', 'DT'),
 ('record', 'NN'),
 ('$', '$'),
 ('5.1', 'CD'),
 ('billion', 'CD'),
 ('on', 'IN'),
 ('Wednesday', 'NNP'),
 ('for', 'IN'),
 ('abusing', 'VBG'),
 ('its', 'PRP$'),
 ('power', 'NN'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('mobile', 'JJ'),
 ('phone', 'NN'),
 ('market', 'NN'),
 ('and', 'CC'),
 ('ordered', 'VBD'),
 ('the', 'DT'),
 ('company', 'NN'),
 ('to', 'TO'),
 ('alter', 'VB'),
 ('its', 'PRP$'),
 ('practices', 'NNS')]

We got list of tuples which includes individual words of the sentence with their associated parts-of-speech.

## Now, implement Noun Phrase Chunking to identify Named Entity

**Chunk Pattern Rule:**
A noun phrase, NP, should be formed whenever the chunker finds an optional determiner (DT), followed by any number of adjectives (JJ), and then a noun (NN).

In [0]:
pattern = 'NP: {<DT>?<JJ>*<NN>}'


## Chunking

In [7]:
cp = nltk.RegexpParser(pattern)
cs = cp.parse(sent)
print(cs)

(S
  European/JJ
  authorities/NNS
  fined/VBD
  Google/NNP
  (NP a/DT record/NN)
  $/$
  5.1/CD
  billion/CD
  on/IN
  Wednesday/NNP
  for/IN
  abusing/VBG
  its/PRP$
  (NP power/NN)
  in/IN
  (NP the/DT mobile/JJ phone/NN)
  (NP market/NN)
  and/CC
  ordered/VBD
  (NP the/DT company/NN)
  to/TO
  alter/VB
  its/PRP$
  practices/NNS)


In [8]:
from nltk.chunk import conlltags2tree, tree2conlltags
from pprint import pprint
iob_tagged = tree2conlltags(cs)
pprint(iob_tagged)

[('European', 'JJ', 'O'),
 ('authorities', 'NNS', 'O'),
 ('fined', 'VBD', 'O'),
 ('Google', 'NNP', 'O'),
 ('a', 'DT', 'B-NP'),
 ('record', 'NN', 'I-NP'),
 ('$', '$', 'O'),
 ('5.1', 'CD', 'O'),
 ('billion', 'CD', 'O'),
 ('on', 'IN', 'O'),
 ('Wednesday', 'NNP', 'O'),
 ('for', 'IN', 'O'),
 ('abusing', 'VBG', 'O'),
 ('its', 'PRP$', 'O'),
 ('power', 'NN', 'B-NP'),
 ('in', 'IN', 'O'),
 ('the', 'DT', 'B-NP'),
 ('mobile', 'JJ', 'I-NP'),
 ('phone', 'NN', 'I-NP'),
 ('market', 'NN', 'B-NP'),
 ('and', 'CC', 'O'),
 ('ordered', 'VBD', 'O'),
 ('the', 'DT', 'B-NP'),
 ('company', 'NN', 'I-NP'),
 ('to', 'TO', 'O'),
 ('alter', 'VB', 'O'),
 ('its', 'PRP$', 'O'),
 ('practices', 'NNS', 'O')]


In [0]:
 from nltk.chunk import ne_chunk


In [10]:
ne_tree = ne_chunk(pos_tag(word_tokenize(ex)))
print(ne_tree)

(S
  (GPE European/JJ)
  authorities/NNS
  fined/VBD
  (PERSON Google/NNP)
  a/DT
  record/NN
  $/$
  5.1/CD
  billion/CD
  on/IN
  Wednesday/NNP
  for/IN
  abusing/VBG
  its/PRP$
  power/NN
  in/IN
  the/DT
  mobile/JJ
  phone/NN
  market/NN
  and/CC
  ordered/VBD
  the/DT
  company/NN
  to/TO
  alter/VB
  its/PRP$
  practices/NNS)


## Google is recognisedc as PERSON
Not good enough. 


# **SpaCy**

## Entity

In [0]:
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()

![Named Entity Annotations](https://drive.google.com/open?id=1rOwyXg9F-S4Wwlc6D3O3FHjIA-doDYv2)

Only need to apply nlp once, the entire background pipeline will return the 

---

objects.

In [12]:
doc = nlp('European authorities fined Google a record $5.1 billion on Wednesday for abusing its power in the mobile phone market and ordered the company to alter its practices')
pprint([(X.text, X.label_) for X in doc.ents])

[('European', 'NORP'),
 ('Google', 'ORG'),
 ('$5.1 billion', 'MONEY'),
 ('Wednesday', 'DATE')]





---



*  NORD : Nationalities or religious or political groups
*  ORG : Organisation
*  MONEY : Monetary Value
*  Date  : Datetime Object
*  GPE. : Geopolitical entity, i.e. countries, cities, states


---






## Token




---

"B" means the token begins an entity, "I" means it is inside an entity, "O" means it is outside an entity, and "" means no entity tag is set.

---




In [13]:
from pprint import pprint
pprint([(X, X.ent_iob_, X.ent_type_) for X in doc])


[(European, 'B', 'NORP'),
 (authorities, 'O', ''),
 (fined, 'O', ''),
 (Google, 'B', 'ORG'),
 (a, 'O', ''),
 (record, 'O', ''),
 ($, 'B', 'MONEY'),
 (5.1, 'I', 'MONEY'),
 (billion, 'I', 'MONEY'),
 (on, 'O', ''),
 (Wednesday, 'B', 'DATE'),
 (for, 'O', ''),
 (abusing, 'O', ''),
 (its, 'O', ''),
 (power, 'O', ''),
 (in, 'O', ''),
 (the, 'O', ''),
 (mobile, 'O', ''),
 (phone, 'O', ''),
 (market, 'O', ''),
 (and, 'O', ''),
 (ordered, 'O', ''),
 (the, 'O', ''),
 (company, 'O', ''),
 (to, 'O', ''),
 (alter, 'O', ''),
 (its, 'O', ''),
 (practices, 'O', '')]


## Extracting Named Entity from an Article

In [14]:
from bs4 import BeautifulSoup
import requests
import re
def url_to_string(url):
    res = requests.get(url)
    html = res.text
    soup = BeautifulSoup(html, 'html5lib')
    for script in soup(["script", "style", 'aside']):
        script.extract()
    return " ".join(re.split(r'[\n\t]+', soup.get_text()))
ny_bb = url_to_string('https://www.nytimes.com/2013/12/10/world/asia/china-is-tied-to-spying-on-european-diplomats.html')
article = nlp(ny_bb)
len(article.ents)

114

There are 115 named entities in the article and they are represented as 11 unique values

In [15]:
labels = [x.label_ for x in article.ents]
Counter(labels)

Counter({'CARDINAL': 10,
         'DATE': 11,
         'EVENT': 1,
         'GPE': 31,
         'LANGUAGE': 1,
         'NORP': 17,
         'ORDINAL': 1,
         'ORG': 22,
         'PERSON': 18,
         'PRODUCT': 1,
         'WORK_OF_ART': 1})

Most frequent tokens

In [16]:
items = [x.text for x in article.ents]
Counter(items).most_common(4)

[('Chinese', 9), ('FireEye', 9), ('China', 7), ('Villeneuve', 3)]

Let's randomly select one sentence to learn more.



In [17]:
sentences = [x for x in article.sents]
print(sentences[11])

The FireEye report does not link the attacks to a specific group in China, but security experts say the list of victims points to a state-affiliated campaign.


Display Raw Markup

In [18]:
displacy.render(nlp(str(sentences[49])), jupyter=True, style='ent')


Visualize

In [19]:
displacy.render(nlp(str(sentences[49])), style='dep', jupyter = True, options = {'distance': 120})

In [20]:
from google.colab import drive
drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive


## Next, extract part-of-speech and lemmatize this sentence.

In [21]:
[(x.orth_,x.pos_, x.lemma_) for x in [y 
                                      for y
                                      in nlp(str(sentences[9])) 
                                      if not y.is_stop and y.pos_ != 'PUNCT']]

[('person', 'NOUN', 'person'),
 ('knowledge', 'NOUN', 'knowledge'),
 ('investigation', 'NOUN', 'investigation'),
 ('authorized', 'VERB', 'authorize'),
 ('speak', 'VERB', 'speak'),
 ('publicly', 'ADV', 'publicly'),
 ('confirmed', 'VERB', 'confirm'),
 ('foreign', 'ADJ', 'foreign'),
 ('ministries', 'NOUN', 'ministry'),
 ('countries', 'NOUN', 'country'),
 ('breached', 'VERB', 'breach')]

In [22]:
dict([(str(x), x.label_) for x in nlp(str(sentences[9])).ents])


{'five': 'CARDINAL'}

In [23]:
print([(x, x.ent_iob_, x.ent_type_) for x in sentences[9]])


[(A, 'O', ''), (person, 'O', ''), (with, 'O', ''), (knowledge, 'O', ''), (of, 'O', ''), (the, 'O', ''), (investigation, 'O', ''), (,, 'O', ''), (who, 'O', ''), (was, 'O', ''), (not, 'O', ''), (authorized, 'O', ''), (to, 'O', ''), (speak, 'O', ''), (publicly, 'O', ''), (,, 'O', ''), (confirmed, 'O', ''), (that, 'O', ''), (the, 'O', ''), (foreign, 'O', ''), (ministries, 'O', ''), (of, 'O', ''), (the, 'O', ''), (five, 'B', 'CARDINAL'), (countries, 'O', ''), (had, 'O', ''), (been, 'O', ''), (breached, 'O', ''), (., 'O', '')]


# Visualize Entire Article

In [24]:
displacy.render(nlp(str(sentences)), jupyter=True, style='ent')


# Test

In [25]:
spacy.explain("GPE")

'Countries, cities, states'

In [0]:
abstract = nlp("We are ZEHUS and represent an European excellence in the field of human-electric transport vehicles. Where we want to get: BITRIDE BIKE SHARING is our pivotal long-term entrepreneurial initiative, expected to boost our business; we aim at developing a new generation of sustainable, hybrid-bike sharing programs, which are keystone for making available zero emission personal transport worldwide. Existing pain points: sharing services of conventional e-bikes are affected by highest installation costs, due to the need for fixed stations (racks, charging stations and kiosks), ranging from 70% to 80% of direct capital costs required to launch a new bike sharing project. Our solution to pain points: the basic idea is to apply in new bike sharing initiatives our award-winning BIKE\ All In One, the first worldwide powertrain for full hybrid bikes that never needs to be recharged from the grid. We gave to our creation a bridge to the cloud, through a seamless Smartphone interface (Bitride App) that uses Bluetooth connectivity.  BIKE\ AIO is the factor enabling a revolutionary solution: we are the sole company offering technology and services to create a fleet of hybrid bikes not needing fixed stations. Bike collecting areas and electrified kiosks are no longer necessary: we have designed an advanced info-mobility service offering comprehensive information supports users, who avail of the Bitride App to localise and rent the nearest bikes equipped with ad hoc padlocks (Smart Lock) for bicycle pickup and drop-off. Thanks to the SME Instrument we will enable sustainable bike sharing initiatives cutting down direct capital investment by up to 78% in comparison with conventional e-bike sharing schemes. Turnover: we plan to reach â‚¬46,82M of cumulated turnover by 2022 Key partners: Milan and Rozzano municipalities in Italy and Arnhem municipality in the Netherlands have expressed their interest in performing a pilot project and becoming early-adopter of our solution.")

In [27]:
displacy.render(nlp(str(abstract)), jupyter=True, style='ent')


In [28]:
import re

nlp = spacy.load('en')
boundary = re.compile('^[0-9]$')

def custom_seg(doc):
    prev = doc[0].text
    length = len(doc)
    for index, token in enumerate(doc):
        if (token.text == '.' and boundary.match(prev) and index!=(length - 1)):
            doc[index+1].sent_start = False
        prev = token.text
    return doc

nlp.add_pipe(custom_seg, before='parser')
#text = u'This is first sentence.\nNext is numbered list.\n1. Hello World!\n2. Hello World2!\n3. Hello World!'
doc = abstract
for sentence in doc.sents:
    print(sentence.text)

We are ZEHUS and represent an European excellence in the field of human-electric transport vehicles.
Where we want to get: BITRIDE BIKE SHARING is our pivotal long-term entrepreneurial initiative, expected to boost our business; we aim at developing a new generation of sustainable, hybrid-bike sharing programs, which are keystone for making available zero emission personal transport worldwide.
Existing pain points:
sharing services of conventional e-bikes are affected by highest installation costs, due to the need for fixed stations (racks, charging stations and kiosks), ranging from 70% to 80% of direct capital costs required to launch a new bike sharing project.
Our solution to pain points: the basic idea is to apply in new bike sharing initiatives our award-winning BIKE\ All
In One, the first worldwide powertrain for full hybrid bikes that never needs to be recharged from the grid.
We gave to our creation a bridge to the cloud, through a seamless Smartphone interface (Bitride App) t

In [0]:
sen = nlp("BlaBla, I love you, love from India")

In [49]:
displacy.render(nlp(str(sen)), style='dep', jupyter = True, options = {'distance': 120})

In [41]:
displacy.render(nlp(str(sen)), jupyter=True, style='ent')


In [0]:
sen_abhi = "I am Abhishek Gupta who is Data Scientist by profession and lives in Lisbon"