# ELM Proof of Concept #3 : Named Entity Recognition

* October 30th, 2018
* Ryan Kazmerik, Strategic EIM

## Hypothesis
Named entity recognition can be used to extract meaningful pieces of information (tags) from our articles, which can then be used to create relationships with other topics and/or articles.

We will test this hypothesis using three popular NER implementations and our ~15,000 news articles from the News-API dataset.

### Research
**1. NLTK (Natural Language Toolkit)**
* popular open source library
* uses POS (part-of-speech) tagging to extract entities
* very fast, but not as accurate as other implementations
<br/>[Testing NER taggers for speed/accuracy](https://pythonprogramming.net/testing-stanford-ner-taggers-for-speed/?completed=/testing-stanford-ner-taggers-for-accuracy)<br/><br/>

**2. Stanford NER**
* developed by NLP lab at Stanford
* uses CRF (conditional random fields) for extracting entities
* most accurate, often seen as the gold-standard
<br/>[NER for Unstructured Documents](https://medium.com/@dudsdu/named-entity-recognition-for-unstructured-documents-c325d47c7e3a)<br/><br/>

**3. SpaCy**
* supported open source by Explosion AI
* uses CNN (conv neural network) for extracting entities
* mixture of accuracy and speed & lots of entity types
<br/>[SpaCy NER - Lingustic Features](https://spacy.io/usage/linguistic-features#section-named-entities)

# Experiments

We will measure the speed and total number of extracted entities for all 3 implementations. Let's load in our ~15,000 articles from Elastic:

In [62]:
from elasticsearch import Elasticsearch
es = Elasticsearch()

docs = es.search(
    index='articles', 
    doc_type='article',
    filter_path=['hits.hits'],
    _source_include='description,title',
    sort='_id',
    size=20000
)

articles = []

for i,d in enumerate(docs['hits']['hits']):
    desc = (d["_source"]["description"])
    if(desc):
        articles.append(desc)
    i+=1;
    
print("No. training articles:",len(articles))
print()
print("Sample description:",articles[21])

No. training articles: 14512

Sample description: A private equity firm backed by some of the world’s largest utilities has raised $681 million to finance startups developing clean-energy technology.


## NLTK
**Let's start with implementing the NLTK tagger**

First we need to tokenize the text (split each sentence into words), and tag each word with it's part of speech (ex. gained = verb)

In [3]:
import nltk
import os
import numpy as np
from nltk import pos_tag
from nltk.tag import StanfordNERTagger
from nltk.tokenize import word_tokenize
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()

def process_text(article):
    token_text = word_tokenize(article)
    return token_text

# NLTK tagger   
def nltk_tagger(token_text):
    tagged_words = nltk.pos_tag(token_text)
    ne_tagged = nltk.ne_chunk(tagged_words)
    return(ne_tagged)

# Stanford NER tagger    
def stanford_tagger(token_text):
    st = StanfordNERTagger('../classes/english.all.3class.distsim.crf.ser.gz',
                            '../classes/stanford-ner.jar',
                            encoding='utf-8')   
    ne_tagged = st.tag(token_text)
    return(ne_tagged)

**All NER taggers include 3 types of tags: PERSON, LOCATION and ORGANIZATION.**

Let's run the NER tagger for NLTK, Stanford and SpaCy and see an example result:

In [4]:
from time import time

NLTK = nltk_tagger(process_text(articles[22]))
STAN = stanford_tagger(process_text(articles[22]))
SPACY = nlp(articles[22])

print("NLTK results:",NLTK)
print()
print("Stanford results:",STAN)
print()
print("SpaCy results:",SPACY)
print([(X.text, X.label_) for X in SPACY.ents])

NLTK results: (S
  In/IN
  (GPE Oregon/NNP)
  ,/,
  (GPE Illinois/NNP)
  ,/,
  an/DT
  old/JJ
  farmhouse/NN
  just/RB
  outside/IN
  of/IN
  town/NN
  is/VBZ
  a/DT
  hotbed/NN
  of/IN
  alternative/JJ
  energy/NN
  ./.
  It/PRP
  's/VBZ
  primarily/RB
  heated/VBN
  by/IN
  wood/NN
  and/CC
  they/PRP
  have/VBP
  a/DT
  wind/NN
  generator/NN
  ,/,
  although/IN
  it/PRP
  's/VBZ
  mainly/RB
  used/VBN
  for/IN
  educational/JJ
  purposes/NNS
  ./.)

Stanford results: [('In', 'O'), ('Oregon', 'LOCATION'), (',', 'O'), ('Illinois', 'LOCATION'), (',', 'O'), ('an', 'O'), ('old', 'O'), ('farmhouse', 'O'), ('just', 'O'), ('outside', 'O'), ('of', 'O'), ('town', 'O'), ('is', 'O'), ('a', 'O'), ('hotbed', 'O'), ('of', 'O'), ('alternative', 'O'), ('energy', 'O'), ('.', 'O'), ('It', 'O'), ("'s", 'O'), ('primarily', 'O'), ('heated', 'O'), ('by', 'O'), ('wood', 'O'), ('and', 'O'), ('they', 'O'), ('have', 'O'), ('a', 'O'), ('wind', 'O'), ('generator', 'O'), (',', 'O'), ('although', 'O'), ('it', 'O

**All 3 taggers correctly identified the same 2 entities (Oregon and Illinois) with the LOCATION tag.**

Now, let's run all 3 taggers on 5, 10, 50, 100, 500, 1000, 5000 and 10000 articles and compare the number of total entities extracted and performance:

In [60]:
NLTK_results = []
STAN_results = []
SPACY_results = []

print("Processing.....")

n_articles = [5, 10, 50, 100, 500, 1000, 5000, 10000]

results = []
for n in n_articles:

    t0 = time()
    for i,a in enumerate(articles[:n]):
        NLTK_results.append(nltk_tagger(process_text(a)))
    NLTK_time = time()-t0

    t0 = time()
    for i,a in enumerate(articles[:n]):
        STAN_results.append(stanford_tagger(process_text(a)))
    STAN_time = time()-t0

    t0 = time()
    for i,a in enumerate(articles[:n]):
        for X in nlp(a).ents:
            SPACY_results.append([(X.text, X.label_)])
    SPACY_time = time()-t0

    NLTK_ents = []
    for tagged in (NLTK_results):
        for tag in tagged:
            if(len(tag)==1):
                NLTK_ents.append(tag)

    STAN_ents = []
    for tagged in (STAN_results):
        for tag in tagged:
            if(tag[1]=="GPE" or tag[1]=="PERSON" or tag[1]=="ORGANIZATION"):
                STAN_ents.append(tag)
    
    SPACY_ents = []
    for tagged in (SPACY_results):
        e = tagged[0][1]
        if(e=="GPE" or e=="ORG" or e=="PERSON"):
            SPACY_ents.append(tagged)
            
    results.append([[len(NLTK_ents), len(STAN_ents), len(SPACY_ents)],[NLTK_time, STAN_time, SPACY_time]])
    
    print("Number of articles:", n)
    print("NLTK found entities:", len(NLTK_ents), "in:", round(NLTK_time,2), 'seconds')
    print("STAN found entities:", len(STAN_ents), "in:", round(STAN_time,2), 'seconds')
    print("SpaCy found entities:", len(SPACY_ents), "in:", round(SPACY_time,2), 'seconds')
    print()

Processing.....
Number of articles: 5
NLTK found entities: 6 in: 0.06 seconds
STAN found entities: 10 in: 9.19 seconds
SpaCy found entities: 2 in: 0.08 seconds

Number of articles: 10
NLTK found entities: 19 in: 0.14 seconds
STAN found entities: 30 in: 18.27 seconds
SpaCy found entities: 6 in: 0.16 seconds

Number of articles: 50
NLTK found entities: 101 in: 0.56 seconds
STAN found entities: 157 in: 91.92 seconds
SpaCy found entities: 85 in: 0.78 seconds

Number of articles: 100
NLTK found entities: 274 in: 1.11 seconds
STAN found entities: 463 in: 190.84 seconds
SpaCy found entities: 279 in: 1.67 seconds

Number of articles: 500
NLTK found entities: 1183 in: 4.75 seconds
STAN found entities: 1908 in: 973.96 seconds
SpaCy found entities: 1360 in: 7.82 seconds

Number of articles: 1000
NLTK found entities: 2749 in: 8.05 seconds
STAN found entities: 4142 in: 1909.95 seconds
SpaCy found entities: 3137 in: 14.63 seconds

Number of articles: 5000
NLTK found entities: 13690 in: 41.57 seconds

# Results
**Extracting the entities took a lot longer using the Stanford NER tagger, this could be a significant factor in how we are able to implement NER within our NLP pipeline.**

Let's have a look at how this performance scales as we process more articles:

In [63]:
import plotly.plotly as py
import plotly.graph_objs as go


nltk_x, nltk_y, stan_x, stan_y, spac_x, spac_y = [],[],[],[],[],[]

for r in results:
    nltk_x.append(r[0][0])
    nltk_y.append(r[1][0])
    stan_x.append(r[0][1])
    stan_y.append(r[1][1])
    spac_x.append(r[0][2])
    spac_y.append(r[1][2])
    
trace1 = go.Scatter(name='NLTK', x=nltk_x, y=nltk_y)
trace2 = go.Scatter(name='Stanford', x=stan_x, y=stan_y)
trace3 = go.Scatter(name='SpaCy', x=spac_x, y=spac_y)

data = [trace1, trace2, trace3]

layout = go.Layout(title = 'No. Entities vs. Time',
    yaxis = dict(title='Time in seconds'),
    xaxis = dict(title='No. Entities'),
)

fig = dict(data=data, layout=layout)
py.iplot(fig)

# Observations

* The Stanford NER tagger takes *significantly* longer as more documents are analyzed, reaching 18992.25 seconds for 10,000 articles which is almost 5 1/2 hours.
* The NLTK tagger had the best performance, aprx 57% better than SpaCy.
* SpaCy retrieved the most entities, aprx 8.5% more than NLTK.
<br/><br/>

**Because of the dramatic performance cost of the Stanford tagger, we will only consider NLTK and SpaCy.**

Let's have a closer look at the quality of the entities for each method:

In [57]:
k = 60

NLTK = nltk_tagger(process_text(articles[k]))
SPACY = nlp(articles[k])

print("Sample article:", articles[k], end="\n\n")
print("NLTK results:")
for tag in NLTK:
    if(len(tag)==1):
        print(tag)
print()

print("SpaCy results:")
for X in SPACY.ents:
    if(X.label_=="GPE" or X.label_=="ORG" or X.label_=="PERSON"):
        print((X.label_, X.text))

Sample article: Yves Rannou, formerly president and chief executive officer of hydro for GE Renewable Energy, has taken a position as CEO of Senvion S.A., a manufacturer of wind turbines.

NLTK results:
(PERSON Rannou/NNP)
(ORGANIZATION CEO/NNP)
(ORGANIZATION Senvion/NNP)

SpaCy results:
('PERSON', 'Yves Rannou')
('ORG', 'GE Renewable Energy')
('ORG', 'Senvion S.A.')


**The quality of the SpaCy entities seems better than NLTK on this example becase:**
* NLTK only identifies the last name of Yves Rannou, where as SpaCy identifies the whole name.
* NLTK thinks 'CEO' is an organization.
* NLTK only identifies the first part of the company Sevion S.A., where as SpaCy identifies the whole company name.
* SpaCy identifies GE Renewable Energy as a company, where as NLTK does not.

In addition, SpaCy can also identify 18 entity types, only 3 of which are available in NLTK. These entity types include:
1. PERSON - People, including fictional.
2. NORP - Nationalities or religious or political groups.
3. FAC - Buildings, airports, highways, bridges, etc.
4. ORG - Companies, agencies, institutions, etc.
5. GPE - Countries, cities, states.
6. LOC - Non-GPE locations, mountain ranges, bodies of water.
7. PRODUCT - Objects, vehicles, foods, etc. (Not services.)
8. EVENT - Named hurricanes, battles, wars, sports events, etc.
9. WORK_OF_ART - Titles of books, songs, etc.
10. LAW - Named documents made into laws.
11. LANGUAGE - Any named language.
12. DATE - Absolute or relative dates or periods.
13. TIME - Times smaller than a day.
14. PERCENT - Percentage, including "%".
15. MONEY - Monetary values, including unit.
16. QUANTITY - Measurements, as of weight or distance.
17. ORDINAL - "first", "second", etc.
18. CARDINAL - Numerals that do not fall under another type.

**To get a better sense of the SpaCy entities we can use the built in DisplaCy visualizer on a few sample articles:***

In [59]:
displacy.render(nlp(articles[30]), jupyter=True, style='ent')
print()
print()
displacy.render(nlp(articles[38]), jupyter=True, style='ent')
print()
print()
displacy.render(nlp(articles[40]), jupyter=True, style='ent')
print()
print()













# Conclusion

* The SpaCy NER tagger looks like the best mix of extracting quality entities and performance compared to NLTK and the Stanford NER.
<br/><br/>

* Entities could provide valuable relationships for building a classifier, but may need to be combined with other metadata (sentiment or topics) in order to be valuable.
<br/><br/>

* Incorprating the SpaCy NER tagger as part of our NLP pipeline would be efficient enough to process thousands if not millions of articles.

# Further Improvements
1. Could train new entity types for product (oil, gas, cbm, etc.) to help correlate with mentions of price increase / decrease.
