### Objective
Identify if a sentence mentions Donald Trump is any form:

>POTUS, POTUS 45, Trump, US President, Commander-in-Chief. <BR>

and not these:

> Obama, POTUS 44, Commander-in-Chief of Canada 

And the see if this can be extended to the local context. For example Lee Hsien Loong should be identified by

> LHL, PM Lee, Prime Minister of Singapore, Prime Minister of the Republic of Singapore, <BR>
> Prime Minister (for article written in the context of Singapore)

### Installation script
```
conda create -n ner python=3.6
source activate ner
conda install -c conda-forge spacy -y
conda install ipython jupyter nb_conda nltk -y
python -m spacy download en
python -m spacy download en_core_web_sm
```

### Summary
At the very list it can identify the people involved in each `claimReviewed` and the compare with every sentence in the article.

I still has yet to understand how a model could be trained to suit Singapore's context. https://spacy.io/usage/training

In [None]:
# test code from https://spacy.io/usage/linguistic-features#section-named-entities

In [None]:
import spacy

### This identified entities, the position from the sentence and its type.

In [None]:
nlp = spacy.load('en')
doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

### identify each word that made up the entity is before or after, nothing quite new here

In [35]:
import spacy

nlp = spacy.load('en')
doc = nlp(u'''San Francisco considers banning Facebook, and Lee Hsien Loong who is \
the Prime Minister of Singapore is concerned about PM Lee and Prime Minister Lee Hsien Loong. ''')

# document level
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print(ents)

[('San Francisco', 0, 13, 'GPE'), ('Facebook', 32, 40, 'ORG'), ('Lee Hsien Loong', 46, 61, 'PERSON'), ('Singapore', 91, 100, 'GPE'), ('Lee', 123, 126, 'PERSON'), ('Lee Hsien Loong', 146, 161, 'PERSON')]


In [33]:
import spacy

nlp = spacy.load('en')
doc = nlp(u'''We know that President Trump is evil and Trump Jr. is involved in illegal activites.''')

# document level
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print(ents)

[('Trump', 23, 28, 'PERSON'), ('Trump Jr.', 41, 50, 'ORG')]


In [22]:
# token level
ent_san = [doc[0].text, doc[0].ent_iob_, doc[0].ent_type_]
ent_francisco = [doc[1].text, doc[1].ent_iob_, doc[1].ent_type_]
print(ent_san)  # [u'San', u'B', u'GPE']
print(ent_francisco)  # [u'Francisco', u'I', u'GPE']

['San', 'B', 'GPE']
['Francisco', 'I', 'GPE']


In [17]:
# BEGIN The first token of a multi-token entity.
# IN An inner token of a multi-token entity.
# LAST The final token of a multi-token entity.
# UNIT A single-token entity.
# OUT A non-entity token.

### How to make the algorithm recognise `FB` is a entity

"To ensure that the sequence of token annotations remains consistent, you have to set entity annotations at the document level."

Thoughts - we could perhaps customise this for Singapore. Recognise all the notable indivuduals and their acronyms. Is there a way equate relationships - such as Prime Minister of Singapore to Lee Hsien Loong?

In [None]:
import spacy
from spacy.tokens import Span

nlp = spacy.load('en_core_web_sm')
doc = nlp(u"FB is hiring a new Vice President of global policy from San Francisco. FB is evil.")
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print('Before', ents)
# the model didn't recognise "FB" as an entity :(

ORG = doc.vocab.strings[u'ORG']  # get hash value of entity label
fb_ent = Span(doc, 0, 1, label=ORG) # create a Span for the new entity
# assigns FB with the entity 'doc'
doc.ents = list(doc.ents) + [fb_ent]

ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print('After', ents)
# [(u'FB', 0, 2, 'ORG')]

In [None]:
# PERSON	People, including fictional.
# NORP	Nationalities or religious or political groups.
# FAC	Buildings, airports, highways, bridges, etc.
# ORG	Companies, agencies, institutions, etc.
# GPE	Countries, cities, states.
# LOC	Non-GPE locations, mountain ranges, bodies of water.
# PRODUCT	Objects, vehicles, foods, etc. (Not services.)
# EVENT	Named hurricanes, battles, wars, sports events, etc.
# WORK_OF_ART	Titles of books, songs, etc.
# LAW	Named documents made into laws.
# LANGUAGE	Any named language.
# DATE	Absolute or relative dates or periods.
# TIME	Times smaller than a day.
# PERCENT	Percentage, including "%".
# MONEY	Monetary values, including unit.
# QUANTITY	Measurements, as of weight or distance.
# ORDINAL	"first", "second", etc.
# CARDINAL	Numerals that do not fall under another type.

This looks like designing a NER algorithm make a doc given all the tags
But how to use it?


In [3]:
import numpy
import spacy
from spacy.attrs import ENT_IOB, ENT_TYPE

nlp = spacy.load('en_core_web_sm')
doc = nlp(u'London is a big city in the San Francisco. London is trashy.')
print('Original', list(doc.ents))

doc = nlp.make_doc(u'London is a big city in the San Francisco. London is trashy.')
print('Before', list(doc.ents))  # []

header = [ENT_IOB, ENT_TYPE]
attr_array = numpy.zeros((len(doc), len(header)))
# len(doc) is the number of tokens

# the first zero refer to the sentence.
attr_array[0, 0] = 3  # 3 is begin, 1 is inside, 2 is outside
attr_array[0, 1] = doc.vocab.strings[u'GPE']

doc.from_array(header, attr_array)
# load the attribute array to the doc
print('After', list(doc.ents))  # [London]

doc = nlp(u'London is a big city in the San Francisco. London is trashy.')

Original [London, San Francisco, London]
Before []
After [London]


In [15]:
import spacy
from spacy import displacy

text = """But Google is starting from behind. The company made a late push \
into hardware, and Apple’s Siri, available on iPhones, and Amazon’s Alexa \
software, which runs on its Echo and Dot devices, have clear leads in \
consumer adoption."""

nlp = spacy.load('en')
doc = nlp(text)
displacy.serve(doc, style='ent')


[93m    Serving on port 5000...[0m
    Using the 'ent' visualizer



127.0.0.1 - - [19/Aug/2018 22:49:02] "GET / HTTP/1.1" 200 3336
127.0.0.1 - - [19/Aug/2018 22:49:02] "GET /favicon.ico HTTP/1.1" 200 3336



    Shutting down server on port 5000.



See result on http://localhost:5000

### Something related to training
https://spacy.io/usage/training

In [11]:
# I don't understand what this does
from spacy.gold import biluo_tags_from_offsets

doc = nlp.make_doc(u'I like London. London is amazing.')
print(list(doc.ents))
entities = [(7, 13, 'LOC')]
tags = biluo_tags_from_offsets(doc, entities)
print(tags)

[]
['O', 'O', 'U-LOC', 'O', 'O', 'O', 'O', 'O']


In [None]:
train_data = [('Who is Chaka Khan?', [(7, 17, 'PERSON')]),
              ('I like London and Berlin.', [(7, 13, 'LOC'), (18, 24, 'LOC')])]

In [7]:
from spacy.tokens import Doc
from spacy.gold import GoldParse
doc = Doc(nlp.vocab, [u'rats', u'make', u'good', u'pets'])
gold = GoldParse(doc, entities=[u'U-ANIMAL', u'O', u'O', u'O'])

In [None]:
import spacy
from spacy import displacy

text = """But Google is starting from behind. The company made a late push
into hardware, and Apple’s Siri, available on iPhones, and Amazon’s Alexa
software, which runs on its Echo and Dot devices, have clear leads in
consumer adoption."""

nlp = spacy.load('custom_ner_model')
doc = nlp(text)
displacy.serve(doc, style='ent')