# Named Entity Recognition(NER)

## Tutorial on identifying named entities in text data (NLP)

### Data Science With Raghav
**Youtube channel - https://www.youtube.com/channel/UC86OgfmVfguW69_0uIEffVQ/**

## What is a Named Entity?

**Any word which reprsents a person, organization, location etc. is a Named Entity.**
**Named entity recognition is a subtask of Information Extraction and is the process of identifying words which are named entities in a given text.**
**It is also called entity identification or entity chunking**

### Example

**"Apple acquired Zoom in China on Wednesday 6th May 2020"**

- Here named entities are Apple, Zoom, China and Wednesday 6th May 2020"
- Named entity recognition is the task of identifying these words from the text

## Why it is important?

**In order to understand the meaning from a given text (for ex a tweet or document), it is important to identify who did what to whom. Named entity recognition is the first task of identifying the words which may represnt the who, what and whom in the text. It helps in identifying the major entities the text is talking about**

**Any NLP task which involves automatically understanding text and acts based on it, needs Named Entity Recognition in its pipeline**

### Caveat

**No algorithm can 100% identify all the named entities correctly**

### Three approaches

- **Basic NLTK algorithm**
    - with word segmentation
    - with sentence segmentation
- **Stanford NLP NER**
- **Using Spacy**

**Dependencies**

In [71]:
import nltk
import pandas as pdnltk.download('tagsets')

**Data**

In [72]:
text = "Apple acquired Zoom in China on Wednesday 6th May 2020.\
This news has made Apple and Google stock jump by 5% on Dow Jones Index in the \
United States of America"

**Basic Named Entity (NE) tagging using NLTK - Word based**

In [73]:
#tokenize to words
words = nltk.word_tokenize(text)
words

['Apple',
 'acquired',
 'Zoom',
 'in',
 'China',
 'on',
 'Wednesday',
 '6th',
 'May',
 '2020.This',
 'news',
 'has',
 'made',
 'Apple',
 'and',
 'Google',
 'stock',
 'jump',
 'by',
 '5',
 '%',
 'on',
 'Dow',
 'Jones',
 'Index',
 'in',
 'the',
 'United',
 'States',
 'of',
 'America']

In [74]:
#Part of speech tagging
pos_tags = nltk.pos_tag(words)
pos_tags

[('Apple', 'NNP'),
 ('acquired', 'VBD'),
 ('Zoom', 'NNP'),
 ('in', 'IN'),
 ('China', 'NNP'),
 ('on', 'IN'),
 ('Wednesday', 'NNP'),
 ('6th', 'CD'),
 ('May', 'NNP'),
 ('2020.This', 'CD'),
 ('news', 'NN'),
 ('has', 'VBZ'),
 ('made', 'VBN'),
 ('Apple', 'NNP'),
 ('and', 'CC'),
 ('Google', 'NNP'),
 ('stock', 'NN'),
 ('jump', 'NN'),
 ('by', 'IN'),
 ('5', 'CD'),
 ('%', 'NN'),
 ('on', 'IN'),
 ('Dow', 'NNP'),
 ('Jones', 'NNP'),
 ('Index', 'NNP'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('United', 'NNP'),
 ('States', 'NNPS'),
 ('of', 'IN'),
 ('America', 'NNP')]

In [75]:
#check nltk help for description of the tag
nltk.help.upenn_tagset('NNP')

NNP: noun, proper, singular
    Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos
    Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA
    Shannon A.K.C. Meltex Liverpool ...


**ne_chunk**

**Binary=True**

In [76]:
chunks = nltk.ne_chunk(pos_tags, binary=True) #either NE or not NE
for chunk in chunks:
    print(chunk)

(NE Apple/NNP)
('acquired', 'VBD')
('Zoom', 'NNP')
('in', 'IN')
(NE China/NNP)
('on', 'IN')
('Wednesday', 'NNP')
('6th', 'CD')
('May', 'NNP')
('2020.This', 'CD')
('news', 'NN')
('has', 'VBZ')
('made', 'VBN')
(NE Apple/NNP)
('and', 'CC')
(NE Google/NNP)
('stock', 'NN')
('jump', 'NN')
('by', 'IN')
('5', 'CD')
('%', 'NN')
('on', 'IN')
('Dow', 'NNP')
('Jones', 'NNP')
('Index', 'NNP')
('in', 'IN')
('the', 'DT')
(NE United/NNP States/NNPS)
('of', 'IN')
(NE America/NNP)


In [77]:
entities =[]
labels =[]
for chunk in chunks:
    if hasattr(chunk,'label'):
        #print(chunk)
        entities.append(' '.join(c[0] for c in chunk))
        labels.append(chunk.label())
        
entities_labels = list(set(zip(entities, labels)))
entities_df = pd.DataFrame(entities_labels)
entities_df.columns = ["Entities","Labels"]
entities_df

Unnamed: 0,Entities,Labels
0,Apple,NE
1,America,NE
2,United States,NE
3,China,NE
4,Google,NE


**Why did it miss Zoom?**

**Binary = False**

In [78]:
chunks = nltk.ne_chunk(pos_tags, binary=False) #either NE or not NE
for chunk in chunks:
    print(chunk)
    
entities =[]
labels =[]
for chunk in chunks:
    if hasattr(chunk,'label'):
        #print(chunk)
        entities.append(' '.join(c[0] for c in chunk))
        labels.append(chunk.label())
        
entities_labels = list(set(zip(entities, labels)))
entities_df = pd.DataFrame(entities_labels)
entities_df.columns = ["Entities","Labels"]
entities_df

(PERSON Apple/NNP)
('acquired', 'VBD')
(PERSON Zoom/NNP)
('in', 'IN')
(GPE China/NNP)
('on', 'IN')
('Wednesday', 'NNP')
('6th', 'CD')
('May', 'NNP')
('2020.This', 'CD')
('news', 'NN')
('has', 'VBZ')
('made', 'VBN')
(PERSON Apple/NNP)
('and', 'CC')
(ORGANIZATION Google/NNP)
('stock', 'NN')
('jump', 'NN')
('by', 'IN')
('5', 'CD')
('%', 'NN')
('on', 'IN')
(PERSON Dow/NNP Jones/NNP Index/NNP)
('in', 'IN')
('the', 'DT')
(GPE United/NNP States/NNPS)
('of', 'IN')
(GPE America/NNP)


Unnamed: 0,Entities,Labels
0,Apple,PERSON
1,Google,ORGANIZATION
2,Zoom,PERSON
3,China,GPE
4,Dow Jones Index,PERSON
5,United States,GPE
6,America,GPE


**Basic Named Entity (NE) tagging using NLTK - Sentence based**

In [79]:
entities = []
labels = []

sentence = nltk.sent_tokenize(text)
for sent in sentence:
    for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent)),binary=False):
        if hasattr(chunk,'label'):
            entities.append(' '.join(c[0] for c in chunk))
            labels.append(chunk.label())
            
entities_labels = list(set(zip(entities,labels)))

entities_df = pd.DataFrame(entities_labels)
entities_df.columns = ["Entities","Labels"]
entities_df

Unnamed: 0,Entities,Labels
0,Apple,PERSON
1,Google,ORGANIZATION
2,Zoom,PERSON
3,China,GPE
4,Dow Jones Index,PERSON
5,United States,GPE
6,America,GPE


### Stanford NLP NER

**Installation and Configuration:**
https://medium.com/manash-en-blog/configuring-stanford-parser-and-stanford-ner-tagger-with-nltk-in-python-on-windows-f685483c374a

**Stanford link:** https://nlp.stanford.edu/software/CRF-NER.html

**More powerful package than NLTK**

In [80]:
from nltk.tag import StanfordNERTagger
from nltk.tokenize import word_tokenize
import os

In [81]:
model = 'C:/StanfordNER_Tagger/stanford-ner-2018-10-16/classifiers/english.all.3class.distsim.crf.ser.gz'
jar = 'C:/StanfordNER_Tagger/stanford-ner-2018-10-16/stanford-ner.jar'



st = StanfordNERTagger(model, jar,encoding='utf-8')

In [82]:
tokenized_text = nltk.word_tokenize(text)
classified_text = st.tag(tokenized_text)

classified_text_df = pd.DataFrame(classified_text)

classified_text_df.drop_duplicates(keep='first', inplace=True)
classified_text_df.reset_index(drop=True, inplace=True)
classified_text_df.columns = ["Entities", "Labels"]
classified_text_df

Unnamed: 0,Entities,Labels
0,Apple,ORGANIZATION
1,acquired,O
2,Zoom,O
3,in,O
4,China,LOCATION
5,on,O
6,Wednesday,O
7,6th,O
8,May,O
9,2020.This,O


In [89]:
tokenized_text = nltk.word_tokenize(text)
classified_text = st.tag(tokenized_text)

netagged_words = classified_text

entities = []
labels = []

from itertools import groupby
for tag, chunk in groupby(classified_text, lambda x:x[1]):
    if tag != "O":
        entities.append(' '.join(w for w, t in chunk))
        labels.append(tag)
        
        
entities_all = list(zip(entities, labels))
entities_unique = list(set(zip(entities, labels))) #unique entities   
entities_df = pd.DataFrame(entities_unique)
entities_df.columns = ["Entities", "Labels"]
entities_df

Unnamed: 0,Entities,Labels
0,Apple,ORGANIZATION
1,China,LOCATION
2,United States of America,LOCATION
3,Google,ORGANIZATION


## Using Spacy

Link: https://spacy.io/

In [84]:
import spacy 
from spacy import displacy
#SpaCy 2.x brough significant speed and accuracy improvements
spacy.__version__

'2.2.4'

In [85]:
#Download spacy models
#!python -m spacy download en_core_web_sm

In [86]:
# Load SpaCy model
nlp = spacy.load("en_core_web_sm")
#nlp = spacy.load("en_core_web_md")
#nlp = spacy.load("en_core_web_lg")

In [87]:
doc = nlp(text)

entities = []
labels = []
position_start = []
position_end = []

for ent in doc.ents:
    entities.append(ent)
    labels.append(ent.label_)
    position_start.append(ent.start_char)
    position_end.append(ent.end_char)
    
df = pd.DataFrame({'Entities':entities,'Labels':labels,'Position_Start':position_start, 'Position_End':position_end})

df

Unnamed: 0,Entities,Labels,Position_Start,Position_End
0,(Apple),ORG,0,5
1,(Zoom),PERSON,15,19
2,(China),GPE,23,28
3,"(Wednesday, 6th)",DATE,32,45
4,(Apple),ORG,74,79
5,(Google),ORG,84,90
6,"(5, %)",PERCENT,105,107
7,"(Dow, Jones)",ORG,111,120
8,"(the, United, States, of, America)",GPE,130,158


In [88]:
spacy.explain("ORG")

'Companies, agencies, institutions, etc.'

**Spacy works the best**

**What you can build with this?**

- A bot that can analyze financial news and extract information about entities that are mentioned in a <br>
given article along with location, dates and numeric information. This information can be further utilized<br> in building algorithmic trading bots<br>
<br>
- Analyze research papers produced everyday on COVID19 and find out any significant developments