### Named Entity Recognition (NER) using Spacy

What is NE ?
- Any word which represents a person, organisaion, location, etc is known as Named Entity (NE).
- Subtask of Information Extraction process to identify works which are named entities.
- Also known as Entity Identification or Entity Chunking.

Why NER ?
- Extract and analyze information about entites mentioned in article or tweet along with location, dates, numeric information.
- Good approach to identify the words which represent the `who`, `what` and `whom` in the text.
- Use the extracted information to build algorithm or model in different tasks as of needed.
- Can be used to analyze research papers to know what the main focused is on it.

In [40]:
%load_ext watermark
%load_ext lab_black

In [44]:
import spacy
import pandas as pd
from spacy import displacy
from spacy.lang.en.examples import sentences

In [43]:
%watermark -iv -v

Python implementation: CPython
Python version       : 3.8.5
IPython version      : 7.23.0

pandas: 1.2.4
spacy : 3.0.6



In [11]:
# download spacy models (for named entity all the models gives almost same result)
#!python3 -m spacy download en_core_web_sm
#!python3 -m spacy download en_core_web_md
#!python3 -m spacy download en_core_web_lg

In [None]:
# load spacy model
nlp = spacy.load("en_core_web_sm")

#### Example 1

In [10]:
# taking sentence from spacy itslef
doc = nlp(sentences[0])
print(doc.text)
for token in doc:
    print(token.text, token.pos_, token.dep_)

Apple is looking at buying U.K. startup for $1 billion
Apple PROPN nsubj
is AUX aux
looking VERB ROOT
at ADP prep
buying VERB pcomp
U.K. PROPN dobj
startup NOUN advcl
for ADP prep
$ SYM quantmod
1 NUM compound
billion NUM pobj


#### Example 2

In [20]:
# tweet from twitter added with dummy text
doc1 = nlp("Nokia's Q1 saw strong net sales, operating margin and cash flow. Our CEO Pekka Lundmark discussed the robust set of results which is good for all countries. By 2021 the telecom company will make 2 Billion dollars.")

In [31]:
#dir(nlp)

In [28]:
[[word.text, word.label_] for word in doc1.ents]

[['Nokia', 'ORG'],
 ['Pekka Lundmark', 'PERSON'],
 ['2021', 'DATE'],
 ['2 Billion dollars', 'MONEY']]

In [23]:
# check the meaning of label
spacy.explain('ORG')

'Companies, agencies, institutions, etc.'

In [26]:
# lets use displacy
displacy.render(doc1, style='ent',jupyter=True)

#### Example 3

In [33]:
text = "The news that US telecoms giant AT&T will use Nokia for a major part of its 5G rollout comes as the Finnish vendor tries to give investors hope. AT&T recently announced it expected to drop around $2 billion per year on rolling out kit designed to make use of the mid-band 5G spectrum it won in the recent auction. A big chunk of that now seems destined to end up in Fin nish pockets thanks to this announcement, which will be especially welcome for Nokia, coming as it does on the day it attempts to reassure its investors about the future."

In [57]:
doc = nlp(text)

entities = []
labels = []
position_start = []
position_end = []

for ent in doc.ents:
    entities.append(ent)
    labels.append(ent.label_)
    position_start.append(ent.start_char)
    position_end.append(ent.end_char)

# displaying in dataframe
df = pd.DataFrame(
    {
        "Entities": entities,
        "Labels": labels,
        "Position_start": position_start,
        "Position_end": position_end,
    }
)
displacy.render(doc, style="ent", jupyter=True)


print("\ndf:")
display(df)


df:


Unnamed: 0,Entities,Labels,Position_start,Position_end
0,(US),GPE,14,16
1,(AT&T),ORG,32,36
2,(Nokia),ORG,46,51
3,(5),CARDINAL,76,77
4,(Finnish),NORP,100,107
5,(AT&T),ORG,145,149
6,"(around, $, 2, billion)",MONEY,189,206
7,(5),CARDINAL,272,273
8,(Nokia),ORG,449,454
9,"(the, day)",DATE,477,484


In [61]:
# list all the valid attributes of the object
dir(ent)

['_',
 '__class__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__ne__',
 '__new__',
 '__pyx_vtable__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '_fix_dep_copy',
 '_vector',
 '_vector_norm',
 'as_doc',
 'char_span',
 'conjuncts',
 'doc',
 'end',
 'end_char',
 'ent_id',
 'ent_id_',
 'ents',
 'get_extension',
 'get_lca_matrix',
 'has_extension',
 'has_vector',
 'kb_id',
 'kb_id_',
 'label',
 'label_',
 'lefts',
 'lemma_',
 'n_lefts',
 'n_rights',
 'noun_chunks',
 'orth_',
 'remove_extension',
 'rights',
 'root',
 'sent',
 'sentiment',
 'set_extension',
 'similarity',
 'start',
 'start_char',
 'subtree',
 'tensor',
 'text',
 'text_with_ws',
 'to_array',
 'vector',
 'vector_norm',
 'vocab']