Named Entity Recognition (NER)

In [2]:
#Perform standard imports
import spacy
nlp = spacy.load('en_core_web_sm')

In [3]:
#Write a function to display basic entity info:
def show_ents(doc):
    if doc.ents:
        for ent in doc.ents:
            print(ent.text+' - '+ent.label_+' - '+str(spacy.explain(ent.label_)))
    else:
        print('No named entities found.')

In [4]:
doc = nlp(u'May I go to Washington, DC next May to see the Washington Monument?')
show_ents(doc)

Washington, DC - GPE - Countries, cities, states
next May - DATE - Absolute or relative dates or periods
the Washington Monument - ORG - Companies, agencies, institutions, etc.


In [5]:
doc = nlp(u'can I please borrow 500 dollars from you to buy some Microsoft stock?')

for ent in doc.ents:
    print(ent.text, ent.start, ent.end, ent.start_char, ent.end_char, ent.label_)

500 dollars 4 6 20 31 MONEY
Microsoft 11 12 53 62 ORG


In [6]:
tokens = [token.text for token in doc]
print(tokens)

['can', 'I', 'please', 'borrow', '500', 'dollars', 'from', 'you', 'to', 'buy', 'some', 'Microsoft', 'stock', '?']


NER Tags
Tags are accessible through the .label_ property of an entity.

In [7]:
doc = nlp(u'Tesla to build a U.K factory for $6 million')

show_ents(doc)

U.K - ORG - Companies, agencies, institutions, etc.
$6 million - MONEY - Monetary values, including unit


Right now, spaCy does not recognize "Tesla" as a company.

In [8]:
from spacy.tokens import Span

# Get th hash value of the ORG entity label
ORG = doc.vocab.strings[u'ORG']

# Create a span for the new entity
new_ent = Span(doc, 0, 1, label=ORG)

#Add the entity to the existing Doc object
doc.ents = list(doc.ents) + [new_ent]

In the code above, the arguments passsed to span() are:

In [9]:
show_ents(doc)

Tesla - ORG - Companies, agencies, institutions, etc.
U.K - ORG - Companies, agencies, institutions, etc.
$6 million - MONEY - Monetary values, including unit


In [10]:
#Adding Named entities to all Matching Spans
#what if we want to tag all occurrrences of "tesla"? In  this section we show how to use the PhraseMatcher to identify a series of spans in this Doc:

In [11]:
doc = nlp(u'Our company plans to introduce a new vacuum cleaner.'
          u'If successful, the vacuum-cleaner will be our first product.')

show_ents(doc)

first - ORDINAL - "first", "second", etc.


In [12]:
#Import PhraseMatcher and create a matcher object:
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)

In [13]:
#Create the desired phrase patterns:
phrase_list = ['vacuum cleaner', 'vacuum-cleaner']
phrase_patterns = [nlp(text) for text in phrase_list] 

In [14]:
#Apply the patterns to our matcher object:
matcher.add('newproduct', None, *phrase_patterns)

#apply the matcher to our doc object:
matches = matcher(doc)

#See what matches occur:
matches

[(2689272359382549672, 7, 9), (2689272359382549672, 14, 17)]

In [15]:
#Here we create spans from each match, and create named entities from them:
from spacy.tokens import Span

PROD = doc.vocab.strings [u'PRODUCT']

new_ents = [Span(doc, match[1],match[2],label=PROD) for match in matches]

doc.ents = list(doc.ents) + new_ents

In [16]:
show_ents(doc)

vacuum cleaner - PRODUCT - Objects, vehicles, foods, etc. (not services)
vacuum-cleaner - PRODUCT - Objects, vehicles, foods, etc. (not services)
first - ORDINAL - "first", "second", etc.


In [17]:
#Counting Entities
#While spaCy may not have a built-in tool for counting entities, we can pass a conditional statement into a list comprehension:

In [18]:
doc = nlp(u'Originally priced at $29.50, the sweater was marked down to five dollars.')
show_ents(doc)

29.50 - MONEY - Monetary values, including unit
five dollars - MONEY - Monetary values, including unit


In [19]:
len([ent for ent in doc.ents if ent.label =='MONEY'])

0

Problem With line Breaks

In [21]:
doc = nlp(u'Originally priced at $29.50,\nthe swaeter was marked down to five dollars.')
show_ents(doc)

29.50 - MONEY - Monetary values, including unit
five dollars - MONEY - Monetary values, including unit


In [None]:
#noun_chunks components:
 #'text' The original noun chunk text.
 #'root.text' The original text of the word connecting the noun chunk to the rest of the parse.
 #'root.dep_' Dependency relation connecting the root to its head.
 #'root.head.text' The text of the root token's head.

In [23]:
doc = nlp(u"Autonomous cars shift insurance lability toward manufacturers.")
for chunk in doc.noun_chunks:
    print(chunk.text+' - '+chunk.root.text+' - '+chunk.root.dep_+' - '+chunk.root.head.text)

Autonomous cars - cars - nsubj - shift
insurance lability - lability - dobj - shift
manufacturers - manufacturers - pobj - toward


In [24]:
len(doc.noun_chunks)#there is diffrent way to chekc

TypeError: object of type 'generator' has no len()

In [26]:
len(list(doc.noun_chunks))

3