<a href="https://colab.research.google.com/github/sujayrittikar/NLP_Basics/blob/main/NER.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Named Entity Recognition

- seeks to locate and classify named entity mentions in unstructured text into pre-defined categories.
- ex: Jim[Person] bought 300 shares of Acme Corp.[Organisation].

In [1]:
import spacy



In [2]:
nlp = spacy.load('en_core_web_sm')

In [3]:
def show_ents(doc):
  if doc.ents:
    for ent in doc.ents:
      print(ent.text + ' - ' + ent.label_ + ' - ' + str(spacy.explain(ent.label_)))
  else:
    print("No entities found")

In [4]:
doc = nlp(u'Hi how are you?')

In [5]:
show_ents(doc)

No entities found


In [6]:
doc = nlp(u'May I go to Washington, DC next June to see the amazing Washington University?')

In [7]:
show_ents(doc)

Washington - GPE - Countries, cities, states
DC - GPE - Countries, cities, states
next June - DATE - Absolute or relative dates or periods
Washington University - ORG - Companies, agencies, institutions, etc.


In [8]:
doc = nlp(u'Can I please have 500 dollars of Microsoft stock?')

In [9]:
show_ents(doc)

500 dollars - MONEY - Monetary values, including unit
Microsoft - ORG - Companies, agencies, institutions, etc.


In [13]:
doc = nlp(u'Tesla to build a U.K. factory for $6 million')

# Creating a new entity

In [10]:
from spacy.tokens import Span

In [11]:
ORG = doc.vocab.strings[u"ORG"]

In [14]:
new_ent = Span(doc, 0, 1, label=ORG)

In [15]:
doc.ents = list(doc.ents) + [new_ent]

In [16]:
show_ents(doc)

Tesla - ORG - Companies, agencies, institutions, etc.
U.K. - GPE - Countries, cities, states
$6 million - MONEY - Monetary values, including unit


# Adding Phrases

In [17]:
doc = nlp(u"Our company created a brand new vacuum cleaner." u"This new vacuum-cleaner is the best in show.")

In [18]:
show_ents(doc)

No entities found


In [19]:
from spacy.matcher import PhraseMatcher

In [20]:
matcher = PhraseMatcher(nlp.vocab)

In [21]:
phrase_list = ['vacuum cleaner', 'vacuum-cleaner']

In [22]:
phrase_patterns = [nlp(text) for text in phrase_list]

In [24]:
matcher.add(key='newproduct', docs = phrase_patterns)

In [25]:
found_matches = matcher(doc)
found_matches

[(2689272359382549672, 6, 8), (2689272359382549672, 11, 14)]

In [26]:
from spacy.tokens import Span

In [27]:
PROD = doc.vocab.strings[u'PRODUCT']

In [28]:
new_ents = [Span(doc, match[1], match[2], label=PROD) for match in found_matches]

In [29]:
doc.ents = list(doc.ents) + new_ents

In [30]:
show_ents(doc)

vacuum cleaner - PRODUCT - Objects, vehicles, foods, etc. (not services)
vacuum-cleaner - PRODUCT - Objects, vehicles, foods, etc. (not services)


# Visualizing NER

In [31]:
from spacy import displacy

In [34]:
doc = nlp(u"Over the last quarter Apple sold nearly 20 thousand iPods for a profit of $6 million."
        u"By contrast, Sony only sold 8 thousand Walkman music players.")

In [35]:
displacy.render(doc, style='ent', jupyter=True)

In [36]:
for sent in doc.sents:
  displacy.render(nlp(sent.text), style='ent', jupyter=True)

In [37]:
options = {'ents': ['PRODUCT', 'ORG']}

In [38]:
displacy.render(doc, style='ent', jupyter=True, options=options)

# Sentence Segmentation

In [43]:
from spacy.language import Language

- Adding a segmentation rule

In [44]:
@Language.component('set_custom_boundaries')
def set_custom_boundaries(doc):
  for token in doc[:-1]:
    if token.text == ';':
      doc[token.i+1].is_sent_start = True
  return doc

In [46]:
nlp.add_pipe('set_custom_boundaries', before='parser')
nlp.pipe_names

['tok2vec',
 'tagger',
 'set_custom_boundaries',
 'parser',
 'attribute_ruler',
 'lemmatizer',
 'ner']

In [47]:
doc2 = nlp(u'"Management is doing the right thing; leadership is doing the right things." -Peter Drucker')

In [49]:
for sent in doc2.sents:
  print(sent)

"Management is doing the right thing;
leadership is doing the right things."
-Peter Drucker
