<a href="https://colab.research.google.com/github/yohanesnuwara/66DaysOfData/blob/main/D12_named_entity_recognition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Named Entity Recognition with SpaCy

In [1]:
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm

from bs4 import BeautifulSoup
import requests
import re

We'll extract information from a website. 

In [2]:
def url_to_string(url):
  res = requests.get(url)
  html = res.text
  soup = BeautifulSoup(html, 'html5lib')
  for script in soup(["script", "style", 'aside']):
      script.extract()
  return " ".join(re.split(r'[\n\t]+', soup.get_text()))

In [4]:
url = 'https://www.energyvoice.com/oilandgas/middle-east/336817/schlumberger-halliburton-adnoc-drilling-sarb-upper-zakum-contract'

ny_bb = url_to_string(url)

nlp = en_core_web_sm.load()
article = nlp(ny_bb)

# Count how many entities in the article
len(article.ents)

98

Labeling entities.

In [5]:
labels = [x.label_ for x in article.ents]
Counter(labels)

Counter({'CARDINAL': 6,
         'DATE': 2,
         'GPE': 14,
         'LOC': 7,
         'MONEY': 11,
         'NORP': 1,
         'ORG': 30,
         'PERCENT': 2,
         'PERSON': 21,
         'PRODUCT': 1,
         'QUANTITY': 2,
         'TIME': 1})

Print the dominant entity labels.

In [6]:
items = [x.text for x in article.ents]
Counter(items).most_common(3)

[('ADNOC', 9), ('Halliburton', 7), ('Schlumberger', 4)]

Print all sentences. 

In [7]:
sentences = [x for x in article.sents]

for i in range(len(sentences)):
  print(f'{i}: {sentences[i]}')

0:  Schlumberger and Halliburton win big as ADNOC dishes out
1: $764m deal   Skip Menu Skip to main content Skip to sidebar Skip to footer Social Media Links For Energy Voice Facebook LinkedIn
2: Twitter Show Links DC Thomson Business VerticalsAdvertise Newsletter Sign Up
3: Account OptionsLogin / Register eEditions Newsletter Preferences
4: Update Profile Search Search Login / Register Menu Menu Main NavigationCoronavirus Oil & Gas Africa Americas Asia Australasia
5: Europe
6: Middle East
7: North Sea Norway Renewables/
8: Energy Transition Nuclear Wind Solar Wave & Tidal Hydro Hydrogen CCS Biofuels Grid & Retail Storage
9: Transport Podcast Opinion Subscribe
10: All Sections  
11: All Sections MenuCoronavirus Oil & Gas Africa Americas Asia Australasia Europe Middle East North Sea Norway
12: Renewables/Energy Transition Nuclear Wind Solar Wave & Tidal Hydro Hydrogen CCS Biofuels Grid & Retail Storage
13: Transport Podcast Opinion Subscribe Close Menu Oil & Gas / Middle East Schlumberg

Take one sentence example and get the entities.

In [14]:
nlp = en_core_web_sm.load()

doc = nlp(str(sentences[14]))
print([(X.text, X.label_) for X in doc.ents])

[('ADNOC', 'ORG'), ('763.7million', 'MONEY'), ('Schlumberger', 'ORG'), ('Halliburton', 'PERSON')]


Rendering the sentence.

In [8]:
displacy.render(nlp(str(sentences[14])), jupyter=True, style='ent')

Take other examples.

In [9]:
displacy.render(nlp(str(sentences[26])), jupyter=True, style='ent')

In [10]:
displacy.render(nlp(str(sentences[33])), jupyter=True, style='ent')

In [11]:
displacy.render(nlp(str(sentences[38])), jupyter=True, style='ent')

Comments on some limitations:
* Not recognize well between "person", "GPE", or "organization". e.g. in Halliburton as "person" (sentence 14) and UAE as "organization" (sentence 26)
* The "$764million" in sentence 14 is not recognized as "money" entity, probably because of typos.
* To recognize "05/07/21" as date, one must use rule-based entity matching in SpaCy, more information [here](https://spacy.io/usage/rule-based-matching/).

References:

* https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da