<h2 align='center'>NLP Tutorial: Named Entity Recognition (NER)</h2>

In [2]:
!pip install spacy

Collecting spacy
  Downloading spacy-3.8.4-cp312-cp312-win_amd64.whl.metadata (27 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy)
  Downloading spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy)
  Downloading spacy_loggers-1.0.5-py3-none-any.whl.metadata (23 kB)
Collecting murmurhash<1.1.0,>=0.28.0 (from spacy)
  Downloading murmurhash-1.0.12-cp312-cp312-win_amd64.whl.metadata (2.2 kB)
Collecting cymem<2.1.0,>=2.0.2 (from spacy)
  Downloading cymem-2.0.11-cp312-cp312-win_amd64.whl.metadata (8.8 kB)
Collecting preshed<3.1.0,>=3.0.2 (from spacy)
  Downloading preshed-3.0.9-cp312-cp312-win_amd64.whl.metadata (2.2 kB)
Collecting thinc<8.4.0,>=8.3.4 (from spacy)
  Downloading thinc-8.3.4-cp312-cp312-win_amd64.whl.metadata (15 kB)
Collecting wasabi<1.2.0,>=0.9.1 (from spacy)
  Downloading wasabi-1.1.3-py3-none-any.whl.metadata (28 kB)
Collecting srsly<3.0.0,>=2.4.3 (from spacy)
  Downloading srsly-2.5.1-cp312-cp312-win_amd64

In [3]:
import pydantic
pydantic.__version__

'2.5.3'

In [27]:
!python -m pip uninstall pydantic --yes

Found existing installation: pydantic 2.10.6
Uninstalling pydantic-2.10.6:
  Successfully uninstalled pydantic-2.10.6


In [31]:
!pip install pydantic

Collecting pydantic
  Using cached pydantic-2.10.6-py3-none-any.whl.metadata (30 kB)
Using cached pydantic-2.10.6-py3-none-any.whl (431 kB)
Installing collected packages: pydantic
Successfully installed pydantic-2.10.6


In [1]:
# after restarting kernel
import pydantic
pydantic.__version__

'2.10.6'

In [3]:
import spacy

In [15]:
!python -m spacy download en_core_web_lg

Collecting en-core-web-lg==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.8.0/en_core_web_lg-3.8.0-py3-none-any.whl (400.7 MB)
     ---------------------------------------- 0.0/400.7 MB ? eta -:--:--
     ---------------------------------------- 0.0/400.7 MB ? eta -:--:--
     -------------------------------------- 0.0/400.7 MB 435.7 kB/s eta 0:15:20
     -------------------------------------- 0.1/400.7 MB 550.5 kB/s eta 0:12:08
     ---------------------------------------- 0.3/400.7 MB 1.9 MB/s eta 0:03:31
     ---------------------------------------- 0.6/400.7 MB 3.5 MB/s eta 0:01:54
     ---------------------------------------- 1.1/400.7 MB 4.9 MB/s eta 0:01:22
     ---------------------------------------- 1.8/400.7 MB 6.8 MB/s eta 0:00:59
     ---------------------------------------- 2.5/400.7 MB 7.9 MB/s eta 0:00:51
     ---------------------------------------- 3.2/400.7 MB 9.0 MB/s eta 0:00:45
     --------------------------------

In [17]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.1/12.8 MB 3.2 MB/s eta 0:00:04
      --------------------------------------- 0.2/12.8 MB 2.5 MB/s eta 0:00:06
     -- ------------------------------------- 0.7/12.8 MB 6.7 MB/s eta 0:00:02
     ---- ----------------------------------- 1.5/12.8 MB 9.5 MB/s eta 0:00:02
     ------- -------------------------------- 2.4/12.8 MB 13.0 MB/s eta 0:00:01
     ---------- ----------------------------- 3.4/12.8 MB 14.6 MB/s eta 0:00:01
     ------------ --------------------------- 4.1/12.8 MB 14.7 MB/s eta 0:00:01
     -------------- ------------------------- 4.7/12.8 MB 15.0 MB/s eta 0:00:01
     ---------------- ----------------------- 5.2/12.8 MB 14.4 MB/s eta 0:00:01
     ----------------- -------------

In [19]:
nlp = spacy.load("en_core_web_sm")
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [21]:
doc = nlp("Tesla Inc is going to acquire twitter for $45 billion")
for ent in doc.ents:
    print(ent.text, " | ", ent.label_, " | ", spacy.explain(ent.label_))

Tesla Inc  |  ORG  |  Companies, agencies, institutions, etc.
$45 billion  |  MONEY  |  Monetary values, including unit


In [23]:
from spacy import displacy

displacy.render(doc, style="ent")

<h3>List down all the entities</h3>

In [26]:
nlp.pipe_labels['ner']

['CARDINAL',
 'DATE',
 'EVENT',
 'FAC',
 'GPE',
 'LANGUAGE',
 'LAW',
 'LOC',
 'MONEY',
 'NORP',
 'ORDINAL',
 'ORG',
 'PERCENT',
 'PERSON',
 'PRODUCT',
 'QUANTITY',
 'TIME',
 'WORK_OF_ART']

List of entities are also documented on this page: https://spacy.io/models/en

In [29]:
doc = nlp("Michael Bloomberg founded Bloomberg in 1982")
for ent in doc.ents:
    print(ent.text, "|", ent.label_, "|", spacy.explain(ent.label_))

Michael Bloomberg | PERSON | People, including fictional
Bloomberg | GPE | Countries, cities, states
1982 | DATE | Absolute or relative dates or periods


Above it made a mistake in identifying Bloomberg the company. Let's try hugging face for this now.

https://huggingface.co/dslim/bert-base-NER?text=Michael+Bloomberg+founded+Bloomberg+in+1982

Here also go through 3 sample examples for NER 

In [32]:
doc = nlp("Tesla Inc is going to acquire Twitter Inc for $45 billion")
for ent in doc.ents:
    print(ent.text, " | ", ent.label_, " | ", ent.start_char, "|", ent.end_char)

Tesla Inc  |  ORG  |  0 | 9
Twitter Inc  |  PERSON  |  30 | 41
$45 billion  |  MONEY  |  46 | 57


<h3>Setting custom entities</h3>

In [35]:
doc = nlp("Tesla is going to acquire Twitter for $45 billion")
for ent in doc.ents:
    print(ent.text, " | ", ent.label_)

Tesla  |  ORG
Twitter  |  PERSON
$45 billion  |  MONEY


In [36]:
s = doc[2:5]
s

going to acquire

In [37]:
type(s)

spacy.tokens.span.Span

In [38]:
from spacy.tokens import Span

s1 = Span(doc, 0, 1, label="ORG")
s2 = Span(doc, 5, 6, label="ORG")

doc.set_ents([s1, s2], default="unmodified")

In [39]:
for ent in doc.ents:
    print(ent.text, " | ", ent.label_)

Tesla  |  ORG
Twitter  |  ORG
$45 billion  |  MONEY
