
# Named Entity Recognition (NER) with spaCy

In this notebook I will:
1. Use **spaCy's pre-trained statistical NER model** to recognize entities in the news article
   `1Text.txt`.
2. **Inspect and discuss** the entities recognized by spaCy (persons, organizations, locations, dates, etc.).
3. Extend spaCy by **adding a custom entity type** and fine-tuning the NER component on a tiny
   annotated dataset, then test the updated model on the original text.


In [2]:
!pip install -U spacy

Collecting spacy
  Downloading spacy-3.8.9-cp313-cp313-win_amd64.whl.metadata (28 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy)
  Downloading spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy)
  Downloading spacy_loggers-1.0.5-py3-none-any.whl.metadata (23 kB)
Collecting murmurhash<1.1.0,>=0.28.0 (from spacy)
  Downloading murmurhash-1.0.15-cp313-cp313-win_amd64.whl.metadata (2.3 kB)
Collecting cymem<2.1.0,>=2.0.2 (from spacy)
  Downloading cymem-2.0.13-cp313-cp313-win_amd64.whl.metadata (9.9 kB)
Collecting preshed<3.1.0,>=3.0.2 (from spacy)
  Downloading preshed-3.0.11-cp313-cp313-win_amd64.whl.metadata (2.6 kB)
Collecting thinc<8.4.0,>=8.3.4 (from spacy)
  Downloading thinc-8.3.9-cp313-cp313-win_amd64.whl.metadata (15 kB)
Collecting wasabi<1.2.0,>=0.9.1 (from spacy)
  Downloading wasabi-1.1.3-py3-none-any.whl.metadata (28 kB)
Collecting srsly<3.0.0,>=2.4.3 (from spacy)
  Downloading srsly-2.5.1-cp313-cp313-win_amd6

In [4]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------- ----------------- 7.3/12.8 MB 40.2 MB/s eta 0:00:01
     --------------------------------------- 12.8/12.8 MB 37.6 MB/s eta 0:00:00
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.8.0
[38;5;2m[+] Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [5]:
# If spaCy is not installed, uncomment and run:
# !pip install -U spacy

# If the English model is not installed, uncomment and run:
# !python -m spacy download en_core_web_sm

import spacy
from pathlib import Path

# Load spaCy's small English model
nlp = spacy.load("en_core_web_sm")

# Load the same text file as in Problem 1
text_path = Path("Problem1Text.txt")
assert text_path.exists(), "Problem1Text.txt not found in the current directory."

raw_text = text_path.read_text(encoding="utf-8")

print("=== Text preview (first 800 characters) ===")
print(raw_text[:800])

=== Text preview (first 800 characters) ===
Sen. Tim Kaine defends Schumer, says House Democrats 'should focus on their own leadership'
In an exclusive interview on "Meet the Press," the Virginia senator defended his decision to side with Republicans on a funding bill to reopen the government.
Volume Muted Icon
Tim Kaine tells House Democrats calling for Schumer’s ouster to ‘focus on their own leadership’
02:23
Get more news
on


Savewith a NBCUniversal Profile
Create your free profile or log in to save this article
Nov. 16, 2025, 10:17 AM EST
By Megan Lebowitz
WASHINGTON — Sen. Tim Kaine, D-Va., defended Senate Minority Leader Chuck Schumer in an exclusive interview on NBC News’ “Meet the Press,” urging House Democrats to stick to their chamber rather than focus on the party’s Senate leadership.

“I don’t tell Ro Khanna or AOC or a


In [6]:
# Process full document with spaCy
doc = nlp(raw_text)

# Collect entities as (text, label_) pairs
entities = [(ent.text, ent.label_) for ent in doc.ents]

print(f"Total entities found: {len(entities)}\n")

# Show a sample of entities
for ent_text, ent_label in entities[:40]:
    print(f"{ent_text:35s} -> {ent_label}")

# Also show unique labels that appear
unique_labels = sorted(set(label for _, label in entities))
print("\nUnique entity labels in this document:", unique_labels)

Total entities found: 118

Tim Kaine                           -> PERSON
Schumer                             -> PERSON
House                               -> ORG
Democrats                           -> NORP
Meet the Press                      -> WORK_OF_ART
Virginia                            -> GPE
Republicans                         -> NORP
Tim Kaine                           -> PERSON
House                               -> ORG
Democrats                           -> NORP
Schumer                             -> PERSON
Nov. 16, 2025                       -> DATE
10:17 AM EST                        -> TIME
Megan Lebowitz                      -> PERSON
WASHINGTON                          -> GPE
Tim Kaine                           -> PERSON
Senate                              -> ORG
Chuck Schumer                       -> PERSON
NBC News                            -> ORG
Meet the Press                      -> WORK_OF_ART
House                               -> ORG
Democrats                   


## Discussion of spaCy's Pre-Trained NER Results

Now see a list of entities and their labels (for example: `PERSON`, `ORG`, `GPE`, `DATE`, etc.).  
When interpreting the output, consider the following points and jot down notes in your report:

- **Accuracy and precision**
  - Which entities are recognized correctly (e.g., *Tim Kaine* → `PERSON`, *WASHINGTON* → `GPE`)?
  - Are dates, organizations (e.g., news outlets, parties), and locations identified reliably?

- **Coverage / recall**
  - Are there important entities that spaCy **misses entirely**?
  - Does spaCy merge multi-word names correctly (e.g., *Alexandria Ocasio-Cortez*)?

- **Label granularity**
  - spaCy's built-in labels are generic (`ORG`, `GPE`, `NORP`, etc.).  
    Are there domain-specific distinctions you might want (e.g., `TV_SHOW`, `PROGRAM`, `LAW`)?

- **Potential improvements**
  - Using a larger model (e.g., `en_core_web_trf`), domain-specific fine-tuning,
    or combining statistical NER with rule-based post-processing.



## Extending NER with a Custom Entity Type

Next, I will extend spaCy's NER model with a **custom label**.

For this example we introduce the label **`PROGRAM`** to mark *government programs or laws* mentioned
in the article, such as:

- **"Affordable Care Act"**  
- **"Obamacare"**  
- **"Supplemental Nutrition Assistance Program"** (SNAP)

Steps:

1. Define a tiny training dataset (`TRAIN_DATA`) with text and character-offset annotations.
2. Add the new label to spaCy's NER component.
3. Run a short fine-tuning loop on the small annotated dataset.
4. Re-run NER on the original text and check whether the model can now recognize `PROGRAM` entities.


In [12]:
import random
from spacy.util import minibatch

TRAIN_DATA = []

sent1 = "The Affordable Care Act subsidies are central to the debate."
start1 = sent1.index("Affordable Care Act")
end1 = start1 + len("Affordable Care Act")
TRAIN_DATA.append((sent1, {"entities": [(start1, end1, "PROGRAM")]}))

sent2 = "Extending Obamacare subsidies was a key demand for many Democrats."
start2 = sent2.index("Obamacare")
end2 = start2 + len("Obamacare")
TRAIN_DATA.append((sent2, {"entities": [(start2, end2, "PROGRAM")]}))

sent3 = "The Supplemental Nutrition Assistance Program provides SNAP benefits."
start3 = sent3.index("Supplemental Nutrition Assistance Program")
end3 = start3 + len("Supplemental Nutrition Assistance Program")
TRAIN_DATA.append((sent3, {"entities": [(start3, end3, "PROGRAM")]}))

print("TRAIN_DATA examples:")
for text, ann in TRAIN_DATA:
    print(text)
    print(ann)
    print()

# Add the new label to the existing NER component
ner = nlp.get_pipe("ner")
ner.add_label("PROGRAM")


TRAIN_DATA examples:
The Affordable Care Act subsidies are central to the debate.
{'entities': [(4, 23, 'PROGRAM')]}

Extending Obamacare subsidies was a key demand for many Democrats.
{'entities': [(10, 19, 'PROGRAM')]}

The Supplemental Nutrition Assistance Program provides SNAP benefits.
{'entities': [(4, 45, 'PROGRAM')]}



0

In [9]:
from spacy.training import Example
from spacy.util import minibatch
import random

In [13]:
# Convert TRAIN_DATA into spaCy Example objects
examples = [Example.from_dict(nlp.make_doc(text), ann) for text, ann in TRAIN_DATA]

# Disable other pipeline components during NER training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
n_iters = 20

with nlp.disable_pipes(*other_pipes):
    optimizer = nlp.resume_training()
    for itn in range(n_iters):
        random.shuffle(examples)
        losses = {}

        # simple minibatching over Example objects
        for batch in minibatch(examples, size=2):
            nlp.update(
                batch,
                drop=0.2,
                sgd=optimizer,
                losses=losses
            )

        if (itn + 1) % 5 == 0:
            print(f"Iteration {itn+1}, Losses: {losses}")

print("Finished fine-tuning.")

Iteration 5, Losses: {'ner': np.float32(4.2955832)}
Iteration 10, Losses: {'ner': np.float32(1.0348586)}
Iteration 15, Losses: {'ner': np.float32(0.027100949)}
Iteration 20, Losses: {'ner': np.float32(0.00013697025)}
Finished fine-tuning.


In [11]:
# Run the updated model on the original article
updated_doc = nlp(raw_text)

program_entities = [(ent.text, ent.label_) for ent in updated_doc.ents if ent.label_ == "PROGRAM"]

print("Custom PROGRAM entities found in the original text:\n")
for text, label in program_entities:
    print(f"{text:50s} -> {label}")

# For comparison, also show a few standard entities after fine-tuning
print("\nSample of other entities after fine-tuning:")
for ent in list(updated_doc.ents)[:30]:
    print(f"{ent.text:35s} -> {ent.label_}")

Custom PROGRAM entities found in the original text:

Affordable Care Act subsidies                      -> PROGRAM
Affordable Care Act subsidies                      -> PROGRAM
Supplemental Nutrition Assistance Program (SNAP), federal worker protections and a future Senate vote on health care subsidies.

 -> PROGRAM
Obamacare                                          -> PROGRAM
Obamacare                                          -> PROGRAM

Sample of other entities after fine-tuning:
Tim Kaine                           -> PERSON
Schumer                             -> PERSON
House                               -> ORG
Meet the Press                      -> WORK_OF_ART
Virginia                            -> GPE
Tim Kaine                           -> PERSON
House                               -> ORG
Schumer                             -> PERSON
Nov. 16, 2025                       -> DATE
10:17 AM EST                        -> TIME
Megan Lebowitz                      -> PERSON
Tim Kaine      


## Discussion of Custom Entity Training

After the short fine-tuning run, the model should start recognizing mentions of
**"Affordable Care Act"**, **"Obamacare"**, and **"Supplemental Nutrition Assistance Program"**
in the original article as `PROGRAM` entities.

Points to discuss in your write-up:

- **Effectiveness**
  - Does the updated model correctly tag the targeted phrases as `PROGRAM`?
  - Does it over-generalize and mislabel unrelated phrases as `PROGRAM`?

- **Impact on existing labels**
  - Did the performance on standard entities (e.g., `PERSON`, `ORG`, `GPE`, `DATE`)
    remain stable, or did some entities get worse?

- **Data requirements**
  - With only 3 short training examples, the model can learn basic patterns,
    but results will be noisy.
  - In practice you would collect **dozens or hundreds** of annotated sentences
    for the new label before fine-tuning.

- **Further improvements**
  - Add more diverse training sentences that use the new label in different contexts.
  - Use a development set to monitor overfitting.
  - Consider combining statistical training with **spaCy's `EntityRuler`** to enforce
    high-precision patterns for critical terms.
