# üè∑Ô∏è Homework 5: Named Entity Recognition (NER)
**MIS 769 - Big Data Analytics for Business | Spring 2026**

**Points:** 20 | **Due:** See WebCampus for deadline

**Author:** Richard Young, Ph.D. | UNLV Lee Business School

**Compute:** CPU (free tier)

---

## What You'll Learn

1. Named Entity Recognition (NER) concepts
2. Use spaCy for entity extraction
3. Extract business-relevant entities from text
4. Analyze entity patterns

---

## Part 1: Setup and Data Loading (3 points)

In [None]:
!pip install spacy datasets pandas matplotlib seaborn -q
!python -m spacy download en_core_web_sm -q

import spacy
from spacy import displacy  # For visual NER rendering
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter

nlp = spacy.load("en_core_web_sm")
print("‚úÖ spaCy loaded!")

In [None]:
from datasets import load_dataset

# Load dataset
dataset = load_dataset("stanfordnlp/imdb", split="train[:5000]")
df = dataset.to_pandas()

print(f"‚úÖ Loaded {len(df):,} reviews")

## Part 2: Understanding NER Entity Types (4 points)

In [None]:
# Demo NER on sample text
sample_text = """
Apple CEO Tim Cook announced the new iPhone 15 in San Francisco yesterday.
The product costs $999 and will be available in the United States starting
September 22, 2024. Amazon and Best Buy will also carry the device.
"""

doc = nlp(sample_text)

print("üìä NAMED ENTITIES FOUND")
print("-" * 60)
for ent in doc.ents:
    print(f"{ent.text:20} | {ent.label_:10} | {spacy.explain(ent.label_)}")

# üé® Visual representation - see entities highlighted in context!
print("\n" + "=" * 60)
print("üé® VISUAL NER (entities highlighted in text):")
displacy.render(doc, style="ent", jupyter=True)

named_entity_recognition.svg

In [None]:
# Try another example - a movie review style text
movie_review = """
Christopher Nolan's Inception starring Leonardo DiCaprio was filmed in Los Angeles 
and Paris. Warner Bros released it in July 2010 for $160 million. The film won 
four Academy Awards and grossed over $830 million worldwide.
"""

movie_doc = nlp(movie_review)

print("üé¨ MOVIE REVIEW NER")
print("-" * 60)
for ent in movie_doc.ents:
    print(f"{ent.text:25} | {ent.label_:10} | {spacy.explain(ent.label_)}")

print("\n" + "=" * 60)
print("üé® VISUAL REPRESENTATION:")
displacy.render(movie_doc, style="ent", jupyter=True)

## Part 3: Extract Entities from Reviews (6 points)

In [None]:
def extract_entities(text):
    """Extract named entities from text."""
    doc = nlp(str(text)[:5000])  # Limit for speed
    entities = {
        'ORG': [],
        'PRODUCT': [],
        'GPE': [],
        'PERSON': [],
    }
    for ent in doc.ents:
        if ent.label_ in entities:
            entities[ent.label_].append(ent.text)
    return entities

# Extract from all reviews
print("Extracting entities (this takes 1-2 minutes)...")

all_orgs = []
all_products = []
all_locations = []
all_persons = []

for idx, row in df.iterrows():
    if idx % 500 == 0:
        print(f"   Processing {idx}/{len(df)}...")
    entities = extract_entities(row['text'])
    all_orgs.extend(entities['ORG'])
    all_products.extend(entities['PRODUCT'])
    all_locations.extend(entities['GPE'])
    all_persons.extend(entities['PERSON'])

print(f"\n‚úÖ Extraction complete!")
print(f"   ORG: {len(all_orgs):,}")
print(f"   PRODUCT: {len(all_products):,}")
print(f"   GPE: {len(all_locations):,}")
print(f"   PERSON: {len(all_persons):,}")

## Part 4: Analyze Entity Patterns (4 points)

In [None]:
# Most common entities
org_counts = Counter(all_orgs).most_common(15)
person_counts = Counter(all_persons).most_common(15)
location_counts = Counter(all_locations).most_common(10)

print("üìä TOP ORGANIZATIONS")
print("-" * 40)
for org, count in org_counts:
    print(f"{org:25} | {count}")

print("\nüìä TOP PEOPLE")
print("-" * 40)
for person, count in person_counts:
    print(f"{person:25} | {count}")

print("\nüìä TOP LOCATIONS")
print("-" * 40)
for loc, count in location_counts:
    print(f"{loc:25} | {count}")

## Part 5: Visualization (3 points)

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Organizations
if org_counts:
    orgs, counts = zip(*org_counts[:10])
    axes[0].barh(orgs, counts, color='steelblue')
    axes[0].set_title('Top 10 Organizations')
    axes[0].invert_yaxis()

# People
if person_counts:
    persons, counts = zip(*person_counts[:10])
    axes[1].barh(persons, counts, color='coral')
    axes[1].set_title('Top 10 People')
    axes[1].invert_yaxis()

# Locations
if location_counts:
    locs, counts = zip(*location_counts[:10])
    axes[2].barh(locs, counts, color='seagreen')
    axes[2].set_title('Top 10 Locations')
    axes[2].invert_yaxis()

plt.tight_layout()
plt.savefig('ner_analysis.png', dpi=150)
plt.show()

---

## Questions to Answer

**Q1:** Were the extracted entities accurate? What errors did you observe?

*Your answer:*

**Q2:** What business insights can you derive from entity mentions?

*Your answer:*

**Q3:** How could you improve NER for your specific domain?

*Your answer:*

**Q4:** What patterns did you observe in the entity distributions? Were there any surprising co-occurrences?

*Your answer:*

---

## Submission Checklist

| Item | Points | Done? |
|------|--------|-------|
| Part 1: Setup and data loaded | 3 | ‚òê |
| Part 2: NER demo with explanations | 4 | ‚òê |
| Part 3: Entity extraction from reviews | 6 | ‚òê |
| Part 4: Analysis of patterns | 4 | ‚òê |
| Part 5: Visualization | 3 | ‚òê |
| **Total** | **20** | |