# üìù Homework 3: Text Processing Fundamentals
**MIS 769 - Big Data Analytics for Business | Spring 2026**

**Points:** 20 | **Due:** Sunday, February 15, 2026 @ 11pm Pacific

**Author:** Richard Young, Ph.D. | UNLV Lee Business School

**Compute:** CPU (free tier)

---

## What You'll Learn

1. Install and use NLP libraries (spaCy, NLTK)
2. Understand WHY we preprocess text (not just how)
3. Create domain-specific stopwords for YOUR data
4. Identify cases where cleaning HURTS your analysis

---

## Part 1: Environment Setup (3 points)

In [None]:
# Install required packages
!pip install spacy nltk datasets wordcloud -q
!python -m spacy download en_core_web_sm -q

print("‚úÖ Libraries installed!")

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

# NLTK
import nltk
nltk.download('stopwords', quiet=True)
nltk.download('punkt', quiet=True)
from nltk.corpus import stopwords

# spaCy
import spacy
nlp = spacy.load("en_core_web_sm")

# Get standard English stopwords
STANDARD_STOPWORDS = set(stopwords.words('english'))
print(f"‚úÖ Loaded {len(STANDARD_STOPWORDS)} standard English stopwords")

## Part 2: Load Your Data (3 points)

In [None]:
from datasets import load_dataset

# Load dataset
dataset = load_dataset("nvidia/HelpSteer2", split="train")
df = dataset.to_pandas()
text_column = "response"  # Adjust based on your dataset

print(f"‚úÖ Loaded {len(df):,} records")
df.head(3)

## Part 3: Standard Stopword Removal (4 points)

In [None]:
def remove_stopwords(text, stopwords_set):
    """Remove stopwords from text."""
    words = str(text).lower().split()
    filtered = [w for w in words if w not in stopwords_set]
    return ' '.join(filtered)

# Apply to dataset
df['text_original'] = df[text_column].astype(str)
df['text_cleaned'] = df['text_original'].apply(
    lambda x: remove_stopwords(x, STANDARD_STOPWORDS)
)

# Calculate statistics
df['word_count_original'] = df['text_original'].str.split().str.len()
df['word_count_cleaned'] = df['text_cleaned'].str.split().str.len()
df['pct_removed'] = ((df['word_count_original'] - df['word_count_cleaned']) / df['word_count_original'] * 100).round(1)

print("üìä STOPWORD REMOVAL IMPACT")
print("=" * 50)
print(f"Average reduction: {df['pct_removed'].mean():.1f}%")

## Part 4: Create Domain-Specific Stopwords (5 points)

Review the most common words and identify domain-specific stopwords.

In [None]:
# Get word frequencies
all_words = ' '.join(df['text_cleaned']).split()
word_freq = Counter(all_words)

print("üìä TOP 30 MOST COMMON WORDS (after standard cleaning)")
print("-" * 50)
for i, (word, count) in enumerate(word_freq.most_common(30), 1):
    print(f"{i:2}. {word:15} {count:,}")

In [None]:
# YOUR TASK: Add domain-specific stopwords based on the list above
DOMAIN_STOPWORDS = {
    # Add at least 10 domain-specific stopwords with justification
    # Example:
    # 'example',  # appears frequently but adds no meaning
}

print(f"‚úÖ Created {len(DOMAIN_STOPWORDS)} domain-specific stopwords")

## Part 5: The Negation Problem (5 points)

Standard stopwords include negation words (not, no, never) that can change meaning!

In [None]:
# Negation words in standard stopwords
negation_words = {'not', 'no', 'never', 'neither', 'nobody', 'none', "n't", 'nor'}
negations_in_stopwords = negation_words & STANDARD_STOPWORDS

print("‚ö†Ô∏è DANGER: These negation words are in standard stopwords:")
print(f"   {negations_in_stopwords}")

# Example of meaning change
examples = [
    "This product is not good at all",
    "I would not recommend this",
]

print("\nüìù NEGATION REMOVAL EXAMPLES")
for text in examples:
    cleaned = remove_stopwords(text, STANDARD_STOPWORDS)
    print(f"\nOriginal: {text}")
    print(f"Cleaned:  {cleaned}")

In [None]:
# Create SMART stopwords that preserve negations
COMBINED_STOPWORDS = STANDARD_STOPWORDS | DOMAIN_STOPWORDS
SMART_STOPWORDS = COMBINED_STOPWORDS - negation_words

print(f"‚úÖ Smart stopwords: {len(SMART_STOPWORDS)} (preserves negation)")

## Part 6: Visualization (3 points)

In [None]:
from wordcloud import WordCloud

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Original
text_orig = ' '.join(df['text_original'].sample(500, random_state=103))
wc1 = WordCloud(width=400, height=300, background_color='white', max_words=50)
wc1.generate(text_orig)
axes[0].imshow(wc1)
axes[0].set_title('Original Text')
axes[0].axis('off')

# Cleaned
text_clean = ' '.join(df['text_cleaned'].sample(500, random_state=103))
wc2 = WordCloud(width=400, height=300, background_color='white', max_words=50)
wc2.generate(text_clean)
axes[1].imshow(wc2)
axes[1].set_title('After Stopword Removal')
axes[1].axis('off')

plt.tight_layout()
plt.show()

---

## Questions to Answer

**Q1:** Which removed stopwords might carry meaning in your domain?

*Your answer:*

**Q2:** Why did you choose each domain-specific stopword?

*Your answer:*

**Q3:** When should you preserve vs. remove negations?

*Your answer:*

---

## Submission Checklist

| Item | Points | Done? |
|------|--------|-------|
| Part 1-2: Setup and data loaded | 3 | ‚òê |
| Part 3: Standard stopword analysis | 4 | ‚òê |
| Part 4: 10+ domain stopwords with justification | 5 | ‚òê |
| Part 5: Negation analysis | 5 | ‚òê |
| Part 6: Word cloud visualization | 3 | ‚òê |
| **Total** | **20** | |