# Lab 02: Basic NLP Preprocessing Techniques

**Course:** ITAI 2373 - Natural Language Processing  
**Module:** 02 - Text Preprocessing  
**Duration:** 2-3 hours  
**Student Name:** Kaden Glover
**Date:** June 7

---

## 🎯 Learning Objectives

By completing this lab, you will:
1. Understand the critical role of preprocessing in NLP pipelines
2. Master fundamental text preprocessing techniques
3. Compare different libraries and their approaches
4. Analyze the effects of preprocessing on text data
5. Build a complete preprocessing pipeline
6. Load and work with different types of text datasets

## 📖 Introduction to NLP Preprocessing

Natural Language Processing (NLP) preprocessing refers to the initial steps taken to clean and transform raw text data into a format that's more suitable for analysis by machine learning algorithms.

### Why is preprocessing crucial?

1. **Standardization:** Ensures consistent text format across your dataset
2. **Noise Reduction:** Removes irrelevant information that could confuse algorithms
3. **Complexity Reduction:** Simplifies text to focus on meaningful patterns
4. **Performance Enhancement:** Improves the efficiency and accuracy of downstream tasks

### Real-world Impact
Consider searching for "running shoes" vs "Running Shoes!" - without preprocessing, these might be treated as completely different queries. Preprocessing ensures they're recognized as equivalent.

### 🤔 Conceptual Question 1
**Before we start coding, think about your daily interactions with text processing systems (search engines, chatbots, translation apps). What challenges do you think these systems face when processing human language? List at least 3 specific challenges and explain why each is problematic.**

*Double-click this cell to write your answer:*

**Challenge 1:** Ambiguity:
Human language is full of ambiguity. Words and phrases often have multiple meanings depending on context. Text processing systems struggle to correctly interpret meaning without sufficient context, which can lead to misunderstandings or inaccurate responses.

**Challenge 2:** Idioms and Figurative Language:
Humans often use idioms, metaphors, and other non-literal expressions. These phrases are difficult for systems to interpret because their meanings cannot be derived from the literal definitions of the words. Without cultural or contextual awareness, machines often misinterpret these expressions.



**Challenge 3:** Grammar and Spelling Variations:
Users often input text with grammar mistakes, slang, or typos—especially in informal settings like social media or text messaging. Text processing systems must be able to understand and correct these variations while still preserving the intended meaning, which requires sophisticated language models and error correction mechanisms.



---

## 🛠️ Part 1: Environment Setup

We'll be working with two major NLP libraries:
- **NLTK (Natural Language Toolkit):** Comprehensive NLP library with extensive resources
- **spaCy:** Industrial-strength NLP with pre-trained models

**⚠️ Note:** Installation might take 2-3 minutes to complete.

In [1]:
# Step 1: Install Required Libraries
print("🔧 Installing NLP libraries...")

!pip install -q nltk spacy
!python -m spacy download en_core_web_sm

print("✅ Installation complete!")

🔧 Installing NLP libraries...
Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m103.4 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
✅ Installation complete!


### 🤔 Conceptual Question 2
**Why do you think we need to install a separate language model (en_core_web_sm) for spaCy? What components might this model contain that help with text processing? Think about what information a computer needs to understand English text.**

*Double-click this cell to write your answer:*
I think we need to install a separate language model like en_core_web_sm because spaCy by itself doesn’t know much about the English language.  Basically, the model gives spaCy the "brain" it needs to process and make sense of English text.

---

In [1]:
# Step 2: Import Libraries and Download NLTK Data
import nltk
import spacy
import string
import re
from collections import Counter

# Download essential NLTK data
print("📦 Downloading NLTK data packages...")
nltk.download('punkt')      # For tokenization
nltk.download('stopwords')  # For stop word removal
nltk.download('wordnet')    # For lemmatization
nltk.download('averaged_perceptron_tagger')  # For POS tagging

print("\n✅ All imports and downloads completed!")

📦 Downloading NLTK data packages...


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...



✅ All imports and downloads completed!


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


## 📂 Part 2: Sample Text Data

We'll work with different types of text to understand how preprocessing affects various text styles:
- Simple text
- Academic text (with citations, URLs)
- Social media text (with emojis, hashtags)
- News text (formal writing)
- Product reviews (informal, ratings)

In [2]:
# Step 3: Load Sample Texts
simple_text = "Natural Language Processing is a fascinating field of AI. It's amazing!"

academic_text = """
Dr. Smith's research on machine-learning algorithms is groundbreaking!
She published 3 papers in 2023, focusing on deep neural networks (DNNs).
The results were amazing - accuracy improved by 15.7%!
"This is revolutionary," said Prof. Johnson.
Visit https://example.com for more info. #NLP #AI @university
"""

social_text = "OMG! Just tried the new coffee shop ☕️ SO GOOD!!! Highly recommend 👍 #coffee #yum 😍"

news_text = """
The stock market experienced significant volatility today, with tech stocks leading the decline.
Apple Inc. (AAPL) dropped 3.2%, while Microsoft Corp. fell 2.8%.
"We're seeing a rotation out of growth stocks," said analyst Jane Doe from XYZ Capital.
"""

review_text = """
This laptop is absolutely fantastic! I've been using it for 6 months and it's still super fast.
The battery life is incredible - lasts 8-10 hours easily.
Only complaint: the keyboard could be better. Overall rating: 4.5/5 stars.
"""

# Store all texts
sample_texts = {
    "Simple": simple_text,
    "Academic": academic_text.strip(),
    "Social Media": social_text,
    "News": news_text.strip(),
    "Product Review": review_text.strip()
}

print("📄 Sample texts loaded successfully!")
for name, text in sample_texts.items():
    preview = text[:80] + "..." if len(text) > 80 else text
    print(f"\n🏷️ {name}: {preview}")

📄 Sample texts loaded successfully!

🏷️ Simple: Natural Language Processing is a fascinating field of AI. It's amazing!

🏷️ Academic: Dr. Smith's research on machine-learning algorithms is groundbreaking!
She publi...

🏷️ Social Media: OMG! Just tried the new coffee shop ☕️ SO GOOD!!! Highly recommend 👍 #coffee #yu...

🏷️ News: The stock market experienced significant volatility today, with tech stocks lead...

🏷️ Product Review: This laptop is absolutely fantastic! I've been using it for 6 months and it's st...


### 🤔 Conceptual Question 3
**Looking at the different text types we've loaded, what preprocessing challenges do you anticipate for each type? For each text type below, identify at least 2 specific preprocessing challenges and explain why they might be problematic for NLP analysis.**

*Double-click this cell to write your answer:*

**Simple text challenges:**
1.Lack of complexity – The text is too short and straightforward, which may not provide enough context for deep analysis or training models.


2.Contractions – Words like "It's" need to be expanded to "It is" to avoid confusion during tokenization or parsing.



**Academic text challenges:**
1.Technical jargon and abbreviations – Terms like "DNNs" or "machine-learning algorithms" might not be recognized without domain-specific models.
2.Complex sentence structures – Long, information-dense sentences with multiple clauses can be harder to parse accurately.

**Social media text challenges:**
1.Emojis and hashtags – Symbols like "☕️", "😍", or "#yum" carry meaning but are difficult for standard models to interpret.
2.Slang and informal language – Words like "OMG" or phrases like "SO GOOD!!!" are non-standard and often require special preprocessing.

**News text challenges:**
1.Named entity recognition (NER) – Identifying companies, stocks, and people correctly (e.g., "Apple Inc.", "Jane Doe") is crucial but can be tricky.
2.Financial data and quotes – Parsing numerical data (e.g., "3.2%") and attributing quotes to the right speakers requires careful structure analysis.



**Product review challenges:**
1.Mixed sentiment – A review may contain both praise and criticism in the same text, making sentiment analysis more difficult.
2.Informal phrasing – Phrases like “super fast” or “could be better” are subjective and can be hard to quantify accurately in analysis.

---

## 🔤 Part 3: Tokenization

### What is Tokenization?
Tokenization is the process of breaking down text into smaller, meaningful units called **tokens**. These tokens are typically words, but can also be sentences, characters, or subwords.

### Why is it Important?
- Most NLP algorithms work with individual tokens, not entire texts
- It's the foundation for all subsequent preprocessing steps
- Different tokenization strategies can significantly impact results

### Common Challenges:
- **Contractions:** "don't" → "do" + "n't" or "don't"?
- **Punctuation:** Keep with words or separate?
- **Special characters:** How to handle @, #, URLs?

In [None]:
# Step 4: Tokenization with NLTK
from nltk.tokenize import word_tokenize, sent_tokenize

# Test on simple text
print("🔍 NLTK Tokenization Results")
print("=" * 40)
print(f"Original: {simple_text}")

# Word tokenization
nltk_tokens = word_tokenize(simple_text)
print(f"\nWord tokens: {nltk_tokens}")
print(f"Number of tokens: {len(nltk_tokens)}")

# Sentence tokenization
sentences = sent_tokenize(simple_text)
print(f"\nSentences: {sentences}")
print(f"Number of sentences: {len(sentences)}")

### 🤔 Conceptual Question 4
**Examine the NLTK tokenization results above. How did NLTK handle the contraction "It's"? What happened to the punctuation marks? Do you think this approach is appropriate for all NLP tasks? Explain your reasoning.**

*Double-click this cell to write your answer:*

**How "It's" was handled:** NLTK split the contraction "It's" into two tokens: "It" and "'s". This is common in many NLP tokenizers because it helps identify the base word and the contraction separately, which can be useful for grammatical analysis.



**Punctuation treatment:**Punctuation marks like the period and exclamation mark were treated as separate tokens. This allows the tokenizer to preserve sentence boundaries and makes it easier to analyze sentence structure or identify sentence-ending punctuation.



**Appropriateness for different tasks:** This approach is appropriate for many NLP tasks like part-of-speech tagging, dependency parsing, and grammar analysis, where separating contractions and punctuation is helpful. However, for tasks like sentiment analysis or text generation, breaking contractions and punctuation might sometimes lose the natural flow or tone of the text.

---

In [None]:
# Step 5: Tokenization with spaCy
nlp = spacy.load('en_core_web_sm')

print("🔍 spaCy Tokenization Results")
print("=" * 40)
print(f"Original: {simple_text}")

# Process with spaCy
doc = nlp(simple_text)

# Extract tokens
spacy_tokens = [token.text for token in doc]
print(f"\nWord tokens: {spacy_tokens}")
print(f"Number of tokens: {len(spacy_tokens)}")

# Show detailed token information
print(f"\n🔬 Detailed Token Analysis:")
print(f"{'Token':<12} {'POS':<8} {'Lemma':<12} {'Is Alpha':<8} {'Is Stop':<8}")
print("-" * 50)
for token in doc:
    print(f"{token.text:<12} {token.pos_:<8} {token.lemma_:<12} {token.is_alpha:<8} {token.is_stop:<8}")

### 🤔 Conceptual Question 5
**Compare the NLTK and spaCy tokenization results. What differences do you notice? Which approach do you think would be better for different NLP tasks? Consider specific examples like sentiment analysis vs. information extraction.**

*Double-click this cell to write your answer:*

**Key differences observed:** Both NLTK and spaCy split contractions like “It’s” into two tokens (It and 's). However, spaCy provides much richer information for each token, such as part of speech, lemmas, and stop word flags. NLTK focuses mainly on splitting the text into words and sentences without additional linguistic details. Also, spaCy’s tokenization tends to be more consistent with linguistic rules, especially for complex text.



**Better for sentiment analysis:** NLTK’s simpler tokenization might be sufficient for basic sentiment analysis where the focus is mainly on words and their counts. However, spaCy’s ability to identify lemmas and stop words can improve sentiment models by reducing noise and capturing the true meaning of words. So, spaCy is generally better if a more nuanced understanding is needed.



**Better for information extraction:** SpaCy is clearly better for information extraction tasks because it provides part ofspeech tags, lemmas, dependency parsing, and named entity recognition. These features allow it to identify relationships, entities, and roles in the text more accurately than NLTK’s basic tokenization.

**Overall assessment:** NLTK is great for straightforward, lightweight tokenization tasks and quick experiments. SpaCy, on the other hand, offers a powerful and linguistically informed pipeline that is better suited for advanced NLP applications like entity recognition, syntactic parsing, and context-aware tasks.



---

In [None]:
# Step 6: Test Tokenization on Complex Text
print("🧪 Testing on Social Media Text")
print("=" * 40)
print(f"Original: {social_text}")

# NLTK approach
social_nltk_tokens = word_tokenize(social_text)
print(f"\nNLTK tokens: {social_nltk_tokens}")

# spaCy approach
social_doc = nlp(social_text)
social_spacy_tokens = [token.text for token in social_doc]
print(f"spaCy tokens: {social_spacy_tokens}")

print(f"\n📊 Comparison:")
print(f"NLTK token count: {len(social_nltk_tokens)}")
print(f"spaCy token count: {len(social_spacy_tokens)}")

### 🤔 Conceptual Question 6
**Looking at how the libraries handled social media text (emojis, hashtags), which library seems more robust for handling "messy" real-world text? What specific advantages do you notice? How might this impact a real-world application like social media sentiment analysis?**

*Double-click this cell to write your answer:*

**More robust library:** spaCy appears more robust for handling "messy" real-world text like social media posts.

**Specific advantages:**

*   spaCy treats emojis, hashtags, and mentions as separate, meaningful tokens, which helps capture the full context of social media language.

* It provides detailed linguistic annotations, allowing better understanding of slang and informal expressions.

* spaCy’s tokenizer is designed to handle non-standard text elements common in social media, such as repeated punctuation and emoticons.




**Impact on sentiment analysis:** Because spaCy recognizes emojis and hashtags as distinct tokens, it can better capture the emotional tone and topics discussed in posts. This leads to more accurate sentiment detection and helps identify trends or opinions expressed in informal language, improving real-world social media monitoring and analysis.



---

## 🛑 Part 4: Stop Words Removal

### What are Stop Words?
Stop words are common words that appear frequently in a language but typically don't carry much meaningful information about the content. Examples include "the", "is", "at", "which", "on", etc.

### Why Remove Stop Words?
1. **Reduce noise** in the data
2. **Improve efficiency** by reducing vocabulary size
3. **Focus on content words** that carry semantic meaning

### When NOT to Remove Stop Words?
- **Sentiment analysis:** "not good" vs "good" - the "not" is crucial!
- **Question answering:** "What is the capital?" - "what" and "is" provide context

In [None]:
# Step 7: Explore Stop Words Lists
from nltk.corpus import stopwords

# Get NLTK English stop words
nltk_stopwords = set(stopwords.words('english'))
print(f"📊 NLTK has {len(nltk_stopwords)} English stop words")
print(f"First 20: {sorted(list(nltk_stopwords))[:20]}")

# Get spaCy stop words
spacy_stopwords = nlp.Defaults.stop_words
print(f"\n📊 spaCy has {len(spacy_stopwords)} English stop words")
print(f"First 20: {sorted(list(spacy_stopwords))[:20]}")

# Compare the lists
common_stopwords = nltk_stopwords.intersection(spacy_stopwords)
nltk_only = nltk_stopwords - spacy_stopwords
spacy_only = spacy_stopwords - nltk_stopwords

print(f"\n🔍 Comparison:")
print(f"Common stop words: {len(common_stopwords)}")
print(f"Only in NLTK: {len(nltk_only)} - Examples: {sorted(list(nltk_only))[:5]}")
print(f"Only in spaCy: {len(spacy_only)} - Examples: {sorted(list(spacy_only))[:5]}")

### 🤔 Conceptual Question 7
**Why do you think NLTK and spaCy have different stop word lists? Look at the examples of words that are only in one list - do you agree with these choices? Can you think of scenarios where these differences might significantly impact your NLP results?**

*Double-click this cell to write your answer:*

**Reasons for differences:** NLTK and spaCy have different stop word lists because they were created with different design goals and use cases in mind. NLTK’s list is more traditional and based on classic linguistic resources, while spaCy’s list is designed for modern NLP tasks and includes more contractions, abbreviations, and common informal words. Each toolkit tailors its stop words to what their users typically need.



**Agreement with choices:** Some unique words in each list make sense given their purpose. For example, spaCy including contractions like “n't” helps remove common negations that may not carry strong meaning alone. On the other hand, NLTK’s inclusion of some archaic or less common words may not always be useful for everyday modern text. Overall, both lists have valid choices, but spaCy’s list might better suit contemporary language.


**Scenarios where differences matter:** In sentiment analysis, removing negations like “not” or “n't” as stop words could change the sentiment drastically if not handled carefully.

For topic modeling or information retrieval, missing or including certain stop words could affect keyword frequency and model quality.

In social media analysis, where informal language and contractions are common, spaCy’s stop word list might perform better, whereas NLTK’s could be too broad or miss important tokens.



---

In [None]:
# Step 8: Remove Stop Words with NLTK
# Test on simple text
original_tokens = nltk_tokens  # From earlier tokenization
filtered_tokens = [word for word in original_tokens if word.lower() not in nltk_stopwords]

print("🧪 NLTK Stop Word Removal")
print("=" * 40)
print(f"Original: {simple_text}")
print(f"\nOriginal tokens ({len(original_tokens)}): {original_tokens}")
print(f"After removing stop words ({len(filtered_tokens)}): {filtered_tokens}")

# Show which words were removed
removed_words = [word for word in original_tokens if word.lower() in nltk_stopwords]
print(f"\nRemoved words: {removed_words}")

# Calculate reduction percentage
reduction = (len(original_tokens) - len(filtered_tokens)) / len(original_tokens) * 100
print(f"Vocabulary reduction: {reduction:.1f}%")

In [None]:
# Step 9: Remove Stop Words with spaCy
doc = nlp(simple_text)
spacy_filtered = [token.text for token in doc if not token.is_stop and not token.is_punct]

print("🧪 spaCy Stop Word Removal")
print("=" * 40)
print(f"Original: {simple_text}")
print(f"\nOriginal tokens ({len(spacy_tokens)}): {spacy_tokens}")
print(f"After removing stop words & punctuation ({len(spacy_filtered)}): {spacy_filtered}")

# Show which words were removed
spacy_removed = [token.text for token in doc if token.is_stop or token.is_punct]
print(f"\nRemoved words: {spacy_removed}")

# Calculate reduction percentage
spacy_reduction = (len(spacy_tokens) - len(spacy_filtered)) / len(spacy_tokens) * 100
print(f"Vocabulary reduction: {spacy_reduction:.1f}%")

### 🤔 Conceptual Question 8
**Compare the NLTK and spaCy stop word removal results. Which approach removed more words? Do you think removing punctuation (as spaCy did) is always a good idea? Give a specific example where keeping punctuation might be important for NLP analysis.**

*Double-click this cell to write your answer:*

**Which removed more:** spaCy removed more words overall because it filters out both stop words and punctuation, while the NLTK example removed only stop words and kept punctuation tokens.



**Punctuation removal assessment:** Removing punctuation is often helpful to reduce noise, but it’s not always the best choice. Punctuation can carry important meaning, especially in informal texts or specialized domains.



**Example where punctuation matters:** In sentiment analysis, exclamation marks can indicate strong emotions or emphasis, so keeping them helps the model understand intensity. For example, “I love this!” vs. “I love this.” convey different levels of enthusiasm.



---

## 🌱 Part 5: Lemmatization and Stemming

### What is Lemmatization?
Lemmatization reduces words to their base or dictionary form (called a **lemma**). It considers context and part of speech to ensure the result is a valid word.

### What is Stemming?
Stemming reduces words to their root form by removing suffixes. It's faster but less accurate than lemmatization.

### Key Differences:
| Aspect | Stemming | Lemmatization |
|--------|----------|---------------|
| Speed | Fast | Slower |
| Accuracy | Lower | Higher |
| Output | May be non-words | Always valid words |
| Context | Ignores context | Considers context |

### Examples:
- **"running"** → Stem: "run", Lemma: "run"
- **"better"** → Stem: "better", Lemma: "good"
- **"was"** → Stem: "wa", Lemma: "be"

In [None]:
# Step 10: Stemming with NLTK
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

# Test words that demonstrate stemming challenges
test_words = ['running', 'runs', 'ran', 'better', 'good', 'best', 'flying', 'flies', 'was', 'were', 'cats', 'dogs']

print("🌿 Stemming Demonstration")
print("=" * 30)
print(f"{'Original':<12} {'Stemmed':<12}")
print("-" * 25)

for word in test_words:
    stemmed = stemmer.stem(word)
    print(f"{word:<12} {stemmed:<12}")

# Apply to our sample text
sample_tokens = [token for token in nltk_tokens if token.isalpha()]
stemmed_tokens = [stemmer.stem(token.lower()) for token in sample_tokens]

print(f"\n🧪 Applied to sample text:")
print(f"Original: {sample_tokens}")
print(f"Stemmed: {stemmed_tokens}")

### 🤔 Conceptual Question 9
**Look at the stemming results above. Can you identify any cases where stemming produced questionable results? For example, how were "better" and "good" handled? Do you think this is problematic for NLP applications? Explain your reasoning.**

*Double-click this cell to write your answer:*

**Questionable results identified:** The stemmer reduced “better” to “better”, but “good” remained “good” as well, showing no common root. Also, words like “running” and “runs” were stemmed correctly to “run,” but “flies” became “fli,” which isn’t a proper root form.



**Assessment of "better" and "good":** Since “better” and “good” are related in meaning, the stemmer’s failure to link them is a limitation. The stemmer treats them as unrelated words because it relies on simple rule-based suffix stripping, not semantic understanding.



**Impact on NLP applications:** This can be problematic for tasks like sentiment analysis or text classification, where recognizing related words matters. Stemming may oversimplify or miss important relationships, leading to less accurate results. In such cases, lemmatization or more advanced methods might be better.



---

In [None]:
# Step 11: Lemmatization with spaCy
print("🌱 spaCy Lemmatization Demonstration")
print("=" * 40)

# Test on a complex sentence
complex_sentence = "The researchers were studying the effects of running and swimming on better performance."
doc = nlp(complex_sentence)

print(f"Original: {complex_sentence}")
print(f"\n{'Token':<15} {'Lemma':<15} {'POS':<10} {'Explanation':<20}")
print("-" * 65)

for token in doc:
    if token.is_alpha:
        explanation = "No change" if token.text.lower() == token.lemma_ else "Lemmatized"
        print(f"{token.text:<15} {token.lemma_:<15} {token.pos_:<10} {explanation:<20}")

# Extract lemmas
lemmas = [token.lemma_.lower() for token in doc if token.is_alpha and not token.is_stop]
print(f"\n🔤 Lemmatized tokens (no stop words): {lemmas}")

In [None]:
# Step 12: Compare Stemming vs Lemmatization
comparison_words = ['better', 'running', 'studies', 'was', 'children', 'feet']

print("⚖️ Stemming vs Lemmatization Comparison")
print("=" * 50)
print(f"{'Original':<12} {'Stemmed':<12} {'Lemmatized':<12}")
print("-" * 40)

for word in comparison_words:
    # Stemming
    stemmed = stemmer.stem(word)

    # Lemmatization with spaCy
    doc = nlp(word)
    lemmatized = doc[0].lemma_

    print(f"{word:<12} {stemmed:<12} {lemmatized:<12}")

### 🤔 Conceptual Question 10
**Compare the stemming and lemmatization results. Which approach do you think is more suitable for:**
1. **A search engine** (where speed is crucial and you need to match variations of words)?
2. **A sentiment analysis system** (where accuracy and meaning preservation are important)?
3. **A real-time chatbot** (where both speed and accuracy matter)?

**Explain your reasoning for each choice.**

*Double-click this cell to write your answer:*

**1. Search engine:** Stemming is more suitable because it is faster and can quickly reduce words to their root forms, helping match different variations of a word during search queries. Speed is crucial here, and exact linguistic accuracy is less important than broad matching.

**2. Sentiment analysis:** Lemmatization is better because it preserves the true meaning of words by considering context and grammar. This helps the system understand the sentiment more accurately, especially with irregular forms like “better” vs. “good” or “was” vs. “be.”

**3. Real-time chatbot:** A balance is needed. Lemmatization provides better accuracy for understanding user input, but stemming is faster. Using lemmatization with efficient implementation or selective stemming might work best to maintain reasonable speed without losing important meaning.

---

## 🧹 Part 6: Text Cleaning and Normalization

### What is Text Cleaning?
Text cleaning involves removing or standardizing elements that might interfere with analysis:
- **Case normalization** (converting to lowercase)
- **Punctuation removal**
- **Number handling** (remove, replace, or normalize)
- **Special character handling** (URLs, emails, mentions)
- **Whitespace normalization**

### Why is it Important?
- Ensures consistency across your dataset
- Reduces vocabulary size
- Improves model performance
- Handles edge cases in real-world data

In [None]:
# Step 13: Basic Text Cleaning
def basic_clean_text(text):
    """Apply basic text cleaning operations"""
    # Convert to lowercase
    text = text.lower()

    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()

    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))

    # Remove numbers
    text = re.sub(r'\d+', '', text)

    # Remove extra spaces again
    text = re.sub(r'\s+', ' ', text).strip()

    return text

# Test basic cleaning
test_text = "   Hello WORLD!!! This has 123 numbers and   extra spaces.   "
cleaned = basic_clean_text(test_text)

print("🧹 Basic Text Cleaning")
print("=" * 30)
print(f"Original: '{test_text}'")
print(f"Cleaned: '{cleaned}'")
print(f"Length reduction: {(len(test_text) - len(cleaned))/len(test_text)*100:.1f}%")

In [None]:
# Step 14: Advanced Cleaning for Social Media
def advanced_clean_text(text):
    """Apply advanced cleaning for social media and web text"""
    # Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)

    # Remove email addresses
    text = re.sub(r'\S+@\S+', '', text)

    # Remove mentions (@username)
    text = re.sub(r'@\w+', '', text)

    # Convert hashtags (keep the word, remove #)
    text = re.sub(r'#(\w+)', r'\1', text)

    # Remove emojis (basic approach)
    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F1E0-\U0001F1FF"  # flags
                               "]+", flags=re.UNICODE)
    text = emoji_pattern.sub(r'', text)

    # Convert to lowercase and normalize whitespace
    text = text.lower()
    text = re.sub(r'\s+', ' ', text).strip()

    return text

# Test on social media text
print("🚀 Advanced Cleaning on Social Media Text")
print("=" * 45)
print(f"Original: {social_text}")

cleaned_social = advanced_clean_text(social_text)
print(f"Cleaned: {cleaned_social}")
print(f"Length reduction: {(len(social_text) - len(cleaned_social))/len(social_text)*100:.1f}%")

### 🤔 Conceptual Question 11
**Look at the advanced cleaning results for the social media text. What information was lost during cleaning? Can you think of scenarios where removing emojis and hashtags might actually hurt your NLP application? What about scenarios where keeping them would be beneficial?**

*Double-click this cell to write your answer:*

**Information lost:** Emojis expressing emotions (like 😍, 👍) and hashtags that signal topics (#coffee, #yum) were removed, so the emotional tone and topical cues got lost.

**Scenarios where removal hurts:** Sentiment analysis, where emojis often carry strong feelings.

Trend detection or topic modeling, where hashtags identify key subjects or communities.



**Scenarios where keeping helps:** When focusing on core text content only, like formal language processing or document classification.

Reducing noise for models that don’t handle emojis/hashtags well, improving general text clarity.










---

## 🔧 Part 7: Building a Complete Preprocessing Pipeline

Now let's combine everything into a comprehensive preprocessing pipeline that you can customize based on your needs.

### Pipeline Components:
1. **Text cleaning** (basic or advanced)
2. **Tokenization** (NLTK or spaCy)
3. **Stop word removal** (optional)
4. **Lemmatization/Stemming** (optional)
5. **Additional filtering** (length, etc.)

In [None]:
# Step 15: Complete Preprocessing Pipeline
def preprocess_text(text,
                   clean_level='basic',     # 'basic' or 'advanced'
                   remove_stopwords=True,
                   use_lemmatization=True,
                   use_stemming=False,
                   min_length=2):
    """
    Complete text preprocessing pipeline
    """
    # Step 1: Clean text
    if clean_level == 'basic':
        cleaned_text = basic_clean_text(text)
    else:
        cleaned_text = advanced_clean_text(text)

    # Step 2: Tokenize
    if use_lemmatization:
        # Use spaCy for lemmatization
        doc = nlp(cleaned_text)
        tokens = [token.lemma_.lower() for token in doc if token.is_alpha]
    else:
        # Use NLTK for basic tokenization
        tokens = word_tokenize(cleaned_text)
        tokens = [token for token in tokens if token.isalpha()]

    # Step 3: Remove stop words
    if remove_stopwords:
        if use_lemmatization:
            tokens = [token for token in tokens if token not in spacy_stopwords]
        else:
            tokens = [token.lower() for token in tokens if token.lower() not in nltk_stopwords]

    # Step 4: Apply stemming if requested
    if use_stemming and not use_lemmatization:
        tokens = [stemmer.stem(token.lower()) for token in tokens]

    # Step 5: Filter by length
    tokens = [token for token in tokens if len(token) >= min_length]

    return tokens

print("🔧 Preprocessing Pipeline Created!")
print("✅ Ready to test different configurations.")

In [None]:
# Step 16: Test Different Pipeline Configurations
test_text = sample_texts["Product Review"]
print(f"🎯 Testing on: {test_text[:100]}...")
print("=" * 60)

# Configuration 1: Minimal processing
minimal = preprocess_text(test_text,
                         clean_level='basic',
                         remove_stopwords=False,
                         use_lemmatization=False,
                         use_stemming=False)
print(f"\n1. Minimal processing ({len(minimal)} tokens):")
print(f"   {minimal[:10]}...")

# Configuration 2: Standard processing
standard = preprocess_text(test_text,
                          clean_level='basic',
                          remove_stopwords=True,
                          use_lemmatization=True)
print(f"\n2. Standard processing ({len(standard)} tokens):")
print(f"   {standard[:10]}...")

# Configuration 3: Aggressive processing
aggressive = preprocess_text(test_text,
                            clean_level='advanced',
                            remove_stopwords=True,
                            use_lemmatization=False,
                            use_stemming=True,
                            min_length=3)
print(f"\n3. Aggressive processing ({len(aggressive)} tokens):")
print(f"   {aggressive[:10]}...")

# Show reduction percentages
original_count = len(word_tokenize(test_text))
print(f"\n📊 Token Reduction Summary:")
print(f"   Original: {original_count} tokens")
print(f"   Minimal: {len(minimal)} ({(original_count-len(minimal))/original_count*100:.1f}% reduction)")
print(f"   Standard: {len(standard)} ({(original_count-len(standard))/original_count*100:.1f}% reduction)")
print(f"   Aggressive: {len(aggressive)} ({(original_count-len(aggressive))/original_count*100:.1f}% reduction)")

### 🤔 Conceptual Question 12
**Compare the three pipeline configurations (Minimal, Standard, Aggressive). For each configuration, analyze:**
1. **What information was preserved?**
2. **What information was lost?**
3. **What type of NLP task would this configuration be best suited for?**

*Double-click this cell to write your answer:*

Minimal Processing:

Preserved:

* All original words including stop words and common function words.
* Full lexical variety and subtle nuances in word forms.
* Punctuation removed but most text features intact.

Lost:

* Some noisy elements like punctuation and numbers.
* No normalization or unification of word forms (e.g., running vs run).

Best for:

* Tasks requiring maximum context or detailed text understanding, like language modeling, text generation, or syntactic analysis.



Standard Processing:

Preserved:

* Core content words with unified lemmas (e.g., “running” → “run”).
* Reduced noise by removing stop words.
* Maintains semantic meaning while simplifying word forms.

Lost:

* Stop words and some function words that might carry subtle meaning.
* No emojis, URLs, or mentions removed (basic cleaning only).
* Some nuance in original word forms lost due to lemmatization.

Best for:

* Sentiment analysis, topic modeling, text classification, or information retrieval where balance between noise reduction and meaning preservation is key.



Aggressive Processing:

Preserved:

* Most meaningful content words after aggressive normalization and cleaning.
* Removes social media noise like URLs, mentions, hashtags, and emojis.
* Stemming reduces words to their root forms.

Lost:

* Stop words, short words, and any social media-specific signals like emojis and hashtags.
* Detailed word form nuance due to stemming.
* Potentially important context from social media-specific tokens removed.

Best for:

* Large-scale or real-time applications needing speed and compactness, such as spam filtering, large-scale search, or social media monitoring where noise is high and rapid processing is needed.


In [None]:
# Step 17: Comprehensive Analysis Across Text Types
print("🔬 Comprehensive Preprocessing Analysis")
print("=" * 50)

# Test standard preprocessing on all text types
results = {}
for name, text in sample_texts.items():
    original_tokens = len(word_tokenize(text))
    processed_tokens = preprocess_text(text,
                                      clean_level='basic',
                                      remove_stopwords=True,
                                      use_lemmatization=True)

    reduction = (original_tokens - len(processed_tokens)) / original_tokens * 100
    results[name] = {
        'original': original_tokens,
        'processed': len(processed_tokens),
        'reduction': reduction,
        'sample': processed_tokens[:8]
    }

    print(f"\n📄 {name}:")
    print(f"   Original: {original_tokens} tokens")
    print(f"   Processed: {len(processed_tokens)} tokens ({reduction:.1f}% reduction)")
    print(f"   Sample: {processed_tokens[:8]}")

# Summary table
print(f"\n\n📋 Summary Table")
print(f"{'Text Type':<15} {'Original':<10} {'Processed':<10} {'Reduction':<10}")
print("-" * 50)
for name, data in results.items():
    print(f"{name:<15} {data['original']:<10} {data['processed']:<10} {data['reduction']:<10.1f}%")

### 🤔 Final Conceptual Question 13
**Looking at the comprehensive analysis results across all text types:**

1. **Which text type was most affected by preprocessing?** Why do you think this happened?

2. **Which text type was least affected?** What does this tell you about the nature of that text?

3. **If you were building an NLP system to analyze customer reviews for a business, which preprocessing approach would you choose and why?**

4. **What are the main trade-offs you need to consider when choosing preprocessing techniques for any NLP project?**

*Double-click this cell to write your answer:*

**1. Most affected text type:**
Social Media text was the most affected by preprocessing, showing the highest token reduction. This likely happened because social media posts often contain a lot of noise such as emojis, hashtags, mentions, URLs, and informal language, which the advanced cleaning and stop word removal target heavily.

**2. Least affected text type:**
Product Reviews or possibly Academic Papers were the least affected. These texts tend to be more formal, structured, and focused on conveying concrete information, so fewer tokens are removed during cleaning and stop word removal. This indicates that the language is more standardized and less noisy.

**3. For customer review analysis:**
A balanced preprocessing approach similar to the Standard Processing configuration would be ideal—cleaning text to remove noise and stop words while using lemmatization to preserve word meaning. This maintains important sentiment-bearing words and key terms without overly aggressive reduction, which is important for accurate sentiment and opinion analysis.

**4. Main trade-offs to consider:**

* **Information retention vs noise reduction:** Aggressive cleaning can remove useful context or subtle cues (e.g., emojis or punctuation), while minimal cleaning risks including irrelevant noise.
* **Speed vs accuracy:** Simpler methods like stemming or minimal cleaning are faster but less accurate; lemmatization and advanced cleaning improve meaning but cost more processing time.
* **Task-specific needs:** Some tasks require preserving certain tokens (e.g., hashtags in social media analysis), while others benefit from generalization (e.g., search engines).
* **Domain and text type:** Informal text requires different cleaning than formal writing, affecting choice of preprocessing steps.




## 🎯 Lab Summary and Reflection

Congratulations! You've completed a comprehensive exploration of NLP preprocessing techniques.

### 🔑 Key Concepts You've Mastered:

1. **Text Preprocessing Fundamentals** - Understanding why preprocessing is crucial
2. **Tokenization Techniques** - NLTK vs spaCy approaches and their trade-offs
3. **Stop Word Management** - When to remove them and when to keep them
4. **Morphological Processing** - Stemming vs lemmatization for different use cases
5. **Text Cleaning Strategies** - Basic vs advanced cleaning for different text types
6. **Pipeline Design** - Building modular, configurable preprocessing systems

### 🎓 Real-World Applications:
These techniques form the foundation for search engines, chatbots, sentiment analysis, document classification, machine translation, and information extraction systems.

### 💡 Key Insights to Remember:
- **No Universal Solution**: Different NLP tasks require different preprocessing approaches
- **Trade-offs Are Everywhere**: Balance information preservation with noise reduction
- **Context Matters**: The same technique can help or hurt depending on your use case
- **Experimentation Is Key**: Always test and measure impact on your specific task

---

**Excellent work completing Lab 02!** 🎉

For your reflection journal, focus on the insights you gained about when and why to use different techniques, the challenges you encountered, and connections you made to real-world applications.