# Text Preprocessing in NLP

This notebook covers the fundamental text preprocessing techniques essential for Natural Language Processing. We'll explore various preprocessing methods and build a comprehensive pipeline for different NLP tasks.

## Learning Objectives
- Understand the importance of text preprocessing in NLP
- Learn tokenization, normalization, and text cleaning techniques
- Master stemming and lemmatization
- Handle stop words and special characters
- Build reusable preprocessing pipelines
- Compare different preprocessing approaches

In [1]:
# Import required libraries
import re
import string
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter

# NLP Libraries
from wordcloud import WordCloud
import nltk
import spacy


# Download required NLTK data
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)

# Load spaCy model (need to install: python -m spacy download en_core_web_sm)
try:
    nlp = spacy.load("en_core_web_sm")
except IOError:
    print("spaCy English model not found. Please install with: python -m spacy download en_core_web_sm")

print("Libraries imported successfully!")

Libraries imported successfully!


## 1. What is Text Preprocessing?

Text preprocessing is the crucial first step in any NLP pipeline. Raw text is messy and inconsistent - it contains:
- **Punctuation and special characters**: "Hello, World!"
- **Different cases**: "Hello" vs "hello"
- **Extra whitespace**: "Hello    World"
- **Common words**: "the", "is", "and" (often not meaningful)
- **Different word forms**: "running", "runs", "ran" (same concept)

**Goal**: Clean and standardize text so algorithms can process it effectively.

### Why is it important?
- **Consistency**: Ensures "Hello" and "hello" are treated the same
- **Efficiency**: Removes noise and focuses on meaningful words
- **Performance**: Better preprocessing often leads to better model results

Let's explore each preprocessing technique step by step! 🚀

In [2]:
sample_text = """
    Hello, World! This is an EXAMPLE text with:
    - Different CASES (Upper, lower, MiXeD)
    - Punctuation marks!!! And special characters @#$%
    - Extra    spaces   and    tabs
    - Numbers like 123 and 456
    - URLs like https://example.com
    - Email addresses like user@example.com
    - Common words: the, is, and, of, to, in
    - Different word forms: running, runs, ran, runner
"""

print("📝 Original Text:")
print(repr(sample_text))
print("\n" + "="*50)
print("🔍 Text Visualization:")
print(sample_text)

📝 Original Text:
'\n    Hello, World! This is an EXAMPLE text with:\n    - Different CASES (Upper, lower, MiXeD)\n    - Punctuation marks!!! And special characters @#$%\n    - Extra    spaces   and    tabs\n    - Numbers like 123 and 456\n    - URLs like https://example.com\n    - Email addresses like user@example.com\n    - Common words: the, is, and, of, to, in\n    - Different word forms: running, runs, ran, runner\n'

🔍 Text Visualization:

    Hello, World! This is an EXAMPLE text with:
    - Different CASES (Upper, lower, MiXeD)
    - Punctuation marks!!! And special characters @#$%
    - Extra    spaces   and    tabs
    - Numbers like 123 and 456
    - URLs like https://example.com
    - Email addresses like user@example.com
    - Common words: the, is, and, of, to, in
    - Different word forms: running, runs, ran, runner



## 2. Tokenization - Breaking Text into Pieces

**Tokenization** is the process of splitting text into individual units called tokens (usually words or sentences).

### Types of Tokenization:
1. **Word Tokenization**: Split into words
   - `"I love AI"` → `["I", "love", "AI"]`
2. **Sentence Tokenization**: Split into sentences  
   - `"Hello. How are you?"` → `["Hello.", "How are you?"]`
3. **Subword Tokenization**: Split into smaller units (advanced)

### Why Tokenization Matters:
- **Foundation**: All other preprocessing steps work on tokens
- **Challenges**: Handling punctuation, contractions, special cases
- **Example**: `"Don't"` → `["Don", "'t"]` or `["Don't"]`?

In [3]:
# Tokenization Examples

# 1. Simple word tokenization (split by spaces)
simple_tokens = sample_text.split()
print("🔹 Simple Split by Spaces:")
print(f"Tokens: {simple_tokens[:10]}...")  # Show first 10
print(f"Total tokens: {len(simple_tokens)}\n")

# 2. NLTK word tokenization (handles punctuation better)
from nltk.tokenize import word_tokenize, sent_tokenize

nltk_words = word_tokenize(sample_text)
print("🔹 NLTK Word Tokenization:")
print(f"Tokens: {nltk_words[:15]}...")  # Show first 15
print(f"Total tokens: {len(nltk_words)}\n")

# 3. Sentence tokenization
sentences = sent_tokenize(sample_text)
print("🔹 Sentence Tokenization:")
for i, sentence in enumerate(sentences, 1):
    print(f"Sentence {i}: {sentence.strip()}")

# 4. spaCy tokenization (most advanced)
if nlp is not None:
    doc = nlp(sample_text)
    spacy_tokens = [token.text for token in doc]
    print(f"\n🔹 spaCy Tokenization:")
    print(f"Tokens: {spacy_tokens[:15]}...")
    print(f"Total tokens: {len(spacy_tokens)}")
    
    # Show token details
    print("\n🔍 Token Details (first 10):")
    for token in list(doc)[:10]:
        print(f"'{token.text}' -> POS: {token.pos_}, Lemma: {token.lemma_}")

🔹 Simple Split by Spaces:
Tokens: ['Hello,', 'World!', 'This', 'is', 'an', 'EXAMPLE', 'text', 'with:', '-', 'Different']...
Total tokens: 58

🔹 NLTK Word Tokenization:
Tokens: ['Hello', ',', 'World', '!', 'This', 'is', 'an', 'EXAMPLE', 'text', 'with', ':', '-', 'Different', 'CASES', '(']...
Total tokens: 85

🔹 Sentence Tokenization:
Sentence 1: Hello, World!
Sentence 2: This is an EXAMPLE text with:
    - Different CASES (Upper, lower, MiXeD)
    - Punctuation marks!!!
Sentence 3: And special characters @#$%
    - Extra    spaces   and    tabs
    - Numbers like 123 and 456
    - URLs like https://example.com
    - Email addresses like user@example.com
    - Common words: the, is, and, of, to, in
    - Different word forms: running, runs, ran, runner

🔹 spaCy Tokenization:
Tokens: ['\n    ', 'Hello', ',', 'World', '!', 'This', 'is', 'an', 'EXAMPLE', 'text', 'with', ':', '\n    ', '-', 'Different']...
Total tokens: 91

🔍 Token Details (first 10):
'
    ' -> POS: SPACE, Lemma: 
    
'Hel

## 3. Stopword Removal

**What are stopwords?**
Stopwords are common words like "the", "a", "an", "in", "on", "at" that occur frequently but don't carry much semantic meaning for most NLP tasks. Removing them can help focus on the important words and reduce noise in your data.

**When to remove stopwords:**
- Text classification
- Topic modeling
- Information retrieval
- Keyword extraction

**When NOT to remove stopwords:**
- Sentiment analysis (words like "not" are crucial)
- Language modeling
- Machine translation
- Part-of-speech tagging

In [4]:
# Stopword Removal Examples

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import spacy

# Download NLTK stopwords if not already available
import nltk
nltk.download('stopwords', quiet=True)

# Load spaCy model
nlp = spacy.load('en_core_web_sm')

# Sample text for demonstration
text = "The quick brown fox jumps over the lazy dog in the beautiful garden."

print("Original text:")
print(text)
print("\n" + "="*60)

# Method 1: Using NLTK
print("\n1. NLTK Stopword Removal:")
stop_words_nltk = set(stopwords.words('english'))
word_tokens = word_tokenize(text.lower())

filtered_nltk = [word for word in word_tokens if word not in stop_words_nltk and word.isalpha()]
print(f"NLTK stopwords count: {len(stop_words_nltk)}")
print(f"Original tokens: {word_tokens}")
print(f"Filtered tokens: {filtered_nltk}")

# Method 2: Using spaCy
print("\n2. spaCy Stopword Removal:")
doc = nlp(text.lower())
filtered_spacy = [token.text for token in doc if not token.is_stop and token.is_alpha]
print(f"spaCy filtered tokens: {filtered_spacy}")

# Method 3: Custom stopwords
print("\n3. Custom Stopword List:")
custom_stopwords = {'the', 'in', 'over', 'a', 'an'}
filtered_custom = [word for word in word_tokens if word.lower() not in custom_stopwords and word.isalpha()]
print(f"Custom stopwords: {custom_stopwords}")
print(f"Custom filtered tokens: {filtered_custom}")

# Comparison
print("\n" + "="*60)
print("COMPARISON:")
print(f"Original word count: {len(word_tokens)}")
print(f"After NLTK filtering: {len(filtered_nltk)}")
print(f"After spaCy filtering: {len(filtered_spacy)}")
print(f"After custom filtering: {len(filtered_custom)}")

Original text:
The quick brown fox jumps over the lazy dog in the beautiful garden.


1. NLTK Stopword Removal:
NLTK stopwords count: 198
Original tokens: ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', 'in', 'the', 'beautiful', 'garden', '.']
Filtered tokens: ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog', 'beautiful', 'garden']

2. spaCy Stopword Removal:
spaCy filtered tokens: ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog', 'beautiful', 'garden']

3. Custom Stopword List:
Custom stopwords: {'in', 'over', 'a', 'an', 'the'}
Custom filtered tokens: ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog', 'beautiful', 'garden']

COMPARISON:
Original word count: 14
After NLTK filtering: 8
After spaCy filtering: 8
After custom filtering: 8


## 4. Stemming and Lemmatization

**What is Stemming?**
Stemming is the process of reducing words to their root form by chopping off suffixes. It's a crude heuristic process that cuts off the ends of words.

**What is Lemmatization?**
Lemmatization is more sophisticated - it finds the dictionary form (lemma) of a word by considering the morphological analysis and part-of-speech.

**Key Differences:**
- **Stemming**: Fast, rule-based, may produce non-words (e.g., "running" → "run", "studies" → "studi")
- **Lemmatization**: Slower, dictionary-based, produces real words (e.g., "running" → "run", "better" → "good")

**When to use:**
- **Stemming**: When speed is important and slight accuracy loss is acceptable
- **Lemmatization**: When accuracy is crucial and you have computational resources

In [5]:
# Stemming and Lemmatization Examples
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet
import spacy

# Download required NLTK data
import nltk
nltk.download('wordnet', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)

# Initialize tools
porter_stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
nlp = spacy.load('en_core_web_sm')

# Sample words to demonstrate differences
sample_words = [
    "running", "ran", "runs", "easily", "fairly", 
    "studies", "studying", "studied", "better", "good",
    "children", "feet", "teeth", "geese", "mice"
]

print("STEMMING vs LEMMATIZATION COMPARISON")
print("="*60)
print(f"{'Word':<12} {'Stemming':<12} {'NLTK Lemma':<12} {'spaCy Lemma':<12}")
print("-"*60)

for word in sample_words:
    # Stemming with Porter Stemmer
    stemmed = porter_stemmer.stem(word)
    
    # NLTK Lemmatization (requires POS tag for better results)
    lemmatized_nltk = lemmatizer.lemmatize(word, pos='v')  # trying as verb first
    if lemmatized_nltk == word:  # if no change, try as noun
        lemmatized_nltk = lemmatizer.lemmatize(word, pos='n')
    
    # spaCy Lemmatization
    doc = nlp(word)
    lemmatized_spacy = doc[0].lemma_
    
    print(f"{word:<12} {stemmed:<12} {lemmatized_nltk:<12} {lemmatized_spacy:<12}")

print("\n" + "="*60)
print("PROCESSING A SENTENCE:")

sentence = "The children were running and studying different subjects better than before."
print(f"Original: {sentence}")

# Tokenize
tokens = word_tokenize(sentence.lower())

# Stem all words
stemmed_words = [porter_stemmer.stem(token) for token in tokens if token.isalpha()]
print(f"Stemmed: {' '.join(stemmed_words)}")

# Lemmatize with NLTK
lemmatized_nltk_words = [lemmatizer.lemmatize(token, pos='v') for token in tokens if token.isalpha()]
lemmatized_nltk_words = [lemmatizer.lemmatize(token, pos='n') if token == lemmatizer.lemmatize(token, pos='v') else lemmatizer.lemmatize(token, pos='v') for token in tokens if token.isalpha()]
print(f"NLTK Lemmatized: {' '.join(lemmatized_nltk_words)}")

# Lemmatize with spaCy
doc = nlp(sentence.lower())
lemmatized_spacy_words = [token.lemma_ for token in doc if token.is_alpha]
print(f"spaCy Lemmatized: {' '.join(lemmatized_spacy_words)}")

STEMMING vs LEMMATIZATION COMPARISON
Word         Stemming     NLTK Lemma   spaCy Lemma 
------------------------------------------------------------
running      run          run          run         
ran          ran          run          run         
runs         run          run          run         
easily       easili       easily       easily      
fairly       fairli       fairly       fairly      
studies      studi        study        study       
studying     studi        study        study       
studied      studi        study        study       
better       better       better       well        
good         good         good         good        
children     children     child        child       
feet         feet         foot         foot        
teeth        teeth        teeth        tooth       
geese        gees         goose        geese       
mice         mice         mouse        mouse       

PROCESSING A SENTENCE:
Original: The children were running and studyi

## 5. Text Cleaning and Normalization

**What is Text Cleaning?**
Text cleaning involves removing or standardizing unwanted characters, formatting, and noise from text data.

**Common cleaning operations:**
- Remove special characters, punctuation
- Handle numbers and digits
- Convert to lowercase
- Remove extra whitespace
- Handle contractions (don't → do not)
- Remove HTML tags, URLs, mentions
- Normalize unicode characters

**Why is it important?**
- Reduces noise in the data
- Ensures consistency
- Improves model performance
- Handles real-world messy text data

In [6]:
# Text Cleaning and Normalization Examples

import re
import string
import unicodedata

def clean_text(text):
    """
    Comprehensive text cleaning function
    """
    # Original text
    print(f"Original: {text}")
    
    # 1. Convert to lowercase
    text = text.lower()
    print(f"Lowercase: {text}")
    
    # 2. Handle contractions
    contractions = {
        "don't": "do not",
        "won't": "will not", 
        "can't": "cannot",
        "n't": " not",
        "'re": " are",
        "'ve": " have",
        "'ll": " will",
        "'d": " would",
        "'m": " am"
    }
    for contraction, expansion in contractions.items():
        text = text.replace(contraction, expansion)
    print(f"Contractions expanded: {text}")
    
    # 3. Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    print(f"URLs removed: {text}")
    
    # 4. Remove mentions and hashtags (social media)
    text = re.sub(r'@\w+|#\w+', '', text)
    print(f"Mentions/hashtags removed: {text}")
    
    # 5. Remove numbers (optional - depends on use case)
    text = re.sub(r'\d+', '', text)
    print(f"Numbers removed: {text}")
    
    # 6. Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    print(f"Punctuation removed: {text}")
    
    # 7. Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    print(f"Extra whitespace removed: {text}")
    
    return text

# Test with messy text
messy_text = """
Hey @user123! Check this out: https://example.com/article?id=123 
It's AMAZING!!! Don't you think so??? I can't believe it's 2024! 
#NLP #TextProcessing    #DataScience
"""

print("TEXT CLEANING DEMONSTRATION")
print("="*60)
cleaned = clean_text(messy_text)

print("\n" + "="*60)
print("ADDITIONAL CLEANING TECHNIQUES")
print("="*60)

# HTML tag removal example
html_text = "<p>This is a <strong>HTML</strong> text with <a href='#'>links</a>.</p>"
clean_html = re.sub(r'<[^>]+>', '', html_text)
print(f"HTML tags removed: {html_text} → {clean_html}")

# Unicode normalization example
unicode_text = "Café naïve résumé"
normalized = unicodedata.normalize('NFKD', unicode_text).encode('ascii', 'ignore').decode('ascii')
print(f"Unicode normalized: {unicode_text} → {normalized}")

# Custom character removal
special_chars = "Remove these: !@#$%^&*()_+"
clean_special = re.sub(r'[^\w\s]', '', special_chars)
print(f"Special chars removed: {special_chars} → {clean_special}")

# Remove single characters (often noise after cleaning)
single_char_text = "This is a b c d example text"
no_single_chars = ' '.join([word for word in single_char_text.split() if len(word) > 1])
print(f"Single chars removed: {single_char_text} → {no_single_chars}")

TEXT CLEANING DEMONSTRATION
Original: 
Hey @user123! Check this out: https://example.com/article?id=123 
It's AMAZING!!! Don't you think so??? I can't believe it's 2024! 
#NLP #TextProcessing    #DataScience

Lowercase: 
hey @user123! check this out: https://example.com/article?id=123 
it's amazing!!! don't you think so??? i can't believe it's 2024! 
#nlp #textprocessing    #datascience

Contractions expanded: 
hey @user123! check this out: https://example.com/article?id=123 
it's amazing!!! do not you think so??? i cannot believe it's 2024! 
#nlp #textprocessing    #datascience

URLs removed: 
hey @user123! check this out:  
it's amazing!!! do not you think so??? i cannot believe it's 2024! 
#nlp #textprocessing    #datascience

Mentions/hashtags removed: 
hey ! check this out:  
it's amazing!!! do not you think so??? i cannot believe it's 2024! 
     

Numbers removed: 
hey ! check this out:  
it's amazing!!! do not you think so??? i cannot believe it's ! 
     

Punctuation removed:

## 6. Complete Preprocessing Pipeline

**Putting it all together:**
Now we'll create a comprehensive preprocessing pipeline that combines all the techniques we've learned. This pipeline will be modular, allowing you to enable/disable specific preprocessing steps based on your needs.

**Pipeline steps:**
1. Text cleaning and normalization
2. Tokenization
3. Stopword removal
4. Stemming or Lemmatization
5. Filtering (remove short words, numbers, etc.)

**Flexibility:**
You can customize the pipeline based on your specific NLP task and requirements.

In [7]:
# Complete Text Preprocessing Pipeline
# Initialize tools
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize
import spacy

# Download required NLTK data
import nltk
class TextPreprocessor:
    """
    A comprehensive text preprocessing pipeline
    """
    
    def __init__(self, 
                 remove_stopwords=True,
                 use_lemmatization=True,  # If False, will use stemming
                 remove_punctuation=True,
                 to_lowercase=True,
                 remove_numbers=True,
                 min_word_length=2):
        

        nltk.download('stopwords', quiet=True)
        nltk.download('wordnet', quiet=True)
        nltk.download('punkt', quiet=True)
        
        self.remove_stopwords = remove_stopwords
        self.use_lemmatization = use_lemmatization
        self.remove_punctuation = remove_punctuation
        self.to_lowercase = to_lowercase
        self.remove_numbers = remove_numbers
        self.min_word_length = min_word_length
        
        # Initialize NLTK tools
        if self.remove_stopwords:
            self.stop_words = set(stopwords.words('english'))
        
        if self.use_lemmatization:
            self.lemmatizer = WordNetLemmatizer()
        else:
            self.stemmer = PorterStemmer()
            
        # Try to load spaCy model (fallback to NLTK if not available)
        try:
            self.nlp = spacy.load('en_core_web_sm')
            self.use_spacy = True
        except:
            self.use_spacy = False
            print("spaCy model not found, using NLTK for tokenization")
    
    def clean_text(self, text):
        """Clean and normalize text"""
        import re
        import string
        
        if self.to_lowercase:
            text = text.lower()
        
        # Handle contractions
        contractions = {
            "don't": "do not", "won't": "will not", "can't": "cannot",
            "n't": " not", "'re": " are", "'ve": " have", 
            "'ll": " will", "'d": " would", "'m": " am"
        }
        for contraction, expansion in contractions.items():
            text = text.replace(contraction, expansion)
        
        # Remove URLs, mentions, hashtags
        text = re.sub(r'http\S+|www\S+|https\S+', '', text)
        text = re.sub(r'@\w+|#\w+', '', text)
        
        # Remove numbers if specified
        if self.remove_numbers:
            text = re.sub(r'\d+', '', text)
        
        # Remove punctuation if specified
        if self.remove_punctuation:
            text = text.translate(str.maketrans('', '', string.punctuation))
        
        # Clean extra whitespace
        text = re.sub(r'\s+', ' ', text).strip()
        
        return text
    
    def tokenize(self, text):
        """Tokenize text using spaCy or NLTK"""
        if self.use_spacy:
            doc = self.nlp(text)
            return [token.text for token in doc if token.is_alpha and len(token.text) >= self.min_word_length]
        else:
            from nltk.tokenize import word_tokenize
            tokens = word_tokenize(text)
            return [token for token in tokens if token.isalpha() and len(token) >= self.min_word_length]
    
    def remove_stopwords_func(self, tokens):
        """Remove stopwords from token list"""
        if not self.remove_stopwords:
            return tokens
        return [token for token in tokens if token.lower() not in self.stop_words]
    
    def normalize_tokens(self, tokens):
        """Apply stemming or lemmatization"""
        if self.use_lemmatization:
            if self.use_spacy:
                # Use spaCy for lemmatization
                doc = self.nlp(' '.join(tokens))
                return [token.lemma_ for token in doc]
            else:
                # Use NLTK lemmatizer
                return [self.lemmatizer.lemmatize(token, pos='v') for token in tokens]
        else:
            # Use stemming
            return [self.stemmer.stem(token) for token in tokens]
    
    def preprocess(self, text, return_string=True):
        """
        Complete preprocessing pipeline
        
        Args:
            text: Input text to preprocess
            return_string: If True, returns joined string; if False, returns list of tokens
        
        Returns:
            Preprocessed text as string or list of tokens
        """
        # Step 1: Clean text
        cleaned_text = self.clean_text(text)
        
        # Step 2: Tokenize
        tokens = self.tokenize(cleaned_text)
        
        # Step 3: Remove stopwords
        tokens = self.remove_stopwords_func(tokens)
        
        # Step 4: Normalize (stem or lemmatize)
        tokens = self.normalize_tokens(tokens)
        
        if return_string:
            return ' '.join(tokens)
        else:
            return tokens

# Test the pipeline with different configurations
sample_texts = [
    "The quick brown fox jumps over the lazy dog!",
    "I can't believe it's 2024! This is AMAZING!!! Don't you think so?",
    "Check out https://example.com for more information @user123 #NLP",
    "The children were running and studying better than before."
]

print("PREPROCESSING PIPELINE DEMONSTRATION")
print("="*80)

# Configuration 1: Full preprocessing
print("\n1. FULL PREPROCESSING (lemmatization, stopword removal):")
preprocessor1 = TextPreprocessor(
    remove_stopwords=True,
    use_lemmatization=True,
    remove_punctuation=True,
    to_lowercase=True,
    remove_numbers=True
)

for i, text in enumerate(sample_texts, 1):
    result = preprocessor1.preprocess(text)
    print(f"Text {i}:")
    print(f"  Original: {text}")
    print(f"  Processed: {result}")

# Configuration 2: Minimal preprocessing
print("\n2. MINIMAL PREPROCESSING (stemming, keep stopwords):")
preprocessor2 = TextPreprocessor(
    remove_stopwords=False,
    use_lemmatization=False,
    remove_punctuation=True,
    to_lowercase=True,
    remove_numbers=False
)

for i, text in enumerate(sample_texts[:2], 1):  # Just first 2 texts for brevity
    result = preprocessor2.preprocess(text)
    print(f"Text {i}:")
    print(f"  Original: {text}")
    print(f"  Processed: {result}")

# Batch processing example
print("\n3. BATCH PROCESSING EXAMPLE:")
all_texts = [
    "Natural Language Processing is fascinating!",
    "Machine learning models require clean data.",
    "Text preprocessing is a crucial step.",
]

batch_results = [preprocessor1.preprocess(text) for text in all_texts]
print("Batch processed texts:")
for original, processed in zip(all_texts, batch_results):
    print(f"  '{original}' → '{processed}'")

PREPROCESSING PIPELINE DEMONSTRATION

1. FULL PREPROCESSING (lemmatization, stopword removal):
Text 1:
  Original: The quick brown fox jumps over the lazy dog!
  Processed: quick brown fox jump lazy dog
Text 2:
  Original: I can't believe it's 2024! This is AMAZING!!! Don't you think so?
  Processed: believe amazing think
Text 3:
  Original: Check out https://example.com for more information @user123 #NLP
  Processed: check information
Text 4:
  Original: The children were running and studying better than before.
  Processed: child run study well

2. MINIMAL PREPROCESSING (stemming, keep stopwords):
Text 1:
  Original: The quick brown fox jumps over the lazy dog!
  Processed: the quick brown fox jump over the lazi dog
Text 2:
  Original: I can't believe it's 2024! This is AMAZING!!! Don't you think so?
  Processed: can not believ it thi is amaz do not you think so

3. BATCH PROCESSING EXAMPLE:
Batch processed texts:
  'Natural Language Processing is fascinating!' → 'natural language pr

## 7. Key Takeaways and Next Steps

### What we covered:
1. **Tokenization**: Breaking text into individual words or tokens
2. **Stopword Removal**: Filtering out common, low-information words
3. **Stemming vs Lemmatization**: Reducing words to their base forms
4. **Text Cleaning**: Removing noise, special characters, and standardizing format
5. **Complete Pipeline**: Combining all techniques in a modular, configurable way

### When to use each technique:
- **Always use**: Text cleaning and tokenization
- **Use for most tasks**: Stopword removal, lowercasing
- **Choose based on task**: 
  - Stemming for speed (information retrieval)
  - Lemmatization for accuracy (sentiment analysis, classification)
- **Task-specific**: Number removal, punctuation handling
