# Module 01: Text Preprocessing Pipeline

**Master the Art of Cleaning and Preparing Text Data**

---

## 1. Objectives

By the end of this module, you will:

- ‚úÖ Understand why preprocessing is critical for NLP
- ‚úÖ Master text cleaning techniques (HTML, URLs, special chars)
- ‚úÖ Implement tokenization from scratch (including BPE)
- ‚úÖ Know when to use stemming vs lemmatization
- ‚úÖ Build a complete preprocessing pipeline

## 2. Prerequisites

- [Module 00: NLP Pipeline Overview](../00_nlp_pipeline/00_nlp_pipeline_overview.ipynb)
- Basic Python and regex knowledge

## 3. Intuition & Motivation

### Why Preprocessing Matters

> **"Garbage in, garbage out"**

Real-world text is messy:
- HTML tags: `<p>Hello</p>`
- URLs: `https://example.com`
- Emojis: `I love this! üòç`
- Inconsistent casing: `AMAZING`, `amazing`, `Amazing`
- Contractions: `don't`, `won't`, `I'm`

### Impact on Model Performance

| Preprocessing | Accuracy |
|---------------|----------|
| None | 78.2% |
| Basic cleaning | 82.5% |
| + Proper tokenization | 85.1% |
| + Normalization | 86.3% |

*Results vary by task - sometimes less preprocessing is better!*

In [None]:
# Setup
import re
import string
from collections import Counter, defaultdict
from typing import List, Dict, Tuple

# Install if needed
# !pip install nltk spacy beautifulsoup4

import nltk
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)

print("Setup complete!")

## 4. Text Cleaning

### 4.1 Lowercasing

In [None]:
def lowercase(text: str) -> str:
    """Convert text to lowercase.
    
    When to use: Most classification tasks
    When NOT to use: 
        - NER ("Apple" company vs "apple" fruit)
        - Sentiment with emphasis ("AMAZING" vs "amazing")
    """
    return text.lower()

# Example
text = "I LOVE This Product! It's AMAZING!"
print(f"Original: {text}")
print(f"Lowercased: {lowercase(text)}")

### 4.2 HTML & URL Removal

In [None]:
from bs4 import BeautifulSoup

def remove_html(text: str) -> str:
    """Remove HTML tags from text."""
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text(separator=' ')

def remove_urls(text: str) -> str:
    """Remove URLs from text."""
    url_pattern = r'https?://\S+|www\.\S+'
    return re.sub(url_pattern, '', text)

def remove_emails(text: str) -> str:
    """Remove email addresses."""
    email_pattern = r'\S+@\S+\.\S+'
    return re.sub(email_pattern, '', text)

# Examples
html_text = "<p>Hello <b>World</b></p> Visit https://example.com"
print(f"Original: {html_text}")
print(f"Clean: {remove_urls(remove_html(html_text))}")

### 4.3 Special Characters & Contractions

In [None]:
# Contraction mapping
CONTRACTIONS = {
    "won't": "will not", "can't": "cannot", "n't": " not",
    "'re": " are", "'s": " is", "'d": " would",
    "'ll": " will", "'ve": " have", "'m": " am"
}

def expand_contractions(text: str) -> str:
    """Expand contractions in text."""
    for contraction, expansion in CONTRACTIONS.items():
        text = text.replace(contraction, expansion)
    return text

def remove_special_chars(text: str, keep_punctuation: bool = False) -> str:
    """Remove special characters."""
    if keep_punctuation:
        pattern = r'[^a-zA-Z0-9\s.,!?\'-]'
    else:
        pattern = r'[^a-zA-Z0-9\s]'
    return re.sub(pattern, '', text)

# Example
text = "I can't believe it! You're amazing... üòç"
print(f"Original: {text}")
print(f"Expanded: {expand_contractions(text)}")
print(f"Cleaned: {remove_special_chars(expand_contractions(text))}")

### 4.4 Complete Cleaning Pipeline

In [None]:
def clean_text(text: str, 
               lower: bool = True,
               remove_html_tags: bool = True,
               remove_url: bool = True,
               expand_contract: bool = True,
               remove_special: bool = True) -> str:
    """Complete text cleaning pipeline."""
    
    if remove_html_tags:
        text = remove_html(text)
    if remove_url:
        text = remove_urls(text)
    if expand_contract:
        text = expand_contractions(text)
    if remove_special:
        text = remove_special_chars(text, keep_punctuation=True)
    if lower:
        text = lowercase(text)
    
    # Normalize whitespace
    text = ' '.join(text.split())
    return text

# Test
messy_text = """
<div>Check out https://example.com! 
I can't believe how AMAZING this is... üòç
Contact: test@email.com</div>
"""

print("Original:")
print(messy_text)
print("\nCleaned:")
print(clean_text(messy_text))

## 5. Tokenization

### 5.1 Types of Tokenization

| Type | Example | Use Case |
|------|---------|----------|
| Whitespace | "hello world" ‚Üí ["hello", "world"] | Simple baseline |
| Word | "don't" ‚Üí ["do", "n't"] | Traditional NLP |
| Subword (BPE) | "unhappy" ‚Üí ["un", "happy"] | **Transformers** |
| Character | "hello" ‚Üí ["h","e","l","l","o"] | Character-level models |

In [None]:
# Basic tokenizers
def whitespace_tokenize(text: str) -> List[str]:
    """Simple whitespace tokenization."""
    return text.split()

def word_tokenize_simple(text: str) -> List[str]:
    """Word tokenization with punctuation handling."""
    # Split on non-alphanumeric, keep tokens
    return re.findall(r"\b\w+\b|[.,!?;]", text)

# NLTK tokenizer
from nltk.tokenize import word_tokenize as nltk_tokenize

text = "Hello, I'm learning NLP! It's amazing."
print(f"Whitespace: {whitespace_tokenize(text)}")
print(f"Simple: {word_tokenize_simple(text)}")
print(f"NLTK: {nltk_tokenize(text)}")

### 5.2 BPE Tokenizer from Scratch (CRITICAL FOR TRANSFORMERS)

**Byte Pair Encoding (BPE)** is used by GPT, RoBERTa, and many modern models.

**Algorithm:**
1. Start with character-level tokens
2. Count all adjacent pairs
3. Merge most frequent pair
4. Repeat until vocabulary size reached

In [None]:
class BPETokenizer:
    """Byte Pair Encoding tokenizer from scratch."""
    
    def __init__(self, vocab_size: int = 1000):
        self.vocab_size = vocab_size
        self.merges = {}  # (pair) -> merged_token
        self.vocab = set()
    
    def _get_stats(self, words: Dict[str, int]) -> Dict[Tuple[str, str], int]:
        """Count frequency of adjacent pairs."""
        pairs = defaultdict(int)
        for word, freq in words.items():
            symbols = word.split()
            for i in range(len(symbols) - 1):
                pairs[(symbols[i], symbols[i+1])] += freq
        return pairs
    
    def _merge_vocab(self, pair: Tuple[str, str], words: Dict[str, int]) -> Dict[str, int]:
        """Merge all occurrences of pair in vocabulary."""
        new_words = {}
        bigram = ' '.join(pair)
        replacement = ''.join(pair)
        
        for word, freq in words.items():
            new_word = word.replace(bigram, replacement)
            new_words[new_word] = freq
        return new_words
    
    def train(self, texts: List[str], num_merges: int = 100):
        """Train BPE on corpus."""
        # Initialize: split words into characters
        word_freqs = Counter()
        for text in texts:
            for word in text.lower().split():
                # Add end-of-word marker
                word_freqs[' '.join(list(word)) + ' </w>'] += 1
        
        # Build initial vocab
        for word in word_freqs:
            for char in word.split():
                self.vocab.add(char)
        
        # Iteratively merge most frequent pairs
        for i in range(num_merges):
            pairs = self._get_stats(word_freqs)
            if not pairs:
                break
            
            best_pair = max(pairs, key=pairs.get)
            word_freqs = self._merge_vocab(best_pair, word_freqs)
            
            merged = ''.join(best_pair)
            self.merges[best_pair] = merged
            self.vocab.add(merged)
            
            if i < 5:  # Show first 5 merges
                print(f"Merge {i+1}: {best_pair} -> {merged}")
        
        print(f"\nVocab size: {len(self.vocab)}")
    
    def tokenize(self, text: str) -> List[str]:
        """Tokenize text using learned merges."""
        tokens = []
        for word in text.lower().split():
            word = ' '.join(list(word)) + ' </w>'
            
            # Apply merges
            for pair, merged in self.merges.items():
                bigram = ' '.join(pair)
                word = word.replace(bigram, merged)
            
            tokens.extend(word.split())
        return tokens

# Train BPE
corpus = [
    "the cat sat on the mat",
    "the dog ran in the park", 
    "cats and dogs are pets",
    "the quick brown fox"
]

bpe = BPETokenizer()
bpe.train(corpus, num_merges=20)

# Test
print(f"\nTokenized: {bpe.tokenize('the cat is sitting')}")

## 6. Normalization: Stemming vs Lemmatization

| Technique | "running" | "better" | Speed | Quality |
|-----------|-----------|----------|-------|--------|
| Stemming | "run" | "better" | ‚ö° Fast | Lower |
| Lemmatization | "run" | "good" | üê¢ Slow | Higher |

In [None]:
from nltk.stem import PorterStemmer, WordNetLemmatizer

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

words = ["running", "runs", "ran", "better", "studies", "studying"]

print(f"{'Word':<12} {'Stem':<12} {'Lemma':<12}")
print("-" * 36)
for word in words:
    stem = stemmer.stem(word)
    lemma = lemmatizer.lemmatize(word, pos='v')  # verb
    print(f"{word:<12} {stem:<12} {lemma:<12}")

## 7. Stopword Removal

In [None]:
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

def remove_stopwords(tokens: List[str], stop_words: set = stop_words) -> List[str]:
    """Remove stopwords from token list.
    
    WARNING: Don't use for sentiment analysis!
    'not good' ‚Üí 'good' (loses negation)
    """
    return [t for t in tokens if t.lower() not in stop_words]

# Example
tokens = ["this", "is", "not", "a", "good", "movie"]
print(f"Original: {tokens}")
print(f"Without stopwords: {remove_stopwords(tokens)}")
print("\n‚ö†Ô∏è Notice 'not' was removed - bad for sentiment!")

## 8. üî• Real-World Usage

### Tool Comparison

| Tool | Speed | Features | Best For |
|------|-------|----------|----------|
| **NLTK** | üê¢ | Educational | Learning |
| **SpaCy** | ‚ö° | Production-ready | Industry |
| **HuggingFace Tokenizers** | üöÄ | Rust-based | Transformers |

### Production Tips

1. **Always preserve original text** alongside processed
2. **Log preprocessing statistics** (avg length, vocab coverage)
3. **Make preprocessing deterministic** (set random seeds)
4. **Handle edge cases**: empty strings, very long texts
5. **For transformers**: use the model's tokenizer, not custom!

In [None]:
# HuggingFace Tokenizers - The industry standard
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

text = "I'm learning about tokenization!"
tokens = tokenizer.tokenize(text)
ids = tokenizer.encode(text)

print(f"Text: {text}")
print(f"Tokens: {tokens}")
print(f"IDs: {ids}")
print(f"Decoded: {tokenizer.decode(ids)}")

## 9. Common Mistakes & Debugging

### Mistake 1: Over-preprocessing for Transformers
‚ùå Stemming + stopword removal + lowercasing before BERT

‚úÖ **BERT has its own tokenizer** - just use it!

### Mistake 2: Removing negations for Sentiment
‚ùå "This is not good" ‚Üí "good" (after stopword removal)

‚úÖ Keep negation words for sentiment analysis

### Mistake 3: Inconsistent preprocessing
‚ùå Different preprocessing for train vs test

‚úÖ Use a pipeline that ensures consistency

## 10. Interview Questions

**Q1: What's the difference between stemming and lemmatization?**
<details><summary>Answer</summary>

- **Stemming**: Rule-based suffix stripping (fast, crude)
- **Lemmatization**: Dictionary-based, returns valid words (slow, accurate)
</details>

**Q2: How does BPE tokenization work?**
<details><summary>Answer</summary>

1. Start with characters
2. Find most frequent adjacent pair
3. Merge into single token
4. Repeat until vocab size reached
</details>

**Q3: Why do transformers use subword tokenization?**
<details><summary>Answer</summary>

- Handles OOV words ("unhappiness" ‚Üí "un", "happiness")
- Fixed vocabulary size
- Balances character and word level
</details>

## 11. Summary

- **Text cleaning**: HTML, URLs, special chars, contractions
- **Tokenization**: Whitespace ‚Üí Word ‚Üí Subword (BPE)
- **Normalization**: Stemming (fast) vs Lemmatization (accurate)
- **Stopwords**: Remove for topic modeling, KEEP for sentiment
- **For transformers**: Use model's tokenizer, minimal preprocessing

## 12. Exercises

### Exercise 1: Build a preprocessing pipeline for tweets
Handle: @mentions, #hashtags, emojis, URLs

### Exercise 2: Implement WordPiece tokenizer
Similar to BPE but uses likelihood instead of frequency

### Exercise 3: Compare preprocessing impact
Train sentiment classifier with different preprocessing levels

## 13. References

- [Neural Machine Translation of Rare Words with Subword Units](https://arxiv.org/abs/1508.07909) - BPE Paper
- [HuggingFace Tokenizers](https://huggingface.co/docs/tokenizers)
- [SpaCy Documentation](https://spacy.io/usage/linguistic-features)

---

**Next:** [Module 02: Text Representation](../02_text_representation/02_text_representation.ipynb)