# Data Preprocessing

## Overview

This notebook covers **data preprocessing** for statistical NLP, focusing on the lossy transformation of raw text into clean, normalized tokens suitable for machine learning models. We explore the context in which statistical NLP is used (search, topic discovery, classification), the importance of Exploratory Data Analysis (EDA) before preprocessing, and the various preprocessing methods including cleaning (noise removal) and normalization (standardization). Understanding preprocessing order and trade-offs is crucial for building effective NLP pipelines.


## Learning Objectives

- Recognize the context in which statistical NLP is used
- Learn the function of many lossy preprocessing methods used in statistical NLP
- Recognize why preprocessing order matters and how to determine the correct sequence

## Outline

1. **The Goal of Statistical NLP** - Understanding the "super-fast librarian" approach
2. **Exploratory Data Analysis (EDA) for Text** - Assessing data quality, vocabulary characteristics, and preprocessing needs
3. **Preprocessing Overview** - Why preprocessing is necessary and its two main phases
4. **Cleaning (Noise Removal)** - Removing non-textual elements, URLs, HTML tags, punctuation, stop words
5. **Normalization** - Standardizing text (case, contractions, elongations, diacritics)
6. **Preprocessing Order** - Why order matters and how to determine the correct sequence
7. **Arabic-Specific Preprocessing** - Special considerations for Arabic text
8. **Preprocessing Trade-offs** - Balancing information loss vs. efficiency


## üéØ The Goal

Think **"super-fast librarian"**. Its goal isn't to understand the emotional nuance of a sentence; its goal is to organize, index, and retrieve documents based on the words they contain.

* **Search:** Finding the one document out of a million that matches your query (e.g., Google Search, Ctrl+F).
* **Topic Discovery:** Scanning huge archives to see what they are talking about (e.g., "70% of these news articles are about 'Sports'").
* **Classification:** Sorting text into predefined buckets based on word statistics (e.g., Spam Filtering: Noticing that emails containing the words "Winner" and "Cash" 50 times are likely junk).

> Example: Routing customer support tickets. If a ticket has the words "refund," "money," and "charge," the math predicts it belongs to the "Billing Department" bucket. If it has "crash," "error," and "screen," it goes to "Tech Support."

## Exploratory Data Analysis (EDA) for Text

Before preprocessing, it's essential to **understand your data**. EDA helps you:

**1. Assess Data Quality**
- **Class distribution**: Are categories balanced or imbalanced? (affects model training)
- **Text length distribution**: Are there unusually long or short texts? (outliers that might need handling)
- **Language detection**: Is the text in the expected language? (mixed languages need special handling)
- **Duplicate detection**: Are there duplicate documents? (can bias training)

**2. Understand Vocabulary Characteristics**
- **Vocabulary size**: How many unique words? (affects vectorization complexity)
- **Word frequency patterns**: Which words are most common? (helps identify stop words)
- **Class-specific patterns**: Do certain words appear more in certain classes? (guides feature selection)

**3. Identify Preprocessing Needs**
- **Noise patterns**: URLs, emails, special characters that need removal
- **Normalization needs**: Case variations, contractions, elongations
- **Outliers**: Very long texts (might be concatenated), very short texts (might be incomplete)

**Why EDA matters:**
- **Informed decisions**: Choose preprocessing steps based on actual data characteristics, not assumptions
- **Quality assurance**: Catch data issues early (wrong language, duplicates, extreme outliers)
- **Baseline understanding**: Know your data before and after preprocessing to validate changes

> **Note**: In the lab, you will practice EDA techniques including class distribution analysis, text length histograms, and vocabulary analysis.

---

## Preprocessing

Preprocessing is the critical, intentionally **lossy** process of stripping raw text down to its bare semantic bones (keywords) to prepare it for analysis. It generally consists of two phases: **Cleaning** (removing noise) and **Normalization** (standardizing text).

### Why Preprocess?

Raw data contains variations in case, punctuation, whitespace, and encoding. Preprocessing resolves these to ensure:

* **Accuracy:** Improves matching, comparison, and analysis reliability.
* **Efficiency:** Reduces vocabulary size and computational overhead.

### The Pipeline: Key Steps

**1. Cleaning (Noise Removal)**
This phase removes non-textual or irrelevant elements to reduce the document to valid tokens.

* Collapse multiple spaces, tabs, and newlines into single spaces.
* Strip URLs, HTML tags, numbers, and special characters.
* Eliminate punctuation (e.g., `.` `?` `!`) and standard "stop words" (filler words like "the", "is", "at") that carry little semantic weight.

**2. Normalization (Canonicalization)**
This phase transforms the remaining text into a single standard form.

* **Casefold:** converting text to lowercase (English)
* **Expansion:** `"don't" -> "do not"`
* **Reduction:** `"closing", "closed", "closes" -> "close"`

### Why Preprocessing Order Matters

**The order of preprocessing steps is critical** because later steps depend on earlier ones. Applying steps in the wrong order can:
- **Lose information**: Removing punctuation before extracting mentions (`@user`) means you can't identify mentions
- **Break patterns**: Lowercasing before removing URLs might break URL detection patterns
- **Create errors**: Removing stop words before expanding contractions loses context

**General order principles:**
1. **Extract structured information first** (URLs, emails, mentions) before removing punctuation
2. **Normalize structure** (whitespace, case) before tokenization
3. **Tokenize** before removing stop words (need word boundaries)
4. **Stem/Lemmatize** after tokenization (operate on individual words)
5. **Remove stop words last** (after all transformations are complete)

**Example of wrong order:**
- ‚ùå Remove punctuation ‚Üí Remove mentions: `"@user"` becomes `"user"` (mention lost)
- ‚úÖ Remove mentions ‚Üí Remove punctuation: `"@user"` ‚Üí `""` ‚Üí punctuation removed (correct)

**Example of correct order:**
1. Remove URLs and emails (they contain punctuation)
2. Remove mentions/handles (they contain `@`)
3. Normalize whitespace
4. Lowercase
5. Tokenize
6. Expand contractions (if needed)
7. Stem/Lemmatize
8. Remove stop words

> **Note**: In the lab, you will see how preprocessing order affects results and learn to design effective preprocessing pipelines.

In [None]:
# %pip install farasapy==0.1.1 nltk==3.9.2 pandas==2.3.3 pyarabic==0.6.15 qalsadi==0.5.1 --quiet

In [None]:
# Standard library imports
import re
import string

# Third-party imports
import pandas as pd
import nltk

# NLTK downloads
nltk.download('stopwords', quiet=True)
nltk.download('punkt', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)
nltk.download('wordnet', quiet=True)

# NLTK imports
from nltk.corpus import stopwords, wordnet
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.isri import ISRIStemmer
from nltk import pos_tag

# Arabic NLP libraries
from farasa.segmenter import FarasaSegmenter
from farasa.stemmer import FarasaStemmer
from pyarabic import number

In [None]:
# Initialize Farasa segmenter and stemmer for Arabic text processing
segmenter = FarasaSegmenter()
stemmer = FarasaStemmer()

In [None]:
# Example 1: Text Normalization - Substitution Approach
# Normalization prefers substitution over removal to preserve information
# This example demonstrates case normalization and whitespace normalization

# Original text with various formatting issues
examples = [
    "  Hello    World  ",
    "HELLO\t\tWORLD\n\n",
    "  hello   world  ",
    "Hello   World"
]

def normalize_text(text):
    """
    Normalize text by substituting variations with canonical forms.
    This preserves information better than removal.
    """
    # Step 1: Case normalization - substitute all cases with lowercase
    normalized = text.lower()
    
    # Step 2: Whitespace normalization - substitute multiple spaces/tabs/newlines with single space
    normalized = re.sub(r'\s+', ' ', normalized)
    
    # Step 3: Remove leading/trailing whitespace
    normalized = normalized.strip()
    
    return normalized

print("Normalization Examples (Substitution Approach):\n")
print("=" * 60)
for i, example in enumerate(examples, 1):
    print(f"\nExample {i}:")
    print(f"Original: '{example}'")
    normalized = normalize_text(example)
    print(f"Normalized: '{normalized}'")
    print("-" * 60)

print("\n" + "=" * 60)
print("Key Takeaway: All variations normalize to the same canonical form!")
print("=" * 60)

In [None]:
# Example 2: Normalization with Substitution (URLs, Numbers, Punctuation)
# Substitution approach: Replace with placeholders or normalized forms instead of removing

# Sample text with URLs, numbers, and punctuation
text = "This is an example sentence with a URL (http://www.example.com) and a number (123)."
print('Original text:', text)

# Step 1: Substitute URLs with a placeholder
# Pattern: http\S+ matches "http" followed by any non-whitespace characters
# Substitute with a normalized placeholder
text_normalized_urls = re.sub(r"http\S+", "[URL]", text)
print('After normalizing URLs:', text_normalized_urls)

# Step 2: Substitute numbers with a placeholder (or convert to words)
# Pattern: \d+ matches one or more digits
# Option 1: Substitute with placeholder
text_normalized_numbers = re.sub(r"\d+", "[NUMBER]", text_normalized_urls)
print('After normalizing numbers (placeholder):', text_normalized_numbers)

# Option 2: Keep numbers but normalize format (e.g., remove leading zeros)
text_normalized_numbers_format = re.sub(r"\b0+(\d+)\b", r"\1", text_normalized_urls)
print('After normalizing number format:', text_normalized_numbers_format)

# Step 3: Normalize punctuation - substitute multiple punctuation with single space
# Instead of removing, we normalize punctuation marks
text_normalized_punct = re.sub(r'[^\w\s]+', ' ', text_normalized_numbers)
print('After normalizing punctuation:', text_normalized_punct)

# Step 4: Clean up extra whitespace (substitute multiple spaces with single space)
final_text = re.sub(r'\s+', ' ', text_normalized_punct).strip()
print('Final normalized text:', final_text)

print("\n" + "="*60)
print("Note: Substitution preserves structure better than removal!")
print("="*60)

In [None]:
# Example 3: Removing stop words
# Stop words are common words that appear frequently but often don't add much semantic meaning
# Examples: "the", "is", "at", "which", "on", etc.

# Important Note: Removing stop words is NOT always beneficial!
# - For tasks like sentiment analysis, stop words might be important (e.g., "not" is a stop word)
# - For tasks like topic modeling or information retrieval, removing stop words can help
# - Always consider your specific use case before removing stop words

# Simple example with a custom stop word list
stop_words = ['is', 'an', 'with', 'a', 'and', 'the', 'to', 'of']
text = "this is is is an example text with a a a lot of stop words that need to be removed"

print('Original text:', text)
print('Number of words before:', len(text.split()))

# Remove stop words
words = text.split()
filtered_words = [word for word in words if word not in stop_words]
filtered_text = ' '.join(filtered_words)

print('After removing stop words:', filtered_text)
print('Number of words after:', len(filtered_words))
print('\nRemoved words:', [w for w in words if w in stop_words])

In [None]:
# Example 4: Using NLTK's built-in stop word lists
# NLTK provides pre-compiled stop word lists for many languages
# These are more comprehensive than custom lists and are maintained by the community

# English stop words
# NLTK's English stop word list contains 179 common English words
english_stop_words = set(stopwords.words('english'))
print(f"English stop words count: {len(english_stop_words)}")
print(f"Sample English stop words: {list(english_stop_words)[:20]}")  # Show first 20

# Arabic stop words
# NLTK's Arabic stop word list contains 701 common Arabic words
arabic_stop_words = set(stopwords.words('arabic'))
print(f"\nArabic stop words count: {len(arabic_stop_words)}")
print(f"Sample Arabic stop words: {list(arabic_stop_words)[:20]}")  # Show first 20

# Example: Using stop words to filter text
sample_text = "The quick brown fox jumps over the lazy dog"
words = sample_text.lower().split()
print(f"\nOriginal text: {sample_text}")
print(f"Words: {words}")

filtered = [w for w in words if w not in english_stop_words]
print(f"After removing stop words: {filtered}")
print(f"Filtered text: {' '.join(filtered)}")

In [None]:
# Example 5: Expanding contractions
# Contractions are shortened forms of words (e.g., "don't" = "do not")
# Expanding contractions can help with:
# - Better word matching (both "don't" and "do not" become the same)
# - More consistent tokenization
# - Better understanding of the actual words used

# Sample text with contractions
sentence_with_contractions = "I ain't going to the store because I ain't got no money."
print('Original text:', sentence_with_contractions)

# Dictionary mapping contractions to their expanded forms
# Note: This is a simplified example. In practice, you might use libraries
# like 'contractions' package which handles edge cases better
contractions = {
    "ain't": "am not",
    "aren't": "are not",
    "can't": "cannot",
    "'cause": "because",
    "could've": "could have",
    "couldn't": "could not",
    "didn't": "did not",
    "doesn't": "does not",
    "don't": "do not",
    "hadn't": "had not",
    "hasn't": "has not",
    "haven't": "have not",
    "he's": "he is",
    "how's": "how is",
    "I'd": "I would",
    "I'll": "I will",
    "I'm": "I am",
    "I've": "I have",
    "isn't": "is not",
    "it's": "it is",
    "let's": "let us",
    "she's": "she is",
    "shouldn't": "should not",
    "that's": "that is",
    "they're": "they are",
    "we're": "we are",
    "won't": "will not",
    "wouldn't": "would not",
    "you're": "you are",
    "you've": "you have"
}

# Expand contractions word by word
words = sentence_with_contractions.split()
expanded_words = []
for word in words:
    # Remove punctuation from word for lookup, but preserve it
    word_clean = word.rstrip('.,!?;:')
    if word_clean.lower() in contractions:
        expanded = contractions[word_clean.lower()]
        # Preserve original punctuation
        if word != word_clean:
            expanded += word[len(word_clean):]
        expanded_words.append(expanded)
    else:
        expanded_words.append(word)

expanded_text = ' '.join(expanded_words)
print('After expanding contractions:', expanded_text)

# Note: The current implementation has a limitation - "ain't" appears twice
# but gets expanded the same way. More sophisticated approaches handle
# context-dependent contractions better.

## Segmentation and Stemming

**Segmentation**: helps identify prefixes, suffixes, and roots.

**Stemming**: reduces words to their core meaning (root).

- Both are essential for Arabic NLP tasks such as:
  - Search engines
  - Topic modeling
  - Text classification
  - Machine translation

### Arabic Text Normalization with PyArabic

[**PyArabic**](https://github.com/linuxscout/pyarabic) is a comprehensive library for Arabic text preprocessing and normalization. It provides many features for handling Arabic text's unique characteristics.

**Key Features:**
- ÿ™ÿµŸÜŸäŸÅ ÿßŸÑÿ≠ÿ±ŸàŸÅ (Character classification)
- ÿ™ŸÅÿ±ŸäŸÇ ÿßŸÑŸÜÿµ ÿ•ŸÑŸâ Ÿàÿ≠ÿØÿßÿ™ (Text segmentation into sentences or words)
- ÿ≠ÿ∞ŸÅ ÿßŸÑÿ≠ÿ±ŸÉÿßÿ™ (Removing diacritics)
- ÿ™ŸÜŸÖŸäÿ∑ ÿßŸÑÿ≠ÿ±ŸàŸÅ (Character normalization - unifying forms like alif-lam, hamzas)
- ÿ™ÿ≠ŸàŸäŸÑ ÿßŸÑÿ£ÿπÿØÿßÿØ ÿ•ŸÑŸâ ŸÉŸÑŸÖÿßÿ™ (Converting numbers to words)
- And many more...

Let's see PyArabic in action with practical examples:

In [None]:
# Convert numbers to Arabic words
numbers = [123, 4567, 1000000]
print("Converting numbers to Arabic words:")
for num in numbers:
    try:
        # PyArabic number module converts numbers to words
        arabic_words = number.number2text(num)
        print(f"  {num} ‚Üí {arabic_words}")
    except Exception as e:
        print(f"  {num} ‚Üí (conversion not available: {e})")

# Convert text numbers to words
text_with_numbers = "ŸÑÿØŸä 5 ŸÉÿ™ÿ® Ÿà 10 ÿ£ŸÇŸÑÿßŸÖ"
print(f"\nOriginal text with numbers: {text_with_numbers}")

### Arabic Segmentation and Stemming

[**FarasaPy**](https://github.com/MagedSaeed/farasapy) is an Arabic NLP toolkit serving the following tasks:

- Segmentation (ÿ™ŸÇÿ≥ŸäŸÖ ÿßŸÑŸÉŸÑŸÖÿ© ÿ•ŸÑŸâ ÿ£ÿ¨ÿ≤ÿßÿ°)
- Stemming (ÿßÿ≥ÿ™ÿÆÿ±ÿßÿ¨ ÿßŸÑÿ¨ÿ∞ÿ±)
- Named Entity Recognition (NER)
- Part Of Speech tagging (POS tagging) (Ÿàÿ≥ŸÖ ÿßŸÑÿ¨ÿ≤ÿ° ŸÖŸÜ ÿßŸÑŸÉŸÑÿßŸÖ)
- Diacritization (ÿ™ÿ¥ŸÉŸäŸÑ ÿßŸÑŸÉŸÑŸÖÿßÿ™)

In [None]:
print("=" * 70)
print("Arabic Word Segmentation (ÿ™ŸÇÿ≥ŸäŸÖ ÿßŸÑŸÉŸÑŸÖÿ©)")
print("=" * 70)
print("Segmentation splits Arabic words into their morphological components.")
print("This is crucial because Arabic words often combine multiple morphemes.\n")

# Sample Arabic words and sentences
arabic_words = [
    "ÿßŸÑŸÉÿ™ÿßÿ®",           # The book
    "ÿ®ÿßŸÑŸÖÿØÿ±ÿ≥ÿ©",         # At the school
    "Ÿäÿ∞Ÿáÿ®ŸàŸÜ",           # They go
    "ŸÉÿ™ÿ®ÿ™Ÿáÿß",           # I wrote it (feminine)
]

print("Word Segmentation Examples:")
for word in arabic_words:
    segmented = segmenter.segment(word)
    print(f"  '{word}' ‚Üí {segmented}")

In [None]:
print("\n" + "=" * 70)
print("Arabic Stemming (ÿßÿ≥ÿ™ÿÆÿ±ÿßÿ¨ ÿßŸÑÿ¨ÿ∞ÿ±)")
print("=" * 70)
print("Stemming extracts the root (ÿ¨ÿ∞ÿ±) of Arabic words.")
print("Arabic roots are typically 3-letter roots that convey core meaning.\n")

# Words to stem
words_to_stem = [
    "ŸÉÿßÿ™ÿ®",
    "ŸÖŸÉÿ™ÿ®ÿ©",
    "ŸäŸÉÿ™ÿ®",
    "ŸÉÿ™ÿ®",
    "ÿßŸÑŸÉÿ™ÿßÿ®ÿ©",
]

print("Stemming Examples:")
for word in words_to_stem:
    stem = stemmer.stem(word)
    print(f"  '{word}' ‚Üí {stem}")

In [None]:
print("\n" + "=" * 70)
print("Combining Segmentation and Stemming")
print("=" * 70)

# Combined workflow
text = "ÿßŸÑÿ∑ŸÑÿßÿ® ŸäÿØÿ±ÿ≥ŸàŸÜ ŸÅŸä ÿßŸÑŸÖŸÉÿ™ÿ®ÿ©"
print(f"Original text: '{text}'")

# First segment
segmented = segmenter.segment(text)
print(f"After segmentation: {segmented}")

# Then stem
stemmed = stemmer.stem(text)
print(f"After stemming: {stemmed}")

## Key Takeaways

- **Statistical NLP** serves as a "super-fast librarian" for search, topic discovery, and classification tasks, focusing on organizing and retrieving documents based on word statistics.

- **Exploratory Data Analysis (EDA)** is essential before preprocessing to:
  - Assess data quality (class distribution, text length, language, duplicates)
  - Understand vocabulary characteristics (size, frequency patterns, class-specific patterns)
  - Identify preprocessing needs (noise patterns, normalization requirements)

- **Preprocessing is intentionally lossy** - it strips text down to semantic keywords, consisting of two phases:
  - **Cleaning**: Removing noise (URLs, HTML tags, punctuation, stop words)
  - **Normalization**: Standardizing text (case, contractions, elongations, diacritics)

- **Preprocessing order matters critically**:
  1. Extract structured information first (URLs, emails, mentions)
  2. Normalize structure (whitespace, case)
  3. Tokenize
  4. Expand contractions (if needed)
  5. Stem/Lemmatize
  6. Remove stop words last

- Wrong preprocessing order can lose information, break patterns, or create errors.

- **Arabic text** requires special preprocessing considerations including diacritic removal, elongation handling, and Arabic-specific stemming/lemmatization tools (Farasa, PyArabic, Qalsadi).

- Preprocessing involves **trade-offs** between information retention and computational efficiency - choose steps based on your specific use case and data characteristics.

---

## References

- [NLTK Documentation](https://www.nltk.org/)
- [PyArabic Documentation](https://github.com/linuxscout/pyarabic)
- [FarasaPy Documentation](https://github.com/MagedSaeed/farasapy)

### Corpora & Data
- [Brown Corpus Overview](https://en.wikipedia.org/wiki/Brown_Corpus)
- [Reuters-21578 Dataset](https://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html)
- [Common Crawl](https://commoncrawl.org/)
- [Arabic Wikipedia Dumps](https://dumps.wikimedia.org/arwiki/)