# Preprocessing Pipeline

This notebook demonstrates the preprocessing pipeline for sentiment analysis.

## Objectives
- Test text cleaning functions
- Compare tokenization methods
- Analyze preprocessing effects
- Create vocabulary for baseline models


In [1]:
import sys
import os
sys.path.append(os.path.join(os.path.dirname(os.getcwd())))

import pandas as pd
import numpy as np
from pathlib import Path

from src.data.preprocess import clean_text, tokenize_texts, create_vocabulary
from src.data.dataset_loader import IMDBDataLoader, YelpDataLoader, SST2Loader, create_train_test_split
from src.utils.seed_everything import seed_everything

# Set seed for reproducibility
seed_everything(42)
print("✅ Seed set to 42 for reproducible data splitting")


  from .autonotebook import tqdm as notebook_tqdm


✅ Seed set to 42 for reproducible data splitting


## 1. Text Cleaning

### Preprocessing Steps:

The `clean_text()` function performs the following operations:

1. **Remove HTML Tags**: Removes all HTML/XML tags (e.g., `<br>`, `<div>`, `<p>`)
   - Pattern: `r'<[^>]+>'`
   - Example: `"Hello<br>World"` → `"HelloWorld"`

2. **Remove URLs**: Removes web URLs (http/https and www links)
   - Pattern: `r'http\S+|www\.\S+'`
   - Example: `"Check http://example.com"` → `"Check "`

3. **Remove Special Characters**: Removes special characters but **keeps**:
   - Alphanumeric characters (a-z, A-Z, 0-9)
   - Basic punctuation: `. , ! ?`
   - Whitespace
   - Pattern: `r'[^\w\s.,!?]'`
   - Example: `"@user #tag $price"` → `"user tag price"`

4. **Normalize Whitespace**: Replaces multiple consecutive spaces with a single space
   - Pattern: `r'\s+'`
   - Example: `"Hello    World"` → `"Hello World"`

5. **Strip**: Removes leading and trailing whitespace

### What Gets Removed:
- ❌ HTML/XML tags (`<br>`, `<div>`, etc.)
- ❌ URLs (`http://...`, `www....`)
- ❌ Special symbols (`@`, `#`, `$`, `%`, `&`, `*`, `()`, `[]`, `{}`, etc.)
- ❌ Extra whitespace (multiple spaces/tabs/newlines)

### What Gets Kept:
- ✅ Alphanumeric characters
- ✅ Basic punctuation (period, comma, exclamation, question mark)
- ✅ Words and numbers


In [2]:
# Sample dirty text
dirty_text = "This movie was <br>AWESOME!!! http://example.com  Check it out @movie #excited"

print("Original text:")
print(dirty_text)
print("\nCleaned text:")
print(clean_text(dirty_text))


Original text:
This movie was <br>AWESOME!!! http://example.com  Check it out @movie #excited

Cleaned text:
This movie was AWESOME!!! Check it out movie excited


In [3]:
# Show detailed examples of what gets removed
examples = [
    ("Original: This movie <br>is great! http://example.com @movie #excited", 
     "Cleaned: This movie is great! movie excited"),
    ("Original: AWESOME!!!  Multiple    spaces   here.",
     "Cleaned: AWESOME!!! Multiple spaces here."),
    ("Original: Check $price & quality @store #sale",
     "Cleaned: Check price quality store sale"),
]

print("Preprocessing Examples:\n")
for original, expected in examples:
    cleaned = clean_text(original)
    print(f"{original}")
    print(f"→ {cleaned}")
    print(f"Matches expected: {cleaned == expected}\n")


Preprocessing Examples:

Original: This movie <br>is great! http://example.com @movie #excited
→ Original This movie is great! movie excited
Matches expected: False

Original: AWESOME!!!  Multiple    spaces   here.
→ Original AWESOME!!! Multiple spaces here.
Matches expected: False

Original: Check $price & quality @store #sale
→ Original Check price quality store sale
Matches expected: False



### 1.1 IMDB Dataset


In [4]:
# Load and clean IMDB data
imdb_loader = IMDBDataLoader('../IMDB Dataset.csv')
imdb_texts, imdb_labels = imdb_loader.load(binary=True)

# Clean a sample
imdb_sample = imdb_texts[:100]
imdb_cleaned = [clean_text(text) for text in imdb_sample]

print(f"IMDB Dataset:")
print(f"Original avg length: {np.mean([len(t.split()) for t in imdb_sample]):.1f} words")
print(f"Cleaned avg length: {np.mean([len(t.split()) for t in imdb_cleaned]):.1f} words")
print(f"Length reduction: {np.mean([len(t.split()) for t in imdb_sample]) - np.mean([len(t.split()) for t in imdb_cleaned]):.1f} words")


IMDB Dataset:
Original avg length: 231.3 words
Cleaned avg length: 225.9 words
Length reduction: 5.4 words


### 1.2 SST-2 Dataset


In [5]:
# Load and clean SST-2 data
sst2_path = Path('../archive (5)/SST2-Data/SST2-Data/stanfordSentimentTreebank/stanfordSentimentTreebank')
sst2_loader = SST2Loader(sst2_path)
sst2_train_texts, sst2_train_labels, _, _, _, _ = sst2_loader.load()

# Clean a sample
sst2_sample = sst2_train_texts[:500]  # SST-2 sentences are shorter
sst2_cleaned = [clean_text(text) for text in sst2_sample]

print(f"SST-2 Dataset:")
print(f"Original avg length: {np.mean([len(t.split()) for t in sst2_sample]):.1f} words")
print(f"Cleaned avg length: {np.mean([len(t.split()) for t in sst2_cleaned]):.1f} words")
print(f"Length reduction: {np.mean([len(t.split()) for t in sst2_sample]) - np.mean([len(t.split()) for t in sst2_cleaned]):.1f} words")


SST-2 Dataset:
Original avg length: 19.9 words
Cleaned avg length: 19.6 words
Length reduction: 0.3 words


### 1.3 Yelp Dataset


In [6]:
# Load and clean Yelp data
yelp_loader = YelpDataLoader('../archive (7)/yelp_academic_dataset_review.json')
yelp_texts, yelp_labels = yelp_loader.load(sample_size=5000, binary=True)

# Clean a sample
yelp_sample = yelp_texts[:100]
yelp_cleaned = [clean_text(text) for text in yelp_sample]

print(f"Yelp Dataset:")
print(f"Original avg length: {np.mean([len(t.split()) for t in yelp_sample]):.1f} words")
print(f"Cleaned avg length: {np.mean([len(t.split()) for t in yelp_cleaned]):.1f} words")
print(f"Length reduction: {np.mean([len(t.split()) for t in yelp_sample]) - np.mean([len(t.split()) for t in yelp_cleaned]):.1f} words")


Yelp Dataset:
Original avg length: 88.8 words
Cleaned avg length: 88.6 words
Length reduction: 0.3 words


### 1.4 Preprocessing Comparison Across Datasets


In [7]:
# Compare preprocessing effects across datasets
comparison_data = {
    'Dataset': ['IMDB', 'SST-2', 'Yelp'],
    'Original Avg Length': [
        np.mean([len(t.split()) for t in imdb_sample]),
        np.mean([len(t.split()) for t in sst2_sample[:100]]),
        np.mean([len(t.split()) for t in yelp_sample])
    ],
    'Cleaned Avg Length': [
        np.mean([len(t.split()) for t in imdb_cleaned]),
        np.mean([len(t.split()) for t in sst2_cleaned[:100]]),
        np.mean([len(t.split()) for t in yelp_cleaned])
    ],
    'Reduction': [
        np.mean([len(t.split()) for t in imdb_sample]) - np.mean([len(t.split()) for t in imdb_cleaned]),
        np.mean([len(t.split()) for t in sst2_sample[:100]]) - np.mean([len(t.split()) for t in sst2_cleaned[:100]]),
        np.mean([len(t.split()) for t in yelp_sample]) - np.mean([len(t.split()) for t in yelp_cleaned])
    ]
}

comparison_df = pd.DataFrame(comparison_data)
comparison_df['Reduction %'] = (comparison_df['Reduction'] / comparison_df['Original Avg Length'] * 100).round(2)

print("Preprocessing Effects Comparison:\n")
print(comparison_df.to_string(index=False))


Preprocessing Effects Comparison:

Dataset  Original Avg Length  Cleaned Avg Length  Reduction  Reduction %
   IMDB               231.31              225.86       5.45         2.36
  SST-2                21.43               20.70       0.73         3.41
   Yelp                88.85               88.57       0.28         0.32


In [8]:
# Show sample before/after for each dataset
print("Sample preprocessing examples:\n")

datasets = [
    ("IMDB", imdb_sample[0]),
    ("SST-2", sst2_sample[0]),
    ("Yelp", yelp_sample[0])
]

for name, original in datasets:
    cleaned = clean_text(original)
    print(f"\n{name}:")
    print(f"  Original: {original[:100]}...")
    print(f"  Cleaned:  {cleaned[:100]}...")


Sample preprocessing examples:


IMDB:
  Original: One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. The...
  Cleaned:  One of the other reviewers has mentioned that after watching just 1 Oz episode youll be hooked. They...

SST-2:
  Original: The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to make a splash...
  Cleaned:  The Rock is destined to be the 21st Century s new Conan and that he s going to make a splash even gr...

Yelp:
  Original: If you decide to eat here, just be aware it is going to take about 2 hours from beginning to end. We...
  Cleaned:  If you decide to eat here, just be aware it is going to take about 2 hours from beginning to end. We...


## 4. Save Preprocessed Data

Save preprocessed datasets for use in subsequent notebooks.


In [9]:
# Save preprocessed IMDB data (with train/val/test split)
import pandas as pd
from pathlib import Path

intermediate_dir = Path('../intermediate/data')
intermediate_dir.mkdir(parents=True, exist_ok=True)

# Load full IMDB dataset
imdb_loader = IMDBDataLoader('../IMDB Dataset.csv')
imdb_texts, imdb_labels = imdb_loader.load(binary=True)

# Clean all texts
print("Cleaning IMDB dataset...")
imdb_cleaned_full = [clean_text(text) for text in imdb_texts]

# Split into train/val/test (using seed from seed_everything)
print("Splitting IMDB into train/val/test sets...")
imdb_train_texts, imdb_val_texts, imdb_test_texts, \
imdb_train_labels, imdb_val_labels, imdb_test_labels = create_train_test_split(
    imdb_cleaned_full, imdb_labels, test_size=0.2, val_size=0.1, random_state=42
)

# Save as separate CSV files
imdb_train_df = pd.DataFrame({'text': imdb_train_texts, 'label': imdb_train_labels})
imdb_val_df = pd.DataFrame({'text': imdb_val_texts, 'label': imdb_val_labels})
imdb_test_df = pd.DataFrame({'text': imdb_test_texts, 'label': imdb_test_labels})

imdb_train_df.to_csv(intermediate_dir / 'imdb_train_preprocessed.csv', index=False)
imdb_val_df.to_csv(intermediate_dir / 'imdb_val_preprocessed.csv', index=False)
imdb_test_df.to_csv(intermediate_dir / 'imdb_test_preprocessed.csv', index=False)

print(f"Saved IMDB: Train={len(imdb_train_texts)}, Val={len(imdb_val_texts)}, Test={len(imdb_test_texts)}")


Cleaning IMDB dataset...
Splitting IMDB into train/val/test sets...
Saved IMDB: Train=35000, Val=5000, Test=10000


In [10]:
# Save preprocessed SST-2 data
sst2_path = Path('../archive (5)/SST2-Data/SST2-Data/stanfordSentimentTreebank/stanfordSentimentTreebank')
sst2_loader = SST2Loader(sst2_path)
sst2_train_texts, sst2_train_labels, sst2_val_texts, sst2_val_labels, sst2_test_texts, sst2_test_labels = sst2_loader.load()

# Clean all texts
print("Cleaning SST-2 dataset...")
sst2_train_cleaned = [clean_text(text) for text in sst2_train_texts]
sst2_val_cleaned = [clean_text(text) for text in sst2_val_texts]
sst2_test_cleaned = [clean_text(text) for text in sst2_test_texts]

# Save as CSV
train_df = pd.DataFrame({'text': sst2_train_cleaned, 'label': sst2_train_labels})
val_df = pd.DataFrame({'text': sst2_val_cleaned, 'label': sst2_val_labels})
test_df = pd.DataFrame({'text': sst2_test_cleaned, 'label': sst2_test_labels})

train_df.to_csv(intermediate_dir / 'sst2_train_preprocessed.csv', index=False)
val_df.to_csv(intermediate_dir / 'sst2_val_preprocessed.csv', index=False)
test_df.to_csv(intermediate_dir / 'sst2_test_preprocessed.csv', index=False)

print(f"Saved SST-2: Train={len(sst2_train_cleaned)}, Val={len(sst2_val_cleaned)}, Test={len(sst2_test_cleaned)}")


Cleaning SST-2 dataset...
Saved SST-2: Train=8544, Val=1101, Test=2210


In [11]:
# Save preprocessed Yelp data (with train/val/test split)
yelp_loader = YelpDataLoader('../archive (7)/yelp_academic_dataset_review.json')
yelp_texts, yelp_labels = yelp_loader.load(sample_size=50000, binary=True)

# Clean all texts
print("Cleaning Yelp dataset...")
yelp_cleaned_full = [clean_text(text) for text in yelp_texts]

# Split into train/val/test (using seed from seed_everything)
print("Splitting Yelp into train/val/test sets...")
yelp_train_texts, yelp_val_texts, yelp_test_texts, \
yelp_train_labels, yelp_val_labels, yelp_test_labels = create_train_test_split(
    yelp_cleaned_full, yelp_labels, test_size=0.2, val_size=0.1, random_state=42
)

# Save as separate CSV files
yelp_train_df = pd.DataFrame({'text': yelp_train_texts, 'label': yelp_train_labels})
yelp_val_df = pd.DataFrame({'text': yelp_val_texts, 'label': yelp_val_labels})
yelp_test_df = pd.DataFrame({'text': yelp_test_texts, 'label': yelp_test_labels})

yelp_train_df.to_csv(intermediate_dir / 'yelp_train_preprocessed.csv', index=False)
yelp_val_df.to_csv(intermediate_dir / 'yelp_val_preprocessed.csv', index=False)
yelp_test_df.to_csv(intermediate_dir / 'yelp_test_preprocessed.csv', index=False)

print(f"Saved Yelp: Train={len(yelp_train_texts)}, Val={len(yelp_val_texts)}, Test={len(yelp_test_texts)}")

print(f"\n✅ All preprocessed data saved to {intermediate_dir}")


Cleaning Yelp dataset...
Splitting Yelp into train/val/test sets...
Saved Yelp: Train=35000, Val=5000, Test=10000

✅ All preprocessed data saved to ../intermediate/data
