## Model Comparison & Method Selection

## Purpose
This notebook documents our experimental process for selecting optimal preprocessing methods and classification. Results from this notebook will inform final implementation decisions in the main analysis notebook.

**Note**: This is an experimental notebook. Only winning approaches will be implemented in the production notebook [sentiment_analysis](/notebooks/sentiment_analysis.ipynb).

1. Examine Dataset for Normalization Decisions
2. spaCy vs NLTK Lemmatization Comparison
3. Spelling Correction Impact Analysis

In [14]:
# Import necessary libraries
from textblob import TextBlob
from collections import Counter
import re
import string
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import spacy
import pandas as pd
import subprocess
import sys
import time

# Download spaCy model (silent)
try:
    nlp = spacy.load('en_core_web_sm')
    print("spaCy model loaded successfully!")
except OSError:
    print("Downloading spaCy model...")
    subprocess.run(
        [sys.executable, "-m", "spacy", "download", "en_core_web_sm"],
        stdout=subprocess.DEVNULL,
        stderr=subprocess.DEVNULL
    )
    nlp = spacy.load('en_core_web_sm')
    print("spaCy model downloaded successfully!")

# Download NLTK resources (silent)
nltk_resources = ['punkt', 'punkt_tab', 'stopwords', 'wordnet', 'omw-1.4']
for resource in nltk_resources:
    nltk.download(resource, quiet=True)
print("NLTK resources ready!")


spaCy model loaded successfully!
NLTK resources ready!


In [15]:
# Load dataset
data = pd.read_csv('../data/customer_sentiment.csv')

In [16]:
# Exam the dataset for normalization decision
print("\n1. SAMPLE REVIEW TEXTS (first 10):")
for i, review in enumerate(data['review_text'].head(10), 1):
    print(f"\n   {i}. {review[:100]}")

# Platform distribution
print("\n2. PLATFORM DISTRIBUTION:")
print(data['platform'].value_counts())

# Get all words from reviews (lowercased)
all_words = ' '.join(data['review_text'].str.lower()).split()
word_counts = Counter(all_words)

print("\n3. TOP 50 MOST COMMON WORDS IN REVIEWS:")
for word, count in word_counts.most_common(50):
    print(f"   {word.ljust(20)} {count}")

# Check for retail/delivery terms
retail_terms = ['delivery', 'deliver', 'delivered', 'shipping', 'shipped', 'ship',
                'refund', 'return', 'returned', 'money', 'back',
                'order', 'ordered', 'ordering', 'package', 'packaging',
                'quality', 'product', 'service', 'customer']

print("\n4. FREQUENCY OF RETAIL/DELIVERY TERMS:")
for term in retail_terms:
    count = sum(1 for review in data['review_text'].str.lower() if term in review)
    if count > 0:
        print(f"   {term.ljust(15)} appears in {count} reviews ({count / len(data) * 100:.1f}%)")

# Check platform mentions in text
print("\n5. PLATFORM NAMES MENTIONED IN REVIEWS:")
platforms = data['platform'].unique()
platform_mentions = 0
for platform in platforms:
    count = sum(1 for review in data['review_text'].str.lower() if platform.lower() in review)
    if count > 0:
        print(f"   {platform.ljust(20)}: mentioned {count} times")
        platform_mentions += 1
if platform_mentions == 0:
    print("   No platform names mentioned in reviews")


1. SAMPLE REVIEW TEXTS (first 10):

   1. very disappointed with the quality.

   2. fast delivery and great packaging.

   3. very disappointed with the quality.

   4. product stopped working after few days.

   5. neutral about the quality.

   6. amazing experience, highly recommend!

   7. great value for money.

   8. excellent product! exceeded expectations.

   9. product is okay, nothing special.

   10. great value for money.

2. PLATFORM DISTRIBUTION:
platform
nykaa                   1301
snapdeal                1289
others                  1286
reliance digital        1279
zepto                   1278
facebook marketplace    1272
paytm mall              1271
myntra                  1267
croma                   1266
flipkart                1264
boat                    1257
lenskart                1241
jiomart                 1240
meesho                  1240
ajio                    1234
bigbasket               1230
shopclues               1220
tata cliq               1201
s

> Decision Notes:

* No platforms mentioned in text - platform normalization unnecessary
* Terms are already simple - "delivery", "quality", "product" are base forms
* Lemmatization could be considered for words like "delivered" -> "deliver",
* "ordered" -> "order","packaging" -> "package"
* Reviews are short - complex normalization adds small value
* Decision: (clean + lemmatize + stopwords)*

## spaCy vs NLTK Lemmatization Comparison

In [18]:
# Initialize tools
lemmatizer = WordNetLemmatizer()

# Text Cleaning Function
def clean_text(text):
    """ Clean and Normalize Text Data
     Steps:
      1. Lowercase the text
      2. Remove URLs
      3. Remove emails addresses
      4. Remove punctuation
      5. Remove extra whitespace
     """
    text = text.lower()
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    text = re.sub(r'\S+@\S+', '', text)
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = re.sub(r'\s+', ' ', text).strip()
    return text

# Tokenization Function
def tokenize_text(text):
    """ Tokenize text into words """
    return word_tokenize(text)

# Stopword Removal Function
def remove_stopwords(tokens):
    """
     Remove common English stopwords, but keep negations.
     Like: not, no, n't, don't, doesn't, didn't, never, none, nobody, nothing, neither, nor
     Negations are crucial for sentiment analysis.
     """
    stop_words = set(stopwords.words('english'))
    negations = {"not", "no", "n't", "don't", "doesn't", "didn't",
                 "never", "none", "nobody", "nothing", "neither", "nor"}
    return [words for words in tokens if words not in stop_words or words in negations]

# Lemmatization Functions
def lemmatize_tokens_nltk(tokens):
    """ Lemmatize text using NLTK's WordNetLemmatizer
    Example:
        packages -> package
        delivered -> deliver
    """
    return [lemmatizer.lemmatize(token) for token in tokens]

def lemmatize_tokens_spacy(tokens):
    """ Lemmatize text using spaCy
    Example:
      running -> run, better -> good | While NLTK may not handle these well
    """
    text = ' '.join(tokens)
    doc = nlp(text)
    return [token.lemma_ for token in doc]

# Preprocessing Pipeline
def preprocess_pipeline(text, method='nltk'):
    """ Complete Preprocessing Pipeline
     method: 'nltk' or 'spacy' for lemmatization
    """
    text = clean_text(text)
    tokens = tokenize_text(text)
    tokens = remove_stopwords(tokens)

    if method == 'nltk':
        tokens = lemmatize_tokens_nltk(tokens)
    elif method == 'spacy':
        tokens = lemmatize_tokens_spacy(tokens)
    else:
        raise ValueError("Method must be 'nltk' or 'spacy'")

    return ' '.join(tokens)

# Sample Comparison
print("\nPreprocessing Comparison: NLTK vs spaCy (Sample)")
sample_size = 5
for i in range(sample_size):
    original = df.loc[i, 'review_text']
    nltk_result = preprocess_pipeline(original)
    spacy_result = preprocess_pipeline(original, method='spacy')

    print(f"\nSample {i+1}:")
    print(f"Original: {original}")
    print(f"NLTK:     {nltk_result}")
    print(f"spaCy:    {spacy_result}")

# Execution Time Test
print(f"\nExecution Time Comparison - Processing {len(df):,} reviews")

# NLTK timing
print("Testing NLTK...")
start_time = time.time()
df['processed_text_nltk'] = df['review_text'].apply(preprocess_pipeline)
nltk_time = time.time() - start_time
print(f"NLTK Lemmatization: {nltk_time:.2f} seconds")

# spaCy timing
print("Testing spaCy...")
start_time = time.time()
df['processed_text_spacy'] = df['review_text'].apply(lambda x: preprocess_pipeline(x, method='spacy'))
spacy_time = time.time() - start_time
print(f"spaCy Lemmatization: {spacy_time:.2f} seconds")

# Speed comparison
print(f"\nSpeed Comparison:")
print(f"spaCy is {spacy_time/nltk_time:.2f}x slower than NLTK")
print(f"Time difference: {spacy_time - nltk_time:.2f} seconds")


Preprocessing Comparison: NLTK vs spaCy (Sample)

Sample 1:
Original: very disappointed with the quality.
NLTK:     disappointed quality
spaCy:    disappoint quality

Sample 2:
Original: fast delivery and great packaging.
NLTK:     fast delivery great packaging
spaCy:    fast delivery great packaging

Sample 3:
Original: very disappointed with the quality.
NLTK:     disappointed quality
spaCy:    disappoint quality

Sample 4:
Original: product stopped working after few days.
NLTK:     product stopped working day
spaCy:    product stop work day

Sample 5:
Original: neutral about the quality.
NLTK:     neutral quality
spaCy:    neutral quality

Execution Time Comparison - Processing 25,000 reviews
Testing NLTK...
NLTK Lemmatization: 2.52 seconds
Testing spaCy...
spaCy Lemmatization: 62.05 seconds

Speed Comparison:
spaCy is 24.66x slower than NLTK
Time difference: 59.53 seconds


>We observe that while spaCy provides better verb normalization, smaller vocabularies. However, it is more aggressive, and it is **~ 23.83x slower** than NLTK. Given our large dataset size **(25,000 reviews)** and the relatively minor differences in lemmatization for our specific use case, we will proceed with NLTK for lemmatization in our final preprocessing pipeline.

In [19]:
# Function to check if word is misspelled
def is_misspelled(text):
    """ Check for misspelled words in text """
    blob = TextBlob(text.lower())
    words = blob.words

    # Count words that might be misspelled
    misspelled_words = []
    for w in words:
        if len(w) > 2:  # Ignore very short words
            corrected_word = TextBlob(w).correct()
            if str(corrected_word).lower() != w.lower():
                misspelled_words.append(w)

    return misspelled_words

# Check misspellings in a sample of reviews
sample_size = 1000
sample_reviews = df['review_text'].sample(n=sample_size, random_state=42)

all_errors = []
reviews_with_errors = 0

print(f"Checking spelling in {sample_size} reviews...")

for review in sample_reviews:
    errors = is_misspelled(review)
    if errors:
        reviews_with_errors += 1
        all_errors.extend(errors)

# Summary statistics
error_rate = (reviews_with_errors / sample_size) * 100
unique_errors = set(all_errors)

print(f"Reviews with potential spelling errors: {reviews_with_errors}/{sample_size} ({error_rate:.1f}%)")
print(f"Total potential errors found: {len(all_errors)}")
print(f"Unique misspelled words: {len(unique_errors)}")

# Display most common misspelled words
error_counts = Counter(all_errors)
print("\nMost Common Potential Misspelled Words:")
for word, count in error_counts.most_common(20):
    print(f"   {word.ljust(15)}: {count} occurrences")
print("\nSpelling check complete!")

Checking spelling in 1000 reviews...
Reviews with potential spelling errors: 236/1000 (23.6%)
Total potential errors found: 236
Unique misspelled words: 2

Most Common Potential Misspelled Words:
   packaging      : 156 occurrences
   unhelpful      : 80 occurrences

Spelling check complete!


> **23.6%** error rate sounds high, but "Errors" are false positives as they are valid words. Spell correction might introduce noise, and its execution is slow. additionally, it might recognize platform names as misspelled words. Given these factors, we will not include spelling correction in our final preprocessing pipeline.