## Model Comparison & Method Selection

## Purpose
This notebook documents our experimental process for selecting optimal preprocessing methods and classification. Results from this notebook will inform final implementation decisions in the main analysis notebook.

**Note**: This is an experimental notebook. Only winning approaches will be implemented in the production notebook [sentiment_analysis](/notebooks/sentiment_analysis.ipynb).

1. Examine Dataset for Normalization Decisions
2. spaCy vs NLTK Lemmatization Comparison
3. Spelling Correction Impact Analysis

In [5]:
from ftplib import all_errors

# Exam the dataset for normalization decision
import pandas as pd
from collections import Counter

# Load dataset
df_check = pd.read_csv('../data/customer_sentiment.csv')

# Sample review texts
print("\n1. SAMPLE REVIEW TEXTS (first 10):")
for i, review in enumerate(df_check['review_text'].head(10), 1):
    print(f"\n   {i}. {review[:]}")

# Platform distribution
print("\n2. PLATFORM DISTRIBUTION:")
print(df_check['platform'].value_counts())

# Get all words from reviews (lowercased)
all_words = ' '.join(df_check['review_text'].str.lower()).split()
word_counts = Counter(all_words)

print("\n3. TOP 30 MOST COMMON WORDS IN REVIEWS:")
for word, count in word_counts.most_common(50):
    print(f"   {word.ljust(20)} {count}")

# Check for retail/delivery terms
retail_terms = ['delivery', 'deliver', 'delivered', 'shipping', 'shipped', 'ship',
                'refund', 'return', 'returned', 'money', 'back',
                'order', 'ordered', 'ordering', 'package', 'packaging',
                'quality', 'product', 'service', 'customer']

print("\n4. FREQUENCY OF RETAIL/DELIVERY TERMS:")
for term in retail_terms:
    count = sum(1 for review in df_check['review_text'].str.lower() if term in review)
    if count > 0:
        print(f"   {term.ljust(15)} appears in {count} reviews ({count / len(df_check) * 100:.1f}%)")

# 6. Check platform mentions in text
print("\n5. PLATFORM NAMES MENTIONED IN REVIEWS:")
platforms = df_check['platform'].unique()
for platform in platforms[:]:  # Check first 10 platforms
    count = sum(1 for review in df_check['review_text'].str.lower() if platform.lower() in review)
    if count > 0:
        print(f"   {platform.ljust(20)}: mentioned {count} times")
print("not mentioned")


1. SAMPLE REVIEW TEXTS (first 10):

   1. very disappointed with the quality.

   2. fast delivery and great packaging.

   3. very disappointed with the quality.

   4. product stopped working after few days.

   5. neutral about the quality.

   6. amazing experience, highly recommend!

   7. great value for money.

   8. excellent product! exceeded expectations.

   9. product is okay, nothing special.

   10. great value for money.

2. PLATFORM DISTRIBUTION:
platform
nykaa                   1301
snapdeal                1289
others                  1286
reliance digital        1279
zepto                   1278
facebook marketplace    1272
paytm mall              1271
myntra                  1267
croma                   1266
flipkart                1264
boat                    1257
lenskart                1241
jiomart                 1240
meesho                  1240
ajio                    1234
bigbasket               1230
shopclues               1220
tata cliq               1201
s

> Decision Notes:

* No platforms mentioned in text - platform normalization unnecessary
* Terms are already simple - "delivery", "quality", "product" are base forms
* Lemmatization could be considered for words like "delivered" -> "deliver",
* "ordered" -> "order","packaging" -> "package"
* Reviews are short - complex normalization adds small value
* Decision: (clean + lemmatize + stopwords)*

## spaCy vs NLTK Lemmatization Comparison

In [12]:
# Spelling Quality Check
from textblob import TextBlob
import numpy as np

print("Spelling Quality Analysis")
print(f"Analyzing sample of {len(df)} reviews...\n")

# Function to check if word is likely misspelled
def check_spelling_errors(text):
    """Check for potential spelling errors in text."""
    blob = TextBlob(text.lower())
    words = blob.words

    # Count words that might be misspelled
    # TextBlob suggests corrections for words it thinks are wrong
    misspelled = []
    for word in words:
        if len(word) > 2:  # Skip very short words
            corrected = TextBlob(word).correct()
            if str(corrected) != word:
                misspelled.append((word, str(corrected)))

    return misspelled

# Analyze a sample of reviews
sample_size = 100
sample_reviews = df['review_text'].sample(sample_size, random_state=42)

all_errors = []
reviews_with_errors = 0

for review in sample_reviews:
    errors = check_spelling_errors(review)
    if errors:
        reviews_with_errors += 1
        all_errors.extend(errors)

# Calculate statistics
error_rate = (reviews_with_errors / sample_size) * 100

print(f"Reviews with potential spelling errors: {reviews_with_errors}/{sample_size} ({error_rate:.1f}%)")
print(f"Total potential errors found: {len(all_errors)}")

# Show most common errors
if all_errors:
    from collections import Counter
    common_errors = Counter(all_errors).most_common(10)

    print(f"\nMost Common Potential Errors:")
    for (wrong, correct), count in common_errors:
        print(f"  '{wrong}' -> '{correct}' ({count} times)")

# Show some example reviews with errors
print(f"\nExample Reviews with Potential Spelling Issues:")
count = 0
for review in sample_reviews:
    errors = check_spelling_errors(review)
    if errors and count < 3:
        print(f"\n{count+1}. Original: {review}")
        print(f"   Errors found: {errors}")
        count += 1

print("\nSpelling check complete!")

spaCy model loaded successfully!
NLTK resources ready!
Dataset loaded: 25,000 reviews

Preprocessing Comparison: NLTK vs spaCy (Sample)

Sample 1:
Original: very disappointed with the quality.
NLTK:     disappointed quality
spaCy:    disappoint quality

Sample 2:
Original: fast delivery and great packaging.
NLTK:     fast delivery great packaging
spaCy:    fast delivery great packaging

Sample 3:
Original: very disappointed with the quality.
NLTK:     disappointed quality
spaCy:    disappoint quality

Sample 4:
Original: product stopped working after few days.
NLTK:     product stopped working day
spaCy:    product stop work day

Sample 5:
Original: neutral about the quality.
NLTK:     neutral quality
spaCy:    neutral quality

Execution Time Comparison - Processing 25,000 reviews
Testing NLTK...
NLTK Lemmatization: 3.76 seconds
Testing spaCy...
spaCy Lemmatization: 85.54 seconds

Speed Comparison:
spaCy is 22.72x slower than NLTK
Time difference: 81.78 seconds


>We observe that while spaCy provides better verb normalization,smaller vocabularies. However, it is more aggressive, and it is 22.72x slower than NLTK. Given our large dataset size **(25,000 reviews)** and the relatively minor differences in lemmatization for our specific use case, we will proceed with NLTK for lemmatization in our final preprocessing pipeline.

In [7]:
from textblob import TextBlob
from collections import Counter
import pandas as pd

df = pd.read_csv('../data/customer_sentiment.csv')

# Function to check if word is misspelled
def is_misspelled(text):
    """ Check for misspelled words in text """
    blob = TextBlob(text.lower())
    words = blob.words

    # Count words that might be misspelled
    misspelled_words = []
    for w in words:
        if len(w) > 2:  # Ignore very short words
            corrected_word = TextBlob(w).correct()
            if str(corrected_word).lower() != w.lower():
                misspelled_words.append(w)

    return misspelled_words

# Check misspellings in a sample of reviews
sample_size = 1000
sample_reviews = df['review_text'].sample(n=sample_size, random_state=42)

all_errors = []
reviews_with_errors = 0

print(f"Checking spelling in {sample_size} reviews...")

for review in sample_reviews:
    errors = is_misspelled(review)
    if errors:
        reviews_with_errors += 1
        all_errors.extend(errors)

# Summary statistics
error_rate = (reviews_with_errors / sample_size) * 100
unique_errors = set(all_errors)

print(f"Reviews with potential spelling errors: {reviews_with_errors}/{sample_size} ({error_rate:.1f}%)")
print(f"Total potential errors found: {len(all_errors)}")
print(f"Unique misspelled words: {len(unique_errors)}")

# Display most common misspelled words
error_counts = Counter(all_errors)
print("\nMost Common Potential Misspelled Words:")
for word, count in error_counts.most_common(20):
    print(f"   {word.ljust(15)}: {count} occurrences")

print("\nSpelling check complete!")

Checking spelling in 1000 reviews...
Reviews with potential spelling errors: 236/1000 (23.6%)
Total potential errors found: 236
Unique misspelled words: 2

Most Common Potential Misspelled Words:
   packaging      : 156 occurrences
   unhelpful      : 80 occurrences

Spelling check complete!


> 23.6% error rate sounds high,but,"Errors" are false positives as it is valid word. Spell correction might introduce noise and it's execution is slow. additionally it might recognize platform names as misspelled words. Given these factors, we will not include spelling correction in our final preprocessing pipeline.