## Model Comparison & Method Selection

## Purpose
This notebook documents our experimental process for selecting optimal preprocessing methods and classification. Results from this notebook will inform final implementation decisions in the main analysis notebook.

**Note**: This is an experimental notebook. Only winning approaches will be implemented in the production notebook [sentiment_analysis](/notebooks/sentiment_analysis.ipynb)

In [5]:
# Exam the dataset for normalization decision
import pandas as pd
from collections import Counter

# Load dataset
df_check = pd.read_csv('../data/customer_sentiment.csv')

# Sample review texts
print("\n1. SAMPLE REVIEW TEXTS (first 10):")
for i, review in enumerate(df_check['review_text'].head(10), 1):
    print(f"\n   {i}. {review[:]}")

# Platform distribution
print("\n2. PLATFORM DISTRIBUTION:")
print(df_check['platform'].value_counts())

# Get all words from reviews (lowercased)
all_words = ' '.join(df_check['review_text'].str.lower()).split()
word_counts = Counter(all_words)

print("\n3. TOP 30 MOST COMMON WORDS IN REVIEWS:")
for word, count in word_counts.most_common(50):
    print(f"   {word.ljust(20)} {count}")

# Check for retail/delivery terms
retail_terms = ['delivery', 'deliver', 'delivered', 'shipping', 'shipped', 'ship',
                'refund', 'return', 'returned', 'money', 'back',
                'order', 'ordered', 'ordering', 'package', 'packaging',
                'quality', 'product', 'service', 'customer']

print("\n4. FREQUENCY OF RETAIL/DELIVERY TERMS:")
for term in retail_terms:
    count = sum(1 for review in df_check['review_text'].str.lower() if term in review)
    if count > 0:
        print(f"   {term.ljust(15)} appears in {count} reviews ({count / len(df_check) * 100:.1f}%)")

# 6. Check platform mentions in text
print("\n5. PLATFORM NAMES MENTIONED IN REVIEWS:")
platforms = df_check['platform'].unique()
for platform in platforms[:]:  # Check first 10 platforms
    count = sum(1 for review in df_check['review_text'].str.lower() if platform.lower() in review)
    if count > 0:
        print(f"   {platform.ljust(20)}: mentioned {count} times")
print("not mentioned")

# Conclusion
# No platforms mentioned in text - platform normalization unnecessary
# Terms are already simple - "delivery", "quality", "product" are base forms
# Lemmatization could be considered for words like "delivered" -> "deliver",
# "ordered" -> "order" "packing" -> "packaging" -> "package"
# Reviews are short - complex normalization adds small value
# Decision: (clean + lemmatize + stopwords)


1. SAMPLE REVIEW TEXTS (first 10):

   1. very disappointed with the quality.

   2. fast delivery and great packaging.

   3. very disappointed with the quality.

   4. product stopped working after few days.

   5. neutral about the quality.

   6. amazing experience, highly recommend!

   7. great value for money.

   8. excellent product! exceeded expectations.

   9. product is okay, nothing special.

   10. great value for money.

2. PLATFORM DISTRIBUTION:
platform
nykaa                   1301
snapdeal                1289
others                  1286
reliance digital        1279
zepto                   1278
facebook marketplace    1272
paytm mall              1271
myntra                  1267
croma                   1266
flipkart                1264
boat                    1257
lenskart                1241
jiomart                 1240
meesho                  1240
ajio                    1234
bigbasket               1230
shopclues               1220
tata cliq               1201
s

## spaCy vs NLTK Lemmatization Comparison

In [12]:
import re
import string
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import spacy
import pandas as pd
import subprocess
import sys
import time

# Download spaCy model (silent)
try:
    nlp = spacy.load('en_core_web_sm')
    print("spaCy model loaded successfully!")
except OSError:
    print("Downloading spaCy model...")
    subprocess.run(
        [sys.executable, "-m", "spacy", "download", "en_core_web_sm"],
        stdout=subprocess.DEVNULL,
        stderr=subprocess.DEVNULL
    )
    nlp = spacy.load('en_core_web_sm')
    print("spaCy model downloaded successfully!")

# Download NLTK resources (silent)
nltk_resources = ['punkt', 'punkt_tab', 'stopwords', 'wordnet', 'omw-1.4']
for resource in nltk_resources:
    nltk.download(resource, quiet=True)
print("NLTK resources ready!")

# Load dataset
df = pd.read_csv('../data/customer_sentiment.csv')
print(f"Dataset loaded: {len(df):,} reviews")

# Initialize tools
lemmatizer = WordNetLemmatizer()

# Text Cleaning Function
def clean_text(text):
    """ Clean and Normalize Text Data
     Steps:
      1. Lowercase the text
      2. Remove URLs
      3. Remove emails addresses
      4. Remove punctuation
      5. Remove extra whitespace
     """
    text = text.lower()
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    text = re.sub(r'\S+@\S+', '', text)
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = re.sub(r'\s+', ' ', text).strip()
    return text

# Tokenization Function
def tokenize_text(text):
    """ Tokenize text into words """
    return word_tokenize(text)

# Stopword Removal Function
def remove_stopwords(tokens):
    """
     Remove common English stopwords, but keep negations.
     Like: not, no, n't, don't, doesn't, didn't, never, none, nobody, nothing, neither, nor
     Negations are crucial for sentiment analysis.
     """
    stop_words = set(stopwords.words('english'))
    negations = {"not", "no", "n't", "don't", "doesn't", "didn't",
                 "never", "none", "nobody", "nothing", "neither", "nor"}
    return [words for words in tokens if words not in stop_words or words in negations]

# Lemmatization Functions
def lemmatize_tokens_nltk(tokens):
    """ Lemmatize text using NLTK's WordNetLemmatizer
    Example:
        packages -> package
        delivered -> deliver
    """
    return [lemmatizer.lemmatize(token) for token in tokens]

def lemmatize_tokens_spacy(tokens):
    """ Lemmatize text using spaCy
    Example:
      running -> run, better -> good | While NLTK may not handle these well
    """
    text = ' '.join(tokens)
    doc = nlp(text)
    return [token.lemma_ for token in doc]

# Preprocessing Pipeline
def preprocess_pipeline(text, method='nltk'):
    """ Complete Preprocessing Pipeline
     method: 'nltk' or 'spacy' for lemmatization
    """
    text = clean_text(text)
    tokens = tokenize_text(text)
    tokens = remove_stopwords(tokens)

    if method == 'nltk':
        tokens = lemmatize_tokens_nltk(tokens)
    elif method == 'spacy':
        tokens = lemmatize_tokens_spacy(tokens)
    else:
        raise ValueError("Method must be 'nltk' or 'spacy'")

    return ' '.join(tokens)

# Sample Comparison
print("\nPreprocessing Comparison: NLTK vs spaCy (Sample)")
sample_size = 5
for i in range(sample_size):
    original = df.loc[i, 'review_text']
    nltk_result = preprocess_pipeline(original)
    spacy_result = preprocess_pipeline(original, method='spacy')

    print(f"\nSample {i+1}:")
    print(f"Original: {original}")
    print(f"NLTK:     {nltk_result}")
    print(f"spaCy:    {spacy_result}")

# Execution Time Test
print(f"\nExecution Time Comparison - Processing {len(df):,} reviews")

# NLTK timing
print("Testing NLTK...")
start_time = time.time()
df['processed_text_nltk'] = df['review_text'].apply(preprocess_pipeline)
nltk_time = time.time() - start_time
print(f"NLTK Lemmatization: {nltk_time:.2f} seconds")

# spaCy timing
print("Testing spaCy...")
start_time = time.time()
df['processed_text_spacy'] = df['review_text'].apply(lambda x: preprocess_pipeline(x, method='spacy'))
spacy_time = time.time() - start_time
print(f"spaCy Lemmatization: {spacy_time:.2f} seconds")

# Speed comparison
print(f"\nSpeed Comparison:")
print(f"spaCy is {spacy_time/nltk_time:.2f}x slower than NLTK")
print(f"Time difference: {spacy_time - nltk_time:.2f} seconds")

spaCy model loaded successfully!
NLTK resources ready!
Dataset loaded: 25,000 reviews

Preprocessing Comparison: NLTK vs spaCy (Sample)

Sample 1:
Original: very disappointed with the quality.
NLTK:     disappointed quality
spaCy:    disappoint quality

Sample 2:
Original: fast delivery and great packaging.
NLTK:     fast delivery great packaging
spaCy:    fast delivery great packaging

Sample 3:
Original: very disappointed with the quality.
NLTK:     disappointed quality
spaCy:    disappoint quality

Sample 4:
Original: product stopped working after few days.
NLTK:     product stopped working day
spaCy:    product stop work day

Sample 5:
Original: neutral about the quality.
NLTK:     neutral quality
spaCy:    neutral quality

Execution Time Comparison - Processing 25,000 reviews
Testing NLTK...
NLTK Lemmatization: 3.76 seconds
Testing spaCy...
spaCy Lemmatization: 85.54 seconds

Speed Comparison:
spaCy is 22.72x slower than NLTK
Time difference: 81.78 seconds


>We observe that while spaCy provides etter verb normalization,smaller vocabularies. However, it is more aggressive and it is 22.72x slower than NLTK. Given our large dataset size **(25,000 reviews)** and the relatively minor differences in lemmatization for our specific use case, we will proceed with NLTK for lemmatization in our final preprocessing pipeline.