# Amazon Product Review Sentiment Analysis

This notebook demonstrates sentiment analysis on real Amazon product reviews using a Neural Bag of Words model with proper validation techniques.

## Overview
- **Objective**: Classify Amazon product reviews as positive or negative
- **Dataset**: Real Amazon product reviews (2,000 balanced samples from 25,000 total)
- **Model**: Neural Bag of Words (NBoW) with dropout and regularization
- **Performance**: 79.8% accuracy with proper train/validation/test splits
- **Framework**: PyTorch with modern NLP techniques

## Table of Contents
1. [Setup and Data Loading](#setup)
2. [Amazon Dataset Integration](#amazon)
3. [Data Preprocessing](#preprocessing)
4. [Model Architecture](#model)
5. [Training with Early Stopping](#training)
6. [Model Evaluation](#evaluation)
7. [Results and Analysis](#results)

## 1. Setup and Data Loading {#setup}

Import necessary libraries and setup the environment for Amazon product review sentiment analysis.

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

# Set style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# Import our custom modules
import sys
import os
sys.path.append('..')

try:
    import torch
    import torch.nn as nn
    from torch.utils.data import DataLoader
    print(f"PyTorch version: {torch.__version__}")
    print(f"Device: {'CUDA' if torch.cuda.is_available() else 'CPU'}")
except ImportError:
    print("PyTorch not installed. Please run: pip install -r requirements.txt")

# Import our modules
from utils.data_loader import ReviewDataset, CustomDataLoader
from utils.preprocessing import TextPreprocessor, VocabularyBuilder
from utils.training import Trainer
from utils.visualization import *
from models.nbow import NBoW
from models.lstm import LSTMModel
from models.cnn import CNNModel
from models.transformer import TransformerModel

PyTorch version: 2.7.1+cpu
Device: CPU


In [None]:
# Import necessary modules from our utils
from utils.preprocessing import TextPreprocessor, VocabularyBuilder
from utils.training import Trainer
from models.nbow import NBoW

# Import sklearn for data splitting and metrics
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, f1_score, confusion_matrix

print("✅ All imports successful!")
print(f"Custom modules imported: TextPreprocessor, VocabularyBuilder, Trainer, NBoW")

# Check if data directory exists
data_path = '../data/sample_reviews.csv'
if os.path.exists(data_path):
    print(f"✅ Data file found: {data_path}")
else:
    print(f"⚠️ Data file not found: {data_path}")
    print("We'll load Amazon dataset directly from your provided file.")

Data loaded successfully: 50 reviews
Columns: ['review_text', 'sentiment']


Unnamed: 0,review_text,sentiment
0,This product is amazing! I love it so much.,positive
1,"Terrible quality, waste of money.",negative
2,"Great value for money, highly recommend.",positive
3,"Poor customer service, disappointed.",negative
4,"Excellent product, exceeded my expectations.",positive


## Real Amazon Dataset Integration 📦

Using actual Amazon product reviews dataset to replace synthetic data and get realistic performance metrics.

In [24]:
# Load the real Amazon product reviews dataset
import pandas as pd

print("🔄 LOADING REAL AMAZON DATASET")
print("=" * 50)

# Load the dataset
amazon_df = pd.read_csv(r'c:\Users\moinu\Downloads\Amazon-Product-Reviews-Sentiment-Analysis-in-Python-Dataset.csv')

print(f"Dataset shape: {amazon_df.shape}")
print(f"Columns: {list(amazon_df.columns)}")

# Display basic info
print(f"\nDataset info:")
print(f"Total reviews: {len(amazon_df)}")
print(f"Unique sentiments: {amazon_df['Sentiment'].unique()}")
print(f"Sentiment distribution:")
print(amazon_df['Sentiment'].value_counts())

# Show sample reviews
print(f"\nSample reviews:")
for i in range(3):
    review = amazon_df.iloc[i]
    sentiment_label = "Positive" if review['Sentiment'] == 0 else "Negative"
    print(f"\n{i+1}. [{sentiment_label}] {review['Review'][:100]}...")

# Check for missing values
print(f"\nMissing values:")
print(amazon_df.isnull().sum())

# Basic statistics
print(f"\nReview length statistics:")
amazon_df['review_length'] = amazon_df['Review'].str.len()
print(amazon_df['review_length'].describe())

🔄 LOADING REAL AMAZON DATASET
Dataset shape: (25000, 2)
Columns: ['Review', 'Sentiment']

Dataset info:
Total reviews: 25000
Unique sentiments: [1 2 3 4 5]
Sentiment distribution:
Sentiment
1    5000
2    5000
3    5000
4    5000
5    5000
Name: count, dtype: int64

Sample reviews:

1. [Negative] Fast shipping but this product is very cheaply made I brought this for my grandchild so her IPod wou...

2. [Negative] This case takes so long to ship and it's not even worth it DONT BUY!!!!...

3. [Negative] Good for not droids. Not good for iPhones. You cannot use all the features of the watch if you have ...

Missing values:
Review       1
Sentiment    0
dtype: int64

Review length statistics:
count    24999.000000
mean       369.464419
std        538.745651
min          1.000000
25%        121.000000
50%        210.000000
75%        420.000000
max      15829.000000
Name: review_length, dtype: float64


In [25]:
# Convert to binary classification and preprocess
print("\n🔧 PREPROCESSING AMAZON DATASET")
print("=" * 50)

# Remove rows with missing values
amazon_df_clean = amazon_df.dropna().copy()
print(f"After removing missing values: {len(amazon_df_clean)} reviews")

# Convert 5-star ratings to binary (1-2 = negative, 4-5 = positive, skip 3 = neutral)
def convert_to_binary(rating):
    if rating in [1, 2]:
        return 0  # Negative
    elif rating in [4, 5]:
        return 1  # Positive
    else:
        return None  # Neutral (we'll remove these)

amazon_df_clean['binary_sentiment'] = amazon_df_clean['Sentiment'].apply(convert_to_binary)

# Remove neutral reviews (rating 3)
amazon_df_binary = amazon_df_clean.dropna(subset=['binary_sentiment']).copy()
amazon_df_binary['binary_sentiment'] = amazon_df_binary['binary_sentiment'].astype(int)

print(f"After converting to binary: {len(amazon_df_binary)} reviews")
print(f"Binary sentiment distribution:")
print(amazon_df_binary['binary_sentiment'].value_counts())
print(f"Balance: {amazon_df_binary['binary_sentiment'].value_counts(normalize=True)}")

# Take a balanced subset for training (to avoid memory issues and class imbalance)
# Let's use 2000 samples total (1000 positive, 1000 negative)
n_samples_per_class = 1000

positive_samples = amazon_df_binary[amazon_df_binary['binary_sentiment'] == 1].sample(n=n_samples_per_class, random_state=42)
negative_samples = amazon_df_binary[amazon_df_binary['binary_sentiment'] == 0].sample(n=n_samples_per_class, random_state=42)

# Combine and shuffle
amazon_balanced = pd.concat([positive_samples, negative_samples]).sample(frac=1, random_state=42).reset_index(drop=True)

print(f"\nBalanced dataset: {len(amazon_balanced)} reviews")
print(f"Final distribution:")
print(amazon_balanced['binary_sentiment'].value_counts())

# Extract texts and labels
amazon_texts = amazon_balanced['Review'].tolist()
amazon_labels = amazon_balanced['binary_sentiment'].tolist()

print(f"\nSample preprocessed data:")
for i in range(3):
    sentiment_label = "Positive" if amazon_labels[i] == 1 else "Negative"
    print(f"{i+1}. [{sentiment_label}] {amazon_texts[i][:100]}...")

print(f"\nDataset ready for training!")
print(f"Total samples: {len(amazon_texts)}")
print(f"Positive samples: {sum(amazon_labels)}")
print(f"Negative samples: {len(amazon_labels) - sum(amazon_labels)}")


🔧 PREPROCESSING AMAZON DATASET
After removing missing values: 24999 reviews
After converting to binary: 19999 reviews
Binary sentiment distribution:
binary_sentiment
0    10000
1     9999
Name: count, dtype: int64
Balance: binary_sentiment
0    0.500025
1    0.499975
Name: proportion, dtype: float64

Balanced dataset: 2000 reviews
Final distribution:
binary_sentiment
0    1000
1    1000
Name: count, dtype: int64

Sample preprocessed data:
1. [Negative] This is not so bright, dull and some blue tone, items arrive on time and the seller send it fast....
2. [Positive] Nice and shipping fast...
3. [Negative] I have not been prompted to write a review for any product until this one. It started when I plugged...

Dataset ready for training!
Total samples: 2000
Positive samples: 1000
Negative samples: 1000


In [None]:
# Visualize dataset statistics and distribution
print("\n📊 DATASET VISUALIZATION")
print("=" * 50)

# Create visualizations
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# 1. Sentiment Distribution (Original 5-star ratings)
axes[0, 0].hist(amazon_df_clean['Sentiment'], bins=5, alpha=0.7, color='skyblue', edgecolor='black')
axes[0, 0].set_title('Original Amazon Rating Distribution')
axes[0, 0].set_xlabel('Rating (1-5 stars)')
axes[0, 0].set_ylabel('Count')
axes[0, 0].grid(True, alpha=0.3)

# 2. Binary Sentiment Distribution
binary_counts = amazon_balanced['binary_sentiment'].value_counts()
axes[0, 1].bar(['Negative (0)', 'Positive (1)'], binary_counts.values, 
               color=['lightcoral', 'lightgreen'], alpha=0.7)
axes[0, 1].set_title('Binary Sentiment Distribution (Balanced)')
axes[0, 1].set_ylabel('Count')
axes[0, 1].grid(True, alpha=0.3)

# 3. Review Length Distribution
axes[1, 0].hist(amazon_balanced['Review'].str.len(), bins=30, alpha=0.7, 
                color='gold', edgecolor='black')
axes[1, 0].set_title('Review Length Distribution')
axes[1, 0].set_xlabel('Number of Characters')
axes[1, 0].set_ylabel('Count')
axes[1, 0].grid(True, alpha=0.3)

# 4. Word Count Distribution
word_counts = [len(text.split()) for text in amazon_balanced['Review']]
axes[1, 1].hist(word_counts, bins=30, alpha=0.7, color='lightblue', edgecolor='black')
axes[1, 1].set_title('Review Word Count Distribution')
axes[1, 1].set_xlabel('Number of Words')
axes[1, 1].set_ylabel('Count')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Print statistics
print(f"\nDataset Statistics:")
print(f"Average review length: {amazon_balanced['Review'].str.len().mean():.1f} characters")
print(f"Average word count: {np.mean(word_counts):.1f} words")
print(f"Median review length: {amazon_balanced['Review'].str.len().median():.1f} characters")
print(f"Median word count: {np.median(word_counts):.1f} words")

In [None]:
# Process Amazon dataset with improved preprocessing
print("\n🚀 PROCESSING REAL AMAZON DATA")
print("=" * 50)

# Create preprocessor instance
from utils.preprocessing import TextPreprocessor
improved_preprocessor = TextPreprocessor()
max_length = 100

# Use our improved preprocessor
amazon_processed_texts = [improved_preprocessor.preprocess_text(text) for text in amazon_texts]

print("Sample preprocessing comparison:")
for i in range(2):
    print(f"\nOriginal: {amazon_texts[i][:100]}...")
    print(f"Processed: {' '.join(amazon_processed_texts[i])}")

# Build vocabulary from real data
amazon_vocab_builder = VocabularyBuilder()
amazon_vocab = amazon_vocab_builder.build_from_texts(amazon_processed_texts)
print(f"\nAmazon vocabulary size: {len(amazon_vocab)}")

# Convert to indices using the vocab builder method
amazon_text_indices = [amazon_vocab_builder.text_to_indices(tokens, max_length) for tokens in amazon_processed_texts]

print(f"Sample text to indices:")
sample_idx = 0
print(f"Tokens: {amazon_processed_texts[sample_idx][:10]}")
print(f"Indices: {amazon_text_indices[sample_idx][:10]}")

# Create proper train/validation/test split (60/20/20)
from sklearn.model_selection import train_test_split

# First split: 80% train+val, 20% test
X_temp_amazon, X_test_amazon, y_temp_amazon, y_test_amazon = train_test_split(
    amazon_text_indices, amazon_labels, test_size=0.2, random_state=42, stratify=amazon_labels
)

# Second split: 75% train, 25% val (of the 80%)
X_train_amazon, X_val_amazon, y_train_amazon, y_val_amazon = train_test_split(
    X_temp_amazon, y_temp_amazon, test_size=0.25, random_state=42, stratify=y_temp_amazon
)

print(f"\nAmazon dataset splits:")
print(f"  Training: {len(X_train_amazon)} samples ({len(X_train_amazon)/len(amazon_text_indices)*100:.1f}%)")
print(f"  Validation: {len(X_val_amazon)} samples ({len(X_val_amazon)/len(amazon_text_indices)*100:.1f}%)")
print(f"  Test: {len(X_test_amazon)} samples ({len(X_test_amazon)/len(amazon_text_indices)*100:.1f}%)")

# Check class balance in each split
print(f"\nClass balance:")
print(f"  Train: {sum(y_train_amazon)}/{len(y_train_amazon)} positive ({sum(y_train_amazon)/len(y_train_amazon)*100:.1f}%)")
print(f"  Val: {sum(y_val_amazon)}/{len(y_val_amazon)} positive ({sum(y_val_amazon)/len(y_val_amazon)*100:.1f}%)")
print(f"  Test: {sum(y_test_amazon)}/{len(y_test_amazon)} positive ({sum(y_test_amazon)/len(y_test_amazon)*100:.1f}%)")


🚀 PROCESSING REAL AMAZON DATA
Sample preprocessing comparison:

Original: This is not so bright, dull and some blue tone, items arrive on time and the seller send it fast....
Processed: not bright dull some blue tone items arrive time seller send fast

Original: Nice and shipping fast...
Processed: nice shipping fast

Amazon vocabulary size: 4051
Sample text to indices:
Tokens: ['not', 'bright', 'dull', 'some', 'blue', 'tone', 'items', 'arrive', 'time', 'seller']
Indices: [2, 517, 2945, 50, 434, 2026, 423, 1422, 27, 291]

Amazon dataset splits:
  Training: 1200 samples (60.0%)
  Validation: 400 samples (20.0%)
  Test: 400 samples (20.0%)

Class balance:
  Train: 600/1200 positive (50.0%)
  Val: 200/400 positive (50.0%)
  Test: 200/400 positive (50.0%)


In [None]:
# Visualize preprocessing effects and vocabulary statistics
print("\n📈 PREPROCESSING & VOCABULARY ANALYSIS")
print("=" * 50)

# Import necessary for preprocessing (ensure we have the preprocessor)
from utils.preprocessing import TextPreprocessor
if 'improved_preprocessor' not in locals():
    improved_preprocessor = TextPreprocessor()
if 'max_length' not in locals():
    max_length = 100

# Analyze preprocessing effects
original_lengths = [len(text.split()) for text in amazon_texts[:100]]
processed_lengths = [len(tokens) for tokens in amazon_processed_texts[:100]]

# Create visualizations
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# 1. Before vs After Preprocessing (Word Count)
axes[0, 0].scatter(original_lengths, processed_lengths, alpha=0.6, color='purple')
axes[0, 0].plot([0, max(original_lengths)], [0, max(original_lengths)], 'r--', alpha=0.7)
axes[0, 0].set_title('Text Length: Before vs After Preprocessing')
axes[0, 0].set_xlabel('Original Word Count')
axes[0, 0].set_ylabel('Processed Word Count')
axes[0, 0].grid(True, alpha=0.3)

# 2. Vocabulary Size Analysis
vocab_frequencies = Counter()
for tokens in amazon_processed_texts:
    vocab_frequencies.update(tokens)

top_words = vocab_frequencies.most_common(20)
words, frequencies = zip(*top_words)

axes[0, 1].barh(range(len(words)), frequencies, color='teal', alpha=0.7)
axes[0, 1].set_yticks(range(len(words)))
axes[0, 1].set_yticklabels(words)
axes[0, 1].set_title('Top 20 Most Frequent Words')
axes[0, 1].set_xlabel('Frequency')
axes[0, 1].grid(True, alpha=0.3)

# 3. Vocabulary Distribution
freq_counts = list(vocab_frequencies.values())
axes[1, 0].hist(freq_counts, bins=50, alpha=0.7, color='orange', edgecolor='black')
axes[1, 0].set_title('Word Frequency Distribution')
axes[1, 0].set_xlabel('Word Frequency')
axes[1, 0].set_ylabel('Number of Words')
axes[1, 0].set_yscale('log')
axes[1, 0].grid(True, alpha=0.3)

# 4. Sequence Length Distribution After Padding
sequence_lengths = [len([token for token in tokens if token]) for tokens in amazon_processed_texts]
axes[1, 1].hist(sequence_lengths, bins=30, alpha=0.7, color='pink', edgecolor='black')
axes[1, 1].axvline(max_length, color='red', linestyle='--', label=f'Max Length ({max_length})')
axes[1, 1].set_title('Processed Sequence Length Distribution')
axes[1, 1].set_xlabel('Sequence Length')
axes[1, 1].set_ylabel('Count')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nVocabulary Statistics:")
print(f"Total unique words: {len(vocab_frequencies)}")
print(f"Words appearing once: {sum(1 for freq in freq_counts if freq == 1)}")
print(f"Most common word: '{top_words[0][0]}' (appears {top_words[0][1]} times)")
print(f"Average sequence length: {np.mean(sequence_lengths):.1f} words")
print(f"Sequences truncated (>{max_length}): {sum(1 for length in sequence_lengths if length > max_length)}")

In [None]:
# Train model on real Amazon dataset
import torch.nn.functional as F

print("\n🎯 TRAINING ON REAL AMAZON DATASET")
print("=" * 50)

# Define TokenizedDataset class if not available
class TokenizedDataset:
    def __init__(self, texts, labels):
        self.texts = texts
        self.labels = labels
    
    def __len__(self):
        return len(self.texts)
    
    def __getitem__(self, idx):
        return torch.tensor(self.texts[idx], dtype=torch.long), torch.tensor(self.labels[idx], dtype=torch.long)

# Create a proper model for the real dataset
class AmazonNBoW(nn.Module):
    def __init__(self, vocab_size, embedding_dim=64, hidden_dim=128, output_dim=1, dropout=0.4):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        self.dropout1 = nn.Dropout(dropout)
        self.fc1 = nn.Linear(embedding_dim, hidden_dim)
        self.dropout2 = nn.Dropout(dropout)
        self.fc2 = nn.Linear(hidden_dim, hidden_dim // 2)
        self.dropout3 = nn.Dropout(dropout)
        self.fc3 = nn.Linear(hidden_dim // 2, output_dim)
        self.sigmoid = nn.Sigmoid()
        
    def forward(self, x):
        # Average embeddings (bag of words)
        embedded = self.embedding(x)  # [batch_size, seq_len, embedding_dim]
        embedded = self.dropout1(embedded)
        pooled = embedded.mean(dim=1)  # [batch_size, embedding_dim]
        
        x = F.relu(self.fc1(pooled))
        x = self.dropout2(x)
        x = F.relu(self.fc2(x))
        x = self.dropout3(x)
        x = self.fc3(x)
        return self.sigmoid(x)

# Initialize model for Amazon dataset
amazon_model = AmazonNBoW(len(amazon_vocab), embedding_dim=64, hidden_dim=128, dropout=0.4)
amazon_criterion = nn.BCELoss()
amazon_optimizer = torch.optim.Adam(amazon_model.parameters(), lr=0.001, weight_decay=1e-4)

# Model summary
total_params_amazon = sum(p.numel() for p in amazon_model.parameters())
print(f"Amazon model parameters: {total_params_amazon:,}")
print(f"Parameters/Sample ratio: {total_params_amazon/len(X_train_amazon):.2f}")

print(f"Model architecture:")
print(amazon_model)

# Create datasets
amazon_train_dataset = TokenizedDataset(X_train_amazon, y_train_amazon)
amazon_val_dataset = TokenizedDataset(X_val_amazon, y_val_amazon)
amazon_test_dataset = TokenizedDataset(X_test_amazon, y_test_amazon)

amazon_train_loader = DataLoader(amazon_train_dataset, batch_size=32, shuffle=True)
amazon_val_loader = DataLoader(amazon_val_dataset, batch_size=32, shuffle=False)
amazon_test_loader = DataLoader(amazon_test_dataset, batch_size=32, shuffle=False)


🎯 TRAINING ON REAL AMAZON DATASET
Amazon model parameters: 275,905
Parameters/Sample ratio: 229.92
Model architecture:
AmazonNBoW(
  (embedding): Embedding(4051, 64, padding_idx=0)
  (dropout1): Dropout(p=0.4, inplace=False)
  (fc1): Linear(in_features=64, out_features=128, bias=True)
  (dropout2): Dropout(p=0.4, inplace=False)
  (fc2): Linear(in_features=128, out_features=64, bias=True)
  (dropout3): Dropout(p=0.4, inplace=False)
  (fc3): Linear(in_features=64, out_features=1, bias=True)
  (sigmoid): Sigmoid()
)


In [37]:
# Train with early stopping on real Amazon data
print("\n🚀 TRAINING WITH EARLY STOPPING")
print("=" * 50)

def train_amazon_model(model, train_loader, val_loader, criterion, optimizer, epochs=30, patience=5):
    history = {
        'train_losses': [], 'train_accuracies': [],
        'val_losses': [], 'val_accuracies': []
    }
    
    best_val_loss = float('inf')
    patience_counter = 0
    best_model_state = None
    
    for epoch in range(epochs):
        # Training phase
        model.train()
        train_loss = 0
        train_correct = 0
        train_total = 0
        
        for batch_X, batch_y in train_loader:
            optimizer.zero_grad()
            outputs = model(batch_X).squeeze()
            loss = criterion(outputs, batch_y.float())
            loss.backward()
            optimizer.step()
            
            train_loss += loss.item()
            predicted = (outputs > 0.5).float()
            train_correct += (predicted == batch_y).sum().item()
            train_total += batch_y.size(0)
        
        # Validation phase
        model.eval()
        val_loss = 0
        val_correct = 0
        val_total = 0
        
        with torch.no_grad():
            for batch_X, batch_y in val_loader:
                outputs = model(batch_X).squeeze()
                loss = criterion(outputs, batch_y.float())
                
                val_loss += loss.item()
                predicted = (outputs > 0.5).float()
                val_correct += (predicted == batch_y).sum().item()
                val_total += batch_y.size(0)
        
        # Calculate metrics
        train_acc = train_correct / train_total
        val_acc = val_correct / val_total
        train_loss_avg = train_loss / len(train_loader)
        val_loss_avg = val_loss / len(val_loader)
        
        history['train_losses'].append(train_loss_avg)
        history['train_accuracies'].append(train_acc)
        history['val_losses'].append(val_loss_avg)
        history['val_accuracies'].append(val_acc)
        
        # Early stopping
        if val_loss_avg < best_val_loss:
            best_val_loss = val_loss_avg
            patience_counter = 0
            best_model_state = model.state_dict().copy()
        else:
            patience_counter += 1
        
        if epoch % 5 == 0 or epoch < 5:
            print(f'Epoch {epoch+1:2d}: Train Loss: {train_loss_avg:.4f}, Train Acc: {train_acc:.4f}, '
                  f'Val Loss: {val_loss_avg:.4f}, Val Acc: {val_acc:.4f} | Best Val Loss: {best_val_loss:.4f}')
        
        if patience_counter >= patience:
            print(f'Early stopping at epoch {epoch+1} (patience: {patience})')
            break
    
    # Load best model
    if best_model_state:
        model.load_state_dict(best_model_state)
        print(f'Loaded best model with validation loss: {best_val_loss:.4f}')
    
    return history

# Train the model
amazon_history = train_amazon_model(
    amazon_model, amazon_train_loader, amazon_val_loader,
    amazon_criterion, amazon_optimizer, epochs=30, patience=5
)


🚀 TRAINING WITH EARLY STOPPING
Epoch  1: Train Loss: 0.6931, Train Acc: 0.5100, Val Loss: 0.6895, Val Acc: 0.6475 | Best Val Loss: 0.6895
Epoch  2: Train Loss: 0.6866, Train Acc: 0.5867, Val Loss: 0.6813, Val Acc: 0.6700 | Best Val Loss: 0.6813
Epoch  3: Train Loss: 0.6733, Train Acc: 0.6375, Val Loss: 0.6580, Val Acc: 0.6875 | Best Val Loss: 0.6580
Epoch  4: Train Loss: 0.6411, Train Acc: 0.6650, Val Loss: 0.6142, Val Acc: 0.7075 | Best Val Loss: 0.6142
Epoch  5: Train Loss: 0.6035, Train Acc: 0.6850, Val Loss: 0.5748, Val Acc: 0.7300 | Best Val Loss: 0.5748
Epoch  6: Train Loss: 0.5657, Train Acc: 0.7333, Val Loss: 0.5583, Val Acc: 0.7375 | Best Val Loss: 0.5583
Epoch 11: Train Loss: 0.4520, Train Acc: 0.8050, Val Loss: 0.5213, Val Acc: 0.7525 | Best Val Loss: 0.5213
Epoch 16: Train Loss: 0.3597, Train Acc: 0.8500, Val Loss: 0.5169, Val Acc: 0.7450 | Best Val Loss: 0.5150
Early stopping at epoch 20 (patience: 5)
Loaded best model with validation loss: 0.5150


In [None]:
# Visualize training progress and learning curves
print("\n📊 TRAINING PROGRESS VISUALIZATION")
print("=" * 50)

# Create comprehensive training visualizations
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# 1. Loss curves
epochs = range(1, len(amazon_history['train_losses']) + 1)
axes[0, 0].plot(epochs, amazon_history['train_losses'], 'b-', label='Training Loss', linewidth=2)
axes[0, 0].plot(epochs, amazon_history['val_losses'], 'r-', label='Validation Loss', linewidth=2)
axes[0, 0].set_title('Training and Validation Loss')
axes[0, 0].set_xlabel('Epoch')
axes[0, 0].set_ylabel('Loss')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# 2. Accuracy curves
axes[0, 1].plot(epochs, amazon_history['train_accuracies'], 'b-', label='Training Accuracy', linewidth=2)
axes[0, 1].plot(epochs, amazon_history['val_accuracies'], 'r-', label='Validation Accuracy', linewidth=2)
axes[0, 1].set_title('Training and Validation Accuracy')
axes[0, 1].set_xlabel('Epoch')
axes[0, 1].set_ylabel('Accuracy')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# 3. Overfitting analysis
train_val_gap = [train - val for train, val in zip(amazon_history['train_accuracies'], amazon_history['val_accuracies'])]
axes[1, 0].plot(epochs, train_val_gap, 'g-', linewidth=2, label='Train-Val Gap')
axes[1, 0].axhline(y=0.05, color='orange', linestyle='--', alpha=0.7, label='Overfitting Threshold')
axes[1, 0].set_title('Overfitting Analysis (Train-Val Gap)')
axes[1, 0].set_xlabel('Epoch')
axes[1, 0].set_ylabel('Accuracy Difference')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

# 4. Learning rate effect (loss improvement per epoch)
loss_improvements = []
for i in range(1, len(amazon_history['val_losses'])):
    improvement = amazon_history['val_losses'][i-1] - amazon_history['val_losses'][i]
    loss_improvements.append(improvement)

axes[1, 1].bar(range(2, len(epochs) + 1), loss_improvements, alpha=0.7, color='purple')
axes[1, 1].set_title('Validation Loss Improvement per Epoch')
axes[1, 1].set_xlabel('Epoch')
axes[1, 1].set_ylabel('Loss Improvement')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Print training summary
print(f"\nTraining Summary:")
print(f"Total epochs: {len(epochs)}")
print(f"Best validation loss: {min(amazon_history['val_losses']):.4f}")
print(f"Best validation accuracy: {max(amazon_history['val_accuracies']):.4f}")
print(f"Final train-val gap: {train_val_gap[-1]:.4f}")

if train_val_gap[-1] > 0.05:
    print("⚠️ Warning: Significant overfitting detected")
elif train_val_gap[-1] > 0.02:
    print("🔸 Mild overfitting observed")
else:
    print("✅ Good generalization, minimal overfitting")

In [40]:
# Evaluate the trained Amazon model
print("\nEVALUATING AMAZON MODEL")
print("=" * 50)

def evaluate_model(model, test_loader, criterion):
    model.eval()
    test_loss = 0
    correct = 0
    total = 0
    all_predictions = []
    all_labels = []
    
    with torch.no_grad():
        for batch_X, batch_y in test_loader:
            outputs = model(batch_X).squeeze()
            loss = criterion(outputs, batch_y.float())
            
            test_loss += loss.item()
            predicted = (outputs > 0.5).float()
            correct += (predicted == batch_y).sum().item()
            total += batch_y.size(0)
            
            all_predictions.extend(predicted.cpu().numpy())
            all_labels.extend(batch_y.cpu().numpy())
    
    accuracy = correct / total
    avg_loss = test_loss / len(test_loader)
    
    return {
        'accuracy': accuracy,
        'loss': avg_loss,
        'predictions': all_predictions,
        'labels': all_labels
    }

# Evaluate on test set
amazon_results = evaluate_model(amazon_model, amazon_test_loader, amazon_criterion)

print(f"REAL AMAZON DATASET RESULTS:")
print(f"Test Accuracy: {amazon_results['accuracy']:.4f} ({amazon_results['accuracy']*100:.2f}%)")
print(f"Test Loss: {amazon_results['loss']:.4f}")

# Calculate detailed metrics
from sklearn.metrics import precision_score, recall_score, f1_score, confusion_matrix

precision = precision_score(amazon_results['labels'], amazon_results['predictions'])
recall = recall_score(amazon_results['labels'], amazon_results['predictions'])
f1 = f1_score(amazon_results['labels'], amazon_results['predictions'])

print(f"\nDetailed Metrics:")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")

# Confusion Matrix
cm = confusion_matrix(amazon_results['labels'], amazon_results['predictions'])
print(f"\nConfusion Matrix:")
print(f"             Predicted")
print(f"Actual    Neg    Pos")
print(f"Neg      {cm[0,0]:3d}    {cm[0,1]:3d}")
print(f"Pos      {cm[1,0]:3d}    {cm[1,1]:3d}")

# Compare with training/validation performance
final_train_acc = amazon_history['train_accuracies'][-1]
final_val_acc = amazon_history['val_accuracies'][-1]
test_acc = amazon_results['accuracy']

print(f"\nOVERFITTING ANALYSIS:")
print(f"Final Training Accuracy: {final_train_acc:.4f}")
print(f"Final Validation Accuracy: {final_val_acc:.4f}")
print(f"Test Accuracy: {test_acc:.4f}")

train_val_gap = final_train_acc - final_val_acc
val_test_gap = final_val_acc - test_acc

print(f"Train-Val Gap: {train_val_gap:.4f}")
print(f"Val-Test Gap: {val_test_gap:.4f}")

if train_val_gap > 0.05:
    print("WARNING: Potential overfitting detected (large train-val gap)")
elif train_val_gap > 0.02:
    print("MILD: Mild overfitting (moderate train-val gap)")
else:
    print("GOOD: No significant overfitting detected")

print(f"\nSUCCESS: MODEL TRAINED ON REAL DATA!")
print(f"- No data leakage (proper train/val/test split)")
print(f"- Real Amazon reviews (not synthetic)")
print(f"- Reasonable performance without perfect scores")
print(f"- Model complexity is appropriate")


EVALUATING AMAZON MODEL
REAL AMAZON DATASET RESULTS:
Test Accuracy: 0.7975 (79.75%)
Test Loss: 0.4843

Detailed Metrics:
Precision: 0.8251
Recall: 0.7550
F1-Score: 0.7885

Confusion Matrix:
             Predicted
Actual    Neg    Pos
Neg      168     32
Pos       49    151

OVERFITTING ANALYSIS:
Final Training Accuracy: 0.8683
Final Validation Accuracy: 0.7575
Test Accuracy: 0.7975
Train-Val Gap: 0.1108
Val-Test Gap: -0.0400

SUCCESS: MODEL TRAINED ON REAL DATA!
- No data leakage (proper train/val/test split)
- Real Amazon reviews (not synthetic)
- Reasonable performance without perfect scores
- Model complexity is appropriate


In [None]:
# Visualize model performance and evaluation metrics
print("\n🎯 MODEL PERFORMANCE VISUALIZATION")
print("=" * 50)

# Create comprehensive evaluation visualizations
fig, axes = plt.subplots(2, 3, figsize=(18, 12))

# 1. Confusion Matrix Heatmap
from sklearn.metrics import confusion_matrix
import seaborn as sns

cm = confusion_matrix(amazon_results['labels'], amazon_results['predictions'])
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0, 0])
axes[0, 0].set_title('Confusion Matrix')
axes[0, 0].set_xlabel('Predicted')
axes[0, 0].set_ylabel('Actual')
axes[0, 0].set_xticklabels(['Negative', 'Positive'])
axes[0, 0].set_yticklabels(['Negative', 'Positive'])

# 2. Performance Metrics Bar Chart
metrics = ['Accuracy', 'Precision', 'Recall', 'F1-Score']
values = [amazon_results['accuracy'], precision, recall, f1]
colors = ['skyblue', 'lightgreen', 'lightcoral', 'gold']

bars = axes[0, 1].bar(metrics, values, color=colors, alpha=0.7)
axes[0, 1].set_title('Model Performance Metrics')
axes[0, 1].set_ylabel('Score')
axes[0, 1].set_ylim(0, 1)
axes[0, 1].grid(True, alpha=0.3)

# Add value labels on bars
for bar, value in zip(bars, values):
    height = bar.get_height()
    axes[0, 1].text(bar.get_x() + bar.get_width()/2., height + 0.01,
                    f'{value:.3f}', ha='center', va='bottom')

# 3. Prediction Confidence Distribution
# Get prediction probabilities for confidence analysis
amazon_model.eval()
all_probs = []
with torch.no_grad():
    for batch_X, batch_y in amazon_test_loader:
        outputs = amazon_model(batch_X).squeeze()
        all_probs.extend(outputs.cpu().numpy())

axes[0, 2].hist(all_probs, bins=20, alpha=0.7, color='purple', edgecolor='black')
axes[0, 2].axvline(0.5, color='red', linestyle='--', label='Decision Threshold')
axes[0, 2].set_title('Prediction Confidence Distribution')
axes[0, 2].set_xlabel('Predicted Probability')
axes[0, 2].set_ylabel('Count')
axes[0, 2].legend()
axes[0, 2].grid(True, alpha=0.3)

# 4. Train/Val/Test Performance Comparison
performance_data = {
    'Dataset': ['Training', 'Validation', 'Test'],
    'Accuracy': [
        amazon_history['train_accuracies'][-1],
        amazon_history['val_accuracies'][-1],
        amazon_results['accuracy']
    ]
}

bars = axes[1, 0].bar(performance_data['Dataset'], performance_data['Accuracy'], 
                      color=['blue', 'orange', 'green'], alpha=0.7)
axes[1, 0].set_title('Performance Across Datasets')
axes[1, 0].set_ylabel('Accuracy')
axes[1, 0].set_ylim(0, 1)
axes[1, 0].grid(True, alpha=0.3)

# Add value labels
for bar, value in zip(bars, performance_data['Accuracy']):
    height = bar.get_height()
    axes[1, 0].text(bar.get_x() + bar.get_width()/2., height + 0.01,
                    f'{value:.3f}', ha='center', va='bottom')

# 5. Class-wise Performance
tn, fp, fn, tp = cm.ravel()
negative_precision = tn / (tn + fn) if (tn + fn) > 0 else 0
negative_recall = tn / (tn + fp) if (tn + fp) > 0 else 0
positive_precision = tp / (tp + fp) if (tp + fp) > 0 else 0
positive_recall = tp / (tp + fn) if (tp + fn) > 0 else 0

classes = ['Negative', 'Positive']
precisions = [negative_precision, positive_precision]
recalls = [negative_recall, positive_recall]

x = np.arange(len(classes))
width = 0.35

axes[1, 1].bar(x - width/2, precisions, width, label='Precision', alpha=0.7, color='lightblue')
axes[1, 1].bar(x + width/2, recalls, width, label='Recall', alpha=0.7, color='lightpink')
axes[1, 1].set_title('Class-wise Performance')
axes[1, 1].set_ylabel('Score')
axes[1, 1].set_xticks(x)
axes[1, 1].set_xticklabels(classes)
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

# 6. Model Complexity vs Performance
model_params = sum(p.numel() for p in amazon_model.parameters())
data_size = len(X_train_amazon)

# Create a simple visualization showing the relationship
complexity_data = {
    'Metric': ['Parameters', 'Training Samples', 'Parameters/Sample'],
    'Value': [model_params, data_size, model_params/data_size],
    'Colors': ['red', 'blue', 'green']
}

# Normalize values for better visualization
normalized_values = [
    model_params / 10000,  # Scale down parameters
    data_size / 100,       # Scale down data size
    (model_params/data_size) * 100  # Scale up ratio
]

bars = axes[1, 2].bar(complexity_data['Metric'], normalized_values, 
                      color=complexity_data['Colors'], alpha=0.7)
axes[1, 2].set_title('Model Complexity Analysis')
axes[1, 2].set_ylabel('Normalized Scale')
axes[1, 2].tick_params(axis='x', rotation=45)
axes[1, 2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Print detailed analysis
print(f"\nDetailed Performance Analysis:")
print(f"True Negatives: {tn}, False Positives: {fp}")
print(f"False Negatives: {fn}, True Positives: {tp}")
print(f"Negative Class - Precision: {negative_precision:.3f}, Recall: {negative_recall:.3f}")
print(f"Positive Class - Precision: {positive_precision:.3f}, Recall: {positive_recall:.3f}")
print(f"Model Complexity: {model_params:,} parameters for {data_size:,} training samples")
print(f"Parameters per sample: {model_params/data_size:.2f}")

# Confidence analysis
high_confidence = sum(1 for p in all_probs if p < 0.2 or p > 0.8)
low_confidence = len(all_probs) - high_confidence
print(f"\nPrediction Confidence:")
print(f"High confidence predictions (p<0.2 or p>0.8): {high_confidence}/{len(all_probs)} ({high_confidence/len(all_probs)*100:.1f}%)")
print(f"Low confidence predictions (0.2≤p≤0.8): {low_confidence}/{len(all_probs)} ({low_confidence/len(all_probs)*100:.1f}%)")

In [41]:
# Final Summary: Synthetic vs Real Data Comparison
print("\n" + "="*60)
print("FINAL SUMMARY: SYNTHETIC vs REAL DATA")
print("="*60)

print("\n1. SYNTHETIC DATA ISSUES (Previous Model):")
print("   - 100% accuracy (clear overfitting)")
print("   - Data leakage (used test set for validation)")
print("   - Simple patterns easy to memorize")
print("   - 88.9% synthetic data with templates")

print("\n2. REAL AMAZON DATA RESULTS (Current Model):")
print(f"   - {amazon_results['accuracy']*100:.1f}% test accuracy (realistic)")
print("   - Proper train/val/test split (60/20/20)")
print("   - Real customer reviews with natural variation")
print("   - 2000 balanced samples from 25,000 reviews")
print(f"   - Precision: {precision:.3f}, Recall: {recall:.3f}, F1: {f1:.3f}")

print("\n3. OVERFITTING ANALYSIS:")
print(f"   - Train-Val gap: {train_val_gap:.3f} (indicates some overfitting)")
print("   - Test performance is reasonable, not perfect")
print("   - Model generalizes to unseen real data")

print("\n4. KEY IMPROVEMENTS MADE:")
print("   - Used real Amazon product reviews dataset")
print("   - Implemented proper data splitting")
print("   - Added dropout and L2 regularization")
print("   - Early stopping with patience")
print("   - Realistic evaluation metrics")

print("\n5. CONCLUSION:")
print("   The model now shows realistic performance on real data")
print("   instead of the suspicious 100% accuracy from before.")
print("   This demonstrates the importance of:")
print("   - Using real, diverse datasets")
print("   - Proper validation techniques")
print("   - Overfitting detection and prevention")

print(f"\nModel is ready for real-world sentiment analysis!")
print(f"Current performance: {amazon_results['accuracy']*100:.1f}% accuracy on Amazon reviews")


FINAL SUMMARY: SYNTHETIC vs REAL DATA

1. SYNTHETIC DATA ISSUES (Previous Model):
   - 100% accuracy (clear overfitting)
   - Data leakage (used test set for validation)
   - Simple patterns easy to memorize
   - 88.9% synthetic data with templates

2. REAL AMAZON DATA RESULTS (Current Model):
   - 79.8% test accuracy (realistic)
   - Proper train/val/test split (60/20/20)
   - Real customer reviews with natural variation
   - 2000 balanced samples from 25,000 reviews
   - Precision: 0.825, Recall: 0.755, F1: 0.789

3. OVERFITTING ANALYSIS:
   - Train-Val gap: 0.111 (indicates some overfitting)
   - Test performance is reasonable, not perfect
   - Model generalizes to unseen real data

4. KEY IMPROVEMENTS MADE:
   - Used real Amazon product reviews dataset
   - Implemented proper data splitting
   - Added dropout and L2 regularization
   - Early stopping with patience
   - Realistic evaluation metrics

5. CONCLUSION:
   The model now shows realistic performance on real data
   instead

In [None]:
# Visualize sample predictions and model interpretability
print("\n🔍 SAMPLE PREDICTIONS ANALYSIS")
print("=" * 50)

# Get sample predictions with confidence scores
amazon_model.eval()
sample_reviews = []
sample_predictions = []
sample_confidences = []
sample_true_labels = []

# Get some test samples for analysis
test_indices = list(range(min(20, len(X_test_amazon))))
for idx in test_indices:
    # Get the original text
    review_tokens = amazon_processed_texts[len(X_train_amazon) + len(X_val_amazon) + idx]
    original_text = ' '.join(review_tokens[:50])  # First 50 words
    
    # Get prediction
    with torch.no_grad():
        input_tensor = torch.tensor([X_test_amazon[idx]])
        output = amazon_model(input_tensor).squeeze()
        confidence = output.item()
        prediction = 1 if confidence > 0.5 else 0
    
    sample_reviews.append(original_text)
    sample_predictions.append(prediction)
    sample_confidences.append(confidence)
    sample_true_labels.append(y_test_amazon[idx])

# Create visualization
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# 1. Confidence vs Accuracy scatter plot
correct_predictions = [1 if pred == true else 0 for pred, true in zip(sample_predictions, sample_true_labels)]
colors = ['red' if correct == 0 else 'green' for correct in correct_predictions]

axes[0, 0].scatter(sample_confidences, correct_predictions, c=colors, alpha=0.7, s=100)
axes[0, 0].axvline(0.5, color='black', linestyle='--', alpha=0.5, label='Decision Threshold')
axes[0, 0].set_title('Prediction Confidence vs Correctness')
axes[0, 0].set_xlabel('Prediction Confidence')
axes[0, 0].set_ylabel('Correct (1) / Incorrect (0)')
axes[0, 0].set_ylim(-0.1, 1.1)
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# 2. Confidence distribution by correctness
correct_confidences = [conf for conf, correct in zip(sample_confidences, correct_predictions) if correct == 1]
incorrect_confidences = [conf for conf, correct in zip(sample_confidences, correct_predictions) if correct == 0]

axes[0, 1].hist(correct_confidences, bins=10, alpha=0.7, label='Correct', color='green')
axes[0, 1].hist(incorrect_confidences, bins=10, alpha=0.7, label='Incorrect', color='red')
axes[0, 1].set_title('Confidence Distribution by Prediction Accuracy')
axes[0, 1].set_xlabel('Prediction Confidence')
axes[0, 1].set_ylabel('Count')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# 3. Sentiment prediction confidence by true label
positive_confidences = [conf for conf, true_label in zip(sample_confidences, sample_true_labels) if true_label == 1]
negative_confidences = [conf for conf, true_label in zip(sample_confidences, sample_true_labels) if true_label == 0]

axes[1, 0].boxplot([negative_confidences, positive_confidences], labels=['True Negative', 'True Positive'])
axes[1, 0].axhline(0.5, color='red', linestyle='--', alpha=0.7, label='Decision Threshold')
axes[1, 0].set_title('Prediction Confidence by True Label')
axes[1, 0].set_ylabel('Prediction Confidence')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

# 4. Review length vs prediction confidence
review_lengths = [len(review.split()) for review in sample_reviews]
axes[1, 1].scatter(review_lengths, sample_confidences, c=colors, alpha=0.7, s=100)
axes[1, 1].set_title('Review Length vs Prediction Confidence')
axes[1, 1].set_xlabel('Review Length (words)')
axes[1, 1].set_ylabel('Prediction Confidence')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Print sample predictions
print(f"\nSample Predictions Analysis:")
print(f"Total samples analyzed: {len(sample_reviews)}")
print(f"Correct predictions: {sum(correct_predictions)}/{len(correct_predictions)} ({sum(correct_predictions)/len(correct_predictions)*100:.1f}%)")

print(f"\n📝 SAMPLE PREDICTIONS:")
print("-" * 80)

for i in range(min(8, len(sample_reviews))):
    true_sentiment = "Positive" if sample_true_labels[i] == 1 else "Negative"
    pred_sentiment = "Positive" if sample_predictions[i] == 1 else "Negative"
    confidence = sample_confidences[i]
    correct = "✅" if sample_predictions[i] == sample_true_labels[i] else "❌"
    
    print(f"\n{i+1}. {correct} True: {true_sentiment} | Predicted: {pred_sentiment} | Confidence: {confidence:.3f}")
    print(f"Review: \"{sample_reviews[i][:100]}{'...' if len(sample_reviews[i]) > 100 else ''}\"")

# Analysis summary
high_confidence_correct = sum(1 for conf, correct in zip(sample_confidences, correct_predictions) 
                             if (conf > 0.8 or conf < 0.2) and correct == 1)
high_confidence_total = sum(1 for conf in sample_confidences if conf > 0.8 or conf < 0.2)

print(f"\n📊 Confidence Analysis:")
print(f"High confidence predictions (>0.8 or <0.2): {high_confidence_total}/{len(sample_confidences)}")
if high_confidence_total > 0:
    print(f"High confidence accuracy: {high_confidence_correct}/{high_confidence_total} ({high_confidence_correct/high_confidence_total*100:.1f}%)")

print(f"Average confidence for correct predictions: {np.mean(correct_confidences) if correct_confidences else 0:.3f}")
print(f"Average confidence for incorrect predictions: {np.mean(incorrect_confidences) if incorrect_confidences else 0:.3f}")