# Sentiment Analysis - Exploratory Data Analysis

This notebook provides comprehensive exploratory analysis of the product review dataset for sentiment analysis.

## Objectives
1. Load and inspect the dataset
2. Analyze sentiment distribution
3. Explore text characteristics
4. Identify data quality issues
5. Generate insights for model development

## 1. Setup and Imports

In [None]:
import sys
import os

# Add parent directory to path
sys.path.append(os.path.dirname(os.getcwd()))

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
import warnings

warnings.filterwarnings('ignore')

# Set plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("âœ“ Imports successful")

## 2. Load Dataset

In [None]:
# Load the generated review data
df = pd.read_csv('../data/reviews.csv')

print(f"Dataset shape: {df.shape}")
print(f"\nFirst few rows:")
df.head()

In [None]:
# Dataset info
print("Dataset Info:")
df.info()

In [None]:
# Basic statistics
print("Dataset Statistics:")
df.describe(include='all')

## 3. Sentiment Distribution Analysis

In [None]:
# Sentiment distribution
sentiment_counts = df['sentiment'].value_counts()
sentiment_pct = df['sentiment'].value_counts(normalize=True) * 100

print("Sentiment Distribution:")
print(f"\nCounts:\n{sentiment_counts}")
print(f"\nPercentages:\n{sentiment_pct}")

In [None]:
# Visualize sentiment distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Count plot
sentiment_counts.plot(kind='bar', ax=axes[0], color=['red', 'gray', 'green'])
axes[0].set_title('Sentiment Distribution (Counts)', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Sentiment')
axes[0].set_ylabel('Count')
axes[0].tick_params(axis='x', rotation=0)

# Pie chart
sentiment_counts.plot(kind='pie', ax=axes[1], autopct='%1.1f%%', 
                      colors=['red', 'gray', 'green'], startangle=90)
axes[1].set_title('Sentiment Distribution (Percentage)', fontsize=14, fontweight='bold')
axes[1].set_ylabel('')

plt.tight_layout()
plt.show()

## 4. Rating Analysis

In [None]:
# Rating distribution
rating_counts = df['rating'].value_counts().sort_index()

print("Rating Distribution:")
print(rating_counts)

plt.figure(figsize=(10, 5))
rating_counts.plot(kind='bar', color='steelblue')
plt.title('Rating Distribution', fontsize=14, fontweight='bold')
plt.xlabel('Rating')
plt.ylabel('Count')
plt.xticks(rotation=0)
plt.grid(axis='y', alpha=0.3)
plt.show()

In [None]:
# Rating vs Sentiment
crosstab = pd.crosstab(df['sentiment'], df['rating'])

print("Sentiment vs Rating Cross-tabulation:")
print(crosstab)

plt.figure(figsize=(10, 6))
crosstab.T.plot(kind='bar', stacked=False, color=['red', 'gray', 'green'])
plt.title('Sentiment Distribution by Rating', fontsize=14, fontweight='bold')
plt.xlabel('Rating')
plt.ylabel('Count')
plt.legend(title='Sentiment')
plt.xticks(rotation=0)
plt.show()

## 5. Text Characteristics Analysis

In [None]:
# Calculate text statistics
df['text_length'] = df['text'].str.len()
df['word_count'] = df['text'].str.split().str.len()
df['avg_word_length'] = df['text'].apply(lambda x: np.mean([len(word) for word in x.split()]))

print("Text Statistics:")
print(df[['text_length', 'word_count', 'avg_word_length']].describe())

In [None]:
# Text length distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Character length
df['text_length'].hist(bins=30, ax=axes[0], color='steelblue', edgecolor='black')
axes[0].set_title('Text Length Distribution (Characters)', fontsize=12, fontweight='bold')
axes[0].set_xlabel('Text Length')
axes[0].set_ylabel('Frequency')
axes[0].axvline(df['text_length'].mean(), color='red', linestyle='--', label=f'Mean: {df["text_length"].mean():.1f}')
axes[0].legend()

# Word count
df['word_count'].hist(bins=30, ax=axes[1], color='coral', edgecolor='black')
axes[1].set_title('Word Count Distribution', fontsize=12, fontweight='bold')
axes[1].set_xlabel('Word Count')
axes[1].set_ylabel('Frequency')
axes[1].axvline(df['word_count'].mean(), color='red', linestyle='--', label=f'Mean: {df["word_count"].mean():.1f}')
axes[1].legend()

plt.tight_layout()
plt.show()

In [None]:
# Text length by sentiment
plt.figure(figsize=(10, 6))
df.boxplot(column='text_length', by='sentiment', ax=plt.gca())
plt.title('Text Length by Sentiment', fontsize=14, fontweight='bold')
plt.suptitle('')  # Remove default title
plt.xlabel('Sentiment')
plt.ylabel('Text Length (characters)')
plt.show()

print("\nText Length Statistics by Sentiment:")
print(df.groupby('sentiment')['text_length'].describe())

## 6. Product Category Analysis

In [None]:
# Category distribution
category_counts = df['product_category'].value_counts()

print("Product Category Distribution:")
print(category_counts)

plt.figure(figsize=(12, 6))
category_counts.plot(kind='barh', color='teal')
plt.title('Reviews by Product Category', fontsize=14, fontweight='bold')
plt.xlabel('Count')
plt.ylabel('Category')
plt.tight_layout()
plt.show()

In [None]:
# Sentiment by category
category_sentiment = pd.crosstab(df['product_category'], df['sentiment'], normalize='index') * 100

print("Sentiment Distribution by Category (%):\n")
print(category_sentiment.round(1))

category_sentiment.plot(kind='barh', stacked=True, figsize=(12, 8), 
                        color=['red', 'gray', 'green'])
plt.title('Sentiment Distribution by Product Category', fontsize=14, fontweight='bold')
plt.xlabel('Percentage')
plt.ylabel('Category')
plt.legend(title='Sentiment', bbox_to_anchor=(1.05, 1))
plt.tight_layout()
plt.show()

## 7. Verified Purchase Analysis

In [None]:
# Verified purchase distribution
verified_counts = df['verified_purchase'].value_counts()

print("Verified Purchase Distribution:")
print(verified_counts)
print(f"\nVerified: {verified_counts.get(True, 0) / len(df) * 100:.1f}%")

# Sentiment by verified purchase
verified_sentiment = pd.crosstab(df['verified_purchase'], df['sentiment'], normalize='index') * 100

print("\nSentiment by Verified Purchase (%):\n")
print(verified_sentiment.round(1))

verified_sentiment.plot(kind='bar', figsize=(10, 6), color=['red', 'gray', 'green'])
plt.title('Sentiment Distribution by Verified Purchase Status', fontsize=14, fontweight='bold')
plt.xlabel('Verified Purchase')
plt.ylabel('Percentage')
plt.legend(title='Sentiment')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

## 8. Helpful Votes Analysis

In [None]:
# Helpful votes statistics
print("Helpful Votes Statistics:")
print(df['helpful_votes'].describe())

# Helpful votes by sentiment
print("\nHelpful Votes by Sentiment:")
print(df.groupby('sentiment')['helpful_votes'].describe())

plt.figure(figsize=(10, 6))
df.boxplot(column='helpful_votes', by='sentiment', ax=plt.gca())
plt.title('Helpful Votes by Sentiment', fontsize=14, fontweight='bold')
plt.suptitle('')
plt.xlabel('Sentiment')
plt.ylabel('Helpful Votes')
plt.show()

## 9. Temporal Analysis

In [None]:
# Convert review_date to datetime
df['review_date'] = pd.to_datetime(df['review_date'])
df['year_month'] = df['review_date'].dt.to_period('M')

# Reviews over time
reviews_over_time = df.groupby('year_month').size()

plt.figure(figsize=(14, 6))
reviews_over_time.plot(kind='line', marker='o', color='steelblue')
plt.title('Reviews Over Time', fontsize=14, fontweight='bold')
plt.xlabel('Month')
plt.ylabel('Number of Reviews')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Sentiment over time
sentiment_over_time = df.groupby(['year_month', 'sentiment']).size().unstack(fill_value=0)

plt.figure(figsize=(14, 6))
sentiment_over_time.plot(kind='area', stacked=True, alpha=0.7, 
                         color=['red', 'gray', 'green'])
plt.title('Sentiment Distribution Over Time', fontsize=14, fontweight='bold')
plt.xlabel('Month')
plt.ylabel('Number of Reviews')
plt.legend(title='Sentiment')
plt.tight_layout()
plt.show()

## 10. Sample Reviews

In [None]:
# Sample positive reviews
print("Sample Positive Reviews:\n")
for i, text in enumerate(df[df['sentiment'] == 'positive']['text'].head(3), 1):
    print(f"{i}. {text}")

print("\n" + "="*80 + "\n")

# Sample negative reviews
print("Sample Negative Reviews:\n")
for i, text in enumerate(df[df['sentiment'] == 'negative']['text'].head(3), 1):
    print(f"{i}. {text}")

print("\n" + "="*80 + "\n")

# Sample neutral reviews
print("Sample Neutral Reviews:\n")
for i, text in enumerate(df[df['sentiment'] == 'neutral']['text'].head(3), 1):
    print(f"{i}. {text}")

## 11. Key Insights and Recommendations

### Data Quality
- Dataset contains 1,000 reviews with no missing values
- Sentiment distribution: 60% positive, 25% neutral, 15% negative
- Reviews are distributed across 10 product categories

### Text Characteristics
- Average review length: ~40 characters
- Average word count: ~7 words
- Short, concise reviews typical of product feedback

### Recommendations for Model Development
1. **Class Imbalance**: Consider class weighting or sampling techniques for the underrepresented negative class
2. **Text Preprocessing**: Reviews are relatively clean but may benefit from:
   - Lowercasing
   - Punctuation handling
   - URL/HTML removal if present in real data
3. **Feature Engineering**: Consider:
   - Text length as a feature
   - Product category encoding
   - Helpful votes as a quality indicator
4. **Model Selection**: DistilBERT is appropriate for this task given:
   - Short text sequences
   - Need for fast inference
   - Good performance on sentiment analysis

### Next Steps
1. Proceed to model training notebook
2. Implement text preprocessing pipeline
3. Fine-tune DistilBERT on this dataset
4. Evaluate model performance and optimize