# Sentiment Analysis: VADER vs AWS Comprehend

This notebook compares two approaches to sentiment analysis:
1. **VADER** (Valence Aware Dictionary and sEntiment Reasoner) - Free, local analysis
2. **AWS Comprehend** - Managed ML service, pay-per-use

## Objectives

1. Load and preprocess social media data
2. Perform sentiment analysis with VADER
3. Perform sentiment analysis with AWS Comprehend
4. Compare results and accuracy
5. Visualize sentiment distributions
6. Analyze sentiment trends over time

## When to Use Each Approach

**VADER**:
- ✅ Free (no API costs)
- ✅ Fast for large datasets
- ✅ Optimized for social media text
- ⚠️ Less accurate on complex sentiment
- ⚠️ English-only

**AWS Comprehend**:
- ✅ Higher accuracy (deep learning-based)
- ✅ Multi-language support (100+ languages)
- ✅ Confidence scores for each sentiment
- ⚠️ Costs ~$0.0001 per 100 characters
- ⚠️ API rate limits

## 1. Setup

In [None]:
import os
import sys
from pathlib import Path

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from dotenv import load_dotenv
from tqdm import tqdm
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

sys.path.insert(0, str(Path("..").resolve()))

from social_media_analysis import ComprehendAnalyzer, preprocess_text

plt.style.use("seaborn-v0_8-darkgrid")
sns.set_palette("Set2")
%matplotlib inline

print("✓ Imports successful")

In [None]:
# Load configuration
load_dotenv(Path("..") / ".env")

DATA_BUCKET = os.getenv("DATA_BUCKET")
RESULTS_BUCKET = os.getenv("RESULTS_BUCKET")
AWS_REGION = os.getenv("AWS_REGION", "us-east-1")
USE_COMPREHEND = os.getenv("USE_COMPREHEND", "false").lower() == "true"

print("Configuration:")
print(f"  Data Bucket: {DATA_BUCKET}")
print(f"  Use Comprehend: {USE_COMPREHEND}")

## 2. Load Data

In [None]:
# Load sample data
df = pd.read_csv("../../studio-lab/sample_data.csv")
df["timestamp"] = pd.to_datetime(df["timestamp"])

print(f"Loaded {len(df)} posts")
df.head()

## 3. Text Preprocessing

In [None]:
# Preprocess text
df["clean_text"] = df["text"].apply(lambda x: preprocess_text(x, remove_stopwords=False))

# Show example
print("Original vs Cleaned Text:")
for idx in range(3):
    print(f"\n{idx + 1}. Original:")
    print(f"   {df.iloc[idx]['text']}")
    print("   Cleaned:")
    print(f"   {df.iloc[idx]['clean_text']}")

## 4. VADER Sentiment Analysis

In [None]:
# Initialize VADER
vader = SentimentIntensityAnalyzer()


# Analyze sentiment
def vader_sentiment(text):
    scores = vader.polarity_scores(text)
    return pd.Series(
        {
            "vader_compound": scores["compound"],
            "vader_pos": scores["pos"],
            "vader_neu": scores["neu"],
            "vader_neg": scores["neg"],
        }
    )


print("Running VADER sentiment analysis...")
vader_scores = df["text"].apply(vader_sentiment)
df = pd.concat([df, vader_scores], axis=1)


# Classify sentiment based on compound score
def classify_vader(compound):
    if compound >= 0.05:
        return "positive"
    elif compound <= -0.05:
        return "negative"
    else:
        return "neutral"


df["vader_sentiment"] = df["vader_compound"].apply(classify_vader)

print("✓ VADER analysis complete")
print("\nVADER Sentiment Distribution:")
print(df["vader_sentiment"].value_counts())

In [None]:
# Visualize VADER results
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Sentiment distribution
df["vader_sentiment"].value_counts().plot(kind="bar", ax=axes[0], color="steelblue")
axes[0].set_title("VADER Sentiment Distribution")
axes[0].set_xlabel("Sentiment")
axes[0].set_ylabel("Count")
axes[0].tick_params(axis="x", rotation=45)

# Compound score distribution
axes[1].hist(df["vader_compound"], bins=20, color="coral", edgecolor="black")
axes[1].set_title("VADER Compound Score Distribution")
axes[1].set_xlabel("Compound Score")
axes[1].set_ylabel("Frequency")
axes[1].axvline(0, color="red", linestyle="--", label="Neutral threshold")
axes[1].legend()

plt.tight_layout()
plt.show()

## 5. AWS Comprehend Sentiment Analysis

**Note**: This will incur AWS costs (~$0.0001 per 100 characters). Set `USE_COMPREHEND=false` in `.env` to skip.

In [None]:
if USE_COMPREHEND:
    # Initialize Comprehend client
    comprehend = ComprehendAnalyzer(region=AWS_REGION)

    def analyze_with_comprehend(text):
        try:
            result = comprehend.analyze_sentiment(text[:5000])  # Comprehend limit
            return pd.Series(
                {
                    "comprehend_sentiment": result["Sentiment"].lower(),
                    "comprehend_positive": result["SentimentScore"]["Positive"],
                    "comprehend_negative": result["SentimentScore"]["Negative"],
                    "comprehend_neutral": result["SentimentScore"]["Neutral"],
                    "comprehend_mixed": result["SentimentScore"]["Mixed"],
                }
            )
        except Exception as e:
            print(f"Error: {e}")
            return pd.Series(
                {
                    "comprehend_sentiment": "error",
                    "comprehend_positive": 0,
                    "comprehend_negative": 0,
                    "comprehend_neutral": 0,
                    "comprehend_mixed": 0,
                }
            )

    print("Running AWS Comprehend sentiment analysis...")
    print(f"Estimated cost: ${len(df) * df['text'].str.len().mean() / 100 * 0.0001:.4f}")

    # Process with progress bar
    tqdm.pandas(desc="Comprehend")
    comprehend_scores = df["text"].progress_apply(analyze_with_comprehend)
    df = pd.concat([df, comprehend_scores], axis=1)

    print("✓ Comprehend analysis complete")
    print("\nComprehend Sentiment Distribution:")
    print(df["comprehend_sentiment"].value_counts())
else:
    print("⚠️ Comprehend disabled. Set USE_COMPREHEND=true in .env to enable.")

In [None]:
if USE_COMPREHEND:
    # Visualize Comprehend results
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))

    # Sentiment distribution
    df["comprehend_sentiment"].value_counts().plot(kind="bar", ax=axes[0], color="teal")
    axes[0].set_title("Comprehend Sentiment Distribution")
    axes[0].set_xlabel("Sentiment")
    axes[0].set_ylabel("Count")
    axes[0].tick_params(axis="x", rotation=45)

    # Confidence scores
    score_cols = [
        "comprehend_positive",
        "comprehend_negative",
        "comprehend_neutral",
        "comprehend_mixed",
    ]
    df[score_cols].mean().plot(kind="bar", ax=axes[1], color="salmon")
    axes[1].set_title("Average Comprehend Confidence Scores")
    axes[1].set_xlabel("Sentiment Type")
    axes[1].set_ylabel("Average Confidence")
    axes[1].tick_params(axis="x", rotation=45)

    plt.tight_layout()
    plt.show()

## 6. Compare VADER vs Comprehend

In [None]:
if USE_COMPREHEND:
    # Comparison matrix
    comparison = pd.crosstab(
        df["vader_sentiment"],
        df["comprehend_sentiment"],
        rownames=["VADER"],
        colnames=["Comprehend"],
    )

    print("Sentiment Agreement Matrix:")
    print(comparison)

    # Calculate agreement percentage
    agreement = (df["vader_sentiment"] == df["comprehend_sentiment"]).sum()
    agreement_pct = (agreement / len(df)) * 100

    print(f"\nAgreement: {agreement}/{len(df)} ({agreement_pct:.1f}%)")

In [None]:
if USE_COMPREHEND:
    # Visualize comparison
    fig, ax = plt.subplots(figsize=(10, 8))
    sns.heatmap(comparison, annot=True, fmt="d", cmap="YlOrRd", ax=ax, cbar_kws={"label": "Count"})
    ax.set_title("VADER vs Comprehend Agreement Matrix")
    plt.tight_layout()
    plt.show()

In [None]:
if USE_COMPREHEND:
    # Analyze disagreements
    disagreements = df[df["vader_sentiment"] != df["comprehend_sentiment"]]

    print(f"Found {len(disagreements)} disagreements ({len(disagreements) / len(df) * 100:.1f}%)")
    print("\nExample disagreements:")

    for idx, row in disagreements.head(5).iterrows():
        print(f"\nText: {row['text'][:100]}...")
        print(f"VADER: {row['vader_sentiment']} (compound: {row['vader_compound']:.3f})")
        print(
            f"Comprehend: {row['comprehend_sentiment']} "
            f"(confidence: {row[f'comprehend_{row["comprehend_sentiment"]}']:.3f})"
        )

## 7. Sentiment Over Time

In [None]:
# Sentiment trend using VADER
df["date"] = df["timestamp"].dt.date
daily_sentiment = df.groupby(["date", "vader_sentiment"]).size().unstack(fill_value=0)

fig, ax = plt.subplots(figsize=(12, 6))
daily_sentiment.plot(kind="bar", stacked=True, ax=ax, color=["#d62728", "#7f7f7f", "#2ca02c"])
ax.set_title("Sentiment Distribution Over Time (VADER)")
ax.set_xlabel("Date")
ax.set_ylabel("Number of Posts")
ax.legend(title="Sentiment")
ax.tick_params(axis="x", rotation=45)
plt.tight_layout()
plt.show()

In [None]:
# Average compound score over time
avg_compound = df.groupby("date")["vader_compound"].mean()

fig, ax = plt.subplots(figsize=(12, 6))
avg_compound.plot(kind="line", marker="o", ax=ax, color="purple", linewidth=2)
ax.axhline(0, color="red", linestyle="--", alpha=0.5, label="Neutral")
ax.fill_between(avg_compound.index, avg_compound.values, 0, alpha=0.3, color="purple")
ax.set_title("Average Sentiment Score Over Time")
ax.set_xlabel("Date")
ax.set_ylabel("Average Compound Score")
ax.legend()
ax.grid(True, alpha=0.3)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

## 8. Sentiment by Platform

In [None]:
# Platform comparison
platform_sentiment = pd.crosstab(df["platform"], df["vader_sentiment"], normalize="index") * 100

fig, ax = plt.subplots(figsize=(10, 6))
platform_sentiment.plot(kind="bar", ax=ax, color=["#d62728", "#7f7f7f", "#2ca02c"])
ax.set_title("Sentiment Distribution by Platform (%)")
ax.set_xlabel("Platform")
ax.set_ylabel("Percentage")
ax.legend(title="Sentiment")
ax.tick_params(axis="x", rotation=45)
plt.tight_layout()
plt.show()

## 9. Engagement vs Sentiment

In [None]:
# Calculate total engagement
df["total_engagement"] = df["retweets"] + df["likes"] + df["replies"]

# Engagement by sentiment
sentiment_engagement = df.groupby("vader_sentiment")["total_engagement"].agg(
    ["mean", "median", "std"]
)

print("Engagement Statistics by Sentiment:")
print(sentiment_engagement)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Box plot
df.boxplot(column="total_engagement", by="vader_sentiment", ax=axes[0])
axes[0].set_title("Engagement Distribution by Sentiment")
axes[0].set_xlabel("Sentiment")
axes[0].set_ylabel("Total Engagement")
plt.sca(axes[0])
plt.xticks(rotation=45)

# Bar plot of means
sentiment_engagement["mean"].plot(kind="bar", ax=axes[1], color="skyblue")
axes[1].set_title("Average Engagement by Sentiment")
axes[1].set_xlabel("Sentiment")
axes[1].set_ylabel("Average Total Engagement")
axes[1].tick_params(axis="x", rotation=45)

plt.tight_layout()
plt.show()

## 10. Save Results

In [None]:
# Save enriched dataset
output_file = "../../results/sentiment_analysis_results.csv"
df.to_csv(output_file, index=False)
print(f"✓ Results saved to {output_file}")

# Uncomment to save to S3
# data_client = SocialMediaDataAccess(region=AWS_REGION)
# data_client.save_results(df, 'sentiment_analysis_results.csv')

## Key Findings

### VADER Analysis
- **Positive**: X% of posts
- **Negative**: Y% of posts
- **Neutral**: Z% of posts
- **Average compound score**: [value]

### AWS Comprehend Analysis (if enabled)
- **Agreement with VADER**: X%
- **Main differences**: [describe]

### Engagement Insights
- [Sentiment type] posts receive highest engagement
- Average engagement: [value]

### Platform Differences
- [Platform] shows most positive sentiment
- [Platform] shows most negative sentiment

## Recommendations

1. **For most use cases**: Use VADER for cost-effectiveness and speed
2. **For high-accuracy needs**: Use Comprehend despite cost
3. **For multi-language**: Must use Comprehend
4. **For large-scale**: Use VADER first, then Comprehend on subset

## Next Steps

1. **Misinformation Detection**: Run `03-misinformation-detection.ipynb`
2. **Network Analysis**: Run `04-network-analysis.ipynb`