# Misinformation Detection and Risk Scoring

This notebook demonstrates pattern-based misinformation detection using a multi-factor risk scoring system.

## Objectives

1. Load social media data with sentiment analysis
2. Detect misinformation patterns (excessive caps, urgency language, vague sources)
3. Calculate risk scores (0-10 scale)
4. Identify high-risk posts
5. Analyze characteristics of potentially misleading content
6. Visualize risk distribution

## Misinformation Indicators

Our risk scoring system evaluates:

1. **Excessive Capitalization**: ALL CAPS words indicate emotional manipulation
2. **Excessive Punctuation**: Multiple !!! or ??? suggest sensationalism
3. **Urgency Language**: "BREAKING", "URGENT", "ACT NOW" create false urgency
4. **Vague Sources**: "Sources say", "experts claim" without specifics
5. **Call-to-Action**: "SHARE THIS", "SPREAD THE WORD" for virality
6. **Conspiracy Language**: "They don't want you to know", "wake up", "cover-up"

**Risk Score**: Weighted sum (0-10 scale) where 0 = low risk, 10 = high risk

## Limitations

⚠️ This is a **pattern-based heuristic**, not a fact-checker:
- Does not verify factual accuracy
- May flag legitimate urgent news
- Cannot detect sophisticated misinformation
- Use as screening tool, not definitive classification

## 1. Setup

In [None]:
import os
import sys
from pathlib import Path

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from dotenv import load_dotenv
from wordcloud import WordCloud

sys.path.insert(0, str(Path("..").resolve()))

from social_media_analysis import detect_misinformation_patterns

plt.style.use("seaborn-v0_8-darkgrid")
sns.set_palette("rocket")
%matplotlib inline

print("✓ Imports successful")

In [None]:
# Load configuration
load_dotenv(Path("..") / ".env")

DATA_BUCKET = os.getenv("DATA_BUCKET")
RESULTS_BUCKET = os.getenv("RESULTS_BUCKET")

print("Configuration loaded")

## 2. Load Data with Sentiment

In [None]:
# Load sample data
df = pd.read_csv("../../studio-lab/sample_data.csv")
df["timestamp"] = pd.to_datetime(df["timestamp"])

print(f"Loaded {len(df)} posts")
df.head()

## 3. Detect Misinformation Patterns

In [None]:
# Apply misinformation detection
print("Analyzing misinformation patterns...")
misinfo_results = df["text"].apply(detect_misinformation_patterns)

# Convert results to DataFrame columns
misinfo_df = pd.DataFrame(misinfo_results.tolist())
df = pd.concat([df, misinfo_df], axis=1)

print("✓ Pattern detection complete")
print("\nRisk Score Statistics:")
print(df["risk_score"].describe())

In [None]:
# Show example detections
print("Example Pattern Detections:\n")
for idx in range(min(5, len(df))):
    row = df.iloc[idx]
    print(f"{idx + 1}. Text: {row['text'][:80]}...")
    print(f"   Risk Score: {row['risk_score']}")
    print("   Indicators:")
    for col in ["excessive_caps", "excessive_punctuation"]:
        if row[col]:
            print(f"     - {col.replace('_', ' ').title()}")
    print()

## 4. Risk Score Distribution

In [None]:
# Visualize risk score distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Histogram
axes[0].hist(df["risk_score"], bins=range(0, 12), edgecolor="black", color="crimson", alpha=0.7)
axes[0].set_title("Risk Score Distribution")
axes[0].set_xlabel("Risk Score")
axes[0].set_ylabel("Frequency")
axes[0].axvline(
    df["risk_score"].mean(),
    color="yellow",
    linestyle="--",
    linewidth=2,
    label=f"Mean: {df['risk_score'].mean():.2f}",
)
axes[0].axvline(7, color="red", linestyle="--", linewidth=2, label="High Risk Threshold")
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Box plot
axes[1].boxplot(df["risk_score"], vert=True)
axes[1].set_title("Risk Score Box Plot")
axes[1].set_ylabel("Risk Score")
axes[1].axhline(7, color="red", linestyle="--", alpha=0.5, label="High Risk Threshold")
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 5. Identify High-Risk Posts

In [None]:
# Define risk categories
def categorize_risk(score):
    if score >= 7:
        return "High Risk"
    elif score >= 4:
        return "Medium Risk"
    else:
        return "Low Risk"


df["risk_category"] = df["risk_score"].apply(categorize_risk)

print("Risk Category Distribution:")
print(df["risk_category"].value_counts())
print("\nPercentages:")
print((df["risk_category"].value_counts() / len(df) * 100).round(2))

In [None]:
# Visualize risk categories
fig, ax = plt.subplots(figsize=(8, 6))
risk_counts = df["risk_category"].value_counts()
colors = ["green", "orange", "red"]
ax.pie(
    risk_counts,
    labels=risk_counts.index,
    autopct="%1.1f%%",
    startangle=90,
    colors=colors,
    explode=[0.05, 0.05, 0.1],
)
ax.set_title("Risk Category Distribution", fontsize=14, fontweight="bold")
plt.tight_layout()
plt.show()

In [None]:
# High-risk posts
high_risk = df[df["risk_category"] == "High Risk"].sort_values("risk_score", ascending=False)

print(f"High-Risk Posts: {len(high_risk)}")
print("\nTop 5 Highest Risk Posts:\n")

for _, row in high_risk.head(5).iterrows():
    print(f"Risk Score: {row['risk_score']}")
    print(f"Text: {row['text']}")
    print(f"Platform: {row['platform']}")
    print(f"Engagement: {row['retweets'] + row['likes'] + row['replies']}")
    print("Indicators detected: ", end="")
    indicators = [
        col.replace("_", " ") for col in misinfo_df.columns if col != "risk_score" and row[col]
    ]
    print(", ".join(indicators))
    print("-" * 80)

## 6. Pattern Analysis

In [None]:
# Frequency of each indicator
indicator_cols = [col for col in misinfo_df.columns if col != "risk_score"]
indicator_freq = df[indicator_cols].sum().sort_values(ascending=False)

print("Indicator Frequency:")
print(indicator_freq)

# Visualize
fig, ax = plt.subplots(figsize=(10, 6))
indicator_freq.plot(kind="barh", ax=ax, color="darkred")
ax.set_title("Misinformation Indicator Frequency")
ax.set_xlabel("Number of Posts")
ax.set_ylabel("Indicator")
ax.grid(True, alpha=0.3, axis="x")
plt.tight_layout()
plt.show()

In [None]:
# Co-occurrence of indicators
print("Indicator Co-occurrence Analysis:\n")

for indicator in indicator_cols:
    posts_with_indicator = df[df[indicator]]
    if len(posts_with_indicator) > 0:
        avg_risk = posts_with_indicator["risk_score"].mean()
        print(f"{indicator.replace('_', ' ').title()}:")
        print(f"  Posts: {len(posts_with_indicator)}")
        print(f"  Average Risk Score: {avg_risk:.2f}")
        print()

## 7. Risk by Platform and Sentiment

In [None]:
# Add sentiment analysis (simple VADER)
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

vader = SentimentIntensityAnalyzer()
df["sentiment_compound"] = df["text"].apply(lambda x: vader.polarity_scores(x)["compound"])
df["sentiment"] = df["sentiment_compound"].apply(
    lambda x: "positive" if x >= 0.05 else ("negative" if x <= -0.05 else "neutral")
)

print("✓ Sentiment added")

In [None]:
# Risk by platform
platform_risk = df.groupby("platform")["risk_score"].agg(["mean", "median", "max"])
print("Risk Score by Platform:")
print(platform_risk)

fig, ax = plt.subplots(figsize=(10, 6))
platform_risk["mean"].plot(kind="bar", ax=ax, color="coral", alpha=0.7, label="Mean")
platform_risk["median"].plot(kind="bar", ax=ax, color="steelblue", alpha=0.7, label="Median")
ax.set_title("Average Risk Score by Platform")
ax.set_xlabel("Platform")
ax.set_ylabel("Risk Score")
ax.legend()
ax.tick_params(axis="x", rotation=45)
ax.grid(True, alpha=0.3, axis="y")
plt.tight_layout()
plt.show()

In [None]:
# Risk by sentiment
sentiment_risk = df.groupby("sentiment")["risk_score"].agg(["mean", "median", "max"])
print("Risk Score by Sentiment:")
print(sentiment_risk)

fig, ax = plt.subplots(figsize=(10, 6))
df.boxplot(column="risk_score", by="sentiment", ax=ax)
ax.set_title("Risk Score Distribution by Sentiment")
ax.set_xlabel("Sentiment")
ax.set_ylabel("Risk Score")
plt.sca(ax)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

## 8. Engagement Analysis

In [None]:
# Calculate total engagement
df["total_engagement"] = df["retweets"] + df["likes"] + df["replies"]

# Correlation between risk and engagement
correlation = df[["risk_score", "total_engagement"]].corr()
print("Correlation Matrix:")
print(correlation)

# Scatter plot
fig, ax = plt.subplots(figsize=(10, 6))
scatter = ax.scatter(
    df["risk_score"], df["total_engagement"], c=df["risk_score"], cmap="Reds", alpha=0.6, s=100
)
ax.set_title("Risk Score vs Total Engagement")
ax.set_xlabel("Risk Score")
ax.set_ylabel("Total Engagement")
plt.colorbar(scatter, label="Risk Score", ax=ax)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Engagement by risk category
engagement_by_risk = df.groupby("risk_category")["total_engagement"].agg(["mean", "median", "sum"])
print("Engagement by Risk Category:")
print(engagement_by_risk)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Average engagement
engagement_by_risk["mean"].plot(kind="bar", ax=axes[0], color=["green", "orange", "red"])
axes[0].set_title("Average Engagement by Risk Category")
axes[0].set_xlabel("Risk Category")
axes[0].set_ylabel("Average Engagement")
axes[0].tick_params(axis="x", rotation=45)

# Total engagement
engagement_by_risk["sum"].plot(kind="bar", ax=axes[1], color=["green", "orange", "red"])
axes[1].set_title("Total Engagement by Risk Category")
axes[1].set_xlabel("Risk Category")
axes[1].set_ylabel("Total Engagement")
axes[1].tick_params(axis="x", rotation=45)

plt.tight_layout()
plt.show()

## 9. Temporal Patterns

In [None]:
# Risk over time
df["date"] = df["timestamp"].dt.date
daily_risk = df.groupby("date")["risk_score"].agg(["mean", "max", "count"])

fig, axes = plt.subplots(2, 1, figsize=(12, 10))

# Average risk over time
axes[0].plot(daily_risk.index, daily_risk["mean"], marker="o", linewidth=2, color="crimson")
axes[0].fill_between(daily_risk.index, daily_risk["mean"], alpha=0.3, color="crimson")
axes[0].axhline(7, color="red", linestyle="--", label="High Risk Threshold")
axes[0].set_title("Average Risk Score Over Time")
axes[0].set_ylabel("Average Risk Score")
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Post volume by risk category
daily_risk_cat = df.groupby(["date", "risk_category"]).size().unstack(fill_value=0)
daily_risk_cat.plot(kind="bar", stacked=True, ax=axes[1], color=["green", "orange", "red"])
axes[1].set_title("Post Volume by Risk Category Over Time")
axes[1].set_xlabel("Date")
axes[1].set_ylabel("Number of Posts")
axes[1].legend(title="Risk Category")
axes[1].tick_params(axis="x", rotation=45)

plt.tight_layout()
plt.show()

## 10. Word Cloud of High-Risk Posts

In [None]:
# Generate word cloud from high-risk posts
if len(high_risk) > 0:
    high_risk_text = " ".join(high_risk["text"].values)

    wordcloud = WordCloud(
        width=800, height=400, background_color="white", colormap="Reds", max_words=100
    ).generate(high_risk_text)

    fig, ax = plt.subplots(figsize=(12, 6))
    ax.imshow(wordcloud, interpolation="bilinear")
    ax.axis("off")
    ax.set_title("Word Cloud: High-Risk Posts", fontsize=16, fontweight="bold")
    plt.tight_layout()
    plt.show()
else:
    print("No high-risk posts to visualize")

## 11. Save Results

In [None]:
# Save full results
output_file = "../../results/misinformation_analysis_results.csv"
df.to_csv(output_file, index=False)
print(f"✓ Full results saved to {output_file}")

# Save high-risk posts separately
if len(high_risk) > 0:
    high_risk_file = "../../results/high_risk_posts.csv"
    high_risk.to_csv(high_risk_file, index=False)
    print(f"✓ High-risk posts saved to {high_risk_file}")

# Uncomment to save to S3
# data_client = SocialMediaDataAccess()
# data_client.save_results(df, 'misinformation_analysis_results.csv')
# data_client.save_results(high_risk, 'high_risk_posts.csv')

## Key Findings

### Risk Distribution
- **High Risk**: X% of posts (score ≥ 7)
- **Medium Risk**: Y% of posts (score 4-6)
- **Low Risk**: Z% of posts (score < 4)
- **Average Risk Score**: [value]

### Most Common Indicators
1. [Indicator 1]: X posts
2. [Indicator 2]: Y posts
3. [Indicator 3]: Z posts

### Platform Analysis
- [Platform] has highest average risk score: [value]
- [Platform] has most high-risk posts: [count]

### Sentiment Relationship
- [Sentiment] posts have highest risk scores
- Correlation with sentiment: [value]

### Engagement Patterns
- High-risk posts receive [more/less/similar] engagement than low-risk
- Correlation between risk and engagement: [value]

## Recommendations

1. **Manual Review**: All posts with risk score ≥ 7 should be manually reviewed
2. **Pattern Monitoring**: Track indicator trends over time to detect campaigns
3. **Enhanced Detection**: Consider ML-based fact-checking for high-risk posts
4. **User Education**: Flag high-risk content with educational context
5. **Platform Action**: Share findings with platform moderators

## Next Steps

1. **Network Analysis**: Run `04-network-analysis.ipynb` to understand information spread
2. **Fact-Checking**: Integrate with fact-checking APIs for verification
3. **ML Model**: Train supervised model with labeled misinformation data