# Social Media Data Exploration

This notebook demonstrates how to load and explore social media datasets using the unified-studio package.

## Objectives

1. Configure AWS credentials and environment
2. Initialize data access client
3. Load data from multiple sources (Twitter, Reddit, CSV)
4. Perform data quality validation
5. Explore dataset characteristics
6. Generate summary statistics

## Prerequisites

- CloudFormation stack deployed
- `.env` file configured with bucket names and role ARN
- Python package installed: `pip install -e ..`
- Sample data uploaded to S3

## 1. Setup and Configuration

In [None]:
# Import required libraries
import os
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from dotenv import load_dotenv
from pathlib import Path

# Add src to path for local development
sys.path.insert(0, str(Path('..').resolve()))

# Import our package
from social_media_analysis import SocialMediaDataAccess

# Configure plotting
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
%matplotlib inline

print("✓ Imports successful")

In [None]:
# Load environment variables
load_dotenv(Path('..') / '.env')

# Get configuration
DATA_BUCKET = os.getenv('DATA_BUCKET')
RESULTS_BUCKET = os.getenv('RESULTS_BUCKET')
AWS_REGION = os.getenv('AWS_REGION', 'us-east-1')

print(f"Configuration:")
print(f"  Data Bucket: {DATA_BUCKET}")
print(f"  Results Bucket: {RESULTS_BUCKET}")
print(f"  Region: {AWS_REGION}")

## 2. Initialize Data Access Client

In [None]:
# Initialize data access client
data_client = SocialMediaDataAccess(
    use_anon=False,  # Use configured AWS credentials
    region=AWS_REGION
)

print("✓ Data access client initialized")

## 3. Load Sample Data

We'll start by loading the sample CSV dataset from Studio Lab as a test.

In [None]:
# For this example, we'll use a local CSV file first
# In production, you would load from S3

# Load sample data from studio-lab
sample_df = pd.read_csv('../../studio-lab/sample_data.csv')

print(f"Loaded {len(sample_df)} posts")
print(f"\nDataset shape: {sample_df.shape}")
print(f"\nColumns: {list(sample_df.columns)}")

In [None]:
# Display first few rows
sample_df.head()

## 4. Data Quality Validation

In [None]:
# Basic validation
validation_results = data_client.validate_dataset(sample_df)

print("Validation Results:")
for key, value in validation_results.items():
    print(f"  {key}: {value}")

In [None]:
# Check for missing values
print("Missing Values:")
print(sample_df.isnull().sum())
print(f"\nMissing percentage:")
print((sample_df.isnull().sum() / len(sample_df) * 100).round(2))

In [None]:
# Data types
print("Data Types:")
print(sample_df.dtypes)

In [None]:
# Convert timestamp to datetime
sample_df['timestamp'] = pd.to_datetime(sample_df['timestamp'])

print("✓ Timestamp converted to datetime")
print(f"Date range: {sample_df['timestamp'].min()} to {sample_df['timestamp'].max()}")

## 5. Exploratory Data Analysis

In [None]:
# Summary statistics for engagement metrics
engagement_cols = ['retweets', 'likes', 'replies']
print("Engagement Metrics Summary:")
sample_df[engagement_cols].describe()

In [None]:
# Platform distribution
print("Platform Distribution:")
print(sample_df['platform'].value_counts())

# Visualize
fig, ax = plt.subplots(figsize=(8, 5))
sample_df['platform'].value_counts().plot(kind='bar', ax=ax)
ax.set_title('Posts by Platform')
ax.set_xlabel('Platform')
ax.set_ylabel('Number of Posts')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
# Text length analysis
sample_df['text_length'] = sample_df['text'].str.len()

print("Text Length Statistics:")
print(sample_df['text_length'].describe())

# Visualize distribution
fig, ax = plt.subplots(figsize=(10, 5))
sample_df['text_length'].hist(bins=20, ax=ax, edgecolor='black')
ax.set_title('Distribution of Post Length')
ax.set_xlabel('Text Length (characters)')
ax.set_ylabel('Frequency')
ax.axvline(sample_df['text_length'].mean(), color='red', 
           linestyle='--', label=f'Mean: {sample_df["text_length"].mean():.0f}')
ax.legend()
plt.tight_layout()
plt.show()

In [None]:
# Engagement analysis
sample_df['total_engagement'] = (
    sample_df['retweets'] + 
    sample_df['likes'] + 
    sample_df['replies']
)

print("Total Engagement Statistics:")
print(sample_df['total_engagement'].describe())

In [None]:
# Engagement by platform
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

for idx, metric in enumerate(engagement_cols):
    sample_df.groupby('platform')[metric].mean().plot(
        kind='bar', ax=axes[idx], color='skyblue'
    )
    axes[idx].set_title(f'Average {metric.capitalize()} by Platform')
    axes[idx].set_xlabel('Platform')
    axes[idx].set_ylabel(f'Average {metric.capitalize()}')
    axes[idx].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

In [None]:
# Temporal analysis
sample_df['hour'] = sample_df['timestamp'].dt.hour
sample_df['day_of_week'] = sample_df['timestamp'].dt.day_name()

print("Posts by Hour:")
print(sample_df['hour'].value_counts().sort_index())

In [None]:
# Correlation matrix for engagement metrics
fig, ax = plt.subplots(figsize=(8, 6))
correlation = sample_df[engagement_cols + ['text_length']].corr()
sns.heatmap(correlation, annot=True, fmt='.2f', cmap='coolwarm', 
            square=True, ax=ax, cbar_kws={'label': 'Correlation'})
ax.set_title('Correlation Matrix: Engagement Metrics')
plt.tight_layout()
plt.show()

## 6. Loading Data from S3 (Production)

In production, you would load data directly from S3. Here are examples:

In [None]:
# Example: Load Twitter data from S3
# Uncomment and modify for your data

# twitter_df = data_client.load_twitter_dataset(
#     bucket=DATA_BUCKET,
#     prefix='twitter/2025/11/',
#     date_range=('2025-11-01', '2025-11-07'),
#     sample_size=10000  # Load 10K posts for testing
# )
# 
# print(f"Loaded {len(twitter_df)} tweets")
# twitter_df.head()

In [None]:
# Example: Load Reddit data from S3
# Uncomment and modify for your data

# reddit_df = data_client.load_reddit_dataset(
#     bucket=DATA_BUCKET,
#     prefix='reddit/2025/11/',
#     subreddits=['politics', 'news', 'worldnews'],
#     sample_size=10000
# )
# 
# print(f"Loaded {len(reddit_df)} Reddit posts")
# reddit_df.head()

In [None]:
# Example: Load CSV from S3
# Uncomment and modify for your data

# csv_df = data_client.load_csv_dataset(
#     bucket=DATA_BUCKET,
#     key='datasets/social_media_sample.csv'
# )
# 
# print(f"Loaded {len(csv_df)} posts from CSV")
# csv_df.head()

## 7. Save Results

In [None]:
# Save exploration results
exploration_summary = pd.DataFrame({
    'metric': ['total_posts', 'avg_text_length', 'avg_retweets', 
               'avg_likes', 'avg_replies', 'avg_total_engagement'],
    'value': [
        len(sample_df),
        sample_df['text_length'].mean(),
        sample_df['retweets'].mean(),
        sample_df['likes'].mean(),
        sample_df['replies'].mean(),
        sample_df['total_engagement'].mean()
    ]
})

print("Exploration Summary:")
print(exploration_summary)

# Uncomment to save to S3
# data_client.save_results(exploration_summary, 'exploration_summary.csv')

## 8. Key Findings

Summary of insights from data exploration:

1. **Data Quality**: Dataset contains X posts with Y% missing values
2. **Engagement Patterns**: Average engagement is Z, with [platform] showing highest activity
3. **Text Characteristics**: Posts average [X] characters, ranging from [min] to [max]
4. **Temporal Patterns**: Peak posting times are [hours], most active day is [day]
5. **Platform Distribution**: [platform] comprises [X]% of dataset

## Next Steps

1. **Sentiment Analysis**: Run notebook `02-sentiment-analysis.ipynb`
2. **Misinformation Detection**: Run notebook `03-misinformation-detection.ipynb`
3. **Network Analysis**: Run notebook `04-network-analysis.ipynb`