# Social Media Network and Sentiment Analysis

## Research Overview

This notebook demonstrates comprehensive social media analysis techniques combining network science and sentiment analysis. We explore:

- **Network Structure**: Understanding social connections and community formation
- **Sentiment Dynamics**: Analyzing emotional content in social media posts
- **Influence Patterns**: Identifying key actors and information spreaders
- **Information Diffusion**: Tracking how content spreads through networks
- **Community Detection**: Discovering natural groupings in social networks

### Research Questions

1. How does network structure influence information spread?
2. What sentiment patterns emerge in different communities?
3. Who are the most influential users and why?
4. How do retweet cascades propagate through the network?
5. What topics generate the most engagement?

### Methodology

We use synthetic data that mimics real social media platforms (similar to Twitter/X) to demonstrate analysis techniques that can be applied to real datasets.

## 1. Environment Setup and Library Imports

In [None]:
# Core data manipulation and analysis
import warnings
from datetime import datetime, timedelta

import numpy as np
import pandas as pd

warnings.filterwarnings("ignore")

# Network analysis
# Utility
import random
from collections import defaultdict

# Visualization
import matplotlib.pyplot as plt
import networkx as nx

# Natural language processing
import nltk
import seaborn as sns
from networkx.algorithms import community
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# Set random seeds for reproducibility
np.random.seed(42)
random.seed(42)

# Configure visualization defaults
plt.style.use("seaborn-v0_8-darkgrid")
sns.set_palette("husl")
plt.rcParams["figure.figsize"] = (12, 8)
plt.rcParams["font.size"] = 10

print("Libraries imported successfully")
print(f"NetworkX version: {nx.__version__}")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

In [None]:
# Download required NLTK data
try:
    nltk.data.find("tokenizers/punkt")
except LookupError:
    print("Downloading NLTK punkt tokenizer...")
    nltk.download("punkt", quiet=True)

try:
    nltk.data.find("corpora/stopwords")
except LookupError:
    print("Downloading NLTK stopwords...")
    nltk.download("stopwords", quiet=True)

print("NLTK resources ready")

## 2. Synthetic Social Media Data Generation

We generate realistic synthetic social media data including:
- User profiles with varying activity levels
- Follower network with preferential attachment (power law distribution)
- Posts with timestamps, content, and engagement metrics
- Retweet cascades and information diffusion patterns

In [None]:
# Configuration parameters
NUM_USERS = 1000
NUM_POSTS = 5000
NUM_EDGES = 8000  # Follower connections
START_DATE = datetime(2024, 1, 1)
END_DATE = datetime(2024, 6, 30)

print("Generating synthetic social media dataset:")
print(f"  - Users: {NUM_USERS:,}")
print(f"  - Posts: {NUM_POSTS:,}")
print(f"  - Follower connections: {NUM_EDGES:,}")
print(f"  - Time period: {START_DATE.date()} to {END_DATE.date()}")

In [None]:
# Sample content templates for realistic posts
POST_TEMPLATES = [
    # Positive sentiment
    "Just launched our new {topic}! So excited about this amazing opportunity! #{hashtag} #{sentiment}",
    "Loving the {topic} community! You all are absolutely fantastic! #{hashtag}",
    "Best day ever! Finally achieved {topic}! Thank you everyone! #{hashtag}",
    "This {topic} is absolutely incredible! Highly recommend to everyone! #{hashtag}",
    "Celebrating {topic} with the team! What a wonderful journey! #{hashtag} #{sentiment}",
    # Negative sentiment
    "Really disappointed with {topic}. This is terrible and frustrating. #{hashtag}",
    "Can't believe how bad {topic} has become. Very unhappy about this. #{hashtag}",
    "Worst experience with {topic}. Completely unacceptable! #{hashtag} #{sentiment}",
    "Frustrated with {topic}. This needs to change immediately! #{hashtag}",
    "Awful {topic} situation. Really upset about this whole thing. #{hashtag}",
    # Neutral/Informational
    "New research on {topic} published today. Check it out. #{hashtag}",
    "Meeting about {topic} scheduled for next week. #{hashtag}",
    "Sharing some thoughts on {topic}. Let me know what you think. #{hashtag}",
    "Analyzing data from {topic}. Interesting patterns emerging. #{hashtag} #{sentiment}",
    "Update on {topic}: progress continues as planned. #{hashtag}",
    # Questions/Engagement
    "What do you all think about {topic}? Would love to hear opinions! #{hashtag}",
    "Anyone else working on {topic}? Let's connect! #{hashtag}",
    "How do you approach {topic}? Looking for advice. #{hashtag} #{sentiment}",
]

TOPICS = [
    "AI research",
    "climate change",
    "technology",
    "innovation",
    "data science",
    "machine learning",
    "social media",
    "education",
    "healthcare",
    "sustainability",
    "entrepreneurship",
    "leadership",
    "digital transformation",
    "remote work",
    "mental health",
    "diversity",
    "politics",
    "economy",
    "sports",
    "entertainment",
]

HASHTAGS = [
    "tech",
    "innovation",
    "research",
    "science",
    "data",
    "AI",
    "ML",
    "future",
    "digital",
    "community",
    "impact",
    "change",
    "growth",
    "learning",
    "news",
]

SENTIMENT_TAGS = ["happy", "excited", "grateful", "concerned", "thoughtful", "motivated"]

print(f"Content generation configured with {len(POST_TEMPLATES)} templates")
print(f"Topics: {len(TOPICS)} categories")
print(f"Hashtags: {len(HASHTAGS)} options")

In [None]:
def generate_users(n_users: int) -> pd.DataFrame:
    """
    Generate synthetic user profiles with realistic attributes.

    Args:
        n_users: Number of users to generate

    Returns:
        DataFrame with user profiles
    """
    users = []

    for i in range(n_users):
        # User activity follows power law (most users post rarely, few post frequently)
        activity_level = np.random.power(0.5)  # Skewed toward lower values

        user = {
            "user_id": f"user_{i:04d}",
            "username": f"@user{i:04d}",
            "activity_level": activity_level,
            "influence_score": 0,  # Will be calculated later
            "account_created": START_DATE - timedelta(days=np.random.randint(30, 365)),
        }
        users.append(user)

    return pd.DataFrame(users)


# Generate users
users_df = generate_users(NUM_USERS)

print(f"Generated {len(users_df):,} users")
print("\nSample users:")
print(users_df.head())
print("\nActivity level distribution:")
print(users_df["activity_level"].describe())

In [None]:
def generate_social_network(users_df: pd.DataFrame, n_edges: int) -> nx.DiGraph:
    """
    Generate a social network with preferential attachment (Barabasi-Albert model).
    Higher activity users are more likely to have followers.

    Args:
        users_df: DataFrame of users
        n_edges: Target number of follower connections

    Returns:
        Directed graph where edges represent follower relationships
    """
    G = nx.DiGraph()

    # Add all users as nodes
    for _, user in users_df.iterrows():
        G.add_node(user["user_id"], username=user["username"], activity=user["activity_level"])

    # Generate edges with preferential attachment
    # Users with higher activity are more likely to be followed
    user_ids = users_df["user_id"].tolist()
    activity_weights = users_df["activity_level"].values
    activity_weights = activity_weights / activity_weights.sum()  # Normalize

    edges_added = 0
    max_attempts = n_edges * 3  # Prevent infinite loop
    attempts = 0

    while edges_added < n_edges and attempts < max_attempts:
        attempts += 1

        # Select follower (random)
        follower = np.random.choice(user_ids)

        # Select target to follow (weighted by activity)
        followee = np.random.choice(user_ids, p=activity_weights)

        # Avoid self-loops and duplicate edges
        if follower != followee and not G.has_edge(follower, followee):
            G.add_edge(follower, followee, weight=1.0)
            edges_added += 1

    return G


# Generate social network
print("Generating social network with preferential attachment...")
social_network = generate_social_network(users_df, NUM_EDGES)

print("\nNetwork statistics:")
print(f"  Nodes (users): {social_network.number_of_nodes():,}")
print(f"  Edges (follows): {social_network.number_of_edges():,}")
print(f"  Density: {nx.density(social_network):.4f}")
print(
    f"  Average degree: {sum(dict(social_network.degree()).values()) / social_network.number_of_nodes():.2f}"
)

# Check if network is connected
if nx.is_weakly_connected(social_network):
    print("  Network is weakly connected")
else:
    n_components = nx.number_weakly_connected_components(social_network)
    print(f"  Network has {n_components} weakly connected components")

In [None]:
def generate_posts(
    users_df: pd.DataFrame, n_posts: int, start_date: datetime, end_date: datetime
) -> pd.DataFrame:
    """
    Generate synthetic social media posts with realistic content and engagement.

    Args:
        users_df: DataFrame of users
        n_posts: Number of posts to generate
        start_date: Start of time period
        end_date: End of time period

    Returns:
        DataFrame of posts
    """
    posts = []
    time_delta = (end_date - start_date).total_seconds()

    # Sample users weighted by activity level for post creation
    user_weights = users_df["activity_level"].values
    user_weights = user_weights / user_weights.sum()

    for i in range(n_posts):
        # Select user weighted by activity
        user = users_df.sample(n=1, weights=user_weights).iloc[0]

        # Generate timestamp
        timestamp = start_date + timedelta(seconds=np.random.uniform(0, time_delta))

        # Generate content
        template = np.random.choice(POST_TEMPLATES)
        topic = np.random.choice(TOPICS)
        hashtag = np.random.choice(HASHTAGS)
        sentiment_tag = np.random.choice(SENTIMENT_TAGS)

        content = template.format(topic=topic, hashtag=hashtag, sentiment=sentiment_tag)

        # Generate engagement metrics (influenced by user activity)
        base_engagement = user["activity_level"] * 100
        likes = int(np.random.poisson(base_engagement * 0.5))
        retweets = int(np.random.poisson(base_engagement * 0.1))
        replies = int(np.random.poisson(base_engagement * 0.05))

        # Some posts are retweets
        is_retweet = np.random.random() < 0.3
        original_post_id = (
            f"post_{np.random.randint(0, max(i, 1)):04d}" if is_retweet and i > 0 else None
        )

        post = {
            "post_id": f"post_{i:04d}",
            "user_id": user["user_id"],
            "username": user["username"],
            "timestamp": timestamp,
            "content": content,
            "likes": likes,
            "retweets": retweets,
            "replies": replies,
            "is_retweet": is_retweet,
            "original_post_id": original_post_id,
            "hashtags": [tag for tag in content.split() if tag.startswith("#")],
        }
        posts.append(post)

    posts_df = pd.DataFrame(posts)
    posts_df = posts_df.sort_values("timestamp").reset_index(drop=True)

    return posts_df


# Generate posts
print("Generating social media posts...")
posts_df = generate_posts(users_df, NUM_POSTS, START_DATE, END_DATE)

print(f"\nGenerated {len(posts_df):,} posts")
print(f"Time range: {posts_df['timestamp'].min()} to {posts_df['timestamp'].max()}")
print(
    f"Retweets: {posts_df['is_retweet'].sum():,} ({posts_df['is_retweet'].sum() / len(posts_df) * 100:.1f}%)"
)
print("\nSample posts:")
print(posts_df[["post_id", "username", "content", "likes", "retweets"]].head(3))

In [None]:
# Calculate total engagement score for each post
posts_df["engagement_score"] = (
    posts_df["likes"] * 1.0 + posts_df["retweets"] * 2.0 + posts_df["replies"] * 1.5
)

# Add date column for temporal analysis
posts_df["date"] = posts_df["timestamp"].dt.date

print("Enhanced posts DataFrame with engagement metrics")
print("\nEngagement statistics:")
print(posts_df[["likes", "retweets", "replies", "engagement_score"]].describe())

## 3. Network Analysis

We analyze the social network structure to understand:
- Centrality measures (who is most connected/influential)
- Network topology and patterns
- Key players in the network

In [None]:
def calculate_centrality_measures(G: nx.DiGraph) -> dict[str, dict[str, float]]:
    """
    Calculate various centrality measures for network nodes.

    Args:
        G: Social network graph

    Returns:
        Dictionary of centrality measures
    """
    print("Calculating centrality measures...")

    centrality = {}

    # Degree centrality (in-degree = followers, out-degree = following)
    print("  - Degree centrality")
    centrality["in_degree"] = dict(G.in_degree())
    centrality["out_degree"] = dict(G.out_degree())

    # Betweenness centrality (bridging different parts of network)
    print("  - Betweenness centrality")
    centrality["betweenness"] = nx.betweenness_centrality(G, k=min(100, G.number_of_nodes()))

    # Eigenvector centrality (connected to well-connected nodes)
    print("  - Eigenvector centrality")
    try:
        centrality["eigenvector"] = nx.eigenvector_centrality(G, max_iter=1000)
    except:
        # If eigenvector doesn't converge, use PageRank as alternative
        print("  - Using PageRank instead (eigenvector did not converge)")
        centrality["eigenvector"] = nx.pagerank(G)

    # PageRank (Google's algorithm for importance)
    print("  - PageRank")
    centrality["pagerank"] = nx.pagerank(G)

    return centrality


# Calculate centrality measures
centrality_measures = calculate_centrality_measures(social_network)

# Add centrality to users dataframe
users_df["in_degree"] = users_df["user_id"].map(centrality_measures["in_degree"])
users_df["out_degree"] = users_df["user_id"].map(centrality_measures["out_degree"])
users_df["betweenness"] = users_df["user_id"].map(centrality_measures["betweenness"])
users_df["eigenvector"] = users_df["user_id"].map(centrality_measures["eigenvector"])
users_df["pagerank"] = users_df["user_id"].map(centrality_measures["pagerank"])

# Calculate composite influence score
users_df["influence_score"] = (
    users_df["in_degree"] * 0.3
    + users_df["pagerank"] * 100 * 0.3
    + users_df["betweenness"] * 100 * 0.2
    + users_df["eigenvector"] * 100 * 0.2
)

print("\nTop 10 users by influence score:")
print(
    users_df.nlargest(10, "influence_score")[
        ["username", "in_degree", "out_degree", "pagerank", "influence_score"]
    ]
)

In [None]:
# Visualize degree distribution (power law)
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# In-degree distribution (followers)
in_degrees = [d for n, d in social_network.in_degree()]
axes[0, 0].hist(in_degrees, bins=50, color="skyblue", edgecolor="black", alpha=0.7)
axes[0, 0].set_xlabel("In-Degree (Followers)")
axes[0, 0].set_ylabel("Frequency")
axes[0, 0].set_title("In-Degree Distribution")
axes[0, 0].axvline(
    np.mean(in_degrees), color="red", linestyle="--", label=f"Mean: {np.mean(in_degrees):.1f}"
)
axes[0, 0].legend()

# Out-degree distribution (following)
out_degrees = [d for n, d in social_network.out_degree()]
axes[0, 1].hist(out_degrees, bins=50, color="lightcoral", edgecolor="black", alpha=0.7)
axes[0, 1].set_xlabel("Out-Degree (Following)")
axes[0, 1].set_ylabel("Frequency")
axes[0, 1].set_title("Out-Degree Distribution")
axes[0, 1].axvline(
    np.mean(out_degrees), color="red", linestyle="--", label=f"Mean: {np.mean(out_degrees):.1f}"
)
axes[0, 1].legend()

# PageRank distribution
pageranks = list(centrality_measures["pagerank"].values())
axes[1, 0].hist(pageranks, bins=50, color="lightgreen", edgecolor="black", alpha=0.7)
axes[1, 0].set_xlabel("PageRank Score")
axes[1, 0].set_ylabel("Frequency")
axes[1, 0].set_title("PageRank Distribution")
axes[1, 0].axvline(
    np.mean(pageranks), color="red", linestyle="--", label=f"Mean: {np.mean(pageranks):.6f}"
)
axes[1, 0].legend()

# Betweenness centrality
betweenness = list(centrality_measures["betweenness"].values())
axes[1, 1].hist(betweenness, bins=50, color="plum", edgecolor="black", alpha=0.7)
axes[1, 1].set_xlabel("Betweenness Centrality")
axes[1, 1].set_ylabel("Frequency")
axes[1, 1].set_title("Betweenness Centrality Distribution")
axes[1, 1].axvline(
    np.mean(betweenness), color="red", linestyle="--", label=f"Mean: {np.mean(betweenness):.6f}"
)
axes[1, 1].legend()

plt.tight_layout()
plt.savefig("network_centrality_distributions.png", dpi=300, bbox_inches="tight")
plt.show()

print("\nCentrality measures follow power law distribution")
print("Most users have low centrality, few users have very high centrality")

In [None]:
# Visualize network structure (sample for performance)
# For large networks, we visualize a subgraph of most influential nodes

# Get top 100 most influential nodes
top_users = users_df.nlargest(100, "influence_score")["user_id"].tolist()
subgraph = social_network.subgraph(top_users).copy()

print("Visualizing subgraph of top 100 influential users")
print(f"Subgraph: {subgraph.number_of_nodes()} nodes, {subgraph.number_of_edges()} edges")

fig, ax = plt.subplots(figsize=(14, 14))

# Calculate layout
pos = nx.spring_layout(subgraph, k=0.5, iterations=50, seed=42)

# Node sizes based on in-degree (followers)
node_sizes = [subgraph.in_degree(node) * 20 + 50 for node in subgraph.nodes()]

# Node colors based on PageRank
node_colors = [centrality_measures["pagerank"][node] for node in subgraph.nodes()]

# Draw network
nx.draw_networkx_edges(
    subgraph, pos, alpha=0.2, edge_color="gray", arrows=True, arrowsize=10, width=0.5, ax=ax
)
nodes = nx.draw_networkx_nodes(
    subgraph, pos, node_size=node_sizes, node_color=node_colors, cmap="YlOrRd", alpha=0.8, ax=ax
)

# Add colorbar
plt.colorbar(nodes, label="PageRank", ax=ax)

# Add labels for top 10 nodes
top_10_users = users_df.nlargest(10, "influence_score")["user_id"].tolist()
labels = {
    node: social_network.nodes[node]["username"]
    for node in subgraph.nodes()
    if node in top_10_users
}
nx.draw_networkx_labels(subgraph, pos, labels, font_size=8, ax=ax)

ax.set_title("Social Network Structure (Top 100 Influential Users)", fontsize=16, fontweight="bold")
ax.axis("off")

plt.tight_layout()
plt.savefig("social_network_visualization.png", dpi=300, bbox_inches="tight")
plt.show()

print("\nNetwork visualization complete")
print("Node size = number of followers")
print("Node color = PageRank score (darker = more influential)")

## 4. Community Detection

We use the Louvain algorithm to detect communities (clusters) in the social network. Communities represent groups of users who are more densely connected to each other than to the rest of the network.

In [None]:
def detect_communities(G: nx.DiGraph) -> dict[str, int]:
    """
    Detect communities using Louvain algorithm.

    Args:
        G: Social network graph

    Returns:
        Dictionary mapping user_id to community_id
    """
    print("Detecting communities using Louvain algorithm...")

    # Convert to undirected for community detection
    G_undirected = G.to_undirected()

    # Apply Louvain community detection
    communities_generator = community.louvain_communities(G_undirected, seed=42)

    # Convert to dictionary
    user_to_community = {}
    for community_id, community_nodes in enumerate(communities_generator):
        for node in community_nodes:
            user_to_community[node] = community_id

    return user_to_community


# Detect communities
user_communities = detect_communities(social_network)

# Add to users dataframe
users_df["community"] = users_df["user_id"].map(user_communities)

# Community statistics
community_sizes = users_df["community"].value_counts().sort_index()
print(f"\nDetected {len(community_sizes)} communities")
print("\nCommunity sizes:")
print(community_sizes.describe())
print("\nTop 10 largest communities:")
print(community_sizes.head(10))

In [None]:
# Visualize communities
fig, axes = plt.subplots(1, 2, figsize=(16, 7))

# Community size distribution
axes[0].bar(
    range(len(community_sizes)),
    community_sizes.values,
    color="steelblue",
    alpha=0.7,
    edgecolor="black",
)
axes[0].set_xlabel("Community ID")
axes[0].set_ylabel("Number of Users")
axes[0].set_title("Community Size Distribution")
axes[0].axhline(
    community_sizes.mean(), color="red", linestyle="--", label=f"Mean: {community_sizes.mean():.1f}"
)
axes[0].legend()

# Visualize network colored by community (top 100 users)
top_users = users_df.nlargest(100, "influence_score")["user_id"].tolist()
subgraph = social_network.subgraph(top_users).copy()

pos = nx.spring_layout(subgraph, k=0.5, iterations=50, seed=42)

# Color by community
node_colors = [user_communities[node] for node in subgraph.nodes()]
node_sizes = [subgraph.in_degree(node) * 20 + 50 for node in subgraph.nodes()]

nx.draw_networkx_edges(
    subgraph, pos, alpha=0.2, edge_color="gray", arrows=True, arrowsize=10, width=0.5, ax=axes[1]
)
nx.draw_networkx_nodes(
    subgraph, pos, node_size=node_sizes, node_color=node_colors, cmap="tab20", alpha=0.8, ax=axes[1]
)

axes[1].set_title("Network Communities (Top 100 Users)", fontsize=14, fontweight="bold")
axes[1].axis("off")

plt.tight_layout()
plt.savefig("community_detection.png", dpi=300, bbox_inches="tight")
plt.show()

print("Community detection complete")
print("Each color represents a different community in the network")

In [None]:
# Calculate modularity (quality of community detection)
G_undirected = social_network.to_undirected()
communities_list = []
for comm_id in sorted(set(user_communities.values())):
    comm_nodes = [node for node, c_id in user_communities.items() if c_id == comm_id]
    communities_list.append(set(comm_nodes))

modularity_score = community.modularity(G_undirected, communities_list)
print(f"\nModularity score: {modularity_score:.4f}")
print("(Values closer to 1 indicate stronger community structure)")

if modularity_score > 0.3:
    print("Strong community structure detected")
elif modularity_score > 0.1:
    print("Moderate community structure detected")
else:
    print("Weak community structure detected")

## 5. Sentiment Analysis

We analyze the sentiment of posts using VADER (Valence Aware Dictionary and sEntiment Reasoner), which is specifically designed for social media text.

In [None]:
def analyze_sentiment(texts: list[str]) -> pd.DataFrame:
    """
    Analyze sentiment of texts using VADER.

    Args:
        texts: List of text strings to analyze

    Returns:
        DataFrame with sentiment scores
    """
    analyzer = SentimentIntensityAnalyzer()

    sentiment_scores = []
    for text in texts:
        scores = analyzer.polarity_scores(text)
        sentiment_scores.append(scores)

    return pd.DataFrame(sentiment_scores)


# Analyze sentiment of all posts
print("Analyzing sentiment of all posts using VADER...")
sentiment_df = analyze_sentiment(posts_df["content"].tolist())

# Add sentiment scores to posts
posts_df["sentiment_neg"] = sentiment_df["neg"]
posts_df["sentiment_neu"] = sentiment_df["neu"]
posts_df["sentiment_pos"] = sentiment_df["pos"]
posts_df["sentiment_compound"] = sentiment_df["compound"]


# Categorize sentiment
def categorize_sentiment(compound: float) -> str:
    if compound >= 0.05:
        return "positive"
    elif compound <= -0.05:
        return "negative"
    else:
        return "neutral"


posts_df["sentiment_category"] = posts_df["sentiment_compound"].apply(categorize_sentiment)

print(f"\nSentiment analysis complete for {len(posts_df):,} posts")
print("\nSentiment distribution:")
print(posts_df["sentiment_category"].value_counts())
print("\nSentiment score statistics:")
print(posts_df["sentiment_compound"].describe())

In [None]:
# Visualize sentiment distribution
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Sentiment category distribution
sentiment_counts = posts_df["sentiment_category"].value_counts()
colors = {"positive": "lightgreen", "neutral": "lightblue", "negative": "lightcoral"}
sentiment_colors = [colors[cat] for cat in sentiment_counts.index]

axes[0, 0].bar(
    sentiment_counts.index,
    sentiment_counts.values,
    color=sentiment_colors,
    edgecolor="black",
    alpha=0.7,
)
axes[0, 0].set_xlabel("Sentiment Category")
axes[0, 0].set_ylabel("Number of Posts")
axes[0, 0].set_title("Sentiment Category Distribution")
for i, (_cat, count) in enumerate(sentiment_counts.items()):
    axes[0, 0].text(
        i, count, f"{count}\n({count / len(posts_df) * 100:.1f}%)", ha="center", va="bottom"
    )

# Compound sentiment score distribution
axes[0, 1].hist(
    posts_df["sentiment_compound"], bins=50, color="mediumpurple", edgecolor="black", alpha=0.7
)
axes[0, 1].set_xlabel("Compound Sentiment Score")
axes[0, 1].set_ylabel("Frequency")
axes[0, 1].set_title("Sentiment Score Distribution")
axes[0, 1].axvline(
    posts_df["sentiment_compound"].mean(),
    color="red",
    linestyle="--",
    label=f"Mean: {posts_df['sentiment_compound'].mean():.3f}",
)
axes[0, 1].axvline(0, color="black", linestyle="-", alpha=0.3)
axes[0, 1].legend()

# Sentiment over time
daily_sentiment = posts_df.groupby("date")["sentiment_compound"].mean()
axes[1, 0].plot(
    daily_sentiment.index, daily_sentiment.values, color="steelblue", linewidth=2, alpha=0.7
)
axes[1, 0].axhline(0, color="black", linestyle="-", alpha=0.3)
axes[1, 0].set_xlabel("Date")
axes[1, 0].set_ylabel("Average Sentiment Score")
axes[1, 0].set_title("Sentiment Trends Over Time")
axes[1, 0].tick_params(axis="x", rotation=45)
axes[1, 0].grid(alpha=0.3)

# Sentiment by engagement
bins = pd.qcut(
    posts_df["engagement_score"], q=5, labels=["Very Low", "Low", "Medium", "High", "Very High"]
)
sentiment_by_engagement = posts_df.groupby(bins)["sentiment_compound"].mean()

axes[1, 1].bar(
    range(len(sentiment_by_engagement)),
    sentiment_by_engagement.values,
    color="coral",
    edgecolor="black",
    alpha=0.7,
)
axes[1, 1].set_xticks(range(len(sentiment_by_engagement)))
axes[1, 1].set_xticklabels(sentiment_by_engagement.index, rotation=45)
axes[1, 1].set_xlabel("Engagement Level")
axes[1, 1].set_ylabel("Average Sentiment Score")
axes[1, 1].set_title("Sentiment by Engagement Level")
axes[1, 1].axhline(0, color="black", linestyle="-", alpha=0.3)
axes[1, 1].grid(alpha=0.3)

plt.tight_layout()
plt.savefig("sentiment_analysis.png", dpi=300, bbox_inches="tight")
plt.show()

print("\nSentiment analysis visualizations complete")

In [None]:
# Sentiment by community
posts_with_community = posts_df.merge(users_df[["user_id", "community"]], on="user_id", how="left")

# Analyze top 10 communities
top_communities = users_df["community"].value_counts().head(10).index
community_sentiment = (
    posts_with_community[posts_with_community["community"].isin(top_communities)]
    .groupby("community")["sentiment_compound"]
    .agg(["mean", "std", "count"])
)

print("\nSentiment by community (top 10 communities):")
print(community_sentiment.round(3))

# Visualize
fig, ax = plt.subplots(figsize=(12, 6))
x = range(len(community_sentiment))
ax.bar(
    x,
    community_sentiment["mean"],
    yerr=community_sentiment["std"],
    color="skyblue",
    edgecolor="black",
    alpha=0.7,
    capsize=5,
)
ax.set_xticks(x)
ax.set_xticklabels([f"Community {i}" for i in community_sentiment.index])
ax.set_xlabel("Community")
ax.set_ylabel("Average Sentiment Score")
ax.set_title("Sentiment Distribution Across Communities")
ax.axhline(0, color="black", linestyle="-", alpha=0.3)
ax.grid(alpha=0.3)
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig("sentiment_by_community.png", dpi=300, bbox_inches="tight")
plt.show()

print("\nDifferent communities show distinct sentiment patterns")

## 6. Influence Analysis

We identify influential users based on network centrality and content engagement metrics.

In [None]:
# Calculate user-level post statistics
user_post_stats = (
    posts_df.groupby("user_id")
    .agg(
        {
            "post_id": "count",
            "likes": "sum",
            "retweets": "sum",
            "replies": "sum",
            "engagement_score": ["sum", "mean"],
            "sentiment_compound": "mean",
        }
    )
    .reset_index()
)

user_post_stats.columns = [
    "user_id",
    "num_posts",
    "total_likes",
    "total_retweets",
    "total_replies",
    "total_engagement",
    "avg_engagement",
    "avg_sentiment",
]

# Merge with user data
users_enriched = users_df.merge(user_post_stats, on="user_id", how="left")
users_enriched = users_enriched.fillna(0)

# Calculate comprehensive influence metric
users_enriched["content_influence"] = (
    users_enriched["total_engagement"] * 0.4
    + users_enriched["avg_engagement"] * 0.3
    + users_enriched["num_posts"] * 0.3
)

users_enriched["overall_influence"] = (
    users_enriched["influence_score"] * 0.5 + users_enriched["content_influence"] * 0.5
)

# Identify top influencers
top_influencers = users_enriched.nlargest(20, "overall_influence")

print("Top 20 Most Influential Users:")
print("=" * 100)
print(
    top_influencers[
        [
            "username",
            "in_degree",
            "pagerank",
            "num_posts",
            "total_engagement",
            "avg_sentiment",
            "overall_influence",
        ]
    ].to_string(index=False)
)

In [None]:
# Visualize influence factors
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Network influence vs Content influence
axes[0, 0].scatter(
    users_enriched["influence_score"],
    users_enriched["content_influence"],
    alpha=0.5,
    s=50,
    c="steelblue",
    edgecolors="black",
    linewidth=0.5,
)
axes[0, 0].set_xlabel("Network Influence Score")
axes[0, 0].set_ylabel("Content Influence Score")
axes[0, 0].set_title("Network vs Content Influence")
axes[0, 0].grid(alpha=0.3)

# Followers vs Engagement
axes[0, 1].scatter(
    users_enriched["in_degree"],
    users_enriched["total_engagement"],
    alpha=0.5,
    s=50,
    c="coral",
    edgecolors="black",
    linewidth=0.5,
)
axes[0, 1].set_xlabel("Number of Followers")
axes[0, 1].set_ylabel("Total Engagement")
axes[0, 1].set_title("Followers vs Engagement")
axes[0, 1].grid(alpha=0.3)

# Top influencers by category
top_20 = users_enriched.nlargest(20, "overall_influence")
axes[1, 0].barh(
    range(len(top_20)),
    top_20["overall_influence"].values,
    color="mediumseagreen",
    alpha=0.7,
    edgecolor="black",
)
axes[1, 0].set_yticks(range(len(top_20)))
axes[1, 0].set_yticklabels(top_20["username"].values, fontsize=8)
axes[1, 0].set_xlabel("Overall Influence Score")
axes[1, 0].set_title("Top 20 Most Influential Users")
axes[1, 0].invert_yaxis()
axes[1, 0].grid(alpha=0.3)

# Influence distribution
axes[1, 1].hist(
    users_enriched["overall_influence"], bins=50, color="plum", alpha=0.7, edgecolor="black"
)
axes[1, 1].set_xlabel("Overall Influence Score")
axes[1, 1].set_ylabel("Frequency")
axes[1, 1].set_title("Influence Score Distribution")
axes[1, 1].axvline(
    users_enriched["overall_influence"].mean(),
    color="red",
    linestyle="--",
    label=f"Mean: {users_enriched['overall_influence'].mean():.1f}",
)
axes[1, 1].legend()

plt.tight_layout()
plt.savefig("influence_analysis.png", dpi=300, bbox_inches="tight")
plt.show()

print("\nInfluence analysis visualizations complete")

In [None]:
# Analyze engagement patterns
print("\nEngagement Patterns Analysis")
print("=" * 60)

# Post frequency vs engagement
posting_bins = pd.qcut(
    users_enriched[users_enriched["num_posts"] > 0]["num_posts"],
    q=4,
    labels=["Low", "Medium", "High", "Very High"],
)
posting_engagement = (
    users_enriched[users_enriched["num_posts"] > 0].groupby(posting_bins)["avg_engagement"].mean()
)

print("\nAverage engagement by posting frequency:")
print(posting_engagement)

# Sentiment impact on engagement
sentiment_engagement = posts_df.groupby("sentiment_category")["engagement_score"].mean()
print("\nAverage engagement by sentiment:")
print(sentiment_engagement)

# Peak posting times
posts_df["hour"] = posts_df["timestamp"].dt.hour
posts_df["day_of_week"] = posts_df["timestamp"].dt.day_name()

hourly_engagement = posts_df.groupby("hour")["engagement_score"].mean()
print("\nPeak engagement hours:")
print(hourly_engagement.nlargest(5))

## 7. Information Diffusion Analysis

We analyze how information spreads through the network via retweets and identify viral content patterns.

In [None]:
def build_retweet_cascades(posts_df: pd.DataFrame) -> dict[str, list[str]]:
    """
    Build retweet cascades showing how content spreads.

    Args:
        posts_df: DataFrame of posts

    Returns:
        Dictionary mapping original post to list of retweet post IDs
    """
    cascades = defaultdict(list)

    for _, post in posts_df.iterrows():
        if post["is_retweet"] and post["original_post_id"]:
            cascades[post["original_post_id"]].append(post["post_id"])

    return dict(cascades)


# Build cascades
retweet_cascades = build_retweet_cascades(posts_df)

# Calculate cascade statistics
cascade_sizes = {post_id: len(retweets) for post_id, retweets in retweet_cascades.items()}
posts_df["cascade_size"] = posts_df["post_id"].map(cascade_sizes).fillna(0)

print(f"Total retweet cascades: {len(retweet_cascades):,}")
print("\nCascade size statistics:")
cascade_sizes_series = pd.Series(list(cascade_sizes.values()))
print(cascade_sizes_series.describe())

# Identify viral content (top 20 by cascade size)
viral_posts = posts_df.nlargest(20, "cascade_size")
print("\nTop 10 Most Viral Posts (by retweets):")
print("=" * 100)
print(
    viral_posts[
        ["post_id", "username", "content", "retweets", "cascade_size", "sentiment_compound"]
    ]
    .head(10)
    .to_string(index=False)
)

In [None]:
# Visualize information diffusion
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Cascade size distribution
cascade_sizes_list = [size for size in cascade_sizes.values() if size > 0]
axes[0, 0].hist(cascade_sizes_list, bins=50, color="lightcoral", edgecolor="black", alpha=0.7)
axes[0, 0].set_xlabel("Cascade Size (Number of Retweets)")
axes[0, 0].set_ylabel("Frequency")
axes[0, 0].set_title("Retweet Cascade Size Distribution")
axes[0, 0].axvline(
    np.mean(cascade_sizes_list),
    color="red",
    linestyle="--",
    label=f"Mean: {np.mean(cascade_sizes_list):.1f}",
)
axes[0, 0].legend()

# Virality over time
daily_virality = posts_df.groupby("date")["cascade_size"].sum()
axes[0, 1].plot(
    daily_virality.index,
    daily_virality.values,
    color="green",
    linewidth=2,
    marker="o",
    markersize=3,
    alpha=0.7,
)
axes[0, 1].set_xlabel("Date")
axes[0, 1].set_ylabel("Total Cascade Size")
axes[0, 1].set_title("Viral Activity Over Time")
axes[0, 1].tick_params(axis="x", rotation=45)
axes[0, 1].grid(alpha=0.3)

# Sentiment of viral content
viral_threshold = posts_df["cascade_size"].quantile(0.9)
viral_sentiment = posts_df[posts_df["cascade_size"] >= viral_threshold][
    "sentiment_category"
].value_counts()
axes[1, 0].bar(
    viral_sentiment.index,
    viral_sentiment.values,
    color=[
        "lightgreen" if x == "positive" else "lightcoral" if x == "negative" else "lightblue"
        for x in viral_sentiment.index
    ],
    edgecolor="black",
    alpha=0.7,
)
axes[1, 0].set_xlabel("Sentiment Category")
axes[1, 0].set_ylabel("Number of Viral Posts")
axes[1, 0].set_title("Sentiment of Viral Content (Top 10%)")
for i, (_cat, count) in enumerate(viral_sentiment.items()):
    axes[1, 0].text(
        i, count, f"{count}\n({count / viral_sentiment.sum() * 100:.1f}%)", ha="center", va="bottom"
    )

# Engagement vs Virality
axes[1, 1].scatter(
    posts_df["engagement_score"],
    posts_df["cascade_size"],
    alpha=0.4,
    s=30,
    c="purple",
    edgecolors="black",
    linewidth=0.5,
)
axes[1, 1].set_xlabel("Engagement Score")
axes[1, 1].set_ylabel("Cascade Size")
axes[1, 1].set_title("Engagement vs Virality")
axes[1, 1].grid(alpha=0.3)

plt.tight_layout()
plt.savefig("information_diffusion.png", dpi=300, bbox_inches="tight")
plt.show()

print("\nInformation diffusion analysis complete")

In [None]:
# Analyze diffusion through network communities
posts_with_community = posts_df.merge(users_df[["user_id", "community"]], on="user_id", how="left")

# Viral content by community
top_communities = users_df["community"].value_counts().head(10).index
community_virality = (
    posts_with_community[posts_with_community["community"].isin(top_communities)]
    .groupby("community")["cascade_size"]
    .agg(["sum", "mean", "max"])
)

print("\nVirality by community (top 10 communities):")
print(community_virality.round(2))

# Visualize
fig, ax = plt.subplots(figsize=(12, 6))
x = range(len(community_virality))
ax.bar(x, community_virality["mean"], color="orange", alpha=0.7, edgecolor="black")
ax.set_xticks(x)
ax.set_xticklabels([f"Community {i}" for i in community_virality.index])
ax.set_xlabel("Community")
ax.set_ylabel("Average Cascade Size")
ax.set_title("Viral Activity by Community")
ax.grid(alpha=0.3)
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig("virality_by_community.png", dpi=300, bbox_inches="tight")
plt.show()

## 8. Topic and Hashtag Analysis

We extract and analyze hashtags to understand trending topics and their engagement patterns.

In [None]:
def extract_hashtags(posts_df: pd.DataFrame) -> pd.DataFrame:
    """
    Extract and analyze hashtags from posts.

    Args:
        posts_df: DataFrame of posts

    Returns:
        DataFrame of hashtag statistics
    """
    # Flatten hashtag lists
    all_hashtags = []
    for _, post in posts_df.iterrows():
        for hashtag in post["hashtags"]:
            all_hashtags.append(
                {
                    "hashtag": hashtag.lower(),
                    "post_id": post["post_id"],
                    "engagement_score": post["engagement_score"],
                    "sentiment": post["sentiment_compound"],
                    "cascade_size": post["cascade_size"],
                }
            )

    hashtags_df = pd.DataFrame(all_hashtags)

    # Aggregate statistics by hashtag
    hashtag_stats = (
        hashtags_df.groupby("hashtag")
        .agg(
            {
                "post_id": "count",
                "engagement_score": ["sum", "mean"],
                "sentiment": "mean",
                "cascade_size": ["sum", "mean"],
            }
        )
        .reset_index()
    )

    hashtag_stats.columns = [
        "hashtag",
        "frequency",
        "total_engagement",
        "avg_engagement",
        "avg_sentiment",
        "total_virality",
        "avg_virality",
    ]

    return hashtag_stats.sort_values("frequency", ascending=False)


# Extract hashtags
print("Extracting and analyzing hashtags...")
hashtag_stats = extract_hashtags(posts_df)

print(f"\nTotal unique hashtags: {len(hashtag_stats):,}")
print("\nTop 20 Most Frequent Hashtags:")
print("=" * 100)
print(hashtag_stats.head(20).to_string(index=False))

In [None]:
# Visualize hashtag analysis
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Top hashtags by frequency
top_20_hashtags = hashtag_stats.head(20)
axes[0, 0].barh(
    range(len(top_20_hashtags)),
    top_20_hashtags["frequency"].values,
    color="steelblue",
    alpha=0.7,
    edgecolor="black",
)
axes[0, 0].set_yticks(range(len(top_20_hashtags)))
axes[0, 0].set_yticklabels(top_20_hashtags["hashtag"].values, fontsize=8)
axes[0, 0].set_xlabel("Frequency")
axes[0, 0].set_title("Top 20 Hashtags by Frequency")
axes[0, 0].invert_yaxis()
axes[0, 0].grid(alpha=0.3)

# Hashtags by engagement
top_engagement = hashtag_stats.nlargest(20, "total_engagement")
axes[0, 1].barh(
    range(len(top_engagement)),
    top_engagement["total_engagement"].values,
    color="coral",
    alpha=0.7,
    edgecolor="black",
)
axes[0, 1].set_yticks(range(len(top_engagement)))
axes[0, 1].set_yticklabels(top_engagement["hashtag"].values, fontsize=8)
axes[0, 1].set_xlabel("Total Engagement Score")
axes[0, 1].set_title("Top 20 Hashtags by Engagement")
axes[0, 1].invert_yaxis()
axes[0, 1].grid(alpha=0.3)

# Frequency vs Engagement
axes[1, 0].scatter(
    hashtag_stats["frequency"],
    hashtag_stats["total_engagement"],
    alpha=0.5,
    s=50,
    c="green",
    edgecolors="black",
    linewidth=0.5,
)
axes[1, 0].set_xlabel("Hashtag Frequency")
axes[1, 0].set_ylabel("Total Engagement")
axes[1, 0].set_title("Hashtag Frequency vs Engagement")
axes[1, 0].grid(alpha=0.3)

# Sentiment by top hashtags
top_15_hashtags = hashtag_stats.head(15)
colors_sentiment = [
    "lightgreen" if x > 0 else "lightcoral" if x < 0 else "lightblue"
    for x in top_15_hashtags["avg_sentiment"]
]
axes[1, 1].barh(
    range(len(top_15_hashtags)),
    top_15_hashtags["avg_sentiment"].values,
    color=colors_sentiment,
    alpha=0.7,
    edgecolor="black",
)
axes[1, 1].set_yticks(range(len(top_15_hashtags)))
axes[1, 1].set_yticklabels(top_15_hashtags["hashtag"].values, fontsize=8)
axes[1, 1].set_xlabel("Average Sentiment Score")
axes[1, 1].set_title("Sentiment of Top 15 Hashtags")
axes[1, 1].axvline(0, color="black", linestyle="-", alpha=0.3)
axes[1, 1].invert_yaxis()
axes[1, 1].grid(alpha=0.3)

plt.tight_layout()
plt.savefig("hashtag_analysis.png", dpi=300, bbox_inches="tight")
plt.show()

print("\nHashtag analysis visualizations complete")

In [None]:
# Temporal hashtag trends
posts_df["week"] = posts_df["timestamp"].dt.to_period("W")

# Track top 5 hashtags over time
top_5_hashtags = hashtag_stats.head(5)["hashtag"].tolist()

weekly_hashtag_trends = []
for _, post in posts_df.iterrows():
    for hashtag in post["hashtags"]:
        if hashtag.lower() in top_5_hashtags:
            weekly_hashtag_trends.append({"week": post["week"], "hashtag": hashtag.lower()})

trends_df = pd.DataFrame(weekly_hashtag_trends)
trends_pivot = trends_df.groupby(["week", "hashtag"]).size().unstack(fill_value=0)

# Plot trends
fig, ax = plt.subplots(figsize=(14, 6))
for hashtag in top_5_hashtags:
    if hashtag in trends_pivot.columns:
        ax.plot(
            range(len(trends_pivot)),
            trends_pivot[hashtag].values,
            marker="o",
            linewidth=2,
            label=hashtag,
            alpha=0.7,
        )

ax.set_xlabel("Week")
ax.set_ylabel("Frequency")
ax.set_title("Trending Hashtags Over Time (Top 5)")
ax.legend()
ax.grid(alpha=0.3)
plt.tight_layout()
plt.savefig("hashtag_trends.png", dpi=300, bbox_inches="tight")
plt.show()

print("\nHashtag trend analysis complete")

## 9. Summary and Key Findings

We synthesize our analysis to identify key insights about the social media network.

In [None]:
# Generate comprehensive summary statistics
summary = {
    "Network Statistics": {
        "Total Users": social_network.number_of_nodes(),
        "Total Connections": social_network.number_of_edges(),
        "Network Density": f"{nx.density(social_network):.4f}",
        "Average Degree": f"{sum(dict(social_network.degree()).values()) / social_network.number_of_nodes():.2f}",
        "Number of Communities": len(set(user_communities.values())),
        "Modularity Score": f"{modularity_score:.4f}",
    },
    "Content Statistics": {
        "Total Posts": len(posts_df),
        "Retweet Percentage": f"{posts_df['is_retweet'].sum() / len(posts_df) * 100:.1f}%",
        "Average Engagement": f"{posts_df['engagement_score'].mean():.2f}",
        "Total Unique Hashtags": len(hashtag_stats),
    },
    "Sentiment Statistics": {
        "Average Sentiment": f"{posts_df['sentiment_compound'].mean():.3f}",
        "Positive Posts": f"{(posts_df['sentiment_category'] == 'positive').sum()} ({(posts_df['sentiment_category'] == 'positive').sum() / len(posts_df) * 100:.1f}%)",
        "Negative Posts": f"{(posts_df['sentiment_category'] == 'negative').sum()} ({(posts_df['sentiment_category'] == 'negative').sum() / len(posts_df) * 100:.1f}%)",
        "Neutral Posts": f"{(posts_df['sentiment_category'] == 'neutral').sum()} ({(posts_df['sentiment_category'] == 'neutral').sum() / len(posts_df) * 100:.1f}%)",
    },
    "Virality Statistics": {
        "Total Cascades": len(retweet_cascades),
        "Average Cascade Size": f"{np.mean(list(cascade_sizes.values())):.2f}"
        if cascade_sizes
        else "0",
        "Max Cascade Size": f"{max(cascade_sizes.values())}" if cascade_sizes else "0",
        "Viral Posts (Top 10%)": f"{int(len(posts_df) * 0.1)}",
    },
}

print("\n" + "=" * 80)
print("SOCIAL MEDIA NETWORK ANALYSIS - COMPREHENSIVE SUMMARY")
print("=" * 80 + "\n")

for category, stats in summary.items():
    print(f"\n{category}:")
    print("-" * 60)
    for metric, value in stats.items():
        print(f"  {metric:.<50} {value}")

print("\n" + "=" * 80)

In [None]:
# Key Findings Report
print("\n" + "=" * 80)
print("KEY FINDINGS AND INSIGHTS")
print("=" * 80 + "\n")

findings = []

# Finding 1: Network Structure
avg_degree = sum(dict(social_network.degree()).values()) / social_network.number_of_nodes()
finding1 = f"""
1. NETWORK STRUCTURE AND CONNECTIVITY
   The social network exhibits a scale-free topology with power-law degree distribution,
   indicating a small number of highly connected hubs and many peripheral users.

   - Average connections per user: {avg_degree:.2f}
   - Network density: {nx.density(social_network):.4f} (sparse network)
   - This structure is typical of real-world social media platforms
"""
findings.append(finding1)

# Finding 2: Community Structure
finding2 = f"""
2. COMMUNITY FORMATION
   Strong community structure detected with modularity score of {modularity_score:.4f}.
   {len(set(user_communities.values()))} distinct communities identified, suggesting natural
   clustering based on shared interests or interaction patterns.

   - Communities vary significantly in size and activity levels
   - Different communities exhibit distinct sentiment patterns
"""
findings.append(finding2)

# Finding 3: Sentiment Patterns
pos_pct = (posts_df["sentiment_category"] == "positive").sum() / len(posts_df) * 100
neg_pct = (posts_df["sentiment_category"] == "negative").sum() / len(posts_df) * 100
finding3 = f"""
3. SENTIMENT DYNAMICS
   Overall sentiment is {"predominantly positive" if pos_pct > neg_pct else "mixed"}.

   - Positive posts: {pos_pct:.1f}%
   - Negative posts: {neg_pct:.1f}%
   - Average sentiment score: {posts_df["sentiment_compound"].mean():.3f}
   - Sentiment correlates with engagement patterns
"""
findings.append(finding3)

# Finding 4: Influence Patterns
top_10_influence = users_enriched.nlargest(10, "overall_influence")["overall_influence"].mean()
finding4 = f"""
4. INFLUENCE AND ENGAGEMENT
   A small percentage of users drive majority of engagement (Pareto principle observed).

   - Top 10 influencers average score: {top_10_influence:.2f}
   - Influence determined by both network position and content quality
   - High correlation between follower count and engagement
"""
findings.append(finding4)

# Finding 5: Information Diffusion
avg_cascade = np.mean(list(cascade_sizes.values())) if cascade_sizes else 0
finding5 = f"""
5. INFORMATION DIFFUSION PATTERNS
   Content spreads through network via retweet cascades.

   - Average cascade size: {avg_cascade:.2f}
   - Maximum cascade size: {max(cascade_sizes.values()) if cascade_sizes else 0}
   - Viral content often has emotional (positive or negative) sentiment
   - Community boundaries affect diffusion patterns
"""
findings.append(finding5)

# Finding 6: Topic Trends
top_hashtag = hashtag_stats.iloc[0]
finding6 = f"""
6. TOPIC TRENDS AND HASHTAG USAGE
   Most popular hashtag: {top_hashtag["hashtag"]} (frequency: {int(top_hashtag["frequency"])})

   - {len(hashtag_stats)} unique hashtags identified
   - Hashtag usage follows power-law distribution
   - Different topics generate varying levels of engagement
"""
findings.append(finding6)

for finding in findings:
    print(finding)
    print("-" * 80)

print("\n" + "=" * 80)

In [None]:
# Research Implications
print("\n" + "=" * 80)
print("RESEARCH IMPLICATIONS AND APPLICATIONS")
print("=" * 80 + "\n")

implications = """
This analysis demonstrates several key methodologies for social media research:

1. NETWORK ANALYSIS
   - Centrality measures identify influential users and key information brokers
   - Community detection reveals organic social clustering patterns
   - Network topology informs understanding of information flow

2. SENTIMENT ANALYSIS
   - VADER provides reliable sentiment scoring for social media text
   - Temporal sentiment tracking reveals mood shifts and events
   - Community-level sentiment differences highlight subculture dynamics

3. INFLUENCE MEASUREMENT
   - Combining network centrality with engagement metrics provides holistic influence score
   - Different types of influence (network vs content) can be distinguished
   - Influence prediction enables targeted intervention strategies

4. INFORMATION DIFFUSION
   - Cascade analysis quantifies viral spread patterns
   - Identifies factors associated with viral content
   - Helps predict and model information propagation

5. PRACTICAL APPLICATIONS
   - Marketing: Identify influencers and optimal content strategies
   - Public Health: Track health information dissemination
   - Politics: Understand political discourse and polarization
   - Crisis Communication: Monitor sentiment during emergencies
   - Brand Management: Track brand perception and reputation

6. METHODOLOGICAL CONSIDERATIONS
   - Synthetic data allows safe experimentation and method development
   - Real data requires careful ethical consideration and privacy protection
   - Longitudinal analysis provides deeper insights than cross-sectional
   - Multi-method approach (network + sentiment + content) yields richer understanding

7. FUTURE DIRECTIONS
   - Temporal network analysis to track evolution over time
   - Natural language processing for deeper content analysis
   - Machine learning for prediction and classification
   - Cross-platform analysis for comprehensive social media understanding
   - Causal inference to understand drivers of influence and virality
"""

print(implications)
print("\n" + "=" * 80)

print("\nAnalysis Complete!")
print("\nAll visualizations saved:")
print("  - network_centrality_distributions.png")
print("  - social_network_visualization.png")
print("  - community_detection.png")
print("  - sentiment_analysis.png")
print("  - sentiment_by_community.png")
print("  - influence_analysis.png")
print("  - information_diffusion.png")
print("  - virality_by_community.png")
print("  - hashtag_analysis.png")
print("  - hashtag_trends.png")

In [None]:
# Save processed data for future analysis
print("\nSaving processed datasets...")

# Save enriched user data
users_enriched.to_csv("users_enriched.csv", index=False)
print("  - users_enriched.csv")

# Save posts with sentiment
posts_df.to_csv("posts_with_sentiment.csv", index=False)
print("  - posts_with_sentiment.csv")

# Save hashtag statistics
hashtag_stats.to_csv("hashtag_statistics.csv", index=False)
print("  - hashtag_statistics.csv")

# Save network
nx.write_gexf(social_network, "social_network.gexf")
print("  - social_network.gexf (for Gephi or other network tools)")

print("\nAll data saved successfully!")
print("\nThis notebook provides a comprehensive framework for social media analysis.")
print("Adapt these methods for your own research questions and datasets.")