# Network Analysis: Information Spread and Influence

This notebook demonstrates social network analysis techniques to understand how information spreads through social media platforms.

## Objectives

1. Build interaction networks from social media data
2. Calculate centrality metrics (degree, betweenness, eigenvector)
3. Identify influential users and information hubs
4. Detect communities and echo chambers
5. Analyze network structure and properties
6. Visualize information flow patterns

## Network Concepts

### Node Centrality Metrics

**Degree Centrality**: Number of direct connections
- High degree = many interactions
- Indicates popular/active users

**Betweenness Centrality**: Number of shortest paths passing through node
- High betweenness = information bridge
- Indicates users connecting different communities

**Eigenvector Centrality**: Connections to well-connected nodes
- High eigenvector = influence
- Indicates users connected to other influential users

### Community Detection

Identifies clusters of densely connected users:
- Echo chambers
- Interest groups
- Coordinated campaigns

## Use Cases

- **Misinformation Spread**: Track how false information propagates
- **Influence Mapping**: Identify key opinion leaders
- **Echo Chambers**: Detect polarized communities
- **Coordination Detection**: Find coordinated inauthentic behavior

## 1. Setup

In [None]:
import os
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
from dotenv import load_dotenv
from pathlib import Path
from collections import Counter

sys.path.insert(0, str(Path('..').resolve()))

from social_media_analysis import SocialMediaDataAccess
from social_media_analysis.network_analysis import (
    build_interaction_network,
    calculate_centrality,
    detect_communities
)
from social_media_analysis.visualization import plot_network_graph

plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('tab10')
%matplotlib inline

print("✓ Imports successful")

In [None]:
# Load configuration
load_dotenv(Path('..') / '.env')

DATA_BUCKET = os.getenv('DATA_BUCKET')
RESULTS_BUCKET = os.getenv('RESULTS_BUCKET')

print("Configuration loaded")

## 2. Load and Prepare Data

In [None]:
# Load sample data
df = pd.read_csv('../../studio-lab/sample_data.csv')
df['timestamp'] = pd.to_datetime(df['timestamp'])

print(f"Loaded {len(df)} posts")
df.head()

In [None]:
# For network analysis, we need interaction data
# Since sample data doesn't have retweet/reply relationships,
# we'll simulate a simple network based on user interactions

# Create synthetic interaction edges
# In production, you would extract actual retweets, mentions, and replies

# Simulate user interaction network
np.random.seed(42)
users = df['user_id'].unique()
n_edges = min(50, len(df) * 2)  # 2 interactions per post on average

edges = []
for _ in range(n_edges):
    source = np.random.choice(users)
    target = np.random.choice(users)
    if source != target:
        edges.append((source, target))

# Create edge DataFrame
edges_df = pd.DataFrame(edges, columns=['source', 'target'])
edges_df['weight'] = 1  # Each interaction has weight 1

# Aggregate to count multiple interactions
edges_df = edges_df.groupby(['source', 'target']).sum().reset_index()

print(f"Created {len(edges_df)} interaction edges between {len(users)} users")
edges_df.head()

## 3. Build Network Graph

In [None]:
# Create directed graph
G = nx.from_pandas_edgelist(
    edges_df,
    source='source',
    target='target',
    edge_attr='weight',
    create_using=nx.DiGraph()
)

print(f"Network Statistics:")
print(f"  Nodes: {G.number_of_nodes()}")
print(f"  Edges: {G.number_of_edges()}")
print(f"  Density: {nx.density(G):.4f}")
print(f"  Is Connected: {nx.is_weakly_connected(G)}")

if nx.is_weakly_connected(G):
    print(f"  Average Shortest Path: {nx.average_shortest_path_length(G):.2f}")
    print(f"  Diameter: {nx.diameter(G)}")

## 4. Calculate Centrality Metrics

In [None]:
# Calculate centrality metrics
print("Calculating centrality metrics...")

# Degree centrality
degree_centrality = nx.degree_centrality(G)
in_degree_centrality = nx.in_degree_centrality(G)
out_degree_centrality = nx.out_degree_centrality(G)

# Betweenness centrality
betweenness_centrality = nx.betweenness_centrality(G)

# Eigenvector centrality (may not converge for all graphs)
try:
    eigenvector_centrality = nx.eigenvector_centrality(G, max_iter=1000)
except:
    print("  ⚠️ Eigenvector centrality failed to converge, using PageRank instead")
    eigenvector_centrality = nx.pagerank(G)

# PageRank (Google's algorithm)
pagerank = nx.pagerank(G)

print("✓ Centrality metrics calculated")

In [None]:
# Create centrality DataFrame
centrality_df = pd.DataFrame({
    'user_id': list(G.nodes()),
    'degree': [degree_centrality[node] for node in G.nodes()],
    'in_degree': [in_degree_centrality[node] for node in G.nodes()],
    'out_degree': [out_degree_centrality[node] for node in G.nodes()],
    'betweenness': [betweenness_centrality[node] for node in G.nodes()],
    'eigenvector': [eigenvector_centrality[node] for node in G.nodes()],
    'pagerank': [pagerank[node] for node in G.nodes()]
})

centrality_df = centrality_df.sort_values('pagerank', ascending=False)

print("Top 10 Most Influential Users (by PageRank):")
print(centrality_df.head(10))

## 5. Visualize Centrality Metrics

In [None]:
# Centrality distribution plots
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()

metrics = ['degree', 'in_degree', 'out_degree', 'betweenness', 'eigenvector', 'pagerank']
colors = ['skyblue', 'lightgreen', 'coral', 'violet', 'gold', 'crimson']

for idx, metric in enumerate(metrics):
    axes[idx].hist(centrality_df[metric], bins=20, color=colors[idx], 
                  edgecolor='black', alpha=0.7)
    axes[idx].set_title(f'{metric.replace("_", " ").title()} Distribution')
    axes[idx].set_xlabel(metric.replace('_', ' ').title())
    axes[idx].set_ylabel('Frequency')
    axes[idx].axvline(centrality_df[metric].mean(), color='red', 
                     linestyle='--', label=f'Mean: {centrality_df[metric].mean():.4f}')
    axes[idx].legend(fontsize=8)
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Correlation between centrality metrics
correlation = centrality_df[metrics].corr()

fig, ax = plt.subplots(figsize=(10, 8))
sns.heatmap(correlation, annot=True, fmt='.3f', cmap='coolwarm', 
            square=True, ax=ax, cbar_kws={'label': 'Correlation'})
ax.set_title('Correlation Matrix: Centrality Metrics', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

## 6. Identify Key Influencers

In [None]:
# Top influencers by different metrics
top_n = 5

print("=" * 80)
print("KEY INFLUENCERS BY METRIC")
print("=" * 80)

for metric in ['degree', 'betweenness', 'pagerank']:
    top_users = centrality_df.nlargest(top_n, metric)[['user_id', metric]]
    print(f"\nTop {top_n} by {metric.replace('_', ' ').title()}:")
    for idx, row in top_users.iterrows():
        print(f"  {row['user_id']}: {row[metric]:.4f}")

print("=" * 80)

In [None]:
# Categorize users by role
def categorize_user(row):
    if row['pagerank'] > centrality_df['pagerank'].quantile(0.9):
        return 'Influencer'
    elif row['betweenness'] > centrality_df['betweenness'].quantile(0.9):
        return 'Bridge'
    elif row['in_degree'] > centrality_df['in_degree'].quantile(0.9):
        return 'Popular'
    elif row['out_degree'] > centrality_df['out_degree'].quantile(0.9):
        return 'Active'
    else:
        return 'Regular'

centrality_df['role'] = centrality_df.apply(categorize_user, axis=1)

print("User Role Distribution:")
print(centrality_df['role'].value_counts())

# Visualize
fig, ax = plt.subplots(figsize=(8, 6))
role_counts = centrality_df['role'].value_counts()
role_counts.plot(kind='bar', ax=ax, color='teal', alpha=0.7)
ax.set_title('User Role Distribution in Network')
ax.set_xlabel('Role')
ax.set_ylabel('Number of Users')
ax.tick_params(axis='x', rotation=45)
ax.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

## 7. Community Detection

In [None]:
# Convert to undirected for community detection
G_undirected = G.to_undirected()

# Detect communities using Louvain method (better than greedy modularity)
try:
    import community as community_louvain
    partition = community_louvain.best_partition(G_undirected)
    print("Using Louvain method")
except ImportError:
    # Fallback to greedy modularity
    print("Using greedy modularity method (install python-louvain for better results)")
    communities = list(nx.community.greedy_modularity_communities(G_undirected))
    partition = {}
    for idx, community in enumerate(communities):
        for node in community:
            partition[node] = idx

# Add community to centrality DataFrame
centrality_df['community'] = centrality_df['user_id'].map(partition)

n_communities = len(set(partition.values()))
print(f"\n✓ Detected {n_communities} communities")
print(f"\nCommunity Size Distribution:")
print(centrality_df['community'].value_counts().sort_index())

In [None]:
# Visualize community sizes
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar chart
community_sizes = centrality_df['community'].value_counts().sort_index()
community_sizes.plot(kind='bar', ax=axes[0], color='steelblue', alpha=0.7)
axes[0].set_title('Community Size Distribution')
axes[0].set_xlabel('Community ID')
axes[0].set_ylabel('Number of Users')
axes[0].tick_params(axis='x', rotation=45)
axes[0].grid(True, alpha=0.3, axis='y')

# Pie chart
axes[1].pie(community_sizes, labels=[f'Community {i}' for i in community_sizes.index],
           autopct='%1.1f%%', startangle=90)
axes[1].set_title('Community Composition')

plt.tight_layout()
plt.show()

In [None]:
# Analyze each community
print("Community Analysis:\n")

for community_id in sorted(centrality_df['community'].unique()):
    community_users = centrality_df[centrality_df['community'] == community_id]
    
    print(f"Community {community_id}:")
    print(f"  Size: {len(community_users)} users")
    print(f"  Average PageRank: {community_users['pagerank'].mean():.4f}")
    print(f"  Average Betweenness: {community_users['betweenness'].mean():.4f}")
    
    # Top influencer in community
    top_user = community_users.nlargest(1, 'pagerank').iloc[0]
    print(f"  Top Influencer: {top_user['user_id']} (PageRank: {top_user['pagerank']:.4f})")
    print()

## 8. Visualize Network

In [None]:
# Network visualization with communities
fig, axes = plt.subplots(1, 2, figsize=(16, 8))

# Layout
pos = nx.spring_layout(G_undirected, k=0.5, iterations=50, seed=42)

# Left: Color by community
node_colors_community = [partition[node] for node in G_undirected.nodes()]
node_sizes = [centrality_df[centrality_df['user_id'] == node]['pagerank'].values[0] * 3000 
             for node in G_undirected.nodes()]

nx.draw_networkx(
    G_undirected, pos, ax=axes[0],
    node_color=node_colors_community,
    node_size=node_sizes,
    cmap='tab10',
    with_labels=False,
    edge_color='gray',
    alpha=0.7,
    width=0.5
)
axes[0].set_title('Network Colored by Community\n(Node size = PageRank)', fontsize=12)
axes[0].axis('off')

# Right: Color by PageRank
node_colors_pagerank = [centrality_df[centrality_df['user_id'] == node]['pagerank'].values[0] 
                       for node in G_undirected.nodes()]

nx.draw_networkx(
    G_undirected, pos, ax=axes[1],
    node_color=node_colors_pagerank,
    node_size=node_sizes,
    cmap='Reds',
    with_labels=False,
    edge_color='gray',
    alpha=0.7,
    width=0.5
)
axes[1].set_title('Network Colored by PageRank\n(Redder = More Influential)', fontsize=12)
axes[1].axis('off')

plt.tight_layout()
plt.show()

## 9. Degree Distribution (Power Law)

In [None]:
# Degree distribution analysis
degrees = [G.degree(node) for node in G.nodes()]
degree_counts = Counter(degrees)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Linear scale
axes[0].bar(degree_counts.keys(), degree_counts.values(), color='steelblue', alpha=0.7)
axes[0].set_title('Degree Distribution (Linear Scale)')
axes[0].set_xlabel('Degree')
axes[0].set_ylabel('Frequency')
axes[0].grid(True, alpha=0.3, axis='y')

# Log-log scale (to see power law)
axes[1].scatter(list(degree_counts.keys()), list(degree_counts.values()), 
               color='crimson', alpha=0.7, s=50)
axes[1].set_title('Degree Distribution (Log-Log Scale)')
axes[1].set_xlabel('Degree (log scale)')
axes[1].set_ylabel('Frequency (log scale)')
axes[1].set_xscale('log')
axes[1].set_yscale('log')
axes[1].grid(True, alpha=0.3, which='both')

plt.tight_layout()
plt.show()

print(f"Average Degree: {np.mean(degrees):.2f}")
print(f"Max Degree: {max(degrees)}")
print(f"Min Degree: {min(degrees)}")

## 10. Information Diffusion Simulation

In [None]:
# Simulate information spread from top influencer
def simulate_diffusion(G, source, steps=5, transmission_prob=0.3):
    """Simulate information diffusion through network."""
    infected = {source}
    newly_infected = {source}
    infection_time = {source: 0}
    
    for step in range(1, steps + 1):
        next_infected = set()
        for node in newly_infected:
            # Try to infect neighbors
            for neighbor in G.neighbors(node):
                if neighbor not in infected and np.random.random() < transmission_prob:
                    next_infected.add(neighbor)
                    infection_time[neighbor] = step
        
        infected.update(next_infected)
        newly_infected = next_infected
        
        if not newly_infected:
            break
    
    return infected, infection_time

# Run simulation from top influencer
top_influencer = centrality_df.nlargest(1, 'pagerank').iloc[0]['user_id']
infected, infection_time = simulate_diffusion(G, top_influencer, steps=10, transmission_prob=0.4)

print(f"Information Diffusion Simulation:")
print(f"  Source: {top_influencer}")
print(f"  Total Reached: {len(infected)} users ({len(infected)/G.number_of_nodes()*100:.1f}%)")
print(f"  Number of Steps: {max(infection_time.values())}")

In [None]:
# Visualize diffusion over time
time_steps = sorted(set(infection_time.values()))
cumulative_infected = [sum(1 for t in infection_time.values() if t <= step) for step in time_steps]

fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(time_steps, cumulative_infected, marker='o', linewidth=2, markersize=8, color='darkred')
ax.fill_between(time_steps, cumulative_infected, alpha=0.3, color='red')
ax.set_title(f'Information Diffusion from Top Influencer ({top_influencer})', fontsize=14)
ax.set_xlabel('Time Step')
ax.set_ylabel('Cumulative Users Reached')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## 11. Save Results

In [None]:
# Save centrality analysis
output_file = '../../results/network_centrality.csv'
centrality_df.to_csv(output_file, index=False)
print(f"✓ Centrality analysis saved to {output_file}")

# Save community assignments
community_file = '../../results/community_assignments.csv'
centrality_df[['user_id', 'community', 'role']].to_csv(community_file, index=False)
print(f"✓ Community assignments saved to {community_file}")

# Save network summary
network_summary = pd.DataFrame({
    'metric': [
        'nodes', 'edges', 'density', 'communities', 
        'avg_degree', 'avg_pagerank', 'avg_betweenness'
    ],
    'value': [
        G.number_of_nodes(),
        G.number_of_edges(),
        nx.density(G),
        n_communities,
        np.mean(degrees),
        centrality_df['pagerank'].mean(),
        centrality_df['betweenness'].mean()
    ]
})

summary_file = '../../results/network_summary.csv'
network_summary.to_csv(summary_file, index=False)
print(f"✓ Network summary saved to {summary_file}")

# Uncomment to save to S3
# data_client = SocialMediaDataAccess()
# data_client.save_results(centrality_df, 'network_centrality.csv')
# data_client.save_results(network_summary, 'network_summary.csv')

## Key Findings

### Network Structure
- **Nodes**: [X] users
- **Edges**: [Y] interactions
- **Density**: [Z] (0 = sparse, 1 = complete)
- **Communities**: [N] detected

### Top Influencers
1. **User [ID]**: PageRank = [value], Role = [Influencer/Bridge/etc.]
2. **User [ID]**: PageRank = [value], Role = [Influencer/Bridge/etc.]
3. **User [ID]**: PageRank = [value], Role = [Influencer/Bridge/etc.]

### Community Insights
- Largest community: [size] users
- Smallest community: [size] users
- Communities show [high/medium/low] modularity

### Information Spread
- From top influencer, reached [X]% of network in [Y] steps
- Average path length: [Z] hops
- Network shows [small-world/scale-free] properties

## Research Applications

### Misinformation Campaigns
- Identify coordinated amplification
- Track narrative spread patterns
- Find inauthentic coordination

### Influence Operations
- Map state-sponsored accounts
- Detect bot networks
- Analyze propaganda diffusion

### Public Health
- Track health misinformation
- Identify trusted health communicators
- Target intervention strategies

### Political Polarization
- Detect echo chambers
- Measure cross-community interaction
- Identify bridging accounts

## Recommendations

1. **Monitor Influencers**: Track top 10 users for narrative changes
2. **Watch Bridge Accounts**: High betweenness users may be coordinating
3. **Community Engagement**: Tailor messaging to community characteristics
4. **Early Detection**: Monitor influencers for misinformation first
5. **Counter-Messaging**: Target bridge users to reach multiple communities

## Further Analysis

1. **Temporal Networks**: Track network evolution over time
2. **Bot Detection**: Identify automated accounts by network behavior
3. **Sentiment Analysis**: Overlay sentiment on network structure
4. **Cross-Platform**: Analyze networks across Twitter, Reddit, Facebook
5. **Predictive Modeling**: Forecast information cascade size