# Sentiment Analysis of Social Media During Health Crises
## Archetype 3: The Socio-behavioral Scientist

This notebook introduces sentiment analysis of social media data to understand public perceptions and behaviors during health crises. Students will analyze simulated tweet data to quantify public opinion over time and identify key themes like trust, fear, and misinformation.

### Learning Objectives:
- Understand how social media reflects and shapes public health behaviors
- Learn natural language processing techniques for sentiment analysis
- Analyze temporal patterns in public opinion during health crises
- Identify factors that influence public trust and compliance
- Connect social media sentiment to real-world health outcomes

### Key Concepts:
- **Risk perception**: How individuals assess and respond to health threats
- **Health communication**: Strategies for conveying health information effectively
- **Social amplification of risk**: How social processes can amplify or attenuate risk perceptions
- **Infodemic**: Information epidemic that accompanies disease outbreaks
- **Behavioral change models**: Frameworks for understanding health behavior adoption

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import re
import random
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

# For text processing and sentiment analysis
from textblob import TextBlob
import nltk
from wordcloud import WordCloud

# Download required NLTK data
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')

try:
    nltk.data.find('corpora/stopwords')
except LookupError:
    nltk.download('stopwords')

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Set up plotting style
plt.style.use('seaborn-v0_8')
plt.rcParams['figure.figsize'] = (12, 8)
sns.set_palette("husl")

print("📱 Libraries loaded successfully for social media sentiment analysis")

## Part 1: Understanding Social Media and Health Behavior

Social media platforms have become primary sources of health information for many people. During health crises, these platforms can:

1. **Spread accurate information** quickly to large audiences
2. **Amplify misinformation** and conspiracy theories
3. **Reflect public sentiment** and concerns in real-time
4. **Influence health behaviors** through social proof and peer pressure
5. **Create echo chambers** that reinforce existing beliefs

Understanding these dynamics is crucial for public health communication and behavior change interventions.

In [None]:
# Generate synthetic tweet data simulating a vaccination campaign
# This represents what you might collect from Twitter's API during a real health crisis

def generate_synthetic_tweets(n_tweets=5000, duration_days=180):
    """
    Generate synthetic tweet data for vaccination campaign analysis
    """
    np.random.seed(42)  # For reproducibility
    
    # Define different phases of the vaccination campaign
    phases = {
        'announcement': {'days': (0, 30), 'sentiment_trend': 0.2, 'volume_multiplier': 1.5},
        'early_rollout': {'days': (31, 90), 'sentiment_trend': -0.1, 'volume_multiplier': 2.0},
        'side_effects_news': {'days': (91, 120), 'sentiment_trend': -0.4, 'volume_multiplier': 2.5},
        'stabilization': {'days': (121, 180), 'sentiment_trend': 0.1, 'volume_multiplier': 1.2}
    }
    
    # Template tweets for different sentiment categories
    tweet_templates = {
        'positive': [
            "Just got my vaccine! Feeling grateful for science and healthcare workers! 💉🙏 #VaccinesWork",
            "Vaccination sites are running smoothly. Great organization! #PublicHealth",
            "So relieved to finally get vaccinated. Looking forward to seeing family again! #Hope",
            "The vaccine gives me hope for the future. Science for the win! 🧬 #TrustScience",
            "Proud to do my part to protect my community. #CommunityFirst #Vaccines",
            "No side effects after my shot. Feel great! #VaccineExperience",
            "Healthcare workers deserve all our gratitude. Thank you! 👩‍⚕️👨‍⚕️ #Heroes"
        ],
        'negative': [
            "I'm worried about the long-term effects. Too rushed? #VaccineHesitancy",
            "My friend had bad side effects. Makes me nervous... #Concerns",
            "Why should I trust something developed so quickly? #Questions",
            "Natural immunity is better than artificial. #NaturalHealth",
            "The government shouldn't mandate medical procedures #Freedom",
            "Too many conflicting messages from experts #Confusion",
            "I'll wait and see what happens to others first #WaitAndSee"
        ],
        'neutral': [
            "Vaccination appointment scheduled for next week. #Update",
            "Looking into vaccine options available in my area #Research",
            "Reading the latest studies on vaccine effectiveness #Science",
            "Discussing vaccines with my doctor tomorrow #Healthcare",
            "Vaccine rollout continues in our state #News",
            "Different vaccines have different efficacy rates #Data",
            "Checking eligibility requirements for vaccination #Info"
        ],
        'fearful': [
            "I'm scared about potential side effects 😰 #Fear",
            "What if something goes wrong? So many unknowns... #Anxiety",
            "Seeing reports of adverse events. Terrifying! #Scared",
            "I don't know what to believe anymore 😢 #Confused",
            "This whole situation is overwhelming #Stress",
            "Afraid of making the wrong choice for my family #ParentWorries"
        ],
        'misinformation': [
            "Vaccines contain microchips for tracking #Conspiracy",
            "Big pharma just wants to make money #FollowTheMoney",
            "Vaccines cause autism - do your research! #Truth",
            "This is all about population control #WakeUp",
            "Natural remedies work better than vaccines #Alternative",
            "The media is hiding vaccine deaths #CoverUp"
        ]
    }
    
    tweets = []
    start_date = datetime(2023, 1, 1)
    
    for day in range(duration_days):
        current_date = start_date + timedelta(days=day)
        
        # Determine current phase
        current_phase = None
        for phase_name, phase_data in phases.items():
            if phase_data['days'][0] <= day <= phase_data['days'][1]:
                current_phase = phase_data
                break
        
        # Base daily tweet volume
        base_volume = int(n_tweets / duration_days)
        daily_volume = int(base_volume * current_phase['volume_multiplier'])
        
        # Add some random variation
        daily_volume = max(1, daily_volume + np.random.randint(-5, 5))
        
        for _ in range(daily_volume):
            # Determine sentiment based on phase
            sentiment_bias = current_phase['sentiment_trend']
            
            # Base probabilities for each sentiment
            probs = {
                'positive': 0.3 + sentiment_bias,
                'negative': 0.25 - sentiment_bias/2,
                'neutral': 0.25,
                'fearful': 0.15 - sentiment_bias/3,
                'misinformation': 0.05
            }
            
            # Normalize probabilities
            total = sum(probs.values())
            probs = {k: max(0, v/total) for k, v in probs.items()}
            
            # Select sentiment category
            sentiment_cat = np.random.choice(list(probs.keys()), p=list(probs.values()))
            
            # Select random tweet from category
            tweet_text = np.random.choice(tweet_templates[sentiment_cat])
            
            # Add some variation to tweet text
            variations = ['', ' 🤔', ' 💭', ' 👍', ' 👎', ' ❤️', ' 😷']
            tweet_text += np.random.choice(variations)
            
            tweets.append({
                'date': current_date,
                'text': tweet_text,
                'sentiment_category': sentiment_cat,
                'day': day,
                'phase': list(phases.keys())[list(phases.values()).index(current_phase)],
                'user_id': f"user_{np.random.randint(1, 1000)}",
                'retweets': np.random.poisson(5),
                'likes': np.random.poisson(15)
            })
    
    return pd.DataFrame(tweets)

# Generate the dataset
tweets_df = generate_synthetic_tweets()
print(f"📊 Generated {len(tweets_df):,} synthetic tweets over {tweets_df['day'].max()+1} days")
print(f"🗓️ Date range: {tweets_df['date'].min().date()} to {tweets_df['date'].max().date()}")
print(f"\n📝 Sample tweets:")
for i, row in tweets_df.sample(3).iterrows():
    print(f"- {row['text']} [{row['sentiment_category']}]")

## Part 2: Basic Text Processing and Sentiment Analysis

In [None]:
# Function to clean and preprocess tweet text
def clean_tweet_text(text):
    """
    Clean tweet text for analysis
    """
    # Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    
    # Remove user mentions and hashtags for sentiment analysis (keep for keyword analysis)
    text_for_sentiment = re.sub(r'@\w+|#\w+', '', text)
    
    # Remove extra whitespace
    text_for_sentiment = ' '.join(text_for_sentiment.split())
    
    return text_for_sentiment.strip()

# Function to extract hashtags
def extract_hashtags(text):
    """
    Extract hashtags from tweet text
    """
    hashtags = re.findall(r'#\w+', text.lower())
    return hashtags

# Function to calculate sentiment using TextBlob
def analyze_sentiment(text):
    """
    Analyze sentiment using TextBlob
    Returns polarity (-1 to 1) and subjectivity (0 to 1)
    """
    blob = TextBlob(text)
    return blob.sentiment.polarity, blob.sentiment.subjectivity

# Apply text processing
tweets_df['cleaned_text'] = tweets_df['text'].apply(clean_tweet_text)
tweets_df['hashtags'] = tweets_df['text'].apply(extract_hashtags)

# Calculate sentiment scores
sentiment_scores = tweets_df['cleaned_text'].apply(analyze_sentiment)
tweets_df['polarity'] = [score[0] for score in sentiment_scores]
tweets_df['subjectivity'] = [score[1] for score in sentiment_scores]

# Categorize sentiment based on polarity
def categorize_sentiment(polarity):
    if polarity > 0.1:
        return 'positive'
    elif polarity < -0.1:
        return 'negative'
    else:
        return 'neutral'

tweets_df['sentiment_score'] = tweets_df['polarity'].apply(categorize_sentiment)

print("✅ Text processing and sentiment analysis completed")
print(f"\n📈 Sentiment Distribution:")
print(tweets_df['sentiment_score'].value_counts())

# Show comparison between true categories and detected sentiment
print(f"\n🎯 True vs Detected Sentiment (sample):")
comparison = tweets_df[['text', 'sentiment_category', 'sentiment_score', 'polarity']].sample(5)
for _, row in comparison.iterrows():
    print(f"Text: {row['text'][:60]}...")
    print(f"True: {row['sentiment_category']} | Detected: {row['sentiment_score']} | Score: {row['polarity']:.2f}\n")

## Part 3: Temporal Analysis of Public Sentiment

In [None]:
# Aggregate sentiment by date
daily_sentiment = tweets_df.groupby('date').agg({
    'polarity': ['mean', 'std', 'count'],
    'subjectivity': 'mean',
    'sentiment_score': lambda x: (x == 'positive').sum() / len(x),  # Positive sentiment ratio
    'retweets': 'sum',
    'likes': 'sum'
}).reset_index()

# Flatten column names
daily_sentiment.columns = ['date', 'mean_polarity', 'std_polarity', 'tweet_count', 
                          'mean_subjectivity', 'positive_ratio', 'total_retweets', 'total_likes']

# Add rolling averages for smoother trends
daily_sentiment['polarity_7day'] = daily_sentiment['mean_polarity'].rolling(window=7, center=True).mean()
daily_sentiment['positive_ratio_7day'] = daily_sentiment['positive_ratio'].rolling(window=7, center=True).mean()

# Create comprehensive temporal visualization
fig, axes = plt.subplots(3, 2, figsize=(16, 15))

# Plot 1: Daily sentiment polarity
axes[0,0].plot(daily_sentiment['date'], daily_sentiment['mean_polarity'], 
               alpha=0.3, color='blue', label='Daily Average')
axes[0,0].plot(daily_sentiment['date'], daily_sentiment['polarity_7day'], 
               color='red', linewidth=2, label='7-Day Moving Average')
axes[0,0].axhline(y=0, color='black', linestyle='--', alpha=0.5)
axes[0,0].set_ylabel('Sentiment Polarity')
axes[0,0].set_title('Public Sentiment Over Time')
axes[0,0].legend()
axes[0,0].grid(True, alpha=0.3)

# Plot 2: Tweet volume
axes[0,1].bar(daily_sentiment['date'], daily_sentiment['tweet_count'], 
              alpha=0.7, color='green')
axes[0,1].set_ylabel('Number of Tweets')
axes[0,1].set_title('Daily Tweet Volume')
axes[0,1].grid(True, alpha=0.3)

# Plot 3: Positive sentiment ratio
axes[1,0].plot(daily_sentiment['date'], daily_sentiment['positive_ratio'], 
               alpha=0.3, color='green', label='Daily Ratio')
axes[1,0].plot(daily_sentiment['date'], daily_sentiment['positive_ratio_7day'], 
               color='darkgreen', linewidth=2, label='7-Day Moving Average')
axes[1,0].set_ylabel('Positive Sentiment Ratio')
axes[1,0].set_title('Proportion of Positive Tweets')
axes[1,0].legend()
axes[1,0].grid(True, alpha=0.3)

# Plot 4: Engagement metrics
ax4_twin = axes[1,1].twinx()
line1 = axes[1,1].plot(daily_sentiment['date'], daily_sentiment['total_retweets'], 
                       color='orange', label='Retweets')
line2 = ax4_twin.plot(daily_sentiment['date'], daily_sentiment['total_likes'], 
                      color='red', label='Likes')
axes[1,1].set_ylabel('Total Retweets', color='orange')
ax4_twin.set_ylabel('Total Likes', color='red')
axes[1,1].set_title('Daily Engagement Metrics')
axes[1,1].grid(True, alpha=0.3)

# Plot 5: Sentiment distribution by phase
phase_sentiment = tweets_df.groupby(['phase', 'sentiment_score']).size().unstack(fill_value=0)
phase_sentiment_norm = phase_sentiment.div(phase_sentiment.sum(axis=1), axis=0)
phase_sentiment_norm.plot(kind='bar', stacked=True, ax=axes[2,0], 
                         color=['red', 'gray', 'green'])
axes[2,0].set_title('Sentiment Distribution by Campaign Phase')
axes[2,0].set_ylabel('Proportion of Tweets')
axes[2,0].tick_params(axis='x', rotation=45)
axes[2,0].legend(title='Sentiment')

# Plot 6: Subjectivity over time
axes[2,1].plot(daily_sentiment['date'], daily_sentiment['mean_subjectivity'], 
               color='purple', linewidth=2)
axes[2,1].set_ylabel('Mean Subjectivity')
axes[2,1].set_title('Opinion vs Fact-based Content Over Time')
axes[2,1].grid(True, alpha=0.3)

# Add phase boundaries
phase_dates = [datetime(2023, 1, 31), datetime(2023, 4, 1), datetime(2023, 5, 1)]
phase_labels = ['Announcement', 'Early Rollout', 'Side Effects News', 'Stabilization']

for i, ax_row in enumerate(axes):
    for j, ax in enumerate(ax_row):
        if i < 2:  # Don't add to bottom row plots
            for date in phase_dates:
                ax.axvline(x=date, color='gray', linestyle=':', alpha=0.7)

plt.tight_layout()
plt.show()

# Print key insights
print(f"\n📊 Key Temporal Patterns:")
print(f"Overall average sentiment: {daily_sentiment['mean_polarity'].mean():.3f}")
print(f"Most positive day: {daily_sentiment.loc[daily_sentiment['mean_polarity'].idxmax(), 'date'].date()}")
print(f"Most negative day: {daily_sentiment.loc[daily_sentiment['mean_polarity'].idxmin(), 'date'].date()}")
print(f"Peak tweet volume: {daily_sentiment['tweet_count'].max()} tweets")
print(f"Average positive tweet ratio: {daily_sentiment['positive_ratio'].mean():.1%}")

## Part 4: Hashtag Analysis and Theme Identification

In [None]:
# Analyze hashtag usage patterns
all_hashtags = []
for hashtag_list in tweets_df['hashtags']:
    all_hashtags.extend(hashtag_list)

hashtag_counts = Counter(all_hashtags)
top_hashtags = hashtag_counts.most_common(20)

print(f"\n🏷️ Most Common Hashtags:")
for hashtag, count in top_hashtags[:10]:
    print(f"{hashtag}: {count:,} times")

# Analyze hashtag sentiment patterns
hashtag_sentiment = {}
for hashtag, count in top_hashtags:
    if count >= 10:  # Only analyze hashtags with sufficient data
        hashtag_tweets = tweets_df[tweets_df['hashtags'].apply(lambda x: hashtag in x)]
        avg_sentiment = hashtag_tweets['polarity'].mean()
        hashtag_sentiment[hashtag] = {
            'count': count,
            'avg_sentiment': avg_sentiment,
            'pos_ratio': (hashtag_tweets['sentiment_score'] == 'positive').mean()
        }

# Create hashtag analysis visualization
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# Plot 1: Top hashtags by frequency
hashtags, counts = zip(*top_hashtags[:15])
axes[0,0].barh(range(len(hashtags)), counts, color='skyblue')
axes[0,0].set_yticks(range(len(hashtags)))
axes[0,0].set_yticklabels(hashtags)
axes[0,0].set_xlabel('Frequency')
axes[0,0].set_title('Most Frequently Used Hashtags')
axes[0,0].grid(True, alpha=0.3)

# Plot 2: Hashtag sentiment analysis
sentiment_hashtags = list(hashtag_sentiment.keys())
sentiment_scores = [hashtag_sentiment[h]['avg_sentiment'] for h in sentiment_hashtags]
sentiment_counts = [hashtag_sentiment[h]['count'] for h in sentiment_hashtags]

scatter = axes[0,1].scatter(sentiment_scores, range(len(sentiment_hashtags)), 
                           s=[c/2 for c in sentiment_counts], alpha=0.6, c=sentiment_scores, 
                           cmap='RdYlGn')
axes[0,1].set_yticks(range(len(sentiment_hashtags)))
axes[0,1].set_yticklabels(sentiment_hashtags)
axes[0,1].set_xlabel('Average Sentiment Score')
axes[0,1].set_title('Hashtag Sentiment Analysis')
axes[0,1].axvline(x=0, color='black', linestyle='--', alpha=0.5)
axes[0,1].grid(True, alpha=0.3)
plt.colorbar(scatter, ax=axes[0,1])

# Plot 3: Hashtag evolution over time
key_hashtags = [h for h, c in top_hashtags[:5]]
for hashtag in key_hashtags:
    hashtag_data = tweets_df[tweets_df['hashtags'].apply(lambda x: hashtag in x)]
    daily_counts = hashtag_data.groupby(hashtag_data['date'].dt.date).size()
    axes[1,0].plot(daily_counts.index, daily_counts.values, label=hashtag)

axes[1,0].set_xlabel('Date')
axes[1,0].set_ylabel('Daily Frequency')
axes[1,0].set_title('Hashtag Usage Over Time')
axes[1,0].legend()
axes[1,0].tick_params(axis='x', rotation=45)
axes[1,0].grid(True, alpha=0.3)

# Plot 4: Word cloud of all hashtags
wordcloud = WordCloud(width=800, height=400, background_color='white', 
                      colormap='viridis').generate_from_frequencies(hashtag_counts)
axes[1,1].imshow(wordcloud, interpolation='bilinear')
axes[1,1].set_title('Hashtag Word Cloud')
axes[1,1].axis('off')

plt.tight_layout()
plt.show()