# Reddit Sentiment Analysis Project Demo

This notebook demonstrates the complete Reddit sentiment analysis pipeline. It covers:
- Data collection from Reddit
- Sentiment analysis of posts and comments
- API demonstration
- Model evaluation

**Before running this notebook, make sure you have:**
- A Reddit API account and credentials
- Python environment set up

**Live Demo:** Check out our web application at: https://zmckinney22.github.io/CS410-Group-Project/

Enter your Reddit API credentials below:

In [79]:
# Enter your Reddit API credentials here
# Get these from: https://www.reddit.com/prefs/apps
REDDIT_CLIENT_ID = "your_client_id_here"  # Replace with your actual client ID
REDDIT_CLIENT_SECRET = "your_client_secret_here"  # Replace with your actual client secret
REDDIT_USER_AGENT = "your_user_agent_here/1.0"  # e.g., "RedditSentimentAnalyzer/1.0"

# Validate that credentials are provided
if REDDIT_CLIENT_ID == "your_client_id_here" or REDDIT_CLIENT_SECRET == "your_client_secret_here":
    print("WARNING: Please replace the placeholder values with your actual Reddit API credentials!")
    print("Get them from: https://www.reddit.com/prefs/apps")
else:
    print("Credentials configured")

# Get current working directory for Jupyter notebook to allow script execution later
try:
    notebook_dir = globals()['_dh'][0]
except:
    notebook_dir = '.'

Get them from: https://www.reddit.com/prefs/apps


## Step 1: Install Dependencies

Install all required Python packages for the project.

In [71]:
# Install project dependencies
import sys
import subprocess

def install_dependencies():
    """Install all project dependencies"""
    print("Installing dependencies...")
    
    # Install main requirements
    try:
        result = subprocess.run(
            [sys.executable, "-m", "pip", "install", "-r", "requirements.txt"],
            capture_output=True,
            text=True,
            cwd=notebook_dir
        )
        if result.returncode != 0:
            print(f"ERROR: Failed to install main dependencies")
            print("STDOUT:", result.stdout)
            print("STDERR:", result.stderr)
            return False
        print("Main dependencies installed")
    except Exception as e:
        print(f"ERROR: Exception during main dependencies: {e}")
        return False
    
    # Install API requirements
    try:
        result = subprocess.run(
            [sys.executable, "-m", "pip", "install", "-r", "backend/api_requirements.txt"],
            capture_output=True,
            text=True,
            cwd=notebook_dir
        )
        if result.returncode != 0:
            print(f"ERROR: Failed to install API dependencies")
            print("STDOUT:", result.stdout)
            print("STDERR:", result.stderr)
            return False
        print("API dependencies installed")
    except Exception as e:
        print(f"ERROR: Exception during API dependencies: {e}")
        return False
    
    # Install additional packages needed for the notebook
    try:
        result = subprocess.run(
            [sys.executable, "-m", "pip", "install", "jupyter", "matplotlib", "pandas", "numpy"],
            capture_output=True,
            text=True,
            cwd=notebook_dir
        )
        if result.returncode != 0:
            print(f"ERROR: Failed to install notebook dependencies")
            print("STDOUT:", result.stdout)
            print("STDERR:", result.stderr)
            return False
        print("Notebook dependencies installed")
    except Exception as e:
        print(f"ERROR: Exception during notebook dependencies: {e}")
        return False
    
    print("All dependencies installed successfully!")
    return True

# Run installation
install_success = install_dependencies()
if not install_success:
    print("\nPlease fix dependency installation issues before continuing.")

Installing dependencies...
Main dependencies installed
API dependencies installed
Notebook dependencies installed
All dependencies installed successfully!


## Step 2: Set Up Environment Variables

Configure the environment with your Reddit API credentials.

In [72]:
# Set up environment variables for Reddit API
import os

def setup_environment():
    """Set up environment variables for API access"""
    os.environ['REDDIT_CLIENT_ID'] = REDDIT_CLIENT_ID
    os.environ['REDDIT_CLIENT_SECRET'] = REDDIT_CLIENT_SECRET
    os.environ['REDDIT_USER_AGENT'] = REDDIT_USER_AGENT
    
    # Verify environment variables are set
    required_vars = ['REDDIT_CLIENT_ID', 'REDDIT_CLIENT_SECRET', 'REDDIT_USER_AGENT']
    missing = [var for var in required_vars if not os.getenv(var)]
    
    if missing:
        print(f"ERROR: Missing environment variables: {missing}")
        return False
    
    print("Environment variables configured")
    return True

# Set up environment
env_success = setup_environment()
if not env_success:
    print("Please check your Reddit API credentials.")

Environment variables configured


## Step 3: Data Collection

Collect Reddit posts and comments for analysis.

In [73]:
# Data collection from Reddit
import sys
sys.path.append('.')

def collect_reddit_data():
    """Collect Reddit data using the project's data collection module"""
    print("Starting Reddit data collection...")
    
    try:
        from backend.reddit import authenticate_reddit, collect_reddit_data, save_raw_data, validate_data_completeness, preprocess_reddit_data, save_preprocessed_data, generate_eda_report
        
        # Authenticate with Reddit API
        print("Authenticating with Reddit API...")
        reddit = authenticate_reddit()
        print("Authentication successful")
        
        # Define subreddits to collect from (smaller sample for demo)
        subreddits = ['python', 'AskReddit', 'movies']
        posts_per_sub = 3  # Reduced for demo
        comments_per_post = 10  # Reduced for demo
        
        print(f"Collecting data from {len(subreddits)} subreddits...")
        print(f"- Posts per subreddit: {posts_per_sub}")
        print(f"- Comments per post: {comments_per_post}")
        
        # Collect raw data
        raw_data = collect_reddit_data(reddit, subreddits, posts_per_sub, comments_per_post)
        save_raw_data(raw_data)
        print(f"Collected {len(raw_data)} posts")
        
        # Validate and preprocess data
        print("\nValidating and preprocessing data...")
        validated_data = validate_data_completeness(raw_data)
        posts_df, comments_df = preprocess_reddit_data(validated_data)
        save_preprocessed_data(posts_df, comments_df)
        
        # Generate EDA report
        print("\nGenerating exploratory data analysis...")
        eda_report = generate_eda_report(posts_df, comments_df)
        
        print("Data collection and preprocessing complete!")
        return posts_df, comments_df, eda_report
        
    except Exception as e:
        print(f"ERROR: Error in data collection: {e}")
        import traceback
        traceback.print_exc()
        return None, None, None

# Collect data
posts_df, comments_df, eda_report = collect_reddit_data()

# Display basic statistics
if posts_df is not None and comments_df is not None:
    print(f"\nDataset Summary:")
    print(f"- Total posts: {len(posts_df)}")
    print(f"- Total comments: {len(comments_df)}")
    print(f"- Subreddits: {posts_df['subreddit'].nunique()}")
    print(f"- Average comments per post: {len(comments_df) / len(posts_df):.1f}")

Starting Reddit data collection...
Authenticating with Reddit API...
Authentication successful
Collecting data from 3 subreddits...
- Posts per subreddit: 3
- Comments per post: 10
Collecting from r/python...
  Collected post 'Pandas 3.0 release candidate tagged...' with 10 comments
  Collected post 'I built an automated court scraper because finding...' with 10 comments
  Collected post 'My wife was manually copying YouTube comments, so ...' with 10 comments
Collecting from r/AskReddit...
  Collected post 'What's an "Insider's secret" from your profession ...' with 10 comments
  Collected post 'What is a 'Survival Myth' that people believe beca...' with 10 comments
  Collected post 'What do girls “never” tell guys?...' with 10 comments
Collecting from r/movies...
  Collected post 'Hot Fuzz (2007) "what did he say?" Dir. Edgar Wrig...' with 10 comments
  Collected post 'First Image from ‘Super Troopers 3’...' with 10 comments
  Collected post 'Cary-Hiroyuki Tagawa Dies: ‘Mortal Kombat,

## Step 4: Set Up SocialSent Lexicons

Download and set up SocialSent lexicons for enhanced sentiment analysis.

In [74]:
# Set up SocialSent lexicons
def setup_socialsent():
    """Download and set up SocialSent lexicons"""
    print("Setting up SocialSent lexicons...")
    
    try:
        from backend.setup_socialsent import check_installation, setup_socialsent as run_setup
        
        if check_installation():
            print("SocialSent lexicons already installed")
            return True
        else:
            print("Installing SocialSent lexicons (this may take a few minutes)...")
            run_setup()
            return True
            
    except Exception as e:
        print(f"ERROR setting up SocialSent: {e}")
        print("Continuing without SocialSent (using Liu & Hu lexicon only)")
        return False

# Set up SocialSent
socialsent_success = setup_socialsent()

Setting up SocialSent lexicons...
SocialSent lexicons already installed


## Step 5: Sentiment Analysis on Collected Comments

Analyze the sentiment of collected Reddit comments.

In [75]:
# Sentiment analysis demonstration
import pandas as pd

def analyze_sample_comments():
    """Analyze sentiment of a sample of collected comments"""
    print("Analyzing sentiment of collected comments...")
    
    try:
        from backend.sentiment import SentimentAnalyzer
        
        # Load preprocessed comments
        comments_df = pd.read_csv(os.path.join(notebook_dir, 'data/comments_preprocessed.csv'))
        posts_df = pd.read_csv(os.path.join(notebook_dir, 'data/posts_preprocessed.csv'))
        
        # Create sentiment analyzer
        analyzer = SentimentAnalyzer(use_socialsent=socialsent_success)
        print(f"Sentiment analyzer initialized (SocialSent: {'enabled' if socialsent_success else 'disabled'})")
        
        # Analyze a sample of comments
        sample_size = min(50, len(comments_df))  # Analyze up to 50 comments
        sample_comments = comments_df.sample(sample_size, random_state=42)
        
        print(f"\nAnalyzing {sample_size} sample comments...")
        
        results = []
        for _, comment in sample_comments.iterrows():
            sentiment = analyzer.analyze_sentiment(comment['text'])
            results.append({
                'comment_id': comment['comment_id'],
                'text': comment['text'][:100] + '...' if len(comment['text']) > 100 else comment['text'],
                'sentiment': sentiment,
                'score': comment['score']
            })
        
        # Convert to DataFrame for display
        results_df = pd.DataFrame(results)
        
        # Show sentiment distribution
        sentiment_counts = results_df['sentiment'].value_counts()
        print(f"\nSentiment Distribution:")
        for sentiment, count in sentiment_counts.items():
            percentage = (count / len(results_df)) * 100
            print(f"- {sentiment.capitalize()}: {count} ({percentage:.1f}%)")
        
        # Show sample results
        print(f"\nSample Results:")
        display(results_df.head(10))
        
        print("Sentiment analysis complete!")
        return results_df
        
    except Exception as e:
        print(f"ERROR in sentiment analysis: {e}")
        import traceback
        traceback.print_exc()
        return None

# Run sentiment analysis
sentiment_results = analyze_sample_comments()

Analyzing sentiment of collected comments...
Found 3 overlapping words, removing them: ['envious', 'enviously', 'enviousness']
Loaded Liu & Hu: 2004 positive, 4780 negative words
Loaded SocialSent lexicon 'reddit_general': 9836 words
Sentiment analyzer initialized (SocialSent: enabled)

Analyzing 50 sample comments...

Sentiment Distribution:
- Negative: 32 (64.0%)
- Positive: 14 (28.0%)
- Mixed: 3 (6.0%)
- Neutral: 1 (2.0%)

Sample Results:


Unnamed: 0,comment_id,text,sentiment,score
0,nspgqek,Walking anywhere if you're lost in the wildern...,SentimentLabel.NEGATIVE,6132
1,ns9fdod,"Im interested in the spam list you created, ar...",SentimentLabel.NEGATIVE,1
2,nsri2i8,That we lay an egg once a month,SentimentLabel.NEGATIVE,18373
3,nsl37mu,That first film is still fucking hilarious.,SentimentLabel.POSITIVE,3281
4,ns3u61o,Congrats to the pandas devs! I appreciate all ...,SentimentLabel.POSITIVE,142
5,nseljtn,"Right now, it's keyword-based, so it catches o...",SentimentLabel.NEGATIVE,1
6,ns2c3z9,So Im a medic. I dont personally care if you t...,SentimentLabel.NEGATIVE,3398
7,nrwh0dl,The having to get a translator for the transla...,SentimentLabel.MIXED,346
8,nsbyj1z,As someone that has gotten recommended a lawye...,SentimentLabel.POSITIVE,48
9,nspeabj,Pulling out the bullet. They do it in all the ...,SentimentLabel.NEGATIVE,13789


Sentiment analysis complete!


# Step 6: Sentiment Analysis Pipeline (Reddit post URL -> Analysis Results)

This demonstrates the complete sentiment analysis pipeline used by our web application.
The web app makes API calls to the backend, which uses these same functions.

Pipeline: Reddit URL → Fetch post/comments → Analyze sentiment → Return results

In [76]:
# Direct Reddit post analysis demonstration
import pandas as pd
import sys
sys.path.append('.')

def analyze_reddit_post():
    """Analyze sentiment of a specific Reddit post directly"""
    print("Analyzing Reddit post sentiment...")
    
    try:
        # Import the backend functions
        from backend.reddit import fetch_post_and_comments
        from backend.sentiment import analyze_post_and_comments
        
        # Reddit post URL to analyze
        reddit_url = "https://www.reddit.com/r/UIUC/comments/1pek4a1/why_are_nighttime_exams_even_allowed/"
        print(f"Fetching and analyzing: {reddit_url}")
        
        # Fetch the post and comments
        post_data = fetch_post_and_comments(reddit_url, max_comments=50)  # Limit comments for demo
        
        if not post_data or not post_data.get('comments'):
            print("ERROR: Could not fetch post data or no comments found")
            return None
        
        print(f"Found {len(post_data['comments'])} comments to analyze")
        
        # Analyze sentiment
        analysis_result = analyze_post_and_comments(post_data)
        
        # Display results
        print(f"\nANALYSIS RESULTS")
        print("=" * 50)
        
        # Post information
        print(f"Post Title: {analysis_result['post_title']}")
        print(f"Overall Sentiment: {analysis_result['overall_sentiment'].upper()}")
        print(f"Controversy Score: {analysis_result['controversy']:.3f}")
        
        # Sentiment distribution
        print(f"\nSentiment Distribution:")
        for group in analysis_result['groups']:
            percentage = group['proportion'] * 100
            print(f"  - {group['label'].capitalize()}: {group['count']} comments ({percentage:.1f}%)")
        
        # Keywords
        if analysis_result['keywords']:
            print(f"\nTop Keywords: {', '.join(analysis_result['keywords'][:10])}")
        
        # Notable comments (show top 3)
        if analysis_result['notable_comments']:
            print(f"\nNotable Comments:")
            for i, comment in enumerate(analysis_result['notable_comments'][:3]):
                snippet = comment['snippet'][:100] + "..." if len(comment['snippet']) > 100 else comment['snippet']
                print(f"  {i+1}. [{comment['sentiment'].upper()}] \"{snippet}\" (Score: {comment['score']})")
        
        print("\n" + "=" * 50)
        print("Post analysis complete")
        
        return analysis_result
        
    except Exception as e:
        print(f"ERROR: Error analyzing post: {e}")
        import traceback
        traceback.print_exc()
        return None

# Analyze the Reddit post
post_analysis = analyze_reddit_post()


Analyzing Reddit post sentiment...
Fetching and analyzing: https://www.reddit.com/r/UIUC/comments/1pek4a1/why_are_nighttime_exams_even_allowed/
Found 25 comments to analyze
Found 3 overlapping words, removing them: ['envious', 'enviously', 'enviousness']
Loaded Liu & Hu: 2004 positive, 4780 negative words
Loaded SocialSent lexicon 'reddit_general': 9836 words

ANALYSIS RESULTS
Post Title: Why are nighttime exams even allowed?
Overall Sentiment: NEGATIVE
Controversy Score: 0.173

Sentiment Distribution:
  - Positive: 6 comments (24.0%)
  - Negative: 18 comments (72.0%)
  - Neutral: 1 comments (4.0%)
  - Mixed: 0 comments (0.0%)

Top Keywords: exams, time, classes, exam, finals, think, work, hours, long, week

Notable Comments:
  1. [POSITIVE] "I loved it. Night is when my brain works best." (Score: 37)
  2. [NEGATIVE] "Exams usually 2-3 hours long. Hard to find a 2-3 hour time block when everyone is available. My gues..." (Score: 122)
  3. [NEUTRAL] "asynchronous" (Score: 7)

Post analy

## Step 7: Download Evaluation Datasets

Download the SST-2 and Sentiment140 datasets for model evaluation.

In [77]:
# Download evaluation datasets
import sys
sys.path.append('.')

def download_datasets():
    """Download evaluation datasets"""
    print("Downloading evaluation datasets...")
    
    try:
        from backend.download_datasets import download_sst2, download_sentiment140
        
        # Download SST-2 dataset (no confirmation needed)
        print("\nDownloading SST-2 dataset...")
        sst2_success = download_sst2()
        if sst2_success:
            print("SST-2 dataset downloaded")
        else:
            print("Failed to download SST-2 dataset")
            
        # Download Sentiment140 dataset with automated confirmation
        print("\nDownloading Sentiment140 dataset...")
        sent140_success = download_sentiment140(confirmation_required=False)
        if sent140_success:
            print("Sentiment140 dataset downloaded")
        else:
            print("Failed to download Sentiment140 dataset")
        
        return sst2_success and sent140_success
        
    except Exception as e:
        print(f"Error downloading datasets: {e}")
        return False

# Download datasets
datasets_success = download_datasets()

Downloading evaluation datasets...

Downloading SST-2 dataset...

DOWNLOADING SST-2 DATASET

Downloading SST-2 dev.tsv...
URL: https://raw.githubusercontent.com/clairett/pytorch-sentiment-classification/master/data/SST2/dev.tsv
Downloaded to d:\Shubhi\COLLEGES\UIUC\Assignments\Fall25\CS410\CS410-Group-Project\data\sst2\dev.tsv

Downloading SST-2 train.tsv...
URL: https://raw.githubusercontent.com/clairett/pytorch-sentiment-classification/master/data/SST2/train.tsv
Downloaded to d:\Shubhi\COLLEGES\UIUC\Assignments\Fall25\CS410\CS410-Group-Project\data\sst2\train.tsv

SST-2 dataset downloaded successfully!
  - dev.tsv: 872 examples
  - train.tsv: 6920 examples
SST-2 dataset downloaded

Downloading Sentiment140 dataset...

DOWNLOADING SENTIMENT140 DATASET

Downloading Sentiment140 dataset...
URL: http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip
Downloaded to d:\Shubhi\COLLEGES\UIUC\Assignments\Fall25\CS410\CS410-Group-Project\data\sentiment140\trainingandtestdata.zip

Extract

## Step 8: Model Evaluation

Evaluate the sentiment analysis model on benchmark datasets.

In [78]:
# Model evaluation
import subprocess
import json

result = subprocess.run(
    ['python', 'test/test_analyzer_comments.py'],
    capture_output=True,
    text=True,
    cwd=notebook_dir
)

print(result.stdout)

Found 3 overlapping words, removing them: ['envious', 'enviously', 'enviousness']
Loaded Liu & Hu: 2004 positive, 4780 negative words
Loaded SocialSent lexicon 'reddit_general': 9836 words
Found 3 overlapping words, removing them: ['envious', 'enviously', 'enviousness']
Loaded Liu & Hu: 2004 positive, 4780 negative words
Loaded SocialSent lexicon 'reddit_general': 9836 words
Found 3 overlapping words, removing them: ['envious', 'enviously', 'enviousness']
Loaded Liu & Hu: 2004 positive, 4780 negative words
Loaded SocialSent lexicon 'reddit_general': 9836 words

EVALUATION RESULTS with SocialSent weight = 0.3
Dataset                       Accuracy   Pos/Neg F1       Pos F1       Neg F1       Neu F1     Mixed F1
----------------------------------------------------------------------------------------------------
Reddit Manual                   0.6061       0.5820       0.4255       0.7385       0.2105       0.0000
SST-2                           0.6778       0.6834       0.6675       0.69

## Summary

This notebook has demonstrated the complete Reddit sentiment analysis pipeline:

1. **Dependencies Installation**: Installed all required packages
2. **Environment Setup**: Configured Reddit API credentials
3. **Data Collection**: Collected and preprocessed Reddit data
4. **SocialSent Setup**: Downloaded community-specific lexicons
5. **Sample Comments Sentiment Analysis**: Analyzed sentiment of sample collected comments
6. **Sentiment Analysis given a Reddit URL**: Fetched Reddit post and analyzed sentiment of its comments
7. **Download Evaluation Datasets**: Downloaded benchmark datasets
8. **Model Evaluation**: Evaluated model performance

### Key Features Demonstrated:
- **Multi-lexicon sentiment analysis** (Liu & Hu + SocialSent)
- **Reddit-specific text preprocessing** (slang, emojis, etc.)
- **Context-aware analysis** (negation, intensifiers, subreddit-specific)
- **Comprehensive evaluation** on multiple benchmark datasets