# Reddit Data Collection Script for Candidate Sentiment Analysis

This Python script uses the `praw` library to collect comments related to specific candidates from the Reddit subreddit `r/politics`. The collected data includes metadata such as comment text, author, timestamp, and engagement metrics, and it saves the data into a CSV file.

## Key Features

1. **Reddit API Configuration**:
   - The script authenticates using `praw`, with a `client_id`, `client_secret`, and a custom `user_agent`.
   - Replace the placeholders with your own Reddit API credentials to run the script.

2. **Candidate-Specific Comment Collection**:
   - The `fetch_candidate_comments` function collects comments for a specified candidate and party affiliation.
   - Metadata such as comment ID, user handle, timestamp, comment text, upvotes, number of replies, and awards count is extracted for each comment.

3. **Candidate and Party Information**:
   - A dictionary `candidates_info` maps candidates to their respective political parties.
   - The script iterates through this dictionary to fetch data for all listed candidates.

4. **Saving Data**:
   - Collected data from all candidates is combined into a single Pandas DataFrame.
   - The DataFrame is exported to a CSV file named `reddit_candidate_comments_praw.csv`.


In [1]:
import praw
import pandas as pd
from datetime import datetime

# Configure Reddit API
reddit = praw.Reddit(
    client_id="zC96I9IB346LY9OD5RSrmg",       # Replace with your Client ID
    client_secret="iNJWZGyfD17UIAs4KxCyoS8xa4_eSQ", # Replace with your Secret Key
    user_agent="ytlei2",  # Custom user agent
)

# Fetch comments for a specific candidate
def fetch_candidate_comments(candidate, party, num_comments=10000):
    comments_data = []
    subreddit = reddit.subreddit("politics")
    comment_count = 0  # Track the number of comments fetched

    # Search posts related to the candidate
    for submission in subreddit.search(candidate, limit=num_comments):
        submission.comments.replace_more(limit=0)  # Flatten the comment tree
        for comment in submission.comments.list():
            if comment_count >= num_comments:  # Stop if we've collected enough comments
                break
            comments_data.append({
                "comment_id": comment.id,
                "user_handle": comment.author.name if comment.author else "deleted",
                "timestamp": datetime.fromtimestamp(comment.created_utc),
                "comment_text": comment.body,
                "candidate": candidate,
                "party": party,
                "upvotes": comment.score,
                "replies": len(comment.replies) if hasattr(comment, "replies") else 0,
                "awards": comment.total_awards_received  # Awards count
            })
            comment_count += 1
        if comment_count >= num_comments:  # Break out of outer loop if limit reached
            break

    return pd.DataFrame(comments_data)

# Define candidates and their parties
candidates_info = {
    "Kamala Harris": "Democratic",
    "Donald Trump": "Republican",
    "Jill Stein": "Green",
    "Robert Kennedy": "Independent",
    "Chase Oliver": "Libertarian"
}

# Fetch data for each candidate
all_comments = []
for candidate, party in candidates_info.items():
    print(f"Fetching comments for {candidate} ({party})...")
    candidate_comments = fetch_candidate_comments(candidate, party, num_comments=1000)
    all_comments.append(candidate_comments)

# Combine all candidates' comments into a single DataFrame
comments_df = pd.concat(all_comments, ignore_index=True)

# Save to CSV
comments_df.to_csv("reddit_candidate_comments_praw.csv", index=False)

print(comments_df.head())

Fetching comments for Kamala Harris (Democratic)...
Fetching comments for Donald Trump (Republican)...
Fetching comments for Jill Stein (Green)...
Fetching comments for Robert Kennedy (Independent)...
Fetching comments for Chase Oliver (Libertarian)...
  comment_id    user_handle           timestamp  \
0    lviwuzp  AutoModerator 2024-11-05 15:19:02   
1    lvj632v      Gymrat777 2024-11-05 16:09:12   
2    lvj74ic       martapap 2024-11-05 16:14:33   
3    lvix1ez    Blackwardz3 2024-11-05 15:20:03   
4    lvjadkj     middlebird 2024-11-05 16:31:05   

                                        comment_text      candidate  \
0  \nAs a reminder, this subreddit [is for civil ...  Kamala Harris   
1  There's another poll going on right now - it's...  Kamala Harris   
2  Just actually vote people. Don't assume anythi...  Kamala Harris   
3  Nice. Don’t get complacent though. Vote if you...  Kamala Harris   
4  Dear college age redditors.\n\nI know you have...  Kamala Harris   

        party

# Filtering High-Quality Reddit Comments for Candidate Analysis

This segment of the script refines the collected Reddit comments to retain only high-quality, relevant entries. The goal is to remove bot-generated comments, ensure single-candidate relevance, and select the top comments based on community engagement (upvotes).


In [2]:
from textblob import TextBlob


def is_bot_comment(user_handle, comment_text):
    bot_indicators = ['AutoModerator', 'bot', 'moderator']
    if any(indicator.lower() in (user_handle or '').lower() for indicator in bot_indicators):
        return True
    if "I am a bot" in comment_text or "this action was performed automatically" in comment_text:
        return True
    return False

def is_relevant_comment(comment_text, keywords):
    blob = TextBlob(comment_text)
    return any(keyword.lower() in blob.words.lower() for keyword in keywords)

def filter_high_quality_comments(df, candidates, short, top_n=100):
    def contains_single_candidate(comment_text, candidates):
        count = sum(candidate.lower() in comment_text.lower() for candidate in candidates)
        return count < 2

    df = df[~df.apply(lambda row: is_bot_comment(row['user_handle'], row['comment_text']), axis=1)]

    df = df[df['comment_text'].apply(lambda x: contains_single_candidate(x, candidates))]
    df = df[df['comment_text'].apply(lambda x: contains_single_candidate(x, short))]

    df = df.sort_values(['candidate', 'upvotes'], ascending=[True, False])
    filtered_df = df.groupby('candidate').head(top_n)

    return filtered_df

candidates_list = ["Kamala Harris", "Donald Trump", "Jill Stein", "Robert Kennedy", "Chase Oliver"]
candidates_short_list = ["Harris", "Trump", "Stein", "Kennedy", "Oliver"]

filtered_comments_df = filter_high_quality_comments(comments_df, candidates=candidates_list, short=candidates_short_list, top_n=100)

filtered_comments_df.to_csv("reddit.csv", index=False)

print(filtered_comments_df.head())

     comment_id    user_handle           timestamp  \
4415    l5tfyku      neuroid99 2024-05-27 02:22:31   
4417    l5t76l5  atomsmasher66 2024-05-27 01:16:06   
4480    l5u3iym    hatrickstar 2024-05-27 05:23:36   
4418    l5t7ooz        deleted 2024-05-27 01:19:53   
4488    l5t7gnp        DJErikD 2024-05-27 01:18:14   

                                           comment_text     candidate  \
4415  While I think the Libertarian party is full of...  Chase Oliver   
4417         But his brain worm is still in the running  Chase Oliver   
4480  True libertarian are loons but they truly aren...  Chase Oliver   
4418                                          [deleted]  Chase Oliver   
4488                               🐛 BRAIN WORM 2024. 🐛  Chase Oliver   

            party  upvotes  replies  awards  
4415  Libertarian     2089        7       0  
4417  Libertarian      964       11       0  
4480  Libertarian      693       19       0  
4418  Libertarian      558        7       0  
4488  