# Reddit Data Collection: Medicare & Medicaid Posts

## Overview
This notebook collects Reddit posts related to Medicare and Medicaid from three subreddits:
- r/Medicare  
- r/Medicaid  
- r/HealthInsurance  

The goal is to construct a clean, reproducible corpus of Reddit posts for downstream text analysis and clustering, with a focus on understanding administrative burden and user experiences with public health insurance programs.

---

## Data Collection Strategy
Posts are retrieved using the Reddit API via the `praw` Python library. For each subreddit, the notebook:
- Iterates backward in time using Reddit‚Äôs `new` listing
- Applies **case-insensitive keyword filtering** (`"medicare"`, `"medicaid"`)
- Collects a large enough sample to ensure broad temporal and topical coverage

Rate limits are handled via conservative sleep intervals.

---

## Output
The notebook produces a single CSV file:

**`cleaned_text_data.csv`**

Each row corresponds to one Reddit post and includes:
- Post metadata (subreddit, score, number of comments, timestamp)
- Raw text (title + body)
- Derived text features (character length, word count)
- A cleaned text field suitable for TF‚ÄìIDF vectorization and clustering

---

## Reproducibility Notes
- API credentials are **not hard-coded** and are loaded from environment variables
- The notebook can be rerun end-to-end to regenerate the dataset
- All preprocessing steps are explicit and documented to support transparency and replication


In [1]:
# Import Statements
import praw
import prawcore
import pandas as pd
from datetime import datetime
import time
import re
import os
from tqdm import tqdm
from dotenv import load_dotenv

In [2]:
# Load .env file from project root
load_dotenv()

True

In [3]:
# Reddit API credentials
reddit = praw.Reddit(
    client_id=os.getenv("REDDIT_CLIENT_ID"),
    client_secret=os.getenv("REDDIT_CLIENT_SECRET"),
    user_agent=os.getenv("REDDIT_USER_AGENT"),
)

In [4]:
# Parameters
SUBREDDITS = ["Medicare", "Medicaid", "HealthInsurance"]
KEYWORDS = ["medicare", "medicaid"]     
TARGET_POSTS_PER_SUB = 1200             
SLEEP_TIME = 1.5                       

In [6]:
# Scrape posts
posts = []

for sub in SUBREDDITS:
    subreddit = reddit.subreddit(sub)
    collected = 0
    last_timestamp = None

    print(f"\nüîç Scraping r/{sub}...")

    while collected < TARGET_POSTS_PER_SUB:
        params = {}
        if last_timestamp:
            params["before"] = int(last_timestamp)

        try:
            batch = list(subreddit.new(limit=1000, params=params))
        except prawcore.exceptions.TooManyRequests:
            print("‚è≥ Rate limit hit ‚Äî sleeping 60s")
            time.sleep(60)
            continue

        if not batch:
            break

        for submission in tqdm(batch, leave=False):
            last_timestamp = submission.created_utc

            raw_text = f"{submission.title} {submission.selftext}"
            raw_text_lower = raw_text.lower()

            # keyword filter (case-insensitive)
            if not any(k in raw_text_lower for k in KEYWORDS):
                continue

            posts.append({
                "id": submission.id,
                "title": submission.title,
                "text": submission.selftext,
                "subreddit": sub.lower(),
                "score": submission.score,
                "num_comments": submission.num_comments,
                "url": submission.url,
                "created_utc": submission.created_utc,
                "date": datetime.fromtimestamp(submission.created_utc),
            })

            collected += 1
            if collected >= TARGET_POSTS_PER_SUB:
                break

            time.sleep(SLEEP_TIME)



üîç Scraping r/Medicare...


                                                                                


üîç Scraping r/Medicaid...


                                                                                


üîç Scraping r/HealthInsurance...


                                                                                

In [7]:
# Create output DataFrame
df = pd.DataFrame(posts).drop_duplicates(subset="id")

# Text features
df["text_length"] = df["text"].str.len()
df["word_count"] = df["text"].str.split().str.len()

# Reorder columns to match your remembered file
df = df[
    [
        "id",
        "title",
        "text",
        "subreddit",
        "score",
        "num_comments",
        "url",
        "created_utc",
        "date",
        "text_length",
        "word_count",
    ]
]


In [8]:
len(df)

1665

In [10]:
# Examine Distribution Across Subreddits 
print(df["subreddit"].value_counts())

subreddit
medicaid           826
medicare           717
healthinsurance    122
Name: count, dtype: int64


In [11]:
df.head()

Unnamed: 0,id,title,text,subreddit,score,num_comments,url,created_utc,date,text_length,word_count
0,1qoihba,CenterWell scam??,Took my 81 year old mother to a Dr appointment...,medicare,2,0,https://www.reddit.com/r/medicare/comments/1qo...,1769531000.0,2026-01-27 11:24:04,1190,215
1,1qohpsb,"Maryland, QMB with Medicare Advantage Plan, be...",My mom has a Medicare Advantage plan that cove...,medicare,3,2,https://www.reddit.com/r/medicare/comments/1qo...,1769529000.0,2026-01-27 10:57:06,633,111
2,1qog97w,"insulin pens now $35 a month, so pen needles w...",Very happy that mom's Humalog pens are now onl...,medicare,3,3,https://www.reddit.com/r/medicare/comments/1qo...,1769526000.0,2026-01-27 10:03:35,481,95
3,1qnvkqv,Cigna Medicare Supplement Plan G 2026 Premium ...,There have been a lot of posts regarding the s...,medicare,5,14,https://www.reddit.com/r/medicare/comments/1qn...,1769467000.0,2026-01-26 17:41:04,956,175
4,1qnuing,Question about late enrollment penalties for p...,The situation is as follows:\n\nPerson born in...,medicare,1,11,https://www.reddit.com/r/medicare/comments/1qn...,1769465000.0,2026-01-26 17:02:12,1047,182


In [13]:
# Save file 
df.to_csv("../data/raw_reddit_data.csv", index=False)

print(f"\n‚úÖ Saved raw_reddit_data.csv with {len(df):,} posts")


‚úÖ Saved raw_reddit_data.csv with 1,665 posts
