# Analyzing Online Discussions on Women's Health


## Connect to Reddit API

The Python Reddit API Wrapper (PRAW) provides access to posts by subreddit asynchronously.

In [1]:
import praw
import time
import pandas as pd

In [2]:
reddit_api = praw.Reddit(
    client_id="09RsFctLtmCvUwL1Idw33A",
    client_secret="2XQNFmYwQ3JmUcKI3NTuiaqK580OgQ",
    user_agent="uspicious-Air-8308"
)

### Example Data Fetch
With the Reddit API, we can specify which subreddit we are interested in pulling information from, and we can access certain data from the post, such as the title, score, id, and url for the post:

In [3]:
subreddit = reddit_api.subreddit("BabyBumps")
for post in subreddit.hot(limit=5):
    print(post.title, post.score, post.id, post.url)

Weekly Reminder: Community Rules 1 1iggeap https://www.reddit.com/r/BabyBumps/comments/1iggeap/weekly_reminder_community_rules/
Introduction and Daily Picture Thread 1 1ihf64l https://www.reddit.com/r/BabyBumps/comments/1ihf64l/introduction_and_daily_picture_thread/
I’m a ‘we’re pregnant ‘ gal. 277 1ihhl8m https://www.reddit.com/r/BabyBumps/comments/1ihhl8m/im_a_were_pregnant_gal/
Anyone else feel *really* unattractive while pregnant? 108 1ihg4bl https://www.reddit.com/r/BabyBumps/comments/1ihg4bl/anyone_else_feel_really_unattractive_while/
So Much For A Late Gender Reveal 28 1ihnb6l https://www.reddit.com/r/BabyBumps/comments/1ihnb6l/so_much_for_a_late_gender_reveal/


### Establishing our Scope

We are interested in the following subreddits:
1. `r/pregnant`
2. `r/babybumps`
3. `r/beyondthebump`
4. `r/pregnancyproblems`
5. `r/pregnancyafterloss`
6. `r/newparents`
7. `r/postpartum_depression`
8. `r/postpartum_anxiety`
9. `r/fitpregnancy`
10. `r/newborns`

In [5]:
PREGNANT = "pregnant"
BABY_BUMPS = "babybumps"
BEYOND_THE_BUMP = "beyondthebump"
PREGNANCY_PROBLEMS = "pregnancyproblems"
PREGNANCY_AFTER_LOSS = "pregnancyafterloss"
NEW_PARENTS = "newparents"
POSTPARTUM_DEPRESSION = "postpartum_depression"
POSTPARTUM_ANXIETY = "postpartum_anxiety"
FIT_PREGNANCY = "fitpregnancy"
NEWBORNS = "newborns"

subreddits = [PREGNANT, BABY_BUMPS, BEYOND_THE_BUMP, PREGNANCY_PROBLEMS, PREGNANCY_AFTER_LOSS, NEW_PARENTS, POSTPARTUM_DEPRESSION,
             POSTPARTUM_ANXIETY, FIT_PREGNANCY, NEWBORNS]

In [6]:
for sub_name in subreddits:
    sub = reddit_api.subreddit(sub_name)
    print(f"Description for 'r/{sub_name}': {sub.public_description}")

Description for 'r/pregnant': A safer space for all pregnant people.

Description for 'r/babybumps': A place for pregnant redditors, those who have been pregnant, those who wish to be in the future, and anyone who supports them. Not the place for bump or ultrasound pics, sorry!
Description for 'r/beyondthebump': A place for new parents, new parents to be, and old parents who want to help out. Posts focusing on the transition into living with your new little one and any issues that may come up. Ranting and gushing is welcome!
Description for 'r/pregnancyproblems': *** PLEASE READ RULES BEFORE POSTING***

Pregnancy can impact dozens of aspects of your life so we made a space where you can rant, ask for advice and find someone to relate to!
Description for 'r/pregnancyafterloss': This sub is for people who are pregnant after any type of pregnancy or infant loss.  

At PAL, the daily and weekly threads act like the main sub in other subreddits: nearly everything gets posted there. Standalo

## Data Collection Strategy

We are interested in:
- subreddit title
- subreddit text body
- top comments from the post

In [7]:
# we are only interested in posts from the past year
cutoff_timestamp = time.time() - (365 * 24 * 60 * 60)

### Fetch posts:

In [8]:
def fetch_subreddit_posts(subreddit_name, limit=500):
    subreddit = reddit_api.subreddit(subreddit_name)
    
    post_data = []
    for post in subreddit.new(limit=limit):
        if post.created_utc >= cutoff_timestamp:
            post_data.append({
                "subreddit": subreddit_name,
                "title": post.title,
                "text": post.selftext,
                "created_utc": post.created_utc,
                "id": post.id
            })
    print(subreddit_name, len(post_data))
    return pd.DataFrame(post_data)

In [9]:
all_posts = pd.concat([fetch_subreddit_posts(sub) for sub in subreddits], ignore_index=True)
all_posts.to_csv("reddit_posts.csv", index=False)
print(f"Saved {len(all_posts)} posts from the past year")

pregnant 500
babybumps 500
beyondthebump 500
pregnancyproblems 500
pregnancyafterloss 500
newparents 500
postpartum_depression 500
postpartum_anxiety 179
fitpregnancy 500
newborns 500
Saved 4679 posts from the past year


### Fetch top comments for each post:

Note that Reddit has a rate limit of 60 requests per minute.

In [1]:
def fetch_post_comments(post_id, limit=50):
    original_post = reddit_api.submission(id=post_id)
    original_post.comments.replace_more(limit=0)

    comments = []
    for comment in original_post.comments[:limit]:
        comments.append({
            "post_id": post_id,
            "comment": comment.body
        })

    return pd.DataFrame(comments)

In [11]:
post_ids = all_posts["id"].tolist()

In [12]:
all_comments = []

# apply rate limiting with loop
for i, post_id in enumerate(post_ids):
    try:
        comments_df = fetch_post_comments(post_id)
        all_comments.append(comments_df)

        if (i + 1) % 100 == 0:
            print(f"Fetched comments for {i + 1} posts so far")

        if (i + 1) % 60 == 0:  # every 60 requests, wait 60 seconds to avoid hitting rate limits
            print("Rate limit reached. Sleeping for 60 seconds.")
            time.sleep(60)

    except Exception as e:
        print(f"Error fetching comments for post {post_id}: {e}")
        time.sleep(5)

Rate limit reached. Sleeping for 60 seconds.
Fetched comments for 100 posts so far
Rate limit reached. Sleeping for 60 seconds.
Rate limit reached. Sleeping for 60 seconds.
Fetched comments for 200 posts so far
Rate limit reached. Sleeping for 60 seconds.
Fetched comments for 300 posts so far
Rate limit reached. Sleeping for 60 seconds.
Rate limit reached. Sleeping for 60 seconds.
Fetched comments for 400 posts so far
Rate limit reached. Sleeping for 60 seconds.
Rate limit reached. Sleeping for 60 seconds.
Fetched comments for 500 posts so far
Rate limit reached. Sleeping for 60 seconds.
Fetched comments for 600 posts so far
Rate limit reached. Sleeping for 60 seconds.
Rate limit reached. Sleeping for 60 seconds.
Fetched comments for 700 posts so far
Rate limit reached. Sleeping for 60 seconds.
Rate limit reached. Sleeping for 60 seconds.
Fetched comments for 800 posts so far
Rate limit reached. Sleeping for 60 seconds.
Fetched comments for 900 posts so far
Rate limit reached. Sleeping

In [13]:
all_comments_df = pd.concat(all_comments, ignore_index=True)
all_comments_df.to_csv("reddit_comments.csv", index=False)

print(f"Successfully saved {len(all_comments_df)} comments from original posts")

Successfully saved 33822 comments from original posts
