In [1]:
import praw
import pandas as pd
import os
from dotenv import load_dotenv

In [None]:
# Load credentials from .env file
load_dotenv('../.env')  # Adjust path if needed

"""We are collecting data from reddit using an API endpoint. We pass through our credentials and the subreddit we want to collect posts from"""

# Set up Reddit API client
reddit = praw.Reddit(
    client_id=os.getenv('REDDIT_CLIENT_ID'),
    client_secret=os.getenv('REDDIT_CLIENT_SECRET'),
    user_agent='text-mining-project by u/Which-Reference-6224',
    username=os.getenv('REDDIT_USERNAME'),
    password=os.getenv('REDDIT_PASSWORD'))

In [3]:
# Test Reddit connection
for submission in reddit.subreddit('science').hot(limit=5):
    print(submission.title)

A new study across 11 African reserves found that dehorning rhinos cut poaching by ~78% – far more effective than costly law enforcement alone.
Low-calorie diets might increase risk of depression. Overweight people and men were particularly vulnerable to the mood changes that come with a low-calorie diet. Cutting calories might also rob the brain of nutrients needed to maintain a balanced mood. Any sort of diet at all affected men's moods.
People around the world are more likely to favor dominant, authoritarian leaders during times of intergroup conflict. Drawing on data from 25 countries, the findings support that humans may have a psychological system that evolved to prioritize strong leadership when faced with external threats.
Self-perceived physical attractiveness linked to stronger materialistic values. Research suggests this occurs because people who believe they are attractive are more likely to compare themselves with others in terms of abilities, opinions, and social status, 

In [4]:
print("Logged in as:", reddit.user.me())

Logged in as: Which-Reference-6224


In [None]:
"""Due to API request limits and avoiding any memory issues, we kept our datasize to only 2000 posts from each sub reddit we requested data
from. This way we also maintain a balanced class from the beginning and avoid additional feature engineering"""
def fetch_posts(subreddit_name, limit=1000):
    subreddit = reddit.subreddit(subreddit_name)
    posts = []

    for post in subreddit.hot(limit=limit):
        posts.append({
            'title': post.title,
            'selftext': post.selftext,
            'score': post.score,
            'created_utc': post.created_utc,
            'subreddit': subreddit_name})

    return pd.DataFrame(posts)

# Fetch from both subreddits
science_df = fetch_posts('science', limit=1000)
tech_df = fetch_posts('technology', limit=1000)

# Combine and save
all_posts = pd.concat([science_df, tech_df], ignore_index=True)
all_posts.to_csv('../data/reddit_posts.csv', index=False)

print("Saved 2,000 combined posts to ../data/reddit_posts.csv")

Saved 2,000 combined posts to ../data/reddit_posts.csv


In [6]:
df = pd.read_csv('../data/reddit_posts.csv')
df.head()

Unnamed: 0,title,selftext,score,created_utc,subreddit
0,A new study across 11 African reserves found t...,,1010,1749149000.0,science
1,Low-calorie diets might increase risk of depre...,,3031,1749118000.0,science
2,People around the world are more likely to fav...,,957,1749125000.0,science
3,Self-perceived physical attractiveness linked ...,,180,1749150000.0,science
4,Efficient mRNA delivery to resting T cells to ...,,260,1749136000.0,science


In [None]:
df['subreddit'].value_counts()
"""We have a relatively balanced dataset, we will not require to balance our dataset"""

subreddit
technology    817
science       754
Name: count, dtype: int64