<a href="https://colab.research.google.com/github/swoodruff-bot/swoodruff-bot.github.io/blob/main/Seamus'_Reddit_scraper.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The code below was primarily written by Professor Hall, but modified and added to by me. It now creates CSVs instead of plain text files, this is helpful for some text analysis codes, and is nice organizationally if you're trying to manually inspect the results. I also got rid of much of the filler, such as who commented or posted something, and any associated words like reply or comment. This is basically front loading the removal of reddit specific stopwords in case you're using a boiler plate text analysis code or a program that doesn't let you directly edit the code. It's still possible to visually tell what is a post vs. comment vs. reply since the space and indentation formattting is maintained, but will be ignored by most text analysis.

**IMPORTANT!**
note about the viability of this scraping method. Reddit recently, October 2025, made it harder to get an API key which is required for this method (shout out AI developers, this is why we can't have nice things). Users are now required to submit a formal request through an online form, but it seems like turnaround time and approval are both iffy, so best of luck ðŸ¤·. If you want to give requesting a key an attempt heres the post regarding these changes, with the necessary forms linked: https://www.reddit.com/r/redditdev/comments/1oug31u/introducing_the_responsible_builder_policy_new

Code running correctly as of December 15th 2025, if you need help with it in the future feel free to reach out to me, the DCS faculty should have my contact info -Seamus

In [24]:
# Run this to import the praw library and get the reddit instance
# This is a read only instance of reddit (meaning you cannot post or comment using this API- if you are curious about that look it up or reach out to me :) )
%pip install praw
import praw
import os
import textwrap
import pandas as pd
import logging
from google.colab import userdata

#I have my API key information stored in Colab so I am using userdata.get to retrieve it. if you don't want to set that up,
#or are using a different environment just delete userdata.get() and replace the text inside of '' with the actual keys

reddit = praw.Reddit(
    client_id = userdata.get('REDDIT_client_id'),#FILL IT WITH YOUR CLIENT ID
    client_secret = userdata.get('Reddit_client_secret'), #FILL IT WITH YOUR CLIENT SECRET
    user_agent ="scraper" ,#FILL IT WITH YOUR USER_agent
)



Should you manage to get an API key, note that reddit limits the number of request you can make within a certain time range, so you may need to change to number of requested posts or comments if you hit a 413 Error when you try to actually scrape the posts

In [22]:
#HYPER_PARAMETERS
POSTS_NUMBERS_PER_QUERY = 40 #Choose the number of posts you want to get per query term
COMMENTS_PER_QUERY = 10 #choose the number of comments you want to get per post

#EXAMPLE BELOW
SUBREDDIT = 'apple' #fill it with the title of your subreddit (drop the r/ just the title for example. productivity and not r/productivity)
# a list of search queries you want to use
# (for example if you want all posts from R/slash addiction about social media, your que)
# you can specifiy multiple queries to run at once, but it makes tripping the 413 response from reddit more likely
QUERIES = ['screen time']
# Choose a name for the directory that your files will be saved to
DISC_DIR = 'ScreenTime'

# EXAMPLE
# SUBREDDIT= 'productivity'
# QUERIES = ['social media addiction', 'delete social media', 'quit social media', '"social media addiction" AND help']
# DISC_DIR = "producitvity_subreddit"

In [20]:
def wrap_text(text, width=80, initial_indent='', subsequent_indent=''):
    """Has the text wrapping for better readability

    Args:
        text (string): text file to wrap
        width (int, optional): Length of line in characters. Defaults to 80.
        initial_indent (str, optional): initial indent for text. Defaults to ''.
        subsequent_indent (str, optional): subsequent indent for text (used for nested comments). Defaults to ''.

    Returns:
        _type_: _description_
    """
    wrapper = textwrap.TextWrapper(width=width, initial_indent=initial_indent, subsequent_indent=subsequent_indent)
    return wrapper.fill(text)

#recusive function to include all comments recursively
def get_comments_text(comment, width = 110, indent = "    "):
    """Recursively collects comment text and its replies.

    Args:
        comment (reddit.comment): The comment object.
        width (int, optional): The length of line in characters.
        indent(string, optional): The initial indent for nested comments.

    Returns:
        str: A formatted string containing the comment and its replies.
    """
    comment_text = f" "
    comment_text += f"{wrap_text(comment.body)} \n \n"
    for reply in comment.replies:
        comment_text += f" "
        comment_text += wrap_text(reply.body, width=width, initial_indent=indent, subsequent_indent=indent) + "\n\n"
    return comment_text

def scraper(subreddit_name, search_queries, base_dir):
    """Scrapes the subreddit chosen based on the search queries provided and returns a list of dictionaries for each post.

    Args:
        subreddit_name (string): Name of subreddit to include
        search_queries (list): List of search queries
        base_dir (string): name of directory (not used for file writing anymore, but kept for function signature compatibility)

    Returns:
        list: A list of dictionaries, where each dictionary represents a Reddit post with its title, selftext, and combined comments.
    """
    subreddit = reddit.subreddit(subreddit_name)
    all_posts_data = []

    for query in search_queries:
        for i, submission in enumerate(subreddit.search(query, limit = POSTS_NUMBERS_PER_QUERY)):
            post_data = {}
            post_data['title'] = submission.title

            combined_text = []
            # Add the title to the combined text as well
            combined_text.append(f"{submission.title}")
            combined_text.append(wrap_text(submission.selftext))

            # getting comments
            submission.comments.replace_more(limit=None)
            top_comments = submission.comments[:COMMENTS_PER_QUERY]

            for comment in top_comments:
                combined_text.append(get_comments_text(comment))

            post_data['text'] = "\n\n".join(combined_text)
            all_posts_data.append(post_data)

    return all_posts_data


Next codeblock actually scrapes the data, tosses it in a dataframe, and saves it as a csv to the DISC_DIR you specified above, this is where you need to watch out for that 413 error

In [23]:
#This just gets rid of a super irritating warning about using Praw instead of Async praw in an asynchronous environment
#it would print out for every post, and I think maybe every comment as well
logging.getLogger("praw").setLevel(logging.ERROR)

scraped_data = scraper(SUBREDDIT, QUERIES, DISC_DIR)
df = pd.DataFrame(scraped_data)

# Define the output CSV file path
output_csv_path = os.path.join(DISC_DIR, f'{QUERIES}_reddit_posts.csv')

# Create the directory if it doesn't exist
os.makedirs(DISC_DIR, exist_ok=True)

df.to_csv(output_csv_path, index=False, encoding='utf-8')
print(f"Data successfully scraped and saved to {output_csv_path}")
print(df.head())

Data successfully scraped and saved to ios/['screen time']_reddit_posts.csv
                                               title  \
0           My kid managed to pass Screen time limit   
1  9th Circuit Rules Apple Owes Retail Workers fo...   
2  For the first time, my iPhone alerted me that ...   
3  Apple releasing iOS 13.3.1 today with Screen T...   
4  Siri Shortcuts, Screen Time, and other iOS fea...   

                                                text  
0  My kid managed to pass Screen time limit\n\nWh...  
1  9th Circuit Rules Apple Owes Retail Workers fo...  
2  For the first time, my iPhone alerted me that ...  
3  Apple releasing iOS 13.3.1 today with Screen T...  
4  Siri Shortcuts, Screen Time, and other iOS fea...  


**The code below this point is meant to combine all of the posts we've scraped so far is combined into one single csv for certain text analysis programs, so wait until you've scraped everything you want to run it!**

In [None]:
# define the path to your folder containing the CSV files
folder_path = DISC_DIR

#Combine all of the individual CSVs into a pandas data frame
all_files = [os.path.join(folder_path, f) for f in os.listdir(folder_path) if f.endswith('.csv')]

combined_df = pd.DataFrame(columns=['title', 'text'])

for file in all_files:
    df = pd.read_csv(file)
    if 'title' in df.columns and 'text' in df.columns:
        combined_df = pd.concat([combined_df, df[['title', 'text']]], ignore_index=True)
    else:
        print(f"Skipping file {file} as it does not contain 'title' and 'text' columns.")
#display the first couple rows to make sure nothing went really wrong
display(combined_df.head())

Unnamed: 0,title,text
0,A.I. has ruined Pinterest,A.I. has ruined Pinterest \n\nI find that ever...
1,Iâ€™m officially done searching for inspiration ...,Iâ€™m officially done searching for inspiration ...
2,How do you save inspiration thatâ€™s not on Pint...,How do you save inspiration thatâ€™s not on Pint...
3,"How to get rid of ""inspired by this board"" col...","How to get rid of ""inspired by this board"" col..."
4,"Come for the Inspiration, Stay for the Bugs an...","Come for the Inspiration, Stay for the Bugs an..."


In [None]:
#create a new csv from our dataframe and save it to the same folder as our other CSVs
output_combined_csv_path = os.path.join(folder_path, 'screentime_combined.csv')
combined_df.to_csv(output_combined_csv_path, index=False, encoding='utf-8')
print(f"All CSV files combined and saved to {output_combined_csv_path}")

All CSV files combined and saved to ios/screentime_combined.csv
