In [20]:
import praw
import pandas as pd
import time

### To use, make a [Reddit](https://www.reddit.com/register/) account and sign up to obtain the [API](https://www.reddit.com/wiki/api/) key. 

Insert **client_id**, **client_secret** & **user_agent** from Reddit after obtaining your API key in the code text below


In [21]:
#Input your API key details here

reddit = praw.Reddit(
    client_id="4TckO1aMiQwyp1mfysQhrA", #input your client_id here
    client_secret="VeqTyxzfRidxh1Za1Rc-u4YV3TyZfw", #input your client_secret here
    user_agent="WebScraper", #input your user_agent here
)

### The function below responsible for all scraping from Reddit

Function scrapes both posts and comments from the top section of specified subreddit.

The reason behind scraping from top specifically is to ensure dataset is not biased towards current events and is able to gather well-populated comments in posts from different time periods.

In [26]:
def scrape_posts_and_comments_from_reddit_top(subreddit_name, num_post, time_period, comments_per_post):
    # List to store scraped comments
    scraped_posts = []
    scraped_comments = []
    start_time = time.time() #Timestamp for progress bar
    total_posts = num_post*comments_per_post
    # Function to scrape both post and comments from a subreddit
    def scrape(subreddit_name, num_post, time_period):
        subreddit = reddit.subreddit(subreddit_name)
        num_scraped = 0 # num_scraped represents total number of posts scraped so far
        comment_counter = 0 # comment_counter represents total comments scraped, for progress bar
        last_post_time = None  # Initialize last_post_time here
        for submission in subreddit.top(time_filter=time_period, limit=None, params={'after': last_post_time}):
            if num_scraped >= num_post: # Breaks for loop once desired post count reached
                break
            comment_per_post_counter = 0 #comment_per_post_counter represents comments scraped for a single post, to ensure comments do not exceed comments_per_post
            scraped_posts.append([submission.id, submission.title, submission.score, submission.num_comments, submission.selftext, submission.created_utc, submission.upvote_ratio, subreddit_name])
            submission.comments.replace_more(limit=10) # Limit set to None to ensure all comments are seen. If 429 Error, change to limit = 10 to resolve it
            for comment in submission.comments.list():
                scraped_comments.append([comment.id, comment.parent_id, comment.link_id, comment.is_submitter, comment.body, comment.score, comment.stickied, comment.created_utc, submission.title, subreddit_name])
                time.sleep(1) # Time delay function to prevent HTTP 429 Error
                if comment_counter % int(total_posts/10) == 0: #Progress bar prints absolute value of comments scraped based on 10% of the total desired comments determined by total_posts
                    comment_time = time.time() # Timestamp for progress bar
                    print(f'{comment_counter} comments scraped in {round((comment_time-start_time)/60,2)} minutes')
                comment_per_post_counter += 1 
                comment_counter += 1
                if comment_per_post_counter >= comments_per_post:
                    break
            num_scraped += 1
            last_post_time = submission.created_utc #Timestamp for progress bar
            time.sleep(1)  # Time delay function to prevent HTTP 429 Error
    
    # Scrape comments from each subreddit
    scrape(subreddit_name, num_post, time_period)
    
    # Convert scraped comments to DataFrame
    posts_columns = ['id', 'title', 'score', 'num_comments', 'selftext', 'created_utc', 'upvote_ratio', 'subreddit']
    comments_columns = ['comment_id', 'parent_id', 'post_id', 'is_submitter', 'body', 'score', 'stickied', 'created_utc', 'post_title', 'subreddit']
    df_posts = pd.DataFrame(scraped_posts, columns=posts_columns)
    df_comments = pd.DataFrame(scraped_comments, columns=comments_columns)
    
    # Convert 'created_utc' column to datetime
    df_posts['created_utc'] = pd.to_datetime(df_posts['created_utc'], unit='s')
    df_comments['created_utc'] = pd.to_datetime(df_comments['created_utc'], unit='s')

    
    end_time = time.time()
    print(f"Done in {round((end_time-start_time)/60,2)} minutes")
    return(df_posts, df_comments)



### Scraping for r/TheOnion and r/news

200 posts of 100 comments each to ensure diverse set of comments and prevent overfitting on a small set of posts.

In [24]:
subreddit_name = "TheOnion"

theonion_df_posts, theonion_df_comments = scrape_posts_and_comments_from_reddit_top(subreddit_name= subreddit_name, num_post=20,time_period= "all", comments_per_post=10)

theonion_df_comments.to_csv('../data/01_raw_onion_data.csv', index=False)

0 comments scraped in 2.37 minutes
20 comments scraped in 7.06 minutes
40 comments scraped in 9.34 minutes
60 comments scraped in 9.79 minutes
80 comments scraped in 10.24 minutes
100 comments scraped in 13.87 minutes
120 comments scraped in 14.35 minutes
140 comments scraped in 14.76 minutes
160 comments scraped in 15.56 minutes
180 comments scraped in 16.05 minutes
Done in 16.41 minutes


In [27]:
subreddit_name = "news"

news_df_posts, news_df_comments = scrape_posts_and_comments_from_reddit_top(subreddit_name= subreddit_name, num_post=20,time_period= "all", comments_per_post=10)

news_df_comments.to_csv('../data/01_raw_news_data.csv', index=False)

0 comments scraped in 0.35 minutes
20 comments scraped in 1.36 minutes
40 comments scraped in 2.14 minutes
60 comments scraped in 3.04 minutes
80 comments scraped in 3.88 minutes
100 comments scraped in 4.7 minutes
120 comments scraped in 5.5 minutes
140 comments scraped in 6.34 minutes
160 comments scraped in 7.26 minutes
180 comments scraped in 8.17 minutes
Done in 8.76 minutes


Note - Print statements above are for illustration purposes to re-enact the scraping of the actual 20,000 (estimate) comments per subreddit from r/news and r/TheOnion. Actual scraped data_sets are labeled similiarly to the export function as written in code

Posts and comments were scraped separately for faster processing in EDA. As the comments CSV file accumulated up to more than 35,000 rows, handling of the comments file took significantly longer processing time than handling of the posts file, which only had 100 rows. However, it was determined that the EDA for the post_only csv file was not meaningful due to too small a sample size of only a 100 posts. Therefore the post_only dataframe generated above will not be exported out as a csv file.