## 2. Data Collection

In [7]:
# install PRAW API
#!pip install praw

Collecting praw
  Downloading praw-7.7.0-py3-none-any.whl (189 kB)
[K     |████████████████████████████████| 189 kB 2.5 MB/s eta 0:00:01
[?25hCollecting update-checker>=0.18
  Downloading update_checker-0.18.0-py3-none-any.whl (7.0 kB)
Collecting prawcore<3,>=2.1
  Downloading prawcore-2.3.0-py3-none-any.whl (16 kB)
Collecting websocket-client>=0.54.0
  Downloading websocket_client-1.5.2-py3-none-any.whl (56 kB)
[K     |████████████████████████████████| 56 kB 5.0 MB/s eta 0:00:011
Installing collected packages: websocket-client, update-checker, prawcore, praw
Successfully installed praw-7.7.0 prawcore-2.3.0 update-checker-0.18.0 websocket-client-1.5.2


In [1]:
# imports
import requests
import time
import pandas as pd
import praw

### Create function to perform scraping

As Reddit API has a rate limit of 60 requests per minute and with a maximum of 100 items at once [1], care has been taken to factor this limit in the code below when fetching the reddit posts. A check for the status code has also been incorporated in the code to check if the rate limit has been exceeded and to wait for 60 seconds before retrying the request.

[1] Jean-Christophe Chouinard, "Reddit API with Python (Complete Guide)", JC Chouinard, 29 April 2023, https://www.jcchouinard.com/reddit-api/#:~:text=The%20Reddit%20API's%20rate%20limit,minute%20as%20per%20the%20rules.

In [2]:
def scrape_subreddit(subreddit_name):

    # create reddit instance
    reddit = praw.Reddit(
        client_id = '8k7-8aj8Hs7I2Ow1jHGVXQ',
        client_secret = 'sqMDOB0EeXbEIBBGPPDlRWX9upnrTA',
        user_agent = 'Post Scraping',
    )

    # get the subreddit object
    subreddit = reddit.subreddit(subreddit_name)

    # parameters
    posts_per_request = 100  # number of posts to fetch per request
    target_num_posts = 1000  # max limit set by reddit

    # create dictionary to store the data
    content = {'subreddit': [], 'post_content': [], 'upvote_ratio': []}

    num_posts = 0

    # fetch posts
    after = None
    while num_posts < target_num_posts:
        # fetch posts by batch
        posts = subreddit.new(limit = posts_per_request, params = {'after': after})

        # iterate over the posts and populate the dictionary
        for post in posts:
            content['subreddit'].append(post.subreddit)
            content['post_content'].append(post.selftext)
            content['upvote_ratio'].append(post.upvote_ratio)
            num_posts += 1

            if num_posts >= target_num_posts:
                break

        # get value of 'after' for the next request
        after = posts._listing.__dict__.get('after')

        # wait for 60 seconds if the status code is 429 (rate limit exceeded)
        if posts._listing.__dict__.get('response') and posts._listing.response.status_code == 429:
            print('Rate limit exceeded. Waiting for 60 seconds...')
            time.sleep(60)

        # break the loop if there are no more posts
        if not after:
            break

    # create dataframe
    return pd.DataFrame(content)

#### a. Subreddit 1: Statistics (https://www.reddit.com/r/statistics/)

In [3]:
# scrape subreddit 1
stats_content = scrape_subreddit('statistics')

stats_content

Unnamed: 0,subreddit,post_content,upvote_ratio
0,statistics,"Hello everyone!\n\nFirstly, please let me know...",1.00
1,statistics,It would be great if the numbers are from 2022...,0.33
2,statistics,"Hi, I would like to know what sort of approach...",0.67
3,statistics,I recently got a job offer for a chemistry ass...,0.76
4,statistics,I am a foreign student in the United States of...,0.65
...,...,...,...
994,statistics,"I thought up a small ""language"" for describing...",0.79
995,statistics,"Hello,\n\nI am taking a course in correlation/...",0.50
996,statistics,"Hi all,\n\nI am reviewing a study on gun-relat...",1.00
997,statistics,I started my master's in statistics 6 months a...,1.00


#### b. Subreddit 2: Machine Learning (https://www.reddit.com/r/MachineLearning/)

In [4]:
# scrape subreddit 2
ml_content = scrape_subreddit('machinelearning')

ml_content

Unnamed: 0,subreddit,post_content,upvote_ratio
0,MachineLearning,I would really appreciate feedback on a versio...,1.00
1,MachineLearning,\n\nIs there any LLM available which can be t...,0.60
2,MachineLearning,I keep reading about open source LLMs that is ...,0.71
3,MachineLearning,Trying to keep up with paper reading but I str...,1.00
4,MachineLearning,I am an experienced software engineer who want...,0.78
...,...,...,...
976,MachineLearning,\n\nI built an app where I recognize the lan...,0.22
977,MachineLearning,Hi everyone\n\nI was reading about data privac...,0.77
978,MachineLearning,"For my project, given *N* steps from time seri...",0.82
979,MachineLearning,The deep ensemble paper [https://arxiv.org/pdf...,1.00


### Combine both sets of data

A total of 999 posts have been scraped from r/statistics while 981 posts have been scraped from r/machinelearning. These two dataframes would be combined to facilitate further analysis in the subsequent EDA.

In [5]:
# concatenate both df
subreddit_content = pd.concat([stats_content, ml_content], axis = 0)

subreddit_content.head()

Unnamed: 0,subreddit,post_content,upvote_ratio
0,statistics,"Hello everyone!\n\nFirstly, please let me know...",1.0
1,statistics,It would be great if the numbers are from 2022...,0.33
2,statistics,"Hi, I would like to know what sort of approach...",0.67
3,statistics,I recently got a job offer for a chemistry ass...,0.76
4,statistics,I am a foreign student in the United States of...,0.65


In [6]:
# remove duplicates
subreddit_content_final = subreddit_content.drop_duplicates(subset = ['post_content'], keep = 'first')

In [9]:
# export as .csv file
subreddit_content_final.to_csv('../dataset/subreddit_content.csv', index = False)