# Project 3 : Web APIs & NLP - Classification of subreddit posts

## Part 2 - Introduction & data scraping

## Background

In this day and age, the ease of setting up trading accounts with the many platforms available has resulted in high accessibility of trading stocks and crytocurrencies by the average joe.

As a trading firm, we need to keep abreast of developments and discussions in platforms like Reddit to understand the prevailing trading sentiments in the current volatile environment. We aim to sieve out trading trends, sentiments through analyzing Reddit posts to support two of our most important trading desks - equities and crytocurrencies.

## Problem Statement

As a new investment firm with 2 main trading desks (one for traditional securities and one for crytocurrency), we aim to develop a tool to analyze the top trending topics in equities and crytocurrencies. In doing so, we could explore if there may be any correlation between any specific trending topics and a stock ticker or cryptocurrency, to support the firm in taking up data-informed trading positions.

The idea is to develop a classification model that leverages on natural language programming (NLP) to identify and classify content into the categories of 'investing' vs. 'crytocurrency'.\
This is no mean feat, as it is expected that there are likely parallels between r/investing and r/cryptocurrency since they revolves around trading. Their uniqueness is exactly what we are keen to find out, and part of the deliverables of this undertaking.

Performance metrics that will be part of the success criteria are:\
(We have specified in the binary classification, 1 or positive refers to r/investing category while 0 or negative refers to r/cryptocurrency category)
- accuracy or how well the model is able to make correct classification
- sensitivity (true positive rate) or how well the model is able to make correct classification as an r/investing topic
- specificity (true negative rate) or how well the model is able to make correct classification as a r/cryptocurrency topic

## Import Libraries

In [1]:
# import libraries

import requests
import re
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import time

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [2]:
# set config to display all
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

## Automated data scraping from Reddit via pushshift.api

The two chosen subreddits are "r/investing" and "r/CryptoCurrency".\
Define a function to extract the requisite data from the said subreddits and also for a specified timeframe.\
Based on the initial pre-scraping data acquaintance step in the previous notebook, there are some steps which are helpful to sieve out posts that are either 'removed' or 'deleted'.

In [3]:
# this function builds a get pushshift data for posts
# adapted from https://colab.research.google.com/drive/1biLcXeHs8yZD1x9f3gv-cNJXEq7tpyoO?usp=sharing#scrollTo=mDma-H_k0frf

def get_push_shift_data(subreddit, end_date_unix):
    
    # push shift URL endpoints
    url ='https://api.pushshift.io/reddit/search/submission'
    
    # define parameters for the posts
    params = {
        'subreddit':subreddit,
        'size':100,
        'before': end_date_unix,
    }
    
    # get the response
    response = requests.get(url, params)
    
    # in the event of error due to time out, etc, pause for 1 second and make a new request
    while response.status_code != 200:
        time.sleep(1)
        response = requests.get(url, params)
    
    # define the JSON data to a variable
    data = response.json()
    
    # define the posts in the data
    posts = data['data']
    
    # returns the list of posts 
    return posts

This function extracts the relevant columns from the reddit data as well as omit rows (posts) that have been removed by reddit, moderator or simply deleted.

In [4]:
# define function to extract the posts from eash JSON result, while eliminating posts removed by moderator
# returns each extracted post as a post dictionary

def extract_posts(data_posts):
    
    # generate a dict
    post = {}
    
    # filter out only posts that have made it to reddit, i.e. not removed by reddit or moderator, not deleted posts
    removed = data_posts['is_crosspostable']
    
    if removed is True:
        selftext = data_posts['selftext']
        if selftext != '':
    
            # extract columns of interest
            post['title'] = data_posts['title']
            post['create'] = data_posts['created_utc']
            post['subreddit'] = data_posts['subreddit']
            post['body'] = data_posts['selftext']
            post['author'] = data_posts['author']
            post['permalink'] = data_posts['permalink']
            post['num_comments'] = data_posts['num_comments']
            
            # returns a dict of the extracted post
            return post

This function returns the time the that post was created. This will be used specifically for the 100th post that each data collects, to reinitialize the next 100 posts to be scraped.

In [5]:
def post_time(data_posts):
    
    # extract the time created for the post
    time = data_posts['created_utc']
    
    return time

Defines the period of interest, before 25 Oct 2021 2359:59hrs

In [6]:
# define date of interest before 25 Oct 2021 2359hrs SG time
# equivalent epoch unix timestamp for end: 1635177599
# reference from https://www.epochconverter.com/
end = 1635177599

### Scraping subreddit - investing posts

Using a while loop to scrap posts from the particular subreddit that are not removed or deleted.\
To achieve at least 1000 posts target, have set 1200 target posts so that there remains a buffer for subsequent data cleaning step.

In [7]:
# start time counter
start_time = time.time()

# initialize and seek the first 100 get requests
invest_data = get_push_shift_data('investing', end)

# generate an empty list to store the retrieved posts
invest_posts = []
count = 0

# extraction a target of 5000 posts that are not removed or deleted
# looping through each extraction process as each pushshift only allows for 100 get requests.
while len(invest_posts) < 5000:
    
    # for each post that are extracted from the get request, extract the relevant columns information
    for i in invest_data:
        post = extract_posts(i)
        
        # if the extracted post is not None, i.e. are not deleted or removed, add into the list
        if post != None:
            invest_posts.append(post)
    
    # get the time of the 100th post in the get request
    last_post = invest_data[len(invest_data)-1]
    last_time = post_time(last_post)
    
    # seek out the next 100 get requests via push shift
    invest_data = get_push_shift_data('investing', last_time)
    count+=1
    
    print(f'looping {count} time(s). num of posts: {len(invest_posts)}')

print(f'\nlast post scraped is {(end-last_time)/(60*60*24):.0f} days before 25 Oct.')

# print the execution time
execution_time = (time.time() - start_time)
print(f'code execution time : {execution_time:.2f} seconds')

looping 1 time(s). num of posts: 17
looping 2 time(s). num of posts: 31
looping 3 time(s). num of posts: 49
looping 4 time(s). num of posts: 62
looping 5 time(s). num of posts: 77
looping 6 time(s). num of posts: 91
looping 7 time(s). num of posts: 105
looping 8 time(s). num of posts: 124
looping 9 time(s). num of posts: 139
looping 10 time(s). num of posts: 158
looping 11 time(s). num of posts: 172
looping 12 time(s). num of posts: 189
looping 13 time(s). num of posts: 204
looping 14 time(s). num of posts: 218
looping 15 time(s). num of posts: 236
looping 16 time(s). num of posts: 251
looping 17 time(s). num of posts: 267
looping 18 time(s). num of posts: 285
looping 19 time(s). num of posts: 300
looping 20 time(s). num of posts: 321
looping 21 time(s). num of posts: 336
looping 22 time(s). num of posts: 361
looping 23 time(s). num of posts: 381
looping 24 time(s). num of posts: 394
looping 25 time(s). num of posts: 406
looping 26 time(s). num of posts: 421
looping 27 time(s). num of 

looping 211 time(s). num of posts: 4443
looping 212 time(s). num of posts: 4458
looping 213 time(s). num of posts: 4471
looping 214 time(s). num of posts: 4481
looping 215 time(s). num of posts: 4490
looping 216 time(s). num of posts: 4501
looping 217 time(s). num of posts: 4520
looping 218 time(s). num of posts: 4528
looping 219 time(s). num of posts: 4541
looping 220 time(s). num of posts: 4557
looping 221 time(s). num of posts: 4573
looping 222 time(s). num of posts: 4588
looping 223 time(s). num of posts: 4604
looping 224 time(s). num of posts: 4612
looping 225 time(s). num of posts: 4615
looping 226 time(s). num of posts: 4628
looping 227 time(s). num of posts: 4640
looping 228 time(s). num of posts: 4649
looping 229 time(s). num of posts: 4666
looping 230 time(s). num of posts: 4682
looping 231 time(s). num of posts: 4694
looping 232 time(s). num of posts: 4706
looping 233 time(s). num of posts: 4727
looping 234 time(s). num of posts: 4735
looping 235 time(s). num of posts: 4748


Convert the extracted data into a dataframe and preview it.

In [8]:
# convert into dataframe
invest = pd.DataFrame(invest_posts)
# preview data
invest.head()

Unnamed: 0,title,create,subreddit,body,author,permalink,num_comments
0,A Great Short Setup on NYSE,1635167989,investing,[Snapshot](https://www.tradingview.com/x/xdfUP...,mildcharts,/r/investing/comments/qfgl43/a_great_short_set...,1
1,A Great Short Setup on NYSE,1635167785,investing,* Double top on ATH level\n* Bullish RSI diver...,mildcharts,/r/investing/comments/qfgirk/a_great_short_set...,1
2,Inherited large amount of equity.. Not sure wh...,1635167216,investing,I'm 25 years old and just inherited a portion ...,EselSchwanz,/r/investing/comments/qfgc31/inherited_large_a...,2
3,Daily General Discussion and spitballin thread...,1635152534,investing,Have a general question? Want to offer some c...,AutoModerator,/r/investing/comments/qfckhu/daily_general_dis...,142
4,Daily Advice Thread - All basic help or advice...,1635152476,investing,"If your question is ""I have $10,000, what do I...",AutoModerator,/r/investing/comments/qfck11/daily_advice_thre...,101


Take a look at the number of rows and columns in the dataframe

In [9]:
# check the number of rows and columns
invest.shape

(5012, 7)

Remove duplicated posts with the same body content.\
Also remove duplicated posts by the same author with the same title. Have noticed that these are duplicated posts even though their body content might not be exactly the same.

In [10]:
# run through 2 times of duplicates cleaning
# first if the rows have same body and author
# second if the rows have same title and author
invest.drop_duplicates(subset=['body','author'], inplace=True)
invest.drop_duplicates(subset=['title','author'], inplace=True)
invest.head()

Unnamed: 0,title,create,subreddit,body,author,permalink,num_comments
0,A Great Short Setup on NYSE,1635167989,investing,[Snapshot](https://www.tradingview.com/x/xdfUP...,mildcharts,/r/investing/comments/qfgl43/a_great_short_set...,1
2,Inherited large amount of equity.. Not sure wh...,1635167216,investing,I'm 25 years old and just inherited a portion ...,EselSchwanz,/r/investing/comments/qfgc31/inherited_large_a...,2
3,Daily General Discussion and spitballin thread...,1635152534,investing,Have a general question? Want to offer some c...,AutoModerator,/r/investing/comments/qfckhu/daily_general_dis...,142
4,Daily Advice Thread - All basic help or advice...,1635152476,investing,"If your question is ""I have $10,000, what do I...",AutoModerator,/r/investing/comments/qfck11/daily_advice_thre...,101
5,Advice on investing and why people overcomplic...,1635140198,investing,Hi guys I'm a 19 year old kid and I've investe...,Worried_individual_,/r/investing/comments/qf9yjv/advice_on_investi...,19


Confirm the removal of duplicates. Check the number of remaining rows and columns.

In [11]:
invest.shape

(4533, 7)

In [20]:
invest.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4533 entries, 0 to 5011
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   title         4533 non-null   object
 1   create        4533 non-null   int64 
 2   subreddit     4533 non-null   object
 3   body          4533 non-null   object
 4   author        4533 non-null   object
 5   permalink     4533 non-null   object
 6   num_comments  4533 non-null   int64 
dtypes: int64(2), object(5)
memory usage: 283.3+ KB


Export the data to csv

In [12]:
invest.to_csv('../data/invest4533.csv', index=False)

### Scraping subreddit - CryptoCurrency posts

Repeat the process to scrape CryptoCurrency subreddit posts as well.

In [13]:
# start time counter
start_time = time.time()

# initialize and seek the first 100 get requests
crypto_data = get_push_shift_data('CryptoCurrency', end)

# generate an empty list to store the retrieved posts
crypto_posts = []
count = 0

# extraction a target of 5000 posts that are not removed or deleted
# looping through each extraction process as each pushshift only allows for 100 get requests.
while len(crypto_posts) < 5000:
    
    # for each post that are extracted from the get request, extract the relevant columns information
    for i in crypto_data:
        post = extract_posts(i)
        
        # if the extracted post is not None, i.e. are not deleted or removed, add into the list
        if post != None:
            crypto_posts.append(post)
    
    # get the time of the 100th post in the get request
    last_post = crypto_data[len(crypto_data)-1]
    last_time = post_time(last_post)
    
    # seek out the next 100 get requests via push shift
    crypto_data = get_push_shift_data('CryptoCurrency', last_time)
    count+=1
    
    print(f'looping {count} times. num of posts: {len(crypto_posts)}')

print(f'\nlast post scraped is {(end-last_time)/(60*60*24):.0f} days before 25 Oct.')

# print the execution time
execution_time = (time.time() - start_time)
print(f'code execution time : {execution_time:.2f} seconds')

looping 1 times. num of posts: 25
looping 2 times. num of posts: 58
looping 3 times. num of posts: 92
looping 4 times. num of posts: 116
looping 5 times. num of posts: 147
looping 6 times. num of posts: 171
looping 7 times. num of posts: 192
looping 8 times. num of posts: 219
looping 9 times. num of posts: 250
looping 10 times. num of posts: 272
looping 11 times. num of posts: 295
looping 12 times. num of posts: 321
looping 13 times. num of posts: 342
looping 14 times. num of posts: 371
looping 15 times. num of posts: 400
looping 16 times. num of posts: 426
looping 17 times. num of posts: 454
looping 18 times. num of posts: 490
looping 19 times. num of posts: 524
looping 20 times. num of posts: 557
looping 21 times. num of posts: 585
looping 22 times. num of posts: 613
looping 23 times. num of posts: 641
looping 24 times. num of posts: 670
looping 25 times. num of posts: 700
looping 26 times. num of posts: 735
looping 27 times. num of posts: 751
looping 28 times. num of posts: 771
loop

Convert the extracted data into a dataframe.

In [14]:
crypto = pd.DataFrame(crypto_posts)
crypto.head()

Unnamed: 0,title,create,subreddit,body,author,permalink,num_comments
0,Is Binance afraid of Crypto.com?,1635177574,CryptoCurrency,So I have to assume as the title says above. I...,Markmanus,/r/CryptoCurrency/comments/qfjw21/is_binance_a...,51
1,What's been your worst move in Crypto? (So far),1635177350,CryptoCurrency,We've all made silly mistakes since our incept...,frostybitz,/r/CryptoCurrency/comments/qfjt5m/whats_been_y...,205
2,IRS wants to monitor your bank account flow. T...,1635177349,CryptoCurrency,Ok so I'll start with the article link from NP...,Mediocre-Sale8473,/r/CryptoCurrency/comments/qfjt5b/irs_wants_to...,12
3,Financial Lifehacks To Get Extra Money for Buy...,1635177185,CryptoCurrency,Hello there do you wish you could put some mor...,Many_Arm7466,/r/CryptoCurrency/comments/qfjr0v/financial_li...,17
4,Want to earn some extra Crypto on the side? Ch...,1635177106,CryptoCurrency,Check out the best play-to-earn crypto game on...,silver_sean,/r/CryptoCurrency/comments/qfjq0i/want_to_earn...,7


Take a look at the number of rows and columns in the dataframe

In [15]:
crypto.shape

(5025, 7)

Likewise, remove duplicated posts with the same body content.\
Also remove duplicated posts by the same author with the same title. Have noticed that these are duplicated posts even though their body content might not be exactly the same.

In [16]:
crypto.drop_duplicates(subset=['body','author'], inplace=True)
crypto.drop_duplicates(subset=['title','author'], inplace=True)

crypto.head()

Unnamed: 0,title,create,subreddit,body,author,permalink,num_comments
0,Is Binance afraid of Crypto.com?,1635177574,CryptoCurrency,So I have to assume as the title says above. I...,Markmanus,/r/CryptoCurrency/comments/qfjw21/is_binance_a...,51
1,What's been your worst move in Crypto? (So far),1635177350,CryptoCurrency,We've all made silly mistakes since our incept...,frostybitz,/r/CryptoCurrency/comments/qfjt5m/whats_been_y...,205
2,IRS wants to monitor your bank account flow. T...,1635177349,CryptoCurrency,Ok so I'll start with the article link from NP...,Mediocre-Sale8473,/r/CryptoCurrency/comments/qfjt5b/irs_wants_to...,12
3,Financial Lifehacks To Get Extra Money for Buy...,1635177185,CryptoCurrency,Hello there do you wish you could put some mor...,Many_Arm7466,/r/CryptoCurrency/comments/qfjr0v/financial_li...,17
4,Want to earn some extra Crypto on the side? Ch...,1635177106,CryptoCurrency,Check out the best play-to-earn crypto game on...,silver_sean,/r/CryptoCurrency/comments/qfjq0i/want_to_earn...,7


Confirm the removal of duplicates. Check the number of remaining rows and columns.

In [17]:
crypto.shape

(4888, 7)

In [18]:
crypto.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4888 entries, 0 to 5024
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   title         4888 non-null   object
 1   create        4888 non-null   int64 
 2   subreddit     4888 non-null   object
 3   body          4888 non-null   object
 4   author        4888 non-null   object
 5   permalink     4888 non-null   object
 6   num_comments  4888 non-null   int64 
dtypes: int64(2), object(5)
memory usage: 305.5+ KB


Export the data to csv

In [19]:
crypto.to_csv('../data/crypto4888.csv', index=False)

### Data Dictionary

|  Feature  |  Type  |  Dataset  |  Description  |
|:----------|:------:|:---------:|:--------------|
| title     | object | invest    | The title of the subreddit post|
| create    | int    | invest    | The time that the reddit post was created in epoch unix units|
| subreddit | object | invest    | The subreddit that the post belongs to |
| body      | object | invest    | The body contents of the reddit post |
| author    | object | invest    | The author of the reddit post |
| permalink | object | invest    | The url link of the reddit post |
| num_comments | object | invest | The number of comments for the reddit post |
| title     | object | crypto    | The title of the subreddit post|
| create    | int    | crypto    | The time that the reddit post was created in epoch unix units|
| subreddit | object | crypto    | The subreddit that the post belongs to |
| body      | object | crypto    | The body contents of the reddit post |
| author    | object | crypto    | The author of the reddit post |
| permalink | object | crypto    | The url link of the reddit post |
| num_comments | object | crypto | The number of comments for the reddit post |
