# Project 3: Web APIs & NLP

**Project Statement:** This project aims to assist 'redditors' (reddit users) to identify which subreddit (either `r/Science` or `r/Philosophy`) to post their submissions by making a classification model using NLP to classify posts belonging to two different subreddits.

**Summary:** According to the NYTimes, for roughly 98 percent of the last 2,500 years of Western intellectual history, philosophy was considered the mother of all knowledge. Today, science, not philosophy, has taken up the mantle as the world’s de-facto source of truth, with some no longer sure what philosophy is or is good for anymore. This leads to much confusion, especially in the online space where like-minded individuals gather to discuss topics, and incorrect submissions of topics are found which reduces the effectiveness of these discussion.

Especially in forums like [Reddit](https://www.reddit.com/), where "network of communities can dive into their interests, hobbies and passions", are popular go-to websites for people from different parts of the world to interact in which normally would not happen if not for the streamlined connectivity of the Internet. 

So how can we better maxmise the productivity of these discussions? The best solution is to ensure the correct topics are posted in the correct environment. But due to the confusion, there are bound to be some that are unable to differentiate. This project aims to reduce this problem by attempting to train a model that will be able to identify the differences between posts from each subreddit and classify new posts accordingly through various NLP techniques. The insights gained from our analysis and processing may be useful to the moderators of the respective subreddits and also normal redditors to know which subreddit their post belongs to.


## Contents:
- [Scraping Subreddits](#Scraping-Subreddits)
- [Data Export](#Data-Export)

## Libraries

In [2]:
# Import libaries
import pandas as pd

from psaw import PushshiftAPI
import requests
from bs4 import BeautifulSoup

from pprint import pprint
import time
from datetime import datetime

## Scraping Subreddits

We will be using [Pushshift](https://github.com/pushshift/api) API to collect posts from the two subreddits. However, it is not simply a matter of accessing the API and downloading your data at one request. We would have to successively call on the API (as it will only give us 100 posts per request) whilst maintaining some sort of false identity to avoid being locked out by their anti-bot systems.

In [2]:
pd.set_option('display.max_columns', None)

In [3]:
url = "https://api.pushshift.io/reddit/search/submission"
url_comments = "https://api.pushshift.io/reddit/search/comment"

In [4]:
params = {
    'subreddit' : 'philosophy', 
    'size' : '100'
}

In [5]:
# for submissions
res = requests.get(url, params)
res.status_code

200

In [6]:
data = res.json()
philo = data['data']

In [29]:
#checking one time request from submission
df_sub_philo = pd.DataFrame(philo)
print(df_sub_philo.shape)
df_sub_philo.head()

(100, 75)


Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,author_premium,awarders,can_mod_post,contest_mode,created_utc,domain,full_link,gildings,id,is_created_from_ads_ui,is_crosspostable,is_meta,is_original_content,is_reddit_media_domain,is_robot_indexable,is_self,is_video,link_flair_background_color,link_flair_richtext,link_flair_text_color,link_flair_type,locked,media_only,no_follow,num_comments,num_crossposts,over_18,parent_whitelist_status,permalink,pinned,pwls,removed_by_category,retrieved_on,score,selftext,send_replies,spoiler,stickied,subreddit,subreddit_id,subreddit_subscribers,subreddit_type,suggested_sort,thumbnail,title,total_awards_received,treatment_tags,upvote_ratio,url,whitelist_status,wls,post_hint,preview,thumbnail_height,thumbnail_width,url_overridden_by_dest,media,media_embed,secure_media,secure_media_embed,author_flair_background_color,author_flair_text_color,link_flair_css_class,link_flair_template_id,link_flair_text
0,[],False,Ok_Berry_3561,,[],,text,t2_t22zntol,False,False,False,[],False,False,1664849887,self.philosophy,https://www.reddit.com/r/philosophy/comments/x...,{},xv2mzf,False,False,False,False,False,False,True,False,,[],dark,text,False,False,True,0,0,False,all_ads,/r/philosophy/comments/xv2mzf/language_and_phi...,False,6,automod_filtered,1664849897,1,[removed],True,False,False,philosophy,t5_2qh5b,16897685,public,confidence,self,Language and Philosophy,0,[],1.0,https://www.reddit.com/r/philosophy/comments/x...,all_ads,6,,,,,,,,,,,,,,
1,[],False,giggly_gigabyte,,[],,text,t2_9tpdrzr9,False,False,False,[],False,False,1664849725,self.philosophy,https://www.reddit.com/r/philosophy/comments/x...,{},xv2kui,False,False,False,False,False,False,True,False,,[],dark,text,False,False,True,0,0,False,all_ads,/r/philosophy/comments/xv2kui/good_universitie...,False,6,automod_filtered,1664849736,1,[removed],True,False,False,philosophy,t5_2qh5b,16897682,public,confidence,self,good universities to study at in the US,0,[],1.0,https://www.reddit.com/r/philosophy/comments/x...,all_ads,6,,,,,,,,,,,,,,
2,[],False,Logical-Steak4716,,[],,text,t2_hh587ve8,False,False,False,[],False,False,1664849291,self.philosophy,https://www.reddit.com/r/philosophy/comments/x...,{},xv2f6h,False,False,False,False,False,False,True,False,,[],dark,text,False,False,True,0,0,False,all_ads,/r/philosophy/comments/xv2f6h/does_art_have_an...,False,6,moderator,1664849301,1,[removed],True,False,False,philosophy,t5_2qh5b,16897675,public,confidence,self,"Does art have any intrinsic value, or is it ar...",0,[],1.0,https://www.reddit.com/r/philosophy/comments/x...,all_ads,6,,,,,,,,,,,,,,
3,[],False,Snoo_82044,,[],,text,t2_74ih7gr5,False,False,False,[],False,False,1664845756,self.philosophy,https://www.reddit.com/r/philosophy/comments/x...,{},xv14ib,False,False,False,False,False,False,True,False,,[],dark,text,False,False,True,0,0,False,all_ads,/r/philosophy/comments/xv14ib/if_we_were_immor...,False,6,moderator,1664845767,1,[removed],True,False,False,philosophy,t5_2qh5b,16897641,public,confidence,self,"If we were immortal,do you think crime rates w...",0,[],1.0,https://www.reddit.com/r/philosophy/comments/x...,all_ads,6,,,,,,,,,,,,,,
4,[],False,AFITNAFITNA,,[],,text,t2_rryp0wum,False,False,False,[],False,False,1664845581,self.philosophy,https://www.reddit.com/r/philosophy/comments/x...,{},xv126e,False,False,False,False,False,False,True,False,,[],dark,text,False,False,True,0,0,False,all_ads,/r/philosophy/comments/xv126e/what_is_and_what...,False,6,automod_filtered,1664845591,1,[removed],True,False,False,philosophy,t5_2qh5b,16897636,public,confidence,self,"What is and what isn’t leftist, please help an...",0,[],1.0,https://www.reddit.com/r/philosophy/comments/x...,all_ads,6,,,,,,,,,,,,,,


In [14]:
#making function to get request from reddit api
def get_submission(subreddit, n_iter):
    
    # date/time of scrape was Saturday, October 1, 2022 1:19:41 PM
    df_list = []
    current_time = 1664601581
    
    #for every cycle, 100 post are gathered at one time
    for _ in range(n_iter):
        res = requests.get(
            url,
            params={
                "subreddit": subreddit,
                "size": 100,
                "before": current_time
            }
        )
        #induce 3 secs delay when requesting to reddit api to prevent being timeout
        time.sleep(3)
        df = pd.DataFrame(res.json()['data'])
        df = df.loc[:, ["subreddit", "title", "created_utc"]]
        df_list.append(df)
        current_time = df.created_utc.min()
        
    return pd.concat(df_list, axis=0) #join the different request output together

### r/Philosophy

In [15]:
#request 250 times, for each cycle getting back 100 post at a time to get 25_000 posts
#for philosophy
df_submission_philo = get_submission('philosophy', 250)

In [30]:
#checking latest and earliest submission
print(datetime.utcfromtimestamp(df_submission_philo['created_utc'].min()).strftime('%Y-%m-%d %H:%M:%S'))
print(datetime.utcfromtimestamp(df_submission_philo['created_utc'].max()).strftime('%Y-%m-%d %H:%M:%S'))

2021-05-20 17:49:19
2022-10-01 04:33:48


In [16]:
#checking
print(df_submission_philo.shape)
df_submission_philo.head()

(24988, 3)


Unnamed: 0,subreddit,title,created_utc
0,philosophy,Random NDE Finder,1664598828
1,philosophy,Authenticity :),1664592382
2,philosophy,The Education System Failed Philosophy,1664591129
3,philosophy,anyone care to have an honest open conversatio...,1664588410
4,philosophy,Philosophy has made me question why I study ph...,1664587740


### r/Science

In [21]:
#request 250 times, for each cycle getting back 100 post at a time to get 25_000 posts
#for science
df_submission_science = get_submission('science', 250)

In [32]:
#checking
print(df_submission_science.shape)
df_submission_science.head()

(24961, 3)


Unnamed: 0,subreddit,title,created_utc
0,science,New study explores why people drop out or don'...,1664601279
1,science,Dogs can discriminate between human baseline a...,1664600103
2,science,A new look at an extremely rare female infant ...,1664595769
3,science,Concussions are associated with 60% increase i...,1664579006
4,science,Our cities are warming and urban greenery coul...,1664578793


In [31]:
#checking latest and earliest submission
print(datetime.utcfromtimestamp(df_submission_science['created_utc'].min()).strftime('%Y-%m-%d %H:%M:%S'))
print(datetime.utcfromtimestamp(df_submission_science['created_utc'].max()).strftime('%Y-%m-%d %H:%M:%S'))

2021-11-13 00:01:30
2022-10-01 05:14:39


After making the function to allow us to scrape 100 post per request (250 times), we essentially managed to scrape the first 25,000 submissions from each subreddit. At the time of scraping the subreddit was Saturday, October 1, 2022 1:19:41 PM (SGT). 
The earliest time in which we scraped each subreddit was May 20, 2021 for `r/Philosophy` and November 11, 2021 for `r/Science`. Thus, our scrapper required to extended the earliest time for `r/Philosophy` to reach the 25,000 submission.

We also only focused on submissions, rather than comments, as the subreddit encourages individuals to start a topic of discussion, with a plethora of textual knowledge. Most questions are found in the comments and/or other subreddits corresponding to the topic (for example there is a `r/askscience` for people to post questions as submissions and people can help answer in the comments). However, we are focusing on `r/Science` and `r/Philosophy`, in which submission posts are a better content scrape for this project.

## Data Export

In [7]:
df_submission_science.to_csv('../datasets/science_25k.csv', index = False)

In [None]:
df_submission_philo.to_csv('../datasets/philo_25k.csv', index = False)