# Project 3: Subreddit Classification
---
Project notebook organisation:<br>
**1 - Webscraping and Data Acquisition** (current notebook)<br>
[2 - Exploratory Data Analysis and Preprocessing](./2_exploratory_data_analysis_and_preprocessing.ipynb)<br>
[3 - Model Tuning and Insights](./3_model_tuning_and_insights.ipynb)<br>
<br>
<br>

In [3]:
import time, warnings
import pandas as pd
import numpy as np

warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)

%matplotlib inline

## Introduction & problem statement
   ---

<img src='../graphics/[source_wikipedia]reddit_logo.png' style='float:left; margin:15px; height: 40px'>

Reddit is a social news, content, and discussions website. Posts are organised according to subject into user-created 'subreddits', which covers practically <a href="https://www.reddit.com/r/BreadStapledToTrees/">any</a> <a href="https://www.reddit.com/r/birdswitharms/">topic</a> <a href="https://www.reddit.com/r/CatsStandingUp/">imaginable</a>. Members submit content (such as images, texts, and links) to subreddits, which can then be voted up ('upvote') or down ('downvote') by other members.

In this project, I examined posts from two subreddits - [**r/Singapore**](https://www.reddit.com/r/singapore/) (Fig 1) and [**r/Malaysia**](https://www.reddit.com/r/malaysia/) (Fig 2). Despite their shared heritage and history (Singapore was part of Malaysia until her separation and independence on 9 August 1965<sup>[[1]](http://eresources.nlb.gov.sg/history/events/dc1efe7a-8159-40b2-9244-cdb078755013)</sup>), Singapore has taken on a very different development path since independence. On the surface, the two countries seem very different today - different languages, different racial and religious compositions, etc. However, their citizens may have more in common than people from both countries usually like to admit. The goal of this project is therefore to try and figure out how similar Singaporeans and Malaysians are, by looking at what they talk about on their subreddits. 

To answer this question, a word-frequency based classification model will be developed to predict which subreddit a random post belongs to. To identify a production model, a variety of preliminary models would be tested and evaluated based on their accuracy scores (i.e. how many correct predictions they are able to make).

<img src='../graphics/[source_reddit]rsingapore.png' width = 700 align = center>
<center><font size=2 color='grey'>(Fig 1. The frontpage of r/Singapore as of 8pm, 21 October 2019.)</font></center>
<img src='../graphics/[source_reddit]rmalaysia.png' width = 700 align = center>
<center><font size=2 color='grey'>(Fig 2. The frontpage of r/Malaysia as of 8pm, 21 October 2019.)</font></center>

While the goal of this project is to classify posts into subreddits, such classifer models have much wider applicabilities, for example the automatic sorting of customer requests into different categories (to be forwarded to different departments), and recommending articles from similar categories to readers.

Due to the scale of this project, it is split into three sequential Jupyter notebooks: webscraping and data acquisition, EDA and feature engineering, and model tuning and insights. This is the webscraping and data acquisition notebook.

### Contents

1. [Data dictionary](#Data-dictionary)
2. [Webscraping](#Webscraping)

## Executive summary
---

Reddit is a social news, content, and discussions website. Posts are organised according to subject into user-created 'subreddits'. Members submit content (such as images, texts, and links) to subreddits, which can then be voted up ('upvote') or down ('downvote') by other members.

In this project, I examined posts from two subreddits - [**r/Singapore**](https://www.reddit.com/r/singapore/) and [**r/Malaysia**](https://www.reddit.com/r/malaysia/). Despite their shared heritage and history (Singapore was part of Malaysia until her separation and independence on 9 August 1965 <sup>[[1]](http://eresources.nlb.gov.sg/history/events/dc1efe7a-8159-40b2-9244-cdb078755013)</sup>), Singapore has taken on a very different development path since independence. On the surface, the two countries seem very different today - different languages, different racial and religious composition, different currencies, etc. However, they may have more in common than people from both countries usually like to admit. The goal of this project is therefore to use posts from their respective subreddits to answer the question - how different is Singapore from Malaysia? Specifically, would one be able to tell apart subreddit posts from r/Singapore and r/Malaysia? 

To answer this question, I developed a word-frequency based classification model to predict the subreddit that a random post belongs to. A variety of preliminary models were tested and evaluated based on prediction accuracy, i.e. how many posts they were able to correctly classify. The final production model was a multinomial naive Bayes classifier that makes predictions based on title content and post lengths, with an accuracy of 71%. This shows that the posts in r/Malaysia and r/Singapore are fairly different, but still have a good amount of similarities. The differences may mainly be due to differences in current affairs in Singapore and Malaysia. It is therefore not surprising that the two subreddits are somewhat distinguishable from each other, as the current affair topics in different countries will undoubtedly be different. The similarities behind the model misclassifications may be due to more generic, day-to-day topics such as people asking for help or life advice, which are likely to be similar between the two countries.

To further improve model accuracy, a bigger corpus that incorporates a bigger vocabulary on the current affairs in Singapore and Malaysia is needed. As news are constantly changing, new words are also constantly emerging in these subreddits. Therefore, it would not be enough to train the model on/obtain the training corpus from past subreddit posts. A more useful corpus for model training would be english new sites that report on both Singapore and Malaysia, such Channel News Asia.

Although the goal of this project is to classify subreddits, such a classification model can also be applied elsewhere, such as to automate CRM tasks based on topic matching, recommending similar articles to readers, and the ever-useful spam email filtering.

## Data dictionary
---

|Feature|Type|Dataset|Description|
|---|---|---|---|
|title        |str      |sg_posts/ms_posts|title of each reddit post
|id           |str      |sg_posts/ms_posts|id of each reddit post
|date_created |datetime |sg_posts/ms_posts|date and time the post is created
|text         |str      |sg_posts/ms_posts|body text of each reddit post
|distinguished|str      |sg_posts/ms_posts|whether the post is created by a moderator of the subreddit
|score        |int      |sg_posts/ms_posts|number of upvotes a post has
|upvote_ratio |float    |sg_posts/ms_posts|number of upvotes a post has, divided by the total number of votes the post received
|post_id                 |str|sg_comments/ms_comments|id of the parent post of a comment
|comment_text            |str|sg_comments/ms_comments|body text of each top level comment
|comment_distinguished   |str|sg_comments/ms_comments|whether the comment is made by a moderator of the subreddit
|comment_score           |int|sg_comments/ms_comments|number of upvotes a comment has

## Webscraping
---

The [Reddit API](https://www.reddit.com/dev/api/) allows one to remotely interact with Reddit, including downloading posts from subreddits (with a cap of 1000 posts due to the [way posts are stored](https://www.reddit.com/r/redditdev/comments/30a7ap/does_reddit_api_limit_total_listings_returned_to/). The API can be interacted with directly by adding a `.json` tag at the end of the html string. However, this method requires a custom `User-agent` and a `time.sleep()` function after scraping each page of data to disguise the API call as coming from a Python programme. After attempting this method ([method 1](#Method-1-default-Reddit-API) below), I also realised that getting more information (such as number of upvotes, number of comments) than the basic post title, text, and authors also requires querying further than the basic `.json` tag.

A workaround is to use the [Python Reddit API Wrapper (PRAW)](https://praw.readthedocs.io/en/v3.6.0/), which has the APIs built into a Python library for easy interactions. This is the approach I ended up using, as it was able to easily pull information on upvotes and comments, with the downside being that it took much longer (see [method 2](#Method-2-PRAW)). Using PRAW, I collected the following from each subreddit:

- post title
- post text (body)
- post ID
- distinguished posts (i.e. whether or not it is a moderator post)
- post score (i.e. number of upvotes)
- post upvote ratio (i.e. number of upvotes divided by the total number of votes)
- post date
- all top level comments on each post and their respective:
    - comment text
    - distinguished comments
    - comment scores
    - parent post ID

The goal at this point is to gather as much data related to each reddit post as computationally possible. Even though the goal of this project is to classify reddit _posts_, I also wanted to see how the comments are like, and whether including comments in the training data would improve the model predictions.

The data from this section will be explored in the [next notebook]((./2_exploratory_data_analysis_and_preprocessing.ipynb)).

### Training data

#### Method 1: default Reddit API

In [4]:
import requests, time

posts = []
after = None

for a in range(40):
    url = 'https://www.reddit.com/r/Singapore/new.json' # download posts sorted by new
    if after == None:
        current_url = url
    else:
        current_url = url + '?after=' + after

    # send request to url
    res = requests.get(current_url, headers={'User-agent': 'Pony Inc 1.0'})
    
    # check for errors
    if res.status_code != 200:
        print('Status error', res.status_code)
        break
    
    # get posts and add to [posts]
    current_dict = res.json()
    current_posts = [p['data'] for p in current_dict['data']['children']]
    posts.extend(current_posts)
    
    # get tag of last post on the page
    after = current_dict['data']['after']
    
    # generate a random sleep duration to look more 'natural'
    sleep_duration = np.random.randint(2,6)
    time.sleep(sleep_duration)

df = pd.DataFrame(posts)
# df.to_csv('../data/singapore.csv', index = False)

# check whether all posts are added to df
df.shape[0] == len(posts)

# print number of posts saved
print(f'a total of {len(posts)} posts were downloaded.')

df.head()

a total of 982 posts were downloaded.


Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,subreddit_name_prefixed,hidden,pwls,link_flair_css_class,downs,thumbnail_height,hide_score,name,quarantine,link_flair_text_color,author_flair_background_color,subreddit_type,ups,total_awards_received,media_embed,thumbnail_width,author_flair_template_id,is_original_content,user_reports,secure_media,is_reddit_media_domain,is_meta,category,secure_media_embed,link_flair_text,can_mod_post,score,approved_by,thumbnail,edited,author_flair_css_class,steward_reports,author_flair_richtext,gildings,post_hint,content_categories,is_self,mod_note,created,link_flair_type,wls,banned_by,author_flair_type,domain,allow_live_comments,selftext_html,likes,suggested_sort,banned_at_utc,view_count,archived,no_follow,is_crosspostable,pinned,over_18,preview,all_awardings,awarders,media_only,link_flair_template_id,can_gild,spoiler,locked,author_flair_text,visited,num_reports,distinguished,subreddit_id,mod_reason_by,removal_reason,link_flair_background_color,id,is_robot_indexable,report_reasons,author,discussion_type,num_comments,send_replies,whitelist_status,contest_mode,mod_reports,author_patreon_flair,author_flair_text_color,permalink,parent_whitelist_status,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,crosspost_parent_list,crosspost_parent,media_metadata
0,,singapore,,t2_jel5b,False,,0,False,End Of The Road For These Beauty World Centre ...,[],r/singapore,False,6,discussion,0,105.0,True,t3_dlzqg7,False,dark,,public,1,0,"{'content': '&lt;iframe width=""600"" height=""33...",140.0,,False,[],"{'type': 'youtube.com', 'oembed': {'provider_u...",False,False,,"{'content': '&lt;iframe width=""600"" height=""33...",Discussion,False,1,,https://b.thumbs.redditmedia.com/bhm8W9oHFfH37...,False,,[],[],{},rich:video,,False,,1571868000.0,text,6,,text,youtube.com,False,,,,,,False,True,False,False,False,{'images': [{'source': {'url': 'https://extern...,[],[],False,cc19a6ee-3023-11e4-bc5b-12313b0b2072,False,False,False,,False,,,t5_2qh8c,,,,dlzqg7,True,,oklos,,1,True,all_ads,False,[],False,,/r/singapore/comments/dlzqg7/end_of_the_road_f...,all_ads,False,https://www.youtube.com/watch?v=p11CESMRLhA,194584,1571839000.0,0,"{'type': 'youtube.com', 'oembed': {'provider_u...",False,,,
1,,singapore,,t2_3d9ac16q,False,,0,False,Read the last paragraph - this feels really el...,[],r/singapore,False,6,,0,140.0,True,t3_dlzny6,False,dark,,public,4,0,{},140.0,,False,[],,True,False,,{},,False,4,,https://b.thumbs.redditmedia.com/sYy-dBkVwyL0H...,False,,[],[],{},image,,False,,1571868000.0,text,6,,text,i.redd.it,False,,,,,,False,False,False,False,False,{'images': [{'source': {'url': 'https://previe...,[],[],False,,False,False,False,,False,,,t5_2qh8c,,,,dlzny6,True,,YanniCui,,1,True,all_ads,False,[],False,,/r/singapore/comments/dlzny6/read_the_last_par...,all_ads,False,https://i.redd.it/lsc36dgoqau31.png,194584,1571839000.0,0,,False,,,
2,,singapore,,t2_8bt8o,False,,0,False,"OK Google, why you do this?",[],r/singapore,False,6,,0,140.0,True,t3_dlyyf8,False,dark,,public,8,0,{},140.0,,False,[],,True,False,,{},,False,8,,https://b.thumbs.redditmedia.com/zqDGEzI_kkVlL...,False,,[],[],{},image,,False,,1571864000.0,text,6,,text,i.redd.it,False,,,,,,False,False,False,False,False,{'images': [{'source': {'url': 'https://previe...,[],[],False,,False,False,False,,False,,,t5_2qh8c,,,,dlyyf8,True,,v1war,,6,True,all_ads,False,[],False,,/r/singapore/comments/dlyyf8/ok_google_why_you...,all_ads,False,https://i.redd.it/kmeqiawogau31.jpg,194584,1571836000.0,0,,False,,,
3,,singapore,1. Charging bills to credit card(Paying PROMPT...,t2_4ndm6068,False,,0,False,Ways to save money and get cashback in Singapo...,[],r/singapore,False,6,discussion,0,,True,t3_dlyt36,False,dark,,public,0,0,{},,,False,[],,False,False,,{},Discussion,False,0,,self,False,,[],[],{},self,,True,,1571864000.0,text,6,,text,self.singapore,False,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",,,,,False,True,False,False,False,{'images': [{'source': {'url': 'https://extern...,[],[],False,cc19a6ee-3023-11e4-bc5b-12313b0b2072,False,False,False,,False,,,t5_2qh8c,,,,dlyt36,True,,undercovermascot,,3,True,all_ads,False,[],False,,/r/singapore/comments/dlyt36/ways_to_save_mone...,all_ads,False,https://www.reddit.com/r/singapore/comments/dl...,194584,1571835000.0,0,,False,,,
4,,singapore,,t2_1nfynjvc,False,,0,False,"Sorry, sushi restaurant broke (313 somerset) D...",[],r/singapore,False,6,,0,140.0,False,t3_dlydmn,False,dark,,public,0,0,{},140.0,6a13139a-ff89-11e0-a912-12313b071981,False,[],{'reddit_video': {'fallback_url': 'https://v.r...,True,False,,{},,False,0,,https://b.thumbs.redditmedia.com/hznjO7qIalnxp...,False,inverted,[],[],{},hosted:video,,False,,1571861000.0,text,6,,text,v.redd.it,False,,,,,,False,True,False,False,False,{'images': [{'source': {'url': 'https://extern...,[],[],False,,False,False,False,sec sch student,False,,,t5_2qh8c,,,,dlydmn,True,,SmokyJosh,,4,True,all_ads,False,[],False,dark,/r/singapore/comments/dlydmn/sorry_sushi_resta...,all_ads,False,https://v.redd.it/tokb50iz7au31,194584,1571833000.0,0,{'reddit_video': {'fallback_url': 'https://v.r...,True,,,


#### Method 2: PRAW

In [2]:
import praw

# instantiate an instance PRAW using OAuth credentials
reddit = praw.Reddit(client_id='j3IqMqkPYZ89Ew',
                     client_secret='8-OQJWRX_PkajrGk7yOPpfbMk2g',
                     user_agent='my user agent')

# define custom scraping function
def scrape_subreddit(subreddit, postlimit=1000):
    
    subreddit = reddit.subreddit(subreddit)

    post_title = []
    post_text = []
    post_id = []
    post_dist = []
    post_score = []
    post_upvoteratio = []
    post_date = []
    comment_text = []
    comment_dist = []
    comment_score = []
    comment_parentpost_id = []

    # collect from posts sorted by new
    for submission in subreddit.new(limit = postlimit):
        # collect information on post
        post_title.append(submission.title)
        post_text.append(submission.selftext)
        post_id.append(submission.id)
        post_dist.append(submission.distinguished)
        post_score.append(submission.score)
        post_upvoteratio.append(submission.upvote_ratio)
        post_date.append(submission.created_utc)

        # collect all comments on each post
        submission.comments.replace_more(limit = None)
        for comment in submission.comments.list():     
            comment_text.append(comment.body)
            comment_dist.append(comment.distinguished)
            comment_score.append(comment.score)
            comment_parentpost_id.append(submission.id)
 
    # put posts into a df
    df_post = pd.DataFrame({'title': post_title,
                              'id': post_id,
                            'date_created':post_date,
                              'text': post_text,
                              'distinguished': post_dist,
                              'score': post_score,
                              'upvote_ratio': post_upvoteratio})
    df_post['date_created'] = pd.to_datetime(df_post['date_created'], unit = 's')
    
    # put comments into a df
    df_comments = pd.DataFrame({'post_id': comment_parentpost_id,
                              'comment_text': comment_text,
                              'comment_distinguished': comment_dist,
                              'comment_score': comment_score})
    
    return df_post, df_comments

In [13]:
%%time
# scrape from subreddits
sg_posts, sg_comments = scrape_subreddit('singapore')
ms_posts, ms_comments = scrape_subreddit('malaysia')

CPU times: user 33.1 s, sys: 3.77 s, total: 36.9 s
Wall time: 44min 32s


#### Export to csv

In [16]:
sg_posts.to_csv('../data/sg_posts.csv')
sg_comments.to_csv('../data/sg_comments.csv')
ms_posts.to_csv('../data/ms_posts.csv')
ms_comments.to_csv('../data/ms_comments.csv')

### Test data

Webscraping the subreddits again since 5 days had elapsed since the previous webscrape (17 Oct - 22 Oct 2019). The new data will be used as test data.

In [7]:
%%time
# scrape from subreddits
sg_posts, sg_comments = scrape_subreddit('singapore')
ms_posts, ms_comments = scrape_subreddit('malaysia')

CPU times: user 29.7 s, sys: 1.99 s, total: 31.7 s
Wall time: 44min 22s


#### Export test data to csv

In [8]:
sg_posts.to_csv('../data/sg_posts_test.csv')
sg_comments.to_csv('../data/sg_comments_test.csv')
ms_posts.to_csv('../data/ms_posts_test.csv')
ms_comments.to_csv('../data/ms_comments_test.csv')