<a name="top"></a>
# Predicting Subreddits through Comments

## Data Acquistion through Web Scrapping
---
* [Problem Statement](#problemstatement)
* [Executive Summary](#problemstatement)
* [Data Dictionary](#dictionary)
* [Importing Libraries](#executivesummary)
* [Data Acquisition](#DataAcquisition)
* [Data Cleaning](#DataCleaning)
  * [Saving CSV](#SavingCSV)
  

### Problem Statement: Predicting Subreddits through Comments

### Executive Summary

Reddit is an American social news aggregation, web content rating, and discussions platform. Members who are registered with reddit submit content to the site such as links, text posts, and images, which are then voted up or down by other members in the reddit community. Posts are organized by subject into user-created boards known as "subreddits". A subreddit is a specific online community, and the posts associated with it are dedicated to a particular topic that people write about, and they're denoted by /r/, followed by the subreddit's name, e.g., /r/gaming.

In this project, I focused on two particular subreddits [r/singapore](https://www.subreddit.com/r/singapore) and [r/unitedkingdom](https://www.subreddit.com/r/unitedkingdom). The aim of this project was to see if the language used in these subreddits were unique to both Singapore and the United Kingdom. In order to tackle this problem, classification models such as the Naive Bayes and Logistic Regression were applied.

### Data Dictionary

|Feature|Type|Dataset|Description|
|---|---|---|---|
|body|object|sg/uk subreddits|text of the comments| 
|created_utc|int|sg/uk subreddits|timestamp of the comments based on EPOCH time format|
|id|object|sg/uk subreddits|unique id of each reddit post|
|parent_id|object|sg/uk subreddits|unique id of parent comment(aka first tier comments)|
|score|int|sg/uk subreddits|the number of upvotes for the comments|
|subreddit|obj|sg/uk subreddits|r/singapore and r/unitedkingdom subreddits (target variable)| 
|word_length|int|sg/uk subreddits|length of words in the comments|


### Importing Libraries

In [61]:
import requests
import pandas as pd
import time
import random
import numpy as np
import re

pd.pandas.set_option('display.max_columns',None)
pd.pandas.set_option('display.max_rows',None)

### Data Acquisition

Pushshift API is an alternative source to Reddit API.

- We can specify a time from when we want to pull the data from
- Data is returned as a list of dictionaries and easily convertible to pandas data frame

In [5]:
# url below calls for the most recent 1000 comments from the subreddit r/Singapore

url = "https://api.pushshift.io/reddit/search/comment/?subreddit=singapore&sort=des&size=1000"

In [6]:
# number indicates that the api has been successfully accessed
headers = {'User-agent': 'shandeep'}
res= requests.get(url, headers=headers)
res.status_code

200

In [7]:
json = res.json()
comments = pd.DataFrame(json['data'])

In [8]:
comments.columns

Index(['all_awardings', 'associated_award', 'author',
       'author_flair_background_color', 'author_flair_css_class',
       'author_flair_richtext', 'author_flair_template_id',
       'author_flair_text', 'author_flair_text_color', 'author_flair_type',
       'author_fullname', 'author_patreon_flair', 'author_premium', 'awarders',
       'body', 'collapsed_because_crowd_control', 'created_utc', 'gildings',
       'id', 'is_submitter', 'link_id', 'locked', 'no_follow', 'parent_id',
       'permalink', 'retrieved_on', 'score', 'send_replies', 'stickied',
       'subreddit', 'subreddit_id', 'total_awards_received', 'treatment_tags',
       'edited', 'distinguished'],
      dtype='object')

There are many features to choose from but restrict data to a subset of features.

Features of interest:

- **body**: text of the comments
- **created_utc**: timestamp of the comments
- **parent_id**: unique id of parent comment(aka first tier comments)
- **score**: the number of upvotes
- **subreddit**: the subreddit of interest (target variable) where the comment was located

In [19]:
# created_utc is in epoch time format and current time is 1589353672.
# t_1 in parent_id corresponds to tier 1 comments.

In [16]:
# removes all the other features apart from the features of interest
comments = comments[['body','created_utc','parent_id','score','subreddit']]

In [18]:
comments

Unnamed: 0,body,created_utc,parent_id,score,subreddit
0,I checked neighbours too! Lololol,1589344033,t1_fqg82jh,1,singapore
1,It’s correct :(,1589344021,t1_fqg82jh,1,singapore
2,It's straight up an offence to cuss out a publ...,1589343998,t1_fqgcdh7,9,singapore
3,[removed],1589343973,t3_gipq7q,1,singapore
4,Learn to respect all NSFs first lah. Level 1 c...,1589343955,t3_giqsdp,15,singapore
5,"There are many, UOB one, OCBC 360, DBS Multipl...",1589343937,t1_fqgd8s8,3,singapore
6,So a fox hole? Ours were more vertical (US Arm...,1589343910,t1_fqgduy3,3,singapore
7,Why never claim mental illness? Maybe won't ge...,1589343888,t3_gir364,3,singapore
8,Let us know if you manage to pick anyone up wi...,1589343888,t1_fqg86jh,5,singapore
9,I had friends who joined MOE. They said they f...,1589343884,t1_fqg4dnb,0,singapore


#### /r/Singapore

In [20]:
# repeats steps from above
# creates a dataframe with 1000 of the most recent comments based on present time
# present time follows an epoch time format (1589353672) and tries to maintain some
# consitency between /r/singapore and /r/unitedkingdom subreddits

url = "https://api.pushshift.io/reddit/search/comment/?subreddit=singapore&before=1589353672&sort=des&size=1000"
headers = {'User-agent': 'shandeep'}
res = requests.get(url, headers=headers)
json = res.json()
singapore = pd.DataFrame(json['data'])
singapore = singapore[['body','created_utc','id','parent_id','score','subreddit']]
# gets rid of mod-removed comments
singapore = singapore[singapore['body']!='[removed]']

In [21]:
len(singapore)

983

In [38]:
singapore

Unnamed: 0,body,created_utc,id,parent_id,score,subreddit
0,Thanks for the advice! Ill try to open it as s...,1589349005,fqgkuus,t1_fqgimf5,1,singapore
1,Look at who you are protecting. Almost Half t...,1589348991,fqgku75,t1_fqg20kg,1,singapore
2,Sometimes it will only feel wet or damp when t...,1589348893,fqgkpqx,t1_fqgk00m,2,singapore
3,"21M, sounds like you’re in uni. How are you do...",1589348878,fqgkp1x,t3_giq3ox,1,singapore
4,Nick and Amy's r/s so bloody toxic sia,1589348871,fqgkoqt,t1_fqggm9a,2,singapore
5,no probs! Okay so my method is using a disposa...,1589348846,fqgkno5,t1_fqgk18a,2,singapore
6,Finally some one with some judgement. This is ...,1589348831,fqgkn06,t1_fqgjuvi,1,singapore
7,"Well, statistics doesn't care whether cases ar...",1589348807,fqgklw9,t1_fqgi58g,3,singapore
8,"The community is more than HDBs though, so I d...",1589348793,fqgklb7,t1_fqgawjf,3,singapore
9,Yeahhhh too sweeping LOL\n\n[https://www.reddi...,1589348781,fqgkkt4,t1_fqejl7t,1,singapore


#### /r/UnitedKingdom

In [25]:
url = "https://api.pushshift.io/reddit/search/comment/?subreddit=unitedkingdom&before=1589353672&sort=des&size=1000"
headers = {'User-agent': 'shandeep'}
res = requests.get(url, headers=headers)
json = res.json()
uk = pd.DataFrame(json['data'])
uk = uk[['body','created_utc','id','parent_id','score','subreddit']]
# gets rid of mod-removed comments
uk = uk[uk['body']!='[removed]']

In [26]:
len(uk)

981

In [69]:
# time based on epoch time format which was converted from https://www.epochconverter.com/
print(uk.created_utc.max())
print(uk.created_utc.min())


print('Data Collected from Wednesday, 13 May 2020 04:39:01 to Wednesday, 13 May 2020 14:00:18')

1589349618
1589315941
Data Collected from Wednesday, 13 May 2020 04:39:01 to Wednesday, 13 May 2020 14:00:18


In [39]:
uk

Unnamed: 0,body,created_utc,id,parent_id,score,subreddit
0,"Ah, so blue energy. That's generating energy f...",1589349618,fqglm41,t1_fqgglcu,4,unitedkingdom
3,"Kids could be taught in groups of friends, but...",1589349532,fqglicd,t1_fqfi4tk,1,unitedkingdom
4,[deleted],1589349514,fqglhk4,t1_fqg3du5,0,unitedkingdom
5,"Hum, wind turbines store energy as inertia as ...",1589349489,fqglggz,t1_fqfmxyf,4,unitedkingdom
6,Just so you know I’m my hospital we have been ...,1589349475,fqglfvu,t1_fqfklsg,6,unitedkingdom
7,Is your plan to announce something without any...,1589349433,fqgle1w,t1_fqgh5r3,8,unitedkingdom
8,"&gt; Are you a dog, because I think you need t...",1589349397,fqglcej,t1_fq9i2nm,1,unitedkingdom
9,The point was that governments can be changed ...,1589349273,fqgl6y7,t1_fqfzvxv,1,unitedkingdom
10,"What you actually mean is ""how can I say this ...",1589349258,fqgl696,t1_fqg8wxf,5,unitedkingdom
11,Isn't it usually blackmail? they scam money fr...,1589349248,fqgl5t6,t3_gisrwb,2,unitedkingdom


### Data Cleaning

Before before any form of modelling, it is of paramount importance to check for missing values. In addition, there could be duplicated rows in the data set and thus, it is crucial to rectify these issues.

In [32]:
# remove rows with missing values
singapore.dropna(inplace=True)
uk.dropna(inplace=True)

#remove comments that have the same 'id'
singapore.drop_duplicates('id', inplace=True)
uk.drop_duplicates('id', inplace=True)

In [37]:
print(len(singapore))
print(len(uk))

983
981


In [48]:
# body column contains removed/deleted which need to be deleted
singapore = singapore[singapore['body']!='[deleted]']
singapore = singapore[singapore['body']!='\\[removed\]']
uk = uk[uk['body']!='[deleted]']
uk = uk[uk['body']!='\\[removed\]']

In [54]:
# check out shapes of data frames
# data seems to be evenly balanced
print(singapore.shape)
print(uk.shape)

(944, 6)
(956, 6)


In [58]:
comments = pd.concat([singapore,uk])
comments = comments.reset_index(drop=True)
comments

Unnamed: 0,body,created_utc,id,parent_id,score,subreddit
0,Thanks for the advice! Ill try to open it as s...,1589349005,fqgkuus,t1_fqgimf5,1,singapore
1,Sometimes it will only feel wet or damp when t...,1589348893,fqgkpqx,t1_fqgk00m,2,singapore
2,"21M, sounds like you’re in uni. How are you do...",1589348878,fqgkp1x,t3_giq3ox,1,singapore
3,Nick and Amy's r/s so bloody toxic sia,1589348871,fqgkoqt,t1_fqggm9a,2,singapore
4,no probs! Okay so my method is using a disposa...,1589348846,fqgkno5,t1_fqgk18a,2,singapore
5,Finally some one with some judgement. This is ...,1589348831,fqgkn06,t1_fqgjuvi,1,singapore
6,"Well, statistics doesn't care whether cases ar...",1589348807,fqgklw9,t1_fqgi58g,3,singapore
7,"The community is more than HDBs though, so I d...",1589348793,fqgklb7,t1_fqgawjf,3,singapore
8,r/AngryUpvote,1589348705,fqgkhe8,t1_fqgjn6g,8,singapore
9,What limited qualifications? She's quite good ...,1589348678,fqgkg3w,t1_fqgfcwe,4,singapore


In [74]:
# function that selects a combination of \r's and \n's in the comments and replaces them with
# single spaces
def replace_linebreaks_w_space(x):
    return re.sub('([\r\n]+)',' ',x) 

# people might also enter space twice accidentally, convert two spaces to a single space
def replace_multispace_w_space(x):
    return re.sub('([ ]{2,})',' ',x)

# map functions
comments['body'] = comments['body'].map(replace_linebreaks_w_space)
comments['body'] = comments['body'].map(replace_multispace_w_space)

In [76]:
# remove spaces at the beginning or end of every comment, split them into a list of words,
# and returns length of that list (aka no. of words in the comment)
comments['word_length'] = comments['body'].map(lambda x: len(x.strip().split(' ')))

In [86]:
# remove word that are less than equal to 5 or more, as it might not have predictive value
comments = comments[comments['word_length']>=5]
comments

Unnamed: 0,body,created_utc,id,parent_id,score,subreddit,word_length
0,Thanks for the advice! Ill try to open it as s...,1589349005,fqgkuus,t1_fqgimf5,1,singapore,14
1,Sometimes it will only feel wet or damp when t...,1589348893,fqgkpqx,t1_fqgk00m,2,singapore,75
2,"21M, sounds like you’re in uni. How are you do...",1589348878,fqgkp1x,t3_giq3ox,1,singapore,48
3,Nick and Amy's r/s so bloody toxic sia,1589348871,fqgkoqt,t1_fqggm9a,2,singapore,8
4,no probs! Okay so my method is using a disposa...,1589348846,fqgkno5,t1_fqgk18a,2,singapore,104
5,Finally some one with some judgement. This is ...,1589348831,fqgkn06,t1_fqgjuvi,1,singapore,11
6,"Well, statistics doesn't care whether cases ar...",1589348807,fqgklw9,t1_fqgi58g,3,singapore,15
7,"The community is more than HDBs though, so I d...",1589348793,fqgklb7,t1_fqgawjf,3,singapore,31
9,What limited qualifications? She's quite good ...,1589348678,fqgkg3w,t1_fqgfcwe,4,singapore,97
10,The answer is no. I donate whatever I want aft...,1589348650,fqgkev6,t1_fqgica1,0,singapore,56


In [70]:
comments['subreddit'].value_counts()

unitedkingdom    956
singapore        944
Name: subreddit, dtype: int64

In [88]:
# repeat steps for individual subreddit data frames just in case I might want to use it later
singapore['body'] = singapore['body'].map(replace_linebreaks_w_space)
singapore['body'] = singapore['body'].map(replace_multispace_w_space)
uk['body'] = uk['body'].map(replace_linebreaks_w_space)
uk['body'] = uk['body'].map(replace_multispace_w_space)


singapore['word_length'] = singapore['body'].map(lambda x: len(x.strip().split(' ')))
uk['word_length'] = uk['body'].map(lambda x: len(x.strip().split(' ')))

singapore = singapore[singapore['word_length']>=5]
uk = uk[uk['word_length']>=5]

# reset indexes
singapore = singapore.reset_index(drop=True)
uk = uk.reset_index(drop=True)

### Saving CSV

In [87]:
# saving data frame to csv
comments.to_csv('../data/comments.csv', index = False)
singapore.to_csv('../data/singapore.csv', index = False)
uk.to_csv('../data/uk.csv', index=False)