<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3: NLP subreddit post classifier (datascrape)

## Problem Statement:

For this project, I wish to train a Natural Language Processing classifier to help differentiate reddit posts between [`Marvel`](https://www.reddit.com/r/Marvel/) and [`DCcomics`](https://www.reddit.com/r/DCcomics/) subreddits. This is a binary classification problem. <br><br>
First, I plan to preview a post from each subreddit, so that I can shortlist which fields would be common between both subreddits, and which fields have differentiating information.

### Contents:
* [Preview `Marvel`](#Preview-Marvel-subreddit-post:)
* [Preview `DCcomics`](#Preview-DCcomics-subreddit-post:)
* [Time format conversion](#Convert-human-readable-time-to-epoch-time:)
* [`Marvel` data](#Marvel-data:)<br>[data cleaning](#Check-for-null-values:)
* [`DCcomics` data](#DCcomics-data:)<br>[data cleaning](#Check-for-null-values:)
* [Combined data](#Combined-data)

### Datasets:
1. [`marvel.csv`](./data/marvel.csv)
1. [`DCcomics.csv`](./data/DCcomics.csv)

##### Guide from [Primer Video](https://www.youtube.com/watch?v=AcrjEWsMi_E)

In [1]:
import requests

import pandas as pd
import requests
import urllib
import time
import json
from datetime import datetime

import re
import numpy as np

##### Instantiate [PushShift API](https://github.com/pushshift/api)

In [2]:
url = 'https://api.pushshift.io/reddit/search/submission'

##### Preview `Marvel` subreddit post: 

In [3]:
# set parameters for Pushshift API url
params = {
    'subreddit':'marvel',
    'size':10 # the API limits maximum 100 posts for each pull
}

In [4]:
# get response content
res = requests.get(url, params)

In [5]:
# check if the url is working properly
res.status_code

200

In [6]:
# json decoding of data from url
data = res.json()

In [7]:
# isolate the key 'data' from all the data in the url
posts = data['data']

In [8]:
# to see what column is important to help with the project
posts[0]

{'all_awardings': [],
 'allow_live_comments': False,
 'author': 'Commercial-Mix-2633',
 'author_flair_css_class': None,
 'author_flair_richtext': [],
 'author_flair_text': None,
 'author_flair_type': 'text',
 'author_fullname': 't2_lw6ivc8b',
 'author_is_blocked': False,
 'author_patreon_flair': False,
 'author_premium': False,
 'awarders': [],
 'can_mod_post': False,
 'contest_mode': False,
 'created_utc': 1658198244,
 'domain': 'i.redd.it',
 'full_link': 'https://www.reddit.com/r/Marvel/comments/w2hh67/what_would_your_lineup_for_the_midnight_sons_be/',
 'gildings': {},
 'id': 'w2hh67',
 'is_created_from_ads_ui': False,
 'is_crosspostable': True,
 'is_meta': False,
 'is_original_content': False,
 'is_reddit_media_domain': True,
 'is_robot_indexable': True,
 'is_self': False,
 'is_video': False,
 'link_flair_background_color': '#cc3600',
 'link_flair_css_class': 'comics',
 'link_flair_richtext': [{'e': 'text', 't': 'Comics '}],
 'link_flair_template_id': '4cd2464c-0af9-11e4-adf3-12313b

##### Preview `DCcomics` subreddit post: 

In [9]:
# define parameters for PushShift API
params = {
    'subreddit':'DCcomics',
    'size':10 # the API limits maximum 100 posts for each pull
}

In [10]:
# get response
res = requests.get(url, params)

In [11]:
# check if the url is working properly
res.status_code

200

In [12]:
# json decoding of data from url
data = res.json()

In [13]:
# isolate the key 'data' from all the data in the url
posts = data['data']

In [14]:
# to see what column is important to help with the project
posts[0]

{'all_awardings': [],
 'allow_live_comments': False,
 'author': 'AbelofAurelia',
 'author_flair_background_color': '#59ac44',
 'author_flair_css_class': None,
 'author_flair_richtext': [{'e': 'text', 't': 'Harlivy '},
  {'a': ':HarleyQuinn:',
   'e': 'emoji',
   'u': 'https://emoji.redditmedia.com/2erwj7e9j9x21_t5_2qlmm/HarleyQuinn'}],
 'author_flair_template_id': 'e01b2748-a7e3-11e9-96bd-0e55bf9e0bee',
 'author_flair_text': 'Harlivy :HarleyQuinn:',
 'author_flair_text_color': 'dark',
 'author_flair_type': 'richtext',
 'author_fullname': 't2_bgmm7scu',
 'author_is_blocked': False,
 'author_patreon_flair': False,
 'author_premium': False,
 'awarders': [],
 'can_mod_post': False,
 'contest_mode': False,
 'created_utc': 1658199408,
 'domain': 'i.redd.it',
 'full_link': 'https://www.reddit.com/r/DCcomics/comments/w2hvin/reminder_this_is_not_the_real_batman_spoiler_dark/',
 'gildings': {},
 'id': 'w2hvin',
 'is_created_from_ads_ui': False,
 'is_crosspostable': False,
 'is_meta': False,
 'is

##### Convert human readable time to epoch time:

* [Reference](https://www.javatpoint.com/python-epoch-to-datetime) 

In [15]:
# to get the date I want as min_time
# using the strptime() function  
obj_1 = datetime.strptime( '17-4-2022 00:00:00', '%d-%m-%Y %H:%M:%S')   
# realised I need to fix my max_time to get reproducible results
obj_2 = datetime.strptime( '17-7-2022 00:00:00', '%d-%m-%Y %H:%M:%S')    
epoch_time = obj_1.timestamp() 
now_time = obj_2.timestamp()
print("epoch time:", epoch_time)  
print('present time:', now_time)

epoch time: 1650124800.0
present time: 1657987200.0


#### Function to loop PushShift API to get 1000 posts in a single run:

__References:__
1. [Watchful1](https://github.com/Watchful1/Sketchpad/blob/master/postDownloader.py)
1. [StackOverFlow](https://stackoverflow.com/questions/66783488/code-efficiency-performance-improvement-in-pushshift-reddit-web-scraping-loop)

On my previous pulls, there were `[deleted]` and `np.nan` values under `author` and `post` features. They seem to signify deleted entries. So I edited the function below to remove these datapoints during data extraction.

In [16]:
def get_data(object_type='submission', subreddit='', max_time=None, min_time=epoch_time):
    # start from current time if not specified
    if max_time is None:
        max_time = int(now_time)

    # generate filter string
    if subreddit != '': 
        filter_string = f"subreddit={subreddit}"
        
    url_format = "https://api.pushshift.io/reddit/search/{}/?size=100&sort=desc&{}&before={}"

    before = max_time
    df = pd.DataFrame()
    
    while before > min_time:
        url = url_format.format(object_type, filter_string, before)
        resp = requests.get(url)

        # convert records to flat table format
        dfi = pd.json_normalize(json.loads(resp.text)['data'])
        
        if object_type == 'submission':
            dfi = dfi.rename(columns={'created_utc': 'epoch_time', 
                                      'selftext': 'post'})
            dfi = dfi[dfi['post'].ne('')] # excludes blank posts
            df = pd.concat([df, dfi[['epoch_time','author','title','post','subreddit']]])
            # To exclude posts that have been removed by moderator
            # preliminary data cleaning
            df_no_removed = df.loc[(df['post']!='[removed]')&
                                   (df['author']!='[deleted]')&
                                   (df['post'].notna()), :]
            # To extract around 1000 posts to reduce loading time
            if len(df_no_removed) > 1000:
                break
        
        # so that not fetch duplicates
        # reset `before` to the earliest comment/post in the results
        # next time we call requests.get(...) we will only get comments/posts before
        before = dfi['epoch_time'].min()

        # if needed
        time.sleep(3)
        
    return df_no_removed

### `Marvel` data:
---

In [17]:
%%time

# pull data from 'Marvel' subreddit
marvel_df = get_data(object_type='submission', subreddit='marvel', max_time=None, min_time=epoch_time)

CPU times: total: 2.06 s
Wall time: 6min 9s


In [18]:
marvel_df.head(3)

Unnamed: 0,epoch_time,author,title,post,subreddit
15,1657980098,DarkUpquark,"Watched ""Helstrom"" on hulu",Missed that it actually made it to air in 2020...,Marvel
20,1657974418,CEO_of_Redd1t,Explaining why the MCU is Earth-616 now in the...,So there's been a lot of discussion lately abo...,Marvel
28,1657963597,Short-Step-6704,what should i watch first?,Guys i've never seen any movie from marvel.. s...,Marvel


In [19]:
# 1008 datapoints
marvel_df.shape

(1008, 5)

In [20]:
marvel_df.reset_index(drop=True, inplace=True)
marvel_df.head(3)

Unnamed: 0,epoch_time,author,title,post,subreddit
0,1657980098,DarkUpquark,"Watched ""Helstrom"" on hulu",Missed that it actually made it to air in 2020...,Marvel
1,1657974418,CEO_of_Redd1t,Explaining why the MCU is Earth-616 now in the...,So there's been a lot of discussion lately abo...,Marvel
2,1657963597,Short-Step-6704,what should i watch first?,Guys i've never seen any movie from marvel.. s...,Marvel


##### Check for null values:

In [21]:
marvel_df.isnull().sum()

epoch_time    0
author        0
title         0
post          0
subreddit     0
dtype: int64

##### Function to find values in square brackets / other null data:

In [22]:
def sq_brkets(data):
    cols = ['author', 'title', 'post']
    missing_dict = {} 
    for col in cols:   
        # make new key for each column under study
        # find all words in squared brackets that might represent 'removed' or 'deleted' entries
        missing_dict[col] = [x for x in data[col].str.findall('\[\w+\]') if x]
    return missing_dict

In [23]:
# checking if there any '[removed]' or '[deleted]' entries in the data pulled
sq_brkets(marvel_df)

{'author': [],
 'title': [['[endgame]'],
  ['[Spoilers]'],
  ['[Comics]'],
  ['[SPOILER]', '[SPOILERS]'],
  ['[Moonknight]'],
  ['[Theory]']],
 'post': [['[here]', '[here]'],
  ['[Spotify]', '[Amazon]'],
  ['[UNFINISHED]', '[UNFINISHED]']]}

There are no `[removed]` or `[deleted]` entries left in the data pulled.

In [24]:
# export data to csv
marvel_df.to_csv('data/marvel.csv', index=False)

### `DCcomics` data:
---

##### Reference of `min_time` and `max_time` last used for `marvel_df`:

In [25]:
%%time

# pull data from 'DCcomics' subreddit
dcomics_df = get_data(object_type='submission', subreddit='DCcomics', max_time=None, min_time=epoch_time)

CPU times: total: 1.36 s
Wall time: 4min 40s


In [26]:
dcomics_df.head(3)

Unnamed: 0,epoch_time,author,title,post,subreddit
2,1657986458,das_cthulu,How fast is wonder woman compared to Superman ...,What the title says.,DCcomics
3,1657985374,nightwing612,"Excluding Linda Park, who is your best/favorit...",Some of those romances did not happen in comic...,DCcomics
4,1657984632,LilBe,Which Injustice graphic novels are a must-have?,So I want to start reading the Injustice books...,DCcomics


In [27]:
dcomics_df.shape
# there are 1010 data points

(1010, 5)

In [28]:
dcomics_df.reset_index(drop=True, inplace=True)
dcomics_df.head(3)

Unnamed: 0,epoch_time,author,title,post,subreddit
0,1657986458,das_cthulu,How fast is wonder woman compared to Superman ...,What the title says.,DCcomics
1,1657985374,nightwing612,"Excluding Linda Park, who is your best/favorit...",Some of those romances did not happen in comic...,DCcomics
2,1657984632,LilBe,Which Injustice graphic novels are a must-have?,So I want to start reading the Injustice books...,DCcomics


##### Check for null values:

In [29]:
dcomics_df.isnull().sum()

epoch_time    0
author        0
title         0
post          0
subreddit     0
dtype: int64

In [30]:
# find if there are any more deleted entries
sq_brkets(dcomics_df)

{'author': [],
 'title': [['[Discussion]'],
  ['[Discussion]'],
  ['[Anime]', '[Manga]'],
  ['[Other]'],
  ['[Discussion]'],
  ['[Discussion]'],
  ['[Discussion]'],
  ['[Discussion]'],
  ['[Discussion]'],
  ['[Discussion]'],
  ['[Discussion]'],
  ['[Other]'],
  ['[Discussion]'],
  ['[Discussion]'],
  ['[Other]'],
  ['[Discussion]'],
  ['[Other]'],
  ['[Other]'],
  ['[Discussion]'],
  ['[Discussion]'],
  ['[Other]'],
  ['[Discussion]'],
  ['[Discussion]'],
  ['[Discussion]'],
  ['[Discussion]'],
  ['[Other]'],
  ['[Question]'],
  ['[Discussion]'],
  ['[Discussion]'],
  ['[Poll]'],
  ['[Discussion]'],
  ['[Discussion]'],
  ['[Discussion]'],
  ['[Discussion]'],
  ['[Discussion]'],
  ['[Discussion]'],
  ['[Cosplay]'],
  ['[Artwork]'],
  ['[Discussion]'],
  ['[poll]'],
  ['[Discussion]'],
  ['[Discussion]'],
  ['[Poll]'],
  ['[Discussion]'],
  ['[COVER]'],
  ['[Discussion]'],
  ['[Cover]'],
  ['[Discussion]'],
  ['[Discussion]'],
  ['[Discussion]'],
  ['[Discussion]'],
  ['[Other]'],
  ['[D

In [31]:
# isolate what posts has been '[removed]'
dcomics_df.loc[dcomics_df['post']=='[removed]',:]

Unnamed: 0,epoch_time,author,title,post,subreddit


It seems that `post`==`[removed]` entries left in `dcomics_df` are likely portions of content that were removed, and not the entire `post`. Therefore, there are no `[deleted]` or `[removed]` entries for me to drop from `dcomics_df`. 

In [32]:
# export csv file
dcomics_df.to_csv('data/DCcomics.csv', index=False)

#### Combined data
---

In [33]:
combined_df = pd.concat([marvel_df, dcomics_df], axis=0, ignore_index=True)

In [34]:
# check concat correct. 1008 'Marvel' rows + 1010 'Dcomics' rows
combined_df.shape

(2018, 5)

In [35]:
# check indices correct
combined_df.head(3)

Unnamed: 0,epoch_time,author,title,post,subreddit
0,1657980098,DarkUpquark,"Watched ""Helstrom"" on hulu",Missed that it actually made it to air in 2020...,Marvel
1,1657974418,CEO_of_Redd1t,Explaining why the MCU is Earth-616 now in the...,So there's been a lot of discussion lately abo...,Marvel
2,1657963597,Short-Step-6704,what should i watch first?,Guys i've never seen any movie from marvel.. s...,Marvel


In [36]:
# check indices
combined_df.tail(3)

Unnamed: 0,epoch_time,author,title,post,subreddit
2015,1655084464,JustNormal141,What happened to Wally west before Dc rebirth?,I know that everyone doesn’t remember him but ...,DCcomics
2016,1655083740,komayeda1,Do we take canon a bit too seriously here?,In a world where continuity is a jumbled mess ...,DCcomics
2017,1655082493,JadeBladeGamer22,"When is ""Static: Season Two"" coming out?",It was said that Static Season Two was slated ...,DCcomics


In [37]:
# export data to csv
combined_df.to_csv('data/combined.csv', index=False)