<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3: Using Reddit's API for Predicting Comments

_Authors: Schubert H. Laforest (BOS)_

---
<a id='part0'></a>
# Project Goal
**GA's's Directives: **
In this project, we will practice two major skills. Collecting data via an API request and then building a binary predictor. For this article, your problem statement will be: _What characteristics of a post on Reddit contribute most to the overall interaction (as measured by number of comments)?_

**Classification Models Used: ** Logistic Regression, CARTs

**Dataset (scraped): ** `pop_hot_raw.csv`


# Notebook Guide
- [Scrapping Data from Reddit](#scrape)
- [Preprocessing the Data & Feature Engineering](#prepo)
- [Natural Language Processing](#NLP)
- [CART Models](#Cart)
- [Logistic Regression](#logit)
- [Executive Summary](#exec)

<a id='scrape'></a>
# Scrapping Data from Reddit

For this porject, I decided to use the PRAW wrapper of the Reddit API in order to scrape posts and create my data set. I limited myself to the `hot` section of the `popular` subreddit for two reasons: 1) Every reddit user by default is subbed to this subreddit, 2) Consequently, I believed that understanding what type of posts and content preformed well here would allow me to understand the componenets of a good Reddit post. 

** Below is the script I wrote in order to scrape reddit posts **

In [None]:
# Scrape of Reddit Popular
import praw
import pandas as pd
from datetime import timedelta
import datetime as dt

reddit = praw.Reddit(client_id = 'c0wH43lHUelWRw',
client_secret = 'd5hKuxZDOea24gEPLGUMB1xl-mU',
username = 'ShoddyRevenue',
password = 'AppleBird167', user_agent ='prawtutorialv2')

# Creating Popular subreddit dataset
subreddit = reddit.subreddit('popular')

top_subreddit = subreddit.hot(limit=10000)

# for submission in subreddit.hot(limit=1):
#     print(submission.title, submission.id)


topics_dict = { "title":[], \
                "subreddit":[], \
                "is_video":[], \
                "subreddit_subs":[], \
                "thumbnail":[], \
                "score":[], \
                "id":[], "url":[], \
                "comms_num": [], \
                "num_crossposts": [], \
                "created": [], \
                "whitelist_status": [], \
                "thumbnail_ht": [], \
                "thumbnail_wdt": [], \
                "ops_flair": [], \
                "link_flair": [], \
                "body":[]}

for submission in top_subreddit:
    #ignoring posts that are sticked or pinned
    if not submission.stickied or not submission.pinned:
        topics_dict["title"].append(submission.title)
        topics_dict["subreddit"].append(submission.subreddit)
        topics_dict["subreddit_subs"].append(submission.subreddit_subscribers)
        topics_dict["thumbnail"].append(submission.thumbnail)
        topics_dict["thumbnail_ht"].append(submission.thumbnail_height)
        topics_dict["thumbnail_wdt"].append(submission.thumbnail_width)
        topics_dict["is_video"].append(submission.is_video)
        topics_dict["score"].append(submission.score)
        topics_dict["id"].append(submission.id)
        topics_dict["url"].append(submission.url)
        topics_dict["comms_num"].append(submission.num_comments)
        topics_dict["num_crossposts"].append(submission.num_crossposts)
        topics_dict["created"].append(submission.created)
        topics_dict["whitelist_status"].append(submission.whitelist_status)
        topics_dict["ops_flair"].append(submission.selftext)
        topics_dict["link_flair"].append(submission.selftext)
        topics_dict["body"].append(submission.selftext)


topics_data = pd.DataFrame(topics_dict)

# topics_data.head()

def get_date(created):
    return dt.datetime.fromtimestamp(created)

_timestamp = topics_data["created"].apply(get_date)

topics_data = topics_data.assign(timestamp = _timestamp)

topics_data["time_up"] = dt.datetime.now() - topics_data["timestamp"]

topics_data["thumbnail_size"] = topics_data["thumbnail_ht"] * topics_data["thumbnail_wdt"]

topics_data.drop(["thumbnail_ht", "thumbnail_wdt"], axis=1, inplace=True)

topics_data.shape


topics_data.head()
# topics_data.to_csv('pop_hot_raw.csv', index=False) 


<a id='prepo'></a>
# Preporcessing the Data and Feature Engineering

In [802]:
import pandas as pd
import numpy as np
import re

In [803]:
data = pd.read_csv('pop_hot_raw.csv')
data.head()

Unnamed: 0,body,comms_num,created,id,is_video,link_flair,num_crossposts,ops_flair,score,subreddit,subreddit_subs,thumbnail,title,url,whitelist_status,timestamp,time_up,thumbnail_size
0,,1933,1528019000.0,8o59w7,False,,6,,33177,space,13878786,https://b.thumbs.redditmedia.com/gGiQxb9FE1tNt...,The close-up of the Andromeda Galaxy from the ...,https://i.redd.it/n7lw9vnqwo111.jpg,all_ads,2018-06-03 05:37:12,-1 days +19:37:40.175413000,19600.0
1,,228,1528017000.0,8o534d,False,,10,,21786,gifs,16182968,https://b.thumbs.redditmedia.com/uAhnNdQgpXwXz...,Interesting paintwork,https://i.imgur.com/dQZmzhQ.gifv,all_ads,2018-06-03 05:03:48,-1 days +20:11:04.175413000,19600.0
2,,749,1528011000.0,8o4k7b,False,,2,,40508,todayilearned,18848334,https://a.thumbs.redditmedia.com/TZsCRRxPYvLsb...,TIL Viggo Mortensen purchased the horse he rod...,http://ca.ign.com/articles/2004/03/04/ign-inte...,all_ads,2018-06-03 03:35:57,-1 days +21:38:55.175413000,19600.0
3,,398,1528012000.0,8o4kz1,False,,0,,26798,aww,17226882,https://b.thumbs.redditmedia.com/LMoBkdKMN5gR7...,My dad just got a Facebook account then asked ...,https://i.redd.it/zy13buapbo111.jpg,all_ads,2018-06-03 03:39:12,-1 days +21:35:40.175413000,19600.0
4,,146,1528013000.0,8o4oqk,False,,1,,26000,PrequelMemes,605612,https://b.thumbs.redditmedia.com/od5BlbIXhwZ_S...,How to legalize a Ewan McGregor photo.,https://i.imgur.com/3NzQr7S.gifv,all_ads,2018-06-03 03:56:36,-1 days +21:18:16.175413000,14140.0


In [804]:
data.isnull().sum()

body                5568
comms_num              0
created                0
id                     0
is_video               0
link_flair          5568
num_crossposts         0
ops_flair           5568
score                  0
subreddit              0
subreddit_subs         0
thumbnail              0
title                  0
url                    0
whitelist_status     697
timestamp              0
time_up                0
thumbnail_size       587
dtype: int64

In [805]:
data.shape

(6015, 18)

In [806]:
data.whitelist_status.unique()

array(['all_ads', 'promo_adult_nsfw', 'promo_specified', 'promo_all',
       'house_only', nan, 'no_ads', 'promo_adult'], dtype=object)

In [807]:
len(data.thumbnail.unique())

4821

In [808]:
len(data.url.unique())

5810

In [809]:
data.describe(include = 'all')

Unnamed: 0,body,comms_num,created,id,is_video,link_flair,num_crossposts,ops_flair,score,subreddit,subreddit_subs,thumbnail,title,url,whitelist_status,timestamp,time_up,thumbnail_size
count,447,6015.0,6015.0,6015,6015,447,6015.0,447,6015.0,6015,6015.0,6015,6015,6015,5318,6015,6015,5428.0
unique,442,,,5991,2,442,,442,,1995,,4821,5802,5810,7,5734,5734,
top,"Sup xx'ers! As the title says, I'm a powerlift...",,,8o2keu,False,"Sup xx'ers! As the title says, I'm a powerlift...",,"Sup xx'ers! As the title says, I'm a powerlift...",,aww,,self,hmmm,https://v.redd.it/0n7lczcjhk111,all_ads,2018-06-02 22:03:34,0 days 03:11:18.175413000,
freq,2,,,2,5749,2,,2,,162,,532,29,5,4509,4,4,
mean,,63.510058,1527998000.0,,,,0.212801,,1556.483292,,2566996.0,,,,,,,16479.467576
std,,376.181487,18048.15,,,,0.962691,,5273.889802,,5760837.0,,,,,,,3745.46026
min,,0.0,1527946000.0,,,,0.0,,22.0,,1315.0,,,,,,,1680.0
25%,,6.0,1527984000.0,,,,0.0,,105.0,,49896.0,,,,,,,13160.0
50%,,15.0,1528000000.0,,,,0.0,,262.0,,180070.0,,,,,,,19460.0
75%,,40.0,1528012000.0,,,,0.0,,764.0,,629260.0,,,,,,,19600.0


In [810]:
data.head()

Unnamed: 0,body,comms_num,created,id,is_video,link_flair,num_crossposts,ops_flair,score,subreddit,subreddit_subs,thumbnail,title,url,whitelist_status,timestamp,time_up,thumbnail_size
0,,1933,1528019000.0,8o59w7,False,,6,,33177,space,13878786,https://b.thumbs.redditmedia.com/gGiQxb9FE1tNt...,The close-up of the Andromeda Galaxy from the ...,https://i.redd.it/n7lw9vnqwo111.jpg,all_ads,2018-06-03 05:37:12,-1 days +19:37:40.175413000,19600.0
1,,228,1528017000.0,8o534d,False,,10,,21786,gifs,16182968,https://b.thumbs.redditmedia.com/uAhnNdQgpXwXz...,Interesting paintwork,https://i.imgur.com/dQZmzhQ.gifv,all_ads,2018-06-03 05:03:48,-1 days +20:11:04.175413000,19600.0
2,,749,1528011000.0,8o4k7b,False,,2,,40508,todayilearned,18848334,https://a.thumbs.redditmedia.com/TZsCRRxPYvLsb...,TIL Viggo Mortensen purchased the horse he rod...,http://ca.ign.com/articles/2004/03/04/ign-inte...,all_ads,2018-06-03 03:35:57,-1 days +21:38:55.175413000,19600.0
3,,398,1528012000.0,8o4kz1,False,,0,,26798,aww,17226882,https://b.thumbs.redditmedia.com/LMoBkdKMN5gR7...,My dad just got a Facebook account then asked ...,https://i.redd.it/zy13buapbo111.jpg,all_ads,2018-06-03 03:39:12,-1 days +21:35:40.175413000,19600.0
4,,146,1528013000.0,8o4oqk,False,,1,,26000,PrequelMemes,605612,https://b.thumbs.redditmedia.com/od5BlbIXhwZ_S...,How to legalize a Ewan McGregor photo.,https://i.imgur.com/3NzQr7S.gifv,all_ads,2018-06-03 03:56:36,-1 days +21:18:16.175413000,14140.0


Things to do:
- time up, just get the hours 
- TDITF on the titles (keep stop words) 
- create dummy column based on num of comments being above 75th percentile (40 comments)
- dummy "is video", wls, 
- use reg ex to pull last part of url slug to know what type of content it is
- light LDA, sentiment analysis on Title and subreddits 
- thumbnail size: replace NaN with 0 

In [811]:
type(data["body"].values)

numpy.ndarray

In [812]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6015 entries, 0 to 6014
Data columns (total 18 columns):
body                447 non-null object
comms_num           6015 non-null int64
created             6015 non-null float64
id                  6015 non-null object
is_video            6015 non-null bool
link_flair          447 non-null object
num_crossposts      6015 non-null int64
ops_flair           447 non-null object
score               6015 non-null int64
subreddit           6015 non-null object
subreddit_subs      6015 non-null int64
thumbnail           6015 non-null object
title               6015 non-null object
url                 6015 non-null object
whitelist_status    5318 non-null object
timestamp           6015 non-null object
time_up             6015 non-null object
thumbnail_size      5428 non-null float64
dtypes: bool(1), float64(2), int64(4), object(11)
memory usage: 804.8+ KB


We're going to drop ops_flair and link flair as highly correlated with the presence of a body. Also, thumbnail gets me to the same place as URL. Let's drop some features we've pulled but aren't going to use.  

In [813]:
data.columns

Index(['body', 'comms_num', 'created', 'id', 'is_video', 'link_flair',
       'num_crossposts', 'ops_flair', 'score', 'subreddit', 'subreddit_subs',
       'thumbnail', 'title', 'url', 'whitelist_status', 'timestamp', 'time_up',
       'thumbnail_size'],
      dtype='object')

In [814]:
data.drop(["link_flair", "ops_flair"], axis=1, inplace=True)

In [815]:
data.whitelist_status.unique()

array(['all_ads', 'promo_adult_nsfw', 'promo_specified', 'promo_all',
       'house_only', nan, 'no_ads', 'promo_adult'], dtype=object)

In [816]:
data.head(100)

Unnamed: 0,body,comms_num,created,id,is_video,num_crossposts,score,subreddit,subreddit_subs,thumbnail,title,url,whitelist_status,timestamp,time_up,thumbnail_size
0,,1933,1.528019e+09,8o59w7,False,6,33177,space,13878786,https://b.thumbs.redditmedia.com/gGiQxb9FE1tNt...,The close-up of the Andromeda Galaxy from the ...,https://i.redd.it/n7lw9vnqwo111.jpg,all_ads,2018-06-03 05:37:12,-1 days +19:37:40.175413000,19600.0
1,,228,1.528017e+09,8o534d,False,10,21786,gifs,16182968,https://b.thumbs.redditmedia.com/uAhnNdQgpXwXz...,Interesting paintwork,https://i.imgur.com/dQZmzhQ.gifv,all_ads,2018-06-03 05:03:48,-1 days +20:11:04.175413000,19600.0
2,,749,1.528011e+09,8o4k7b,False,2,40508,todayilearned,18848334,https://a.thumbs.redditmedia.com/TZsCRRxPYvLsb...,TIL Viggo Mortensen purchased the horse he rod...,http://ca.ign.com/articles/2004/03/04/ign-inte...,all_ads,2018-06-03 03:35:57,-1 days +21:38:55.175413000,19600.0
3,,398,1.528012e+09,8o4kz1,False,0,26798,aww,17226882,https://b.thumbs.redditmedia.com/LMoBkdKMN5gR7...,My dad just got a Facebook account then asked ...,https://i.redd.it/zy13buapbo111.jpg,all_ads,2018-06-03 03:39:12,-1 days +21:35:40.175413000,19600.0
4,,146,1.528013e+09,8o4oqk,False,1,26000,PrequelMemes,605612,https://b.thumbs.redditmedia.com/od5BlbIXhwZ_S...,How to legalize a Ewan McGregor photo.,https://i.imgur.com/3NzQr7S.gifv,all_ads,2018-06-03 03:56:36,-1 days +21:18:16.175413000,14140.0
5,,314,1.528011e+09,8o4j9q,False,1,22579,pics,18708090,nsfw,Dressed In Light,https://i.redd.it/zp8ni61dao111.jpg,promo_adult_nsfw,2018-06-03 03:31:42,-1 days +21:43:10.175413000,19600.0
6,,171,1.528014e+09,8o4v6z,False,0,9375,lifehacks,1236703,https://a.thumbs.redditmedia.com/woZz5ZX5RipLm...,Drained my water heater and the hose I used wa...,https://i.redd.it/8b3f58yzjo111.jpg,all_ads,2018-06-03 04:26:01,-1 days +20:48:51.175413000,14700.0
7,,381,1.528011e+09,8o4hoh,True,2,20620,funny,19639174,https://b.thumbs.redditmedia.com/PkJ-fikMQH732...,Many of us can relate.,https://v.redd.it/dl4v4yv19o111,all_ads,2018-06-03 03:25:00,-1 days +21:49:52.175413000,19600.0
8,,128,1.528013e+09,8o4pb1,False,0,9387,wallstreetbets,261201,https://b.thumbs.redditmedia.com/rmDdsrZUqyPO-...,Investment advice from WSB,https://i.redd.it/onn47j1afo111.jpg,promo_specified,2018-06-03 03:59:19,-1 days +21:15:33.175413000,10780.0
9,,124,1.528017e+09,8o531f,False,0,7555,gaming,18204889,https://b.thumbs.redditmedia.com/C-pS7aj-SbrCH...,The police will go to extreme measures to capt...,https://gfycat.com/TemptingExcellentIchthyosau...,all_ads,2018-06-03 05:03:28,-1 days +20:11:24.175413000,10920.0


In [817]:
gifs = [".gifv", ".gif", "gfycat"]
images = ["i.redd", "https://imgur.com"]

In [818]:
(data["comms_num"]>6).sum()

4422

In [819]:
# the target 
data["engagement"] = np.where(data["comms_num"]>500, 1, 0)

In [820]:
data.head()

Unnamed: 0,body,comms_num,created,id,is_video,num_crossposts,score,subreddit,subreddit_subs,thumbnail,title,url,whitelist_status,timestamp,time_up,thumbnail_size,engagement
0,,1933,1528019000.0,8o59w7,False,6,33177,space,13878786,https://b.thumbs.redditmedia.com/gGiQxb9FE1tNt...,The close-up of the Andromeda Galaxy from the ...,https://i.redd.it/n7lw9vnqwo111.jpg,all_ads,2018-06-03 05:37:12,-1 days +19:37:40.175413000,19600.0,1
1,,228,1528017000.0,8o534d,False,10,21786,gifs,16182968,https://b.thumbs.redditmedia.com/uAhnNdQgpXwXz...,Interesting paintwork,https://i.imgur.com/dQZmzhQ.gifv,all_ads,2018-06-03 05:03:48,-1 days +20:11:04.175413000,19600.0,0
2,,749,1528011000.0,8o4k7b,False,2,40508,todayilearned,18848334,https://a.thumbs.redditmedia.com/TZsCRRxPYvLsb...,TIL Viggo Mortensen purchased the horse he rod...,http://ca.ign.com/articles/2004/03/04/ign-inte...,all_ads,2018-06-03 03:35:57,-1 days +21:38:55.175413000,19600.0,1
3,,398,1528012000.0,8o4kz1,False,0,26798,aww,17226882,https://b.thumbs.redditmedia.com/LMoBkdKMN5gR7...,My dad just got a Facebook account then asked ...,https://i.redd.it/zy13buapbo111.jpg,all_ads,2018-06-03 03:39:12,-1 days +21:35:40.175413000,19600.0,0
4,,146,1528013000.0,8o4oqk,False,1,26000,PrequelMemes,605612,https://b.thumbs.redditmedia.com/od5BlbIXhwZ_S...,How to legalize a Ewan McGregor photo.,https://i.imgur.com/3NzQr7S.gifv,all_ads,2018-06-03 03:56:36,-1 days +21:18:16.175413000,14140.0,0


In [821]:
urls = data["url"]

In [822]:
data["is_video"] = data["is_video"].astype(str)

In [823]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6015 entries, 0 to 6014
Data columns (total 17 columns):
body                447 non-null object
comms_num           6015 non-null int64
created             6015 non-null float64
id                  6015 non-null object
is_video            6015 non-null object
num_crossposts      6015 non-null int64
score               6015 non-null int64
subreddit           6015 non-null object
subreddit_subs      6015 non-null int64
thumbnail           6015 non-null object
title               6015 non-null object
url                 6015 non-null object
whitelist_status    5318 non-null object
timestamp           6015 non-null object
time_up             6015 non-null object
thumbnail_size      5428 non-null float64
engagement          6015 non-null int64
dtypes: float64(2), int64(5), object(10)
memory usage: 798.9+ KB


In [824]:
gif_1 = data["url"].str.contains(".gifv")

In [825]:
data["media"] = None

In [826]:
for m in data.media:
    if m == None:
        data["media"] = 'ext_link'

In [827]:
data.loc[data["url"].str.contains(".gifv"), 'media'] = 'gif'
data.loc[data["url"].str.contains(".gif"), 'media'] = 'gif'
data.loc[data["url"].str.contains("gfycat"), 'media'] = 'gif'
data.loc[data["url"].str.contains("i.redd"), 'media'] = 'img'
data.loc[data["url"].str.contains("https://imgur.com"), 'media'] = 'img'
data.loc[data["url"].str.contains("http://imgur.com"), 'media'] = 'img'
data.loc[data["url"].str.contains("https://i.imgur.com"), 'media'] = 'img'
data.loc[data["url"].str.contains("http://i.imgur.com"), 'media'] = 'img'
data.loc[data["url"].str.contains("https://www.reddit.com"), 'media'] = 'body'
data.loc[data["url"].str.contains("http://www.reddit.com"), 'media'] = 'body'
data.loc[data["is_video"].str.contains("True"), 'media'] = 'video'
data.loc[data["thumbnail"].str.contains("self"), 'media'] = 'body'

# data.loc[data["url"].str.contains(".gifv"), 'media'] = 'gif'

In [828]:
# for m in data.media:
#     if gif_1 is True:
#         data["media"].append('gif')
    

In [829]:
data["time_up_clean"] = data["time_up"]
# data["time_up_clean"] = re.sub('days+-\s','')

In [830]:
def clean_it(time):
    return re.sub('days+-\s', '')

In [831]:
data["time_up_clean"] = data["time_up_clean"].str.replace('d','')
data["time_up_clean"] = data["time_up_clean"].str.replace('a','')
data["time_up_clean"] = data["time_up_clean"].str.replace('y','')
data["time_up_clean"] = data["time_up_clean"].str.replace('s','')
data["time_up_clean"] = data["time_up_clean"].str.replace('+','')
data["time_up_clean"] = data["time_up_clean"].str.replace('-','')
data["time_up_clean"] = data["time_up_clean"].str.replace(' ','')
data["time_up_clean"] = data["time_up_clean"].str.replace('d','')
data["time_up_clean"] = data["time_up_clean"].str.replace('d','')
data["time_up_clean"] = data["time_up_clean"].str.replace('d','')
data["time_up_clean"] = data["time_up_clean"].str.replace('d','')
data["time_up_clean"] = data["time_up_clean"].str[1:]
data["time_up_clean"] = data["time_up_clean"].str[:-10]

In [832]:
# using seconds as unit for more precision
data["time_up_sec"] = pd.to_datetime(data["time_up_clean"], format= "%H:%M:%S")

In [833]:
# pd.to_datetime(data["time_up"])

In [834]:
# print(data["time_up"].dt.total_seconds())

In [835]:
data["hours"] = data["time_up_clean"].astype(str).str[0:2]
data["hours"] = data["hours"].astype(int)
#replaceing time under an hour with 1 hour 
data.loc[data["hours"] == 0, "hours"] = 1

In [836]:
data.head(50)

Unnamed: 0,body,comms_num,created,id,is_video,num_crossposts,score,subreddit,subreddit_subs,thumbnail,...,url,whitelist_status,timestamp,time_up,thumbnail_size,engagement,media,time_up_clean,time_up_sec,hours
0,,1933,1528019000.0,8o59w7,False,6,33177,space,13878786,https://b.thumbs.redditmedia.com/gGiQxb9FE1tNt...,...,https://i.redd.it/n7lw9vnqwo111.jpg,all_ads,2018-06-03 05:37:12,-1 days +19:37:40.175413000,19600.0,1,img,19:37:40,1900-01-01 19:37:40,19
1,,228,1528017000.0,8o534d,False,10,21786,gifs,16182968,https://b.thumbs.redditmedia.com/uAhnNdQgpXwXz...,...,https://i.imgur.com/dQZmzhQ.gifv,all_ads,2018-06-03 05:03:48,-1 days +20:11:04.175413000,19600.0,0,img,20:11:04,1900-01-01 20:11:04,20
2,,749,1528011000.0,8o4k7b,False,2,40508,todayilearned,18848334,https://a.thumbs.redditmedia.com/TZsCRRxPYvLsb...,...,http://ca.ign.com/articles/2004/03/04/ign-inte...,all_ads,2018-06-03 03:35:57,-1 days +21:38:55.175413000,19600.0,1,ext_link,21:38:55,1900-01-01 21:38:55,21
3,,398,1528012000.0,8o4kz1,False,0,26798,aww,17226882,https://b.thumbs.redditmedia.com/LMoBkdKMN5gR7...,...,https://i.redd.it/zy13buapbo111.jpg,all_ads,2018-06-03 03:39:12,-1 days +21:35:40.175413000,19600.0,0,img,21:35:40,1900-01-01 21:35:40,21
4,,146,1528013000.0,8o4oqk,False,1,26000,PrequelMemes,605612,https://b.thumbs.redditmedia.com/od5BlbIXhwZ_S...,...,https://i.imgur.com/3NzQr7S.gifv,all_ads,2018-06-03 03:56:36,-1 days +21:18:16.175413000,14140.0,0,img,21:18:16,1900-01-01 21:18:16,21
5,,314,1528011000.0,8o4j9q,False,1,22579,pics,18708090,nsfw,...,https://i.redd.it/zp8ni61dao111.jpg,promo_adult_nsfw,2018-06-03 03:31:42,-1 days +21:43:10.175413000,19600.0,0,img,21:43:10,1900-01-01 21:43:10,21
6,,171,1528014000.0,8o4v6z,False,0,9375,lifehacks,1236703,https://a.thumbs.redditmedia.com/woZz5ZX5RipLm...,...,https://i.redd.it/8b3f58yzjo111.jpg,all_ads,2018-06-03 04:26:01,-1 days +20:48:51.175413000,14700.0,0,img,20:48:51,1900-01-01 20:48:51,20
7,,381,1528011000.0,8o4hoh,True,2,20620,funny,19639174,https://b.thumbs.redditmedia.com/PkJ-fikMQH732...,...,https://v.redd.it/dl4v4yv19o111,all_ads,2018-06-03 03:25:00,-1 days +21:49:52.175413000,19600.0,0,video,21:49:52,1900-01-01 21:49:52,21
8,,128,1528013000.0,8o4pb1,False,0,9387,wallstreetbets,261201,https://b.thumbs.redditmedia.com/rmDdsrZUqyPO-...,...,https://i.redd.it/onn47j1afo111.jpg,promo_specified,2018-06-03 03:59:19,-1 days +21:15:33.175413000,10780.0,0,img,21:15:33,1900-01-01 21:15:33,21
9,,124,1528017000.0,8o531f,False,0,7555,gaming,18204889,https://b.thumbs.redditmedia.com/C-pS7aj-SbrCH...,...,https://gfycat.com/TemptingExcellentIchthyosau...,all_ads,2018-06-03 05:03:28,-1 days +20:11:24.175413000,10920.0,0,gif,20:11:24,1900-01-01 20:11:24,20


Just doing the core moving forward 

In [837]:
reddit_body = data.drop(['created', 'is_video', 'thumbnail', 'url', 'timestamp',
                    'time_up', 'time_up_sec', 'time_up_clean'], axis=1)

In [838]:
reddit = data.drop(['body','id','created', 'is_video', 'thumbnail', 'url', 'timestamp',
                    'time_up', 'time_up_sec', 'time_up_clean'], axis=1)

In [839]:
reddit.head(50)

Unnamed: 0,comms_num,num_crossposts,score,subreddit,subreddit_subs,title,whitelist_status,thumbnail_size,engagement,media,hours
0,1933,6,33177,space,13878786,The close-up of the Andromeda Galaxy from the ...,all_ads,19600.0,1,img,19
1,228,10,21786,gifs,16182968,Interesting paintwork,all_ads,19600.0,0,img,20
2,749,2,40508,todayilearned,18848334,TIL Viggo Mortensen purchased the horse he rod...,all_ads,19600.0,1,ext_link,21
3,398,0,26798,aww,17226882,My dad just got a Facebook account then asked ...,all_ads,19600.0,0,img,21
4,146,1,26000,PrequelMemes,605612,How to legalize a Ewan McGregor photo.,all_ads,14140.0,0,img,21
5,314,1,22579,pics,18708090,Dressed In Light,promo_adult_nsfw,19600.0,0,img,21
6,171,0,9375,lifehacks,1236703,Drained my water heater and the hose I used wa...,all_ads,14700.0,0,img,20
7,381,2,20620,funny,19639174,Many of us can relate.,all_ads,19600.0,0,video,21
8,128,0,9387,wallstreetbets,261201,Investment advice from WSB,promo_specified,10780.0,0,img,21
9,124,0,7555,gaming,18204889,The police will go to extreme measures to capt...,all_ads,10920.0,0,gif,20


In [840]:
(reddit["engagement"] > 0).sum()

114

Let's create some dummies

In [841]:
reddit["thumbnail_size"].fillna(0, inplace=True)

In [842]:
#reddit.to_csv('clean_reddit.csv', index=False)

In [738]:
reddit = pd.get_dummies(reddit, columns=['whitelist_status', 'media'])

In [739]:
reddit.head()

Unnamed: 0,comms_num,num_crossposts,score,subreddit,subreddit_subs,title,thumbnail_size,engagement,hours,whitelist_status_all_ads,...,whitelist_status_no_ads,whitelist_status_promo_adult,whitelist_status_promo_adult_nsfw,whitelist_status_promo_all,whitelist_status_promo_specified,media_body,media_ext_link,media_gif,media_img,media_video
0,1933,6,33177,space,13878786,The close-up of the Andromeda Galaxy from the ...,19600.0,1,19,1,...,0,0,0,0,0,0,0,0,1,0
1,228,10,21786,gifs,16182968,Interesting paintwork,19600.0,0,20,1,...,0,0,0,0,0,0,0,0,1,0
2,749,2,40508,todayilearned,18848334,TIL Viggo Mortensen purchased the horse he rod...,19600.0,1,21,1,...,0,0,0,0,0,0,1,0,0,0
3,398,0,26798,aww,17226882,My dad just got a Facebook account then asked ...,19600.0,0,21,1,...,0,0,0,0,0,0,0,0,1,0
4,146,1,26000,PrequelMemes,605612,How to legalize a Ewan McGregor photo.,14140.0,0,21,1,...,0,0,0,0,0,0,0,0,1,0


<a id='NLP'></a>
# Natural Language Processing 

**Playing with Latent Dirichlet Allocation**

In [740]:
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from gensim import models, corpora
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [741]:
reddit_titles = reddit["title"]

In [742]:
reddit_titles

0       The close-up of the Andromeda Galaxy from the ...
1                                   Interesting paintwork
2       TIL Viggo Mortensen purchased the horse he rod...
3       My dad just got a Facebook account then asked ...
4                  How to legalize a Ewan McGregor photo.
5                                        Dressed In Light
6       Drained my water heater and the hose I used wa...
7                                  Many of us can relate.
8                              Investment advice from WSB
9       The police will go to extreme measures to capt...
10      [Homemade] Beef Wellington with a blackberry r...
11      The largest wildfire in California's modern hi...
12      Got to the UPS store before they opened, they ...
13                                       Y’all showed him
14                                                   hmmm
15                         I ..uhm.. concluded Rose's arc
16      France warns US it has one week to avoid trade...
17            

In [743]:
# Processor function for tokenizing, removing stop words, and stemming
def process(input_text):
    # Creates a regular expression tokenizer
    tokenizer = RegexpTokenizer(r'\w+')
    
    # Creates a Snowball stemmer
    stemmer = SnowballStemmer('english')
    stop_words = stopwords.words('english')
    
    # Tokenizes the input string
    tokens = tokenizer.tokenize(input_text.lower())
    tokens = [x for x in tokens if not x in stop_words]
    
    # Performs stemming on the tokenized words
    tokens_stemmed = [stemmer.stem(x) for x in tokens]
    return tokens_stemmed

In [744]:
# Creates a list for sentence tokens
tokens = reddit.title.apply(process)

In [745]:
tokens

0       [close, andromeda, galaxi, hubbl, space, teles...
1                                   [interest, paintwork]
2       [til, viggo, mortensen, purchas, hors, rode, l...
3       [dad, got, facebook, account, ask, take, pic, ...
4                          [legal, ewan, mcgregor, photo]
5                                          [dress, light]
6       [drain, water, heater, hose, use, littl, short...
7                                       [mani, us, relat]
8                                    [invest, advic, wsb]
9              [polic, go, extrem, measur, captur, fugit]
10      [homemad, beef, wellington, blackberri, red, w...
11      [largest, wildfir, california, modern, histori...
12          [got, up, store, open, miss, fedex, deliveri]
13                                                 [show]
14                                                 [hmmm]
15                              [uhm, conclud, rose, arc]
16        [franc, warn, us, one, week, avoid, trade, war]
17            

In [746]:
# Creates a dictionary based on the sentence tokens
dict_tokens = corpora.Dictionary(tokens)

In [747]:
# Creating a document-term matrix
doc_term_mat = [dict_tokens.doc2bow(token) for token in tokens]

In [1]:
# doc_term_mat

In [749]:
# Defining the number of topics for the LDA model
num_topics = 2

In [750]:
# Generating the LDA model
ldamodel = models.ldamodel.LdaModel(doc_term_mat,
                                    num_topics=num_topics,
                                    id2word=dict_tokens,
                                    passes=25, alpha=1)

In [751]:
num_words = 5
print('\nTop ' + str(num_words) + ' contributing words to each topic:')
for item in ldamodel.print_topics(num_topics=num_topics, num_words=num_words):
    print('\nTopic', item[0])
    
# Print the contributing words along with their relative contributions
list_of_strings = item[1].split(' + ')
for text in list_of_strings:
    weight = text.split('*')[0]
    word = text.split('*')[1]
    print(word, '==>', str(round(float(weight) * 100, 2)) + '%')


Top 5 contributing words to each topic:

Topic 0

Topic 1
"first" ==> 1.5%
"day" ==> 1.5%
"littl" ==> 1.3%
"one" ==> 1.0%
"time" ==> 0.9%


LDA on posts with bodies 

In [752]:
reddit2 = reddit_body["body"]

In [753]:
reddit2.dropna(axis=0, inplace=True) 

In [754]:
reddit2

19      So if you can't tell from my username, I am a ...
47      FTP. On mobile. Sorry, this involved a lot of ...
59                      Hit r/all. So fuck you johnathon 
70      It might not be much to some people but I had ...
97      I've been seeing these characters :\n\n| ||\n\...
98      I wonder if "stirring music" in between dialog...
108     This will let you know exactly what you're get...
110     EDIT: Never thought it would blow up! Hope it ...
134     In high school I worked at a pizza place for t...
140     CAPS WIN SO LETS ALL CELEBRATE WITH FAKE INTER...
180     I've spent a few hours on it myself and it's a...
193     1) I want my remains spread around Disney Worl...
209     Everytime I find myself on this sub writing an...
240     I have spent the last several years trying to ...
254     **UPDATE:** Jesselyn and Andy out! Thanks a bu...
277     First things first, I want to say thank you to...
283     >Be me\n\n>DnD with the pals\n\n>A lvl 20 one-...
300     So I r

In [755]:
tokens = reddit2.apply(process)
dict_tokens = corpora.Dictionary(tokens)
doc_term_mat = [dict_tokens.doc2bow(token) for token in tokens]

In [756]:
tokens

19      [tell, usernam, huge, fan, booti, specif, fan,...
47      [ftp, mobil, sorri, involv, lot, necessari, de...
59                              [hit, r, fuck, johnathon]
70      [might, much, peopl, hope, 2, year, ago, asham...
97      [see, charact, _, pop, reddit, clue, mean, new...
98      [wonder, stir, music, dialogu, littl, mean, wo...
108                               [let, know, exact, get]
110       [edit, never, thought, would, blow, hope, help]
134     [high, school, work, pizza, place, two, year, ...
140     [cap, win, let, celebr, fake, internet, point,...
180     [spent, hour, uniqu, product, way, learn, help...
193     [1, want, remain, spread, around, disney, worl...
209     [everytim, find, sub, write, anoth, stori, wor...
240     [spent, last, sever, year, tri, continu, broad...
254     [updat, jesselyn, andi, thank, bunch, question...
277     [first, thing, first, want, say, thank, encour...
283     [dnd, pal, lvl, 20, one, shot, colosseum, styl...
300     [recen

In [757]:
num_topics = 1

In [758]:
ldamodel = models.ldamodel.LdaModel(doc_term_mat,
                                    num_topics=num_topics,
                                    id2word=dict_tokens,
                                    passes=50, alpha=1)

num_words = 5
print('\nTop ' + str(num_words) + ' contributing words to each topic:')
for item in ldamodel.print_topics(num_topics=num_topics, num_words=num_words):
    print('\nTopic', item[0])
    
# Print the contributing words along with their relative contributions
list_of_strings = item[1].split(' + ')
for text in list_of_strings:
    weight = text.split('*')[0]
    word = text.split('*')[1]
    print(word, '==>', str(round(float(weight) * 100, 2)) + '%')


Top 5 contributing words to each topic:

Topic 0
"get" ==> 0.6%
"com" ==> 0.6%
"like" ==> 0.5%
"https" ==> 0.5%
"0" ==> 0.5%


**Sentiment Analysis**

In [759]:
vader = SentimentIntensityAnalyzer()
print(vader.polarity_scores(reddit.title[100]))

{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}


In [760]:
# vader = SentimentIntensityAnalyzer()
# print(vader.polarity_scores(reddit_body["body"]))

In [761]:
# reddit_body.columns

<a id='CART'></a>
# CART Models

## Random Forest 

In [762]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, ExtraTreesClassifier



In [763]:
reddit['title'].value_counts()/reddit.shape[0]

hmmm                                                                                                                                                                                                                                                                                       0.004821
2meirl4meirl                                                                                                                                                                                                                                                                               0.003824
furry_irl                                                                                                                                                                                                                                                                                  0.001496
gay_irl                                                                                                                     

Baseline

In [783]:
reddit.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6015 entries, 0 to 6014
Data columns (total 21 columns):
comms_num                            6015 non-null int64
num_crossposts                       6015 non-null int64
score                                6015 non-null int64
subreddit                            6015 non-null object
subreddit_subs                       6015 non-null int64
title                                6015 non-null object
thumbnail_size                       5428 non-null float64
engagement                           6015 non-null int64
hours                                6015 non-null int64
whitelist_status_all_ads             6015 non-null uint8
whitelist_status_house_only          6015 non-null uint8
whitelist_status_no_ads              6015 non-null uint8
whitelist_status_promo_adult         6015 non-null uint8
whitelist_status_promo_adult_nsfw    6015 non-null uint8
whitelist_status_promo_all           6015 non-null uint8
whitelist_status_promo_specified  

In [789]:
reddit.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6015 entries, 0 to 6014
Data columns (total 21 columns):
comms_num                            6015 non-null int64
num_crossposts                       6015 non-null int64
score                                6015 non-null int64
subreddit                            6015 non-null object
subreddit_subs                       6015 non-null int64
title                                6015 non-null object
thumbnail_size                       6015 non-null float64
engagement                           6015 non-null int64
hours                                6015 non-null int64
whitelist_status_all_ads             6015 non-null uint8
whitelist_status_house_only          6015 non-null uint8
whitelist_status_no_ads              6015 non-null uint8
whitelist_status_promo_adult         6015 non-null uint8
whitelist_status_promo_adult_nsfw    6015 non-null uint8
whitelist_status_promo_all           6015 non-null uint8
whitelist_status_promo_specified  

In [790]:
reddit['engagement'].value_counts(normalize=True)

0    0.981047
1    0.018953
Name: engagement, dtype: float64

In [791]:
features = ['comms_num', 'num_crossposts', 'score', 'subreddit_subs',
            'thumbnail_size', 'hours',
            'whitelist_status_all_ads', 'whitelist_status_house_only',
            'whitelist_status_no_ads', 'whitelist_status_promo_adult',
            'whitelist_status_promo_adult_nsfw', 'whitelist_status_promo_all',
            'whitelist_status_promo_specified', 'media_body', 'media_ext_link',
            'media_gif', 'media_img', 'media_video']

In [792]:
X = reddit[features]
y = reddit["engagement"]

In [793]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y)

In [794]:


tree = DecisionTreeClassifier()
tree.fit(X_train, y_train)
tree.score(X_test, y_test)



1.0

In [795]:
rf = RandomForestClassifier()
rf_params = {
    'n_estimators': [10],
    'max_features': [3, 4, 5],
    'max_depth': [None, 2, 3, 4]
}
gs = GridSearchCV(rf, param_grid=rf_params)
gs.fit(X_train, y_train)
print(gs.best_score_)
print(gs.best_params_)

0.9995566393260917
{'max_depth': None, 'max_features': 5, 'n_estimators': 10}


**Extra Trees**

In [796]:
et = ExtraTreesClassifier()
et.fit(X_train, y_train)
et.score(X_test, y_test)

0.9920212765957447

**Bagging Classifier** 

In [799]:
bag = BaggingClassifier()
bag_params = {
    'n_estimators': range(10, 21)
}
gs = GridSearchCV(bag, param_grid=bag_params)
gs.fit(X_train, y_train)
print(gs.best_score_)
print(gs.best_params_)

0.9995566393260917
{'n_estimators': 11}


<a id='logit'></a>
# Logistic Regression

In [770]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [771]:
# setting X & y and tts 
X = reddit['title'].values
y = reddit['engagement']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [772]:
#Testing tvec
tvec = TfidfVectorizer(stop_words='english')
X_train_counts = tvec.fit_transform(X_train)
X_test_counts = tvec.transform(X_test)

In [773]:
log_reg = LogisticRegression()
log_reg.fit(X_train_counts, y_train)
log_reg.score(X_test_counts, y_test)

0.9753148614609571

In [774]:
print('Logreg intercept:', log_reg.intercept_)
print('Logreg coef(s):', log_reg.coef_)

Logreg intercept: [-4.07824623]
Logreg coef(s): [[-0.02680607  0.08434043 -0.00870372 ... -0.00946206 -0.00946206
  -0.01065016]]


In [775]:
reddit.columns

Index(['comms_num', 'num_crossposts', 'score', 'subreddit', 'subreddit_subs',
       'title', 'thumbnail_size', 'engagement', 'hours',
       'whitelist_status_all_ads', 'whitelist_status_house_only',
       'whitelist_status_no_ads', 'whitelist_status_promo_adult',
       'whitelist_status_promo_adult_nsfw', 'whitelist_status_promo_all',
       'whitelist_status_promo_specified', 'media_body', 'media_ext_link',
       'media_gif', 'media_img', 'media_video'],
      dtype='object')

In [776]:
features = ['comms_num', 'num_crossposts', 'score', 'subreddit_subs',
            'thumbnail_size', 'hours',
            'whitelist_status_all_ads', 'whitelist_status_house_only',
            'whitelist_status_no_ads', 'whitelist_status_promo_adult',
            'whitelist_status_promo_adult_nsfw', 'whitelist_status_promo_all',
            'whitelist_status_promo_specified', 'media_body', 'media_ext_link',
            'media_gif', 'media_img', 'media_video']

In [777]:
X = reddit[features]
y = reddit['engagement']

# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [778]:
# log_reg = LogisticRegression()
# log_reg.fit(X, y)
# print('Logreg intercept:', logreg.intercept_)
# print('Logreg coef(s):', logreg.coef_)
# print('Logreg predicted probabilities:', logreg.predict_proba(X.head(5)))

In [779]:
log_reg = LogisticRegression()
log_reg.fit(X_train_counts, y_train)
log_reg.score(X_test_counts, y_test)

0.9753148614609571

In [780]:
print('Logreg intercept:', log_reg.intercept_)
print('Logreg coef(s):', log_reg.coef_)

Logreg intercept: [-4.07824623]
Logreg coef(s): [[-0.02680607  0.08434043 -0.00870372 ... -0.00946206 -0.00946206
  -0.01065016]]


In [781]:
coefs = pd.DataFrame(log_reg.coef_)

In [782]:
coefs

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,8699,8700,8701,8702,8703,8704,8705,8706,8707,8708
0,-0.026806,0.08434,-0.008704,-0.003118,-0.028163,-0.006881,-0.011064,-0.01537,-0.00573,-0.013036,...,-0.008315,-0.005835,-0.01501,-0.009462,-0.009462,-0.009462,-0.009462,-0.009462,-0.009462,-0.01065


In [689]:
# log_coefs = pd.DataFrame({'variable':X.columns,
#                             'coef': log_reg.coef_,
#                             'abs_coef':np.abs(log_reg.coef_)})

# log_coefs.sort_values('abs_coef', inplace=True, ascending=False)

# log_coefs

<a id='exec'></a>
# Executive Summary 

Often referred to as the front page of the internet, Reddit is an online, social media/news discussion website. It the “comments section” for any topic, issue and interest imaginable. Whether it be new, politics, wholesome memes, cat gifs, politics, weed whacking… There is a place for everyone. 

**Quick Stats on Reddit **
- 1.37M Daily Active Users and 18M Monthly Active Users 
- 82 Billion pages viewed on Reddit each year 
- 8th most popular website in the world
- Average website visit is 15 mins  

**Insights from Modeling **
- From the data we gathered, one of the greatest predictors for engagement was the presence of an Image (and thumbnail size)
- Sentiment analysis indicated that post where titles had either a neutral or positive sentiment experienced the most engagement. (fall off point at .5 positive sentiment rating)  
- Best titles: Either really short or very long.
- Newer posts fair better than older ones  (re: Reddit’s ranking algorithm) 
- Relevant (read topical) posts rise through
- People love wholesome memes, and feel good stories!