# Using Reddit's API for Predicting Comments

In this project, we will practice two major skills. Collecting data via an API request and then building a binary predictor.

As we discussed in week 2, and earlier today, there are two components to starting a data science problem: the problem statement, and acquiring the data.

For this article, your problem statement will be: _What characteristics of a post on Reddit contribute most to the overall interaction (as measured by number of comments)?_

Your method for acquiring the data will be scraping the 'hot' threads as listed on the [Reddit homepage](https://www.reddit.com/). You'll acquire _AT LEAST FOUR_ pieces of information about each thread:
1. The title of the thread
2. The subreddit that the thread corresponds to
3. The length of time it has been up on Reddit
4. The number of comments on the thread

Once you've got the data, you will build a classification model that, using Natural Language Processing and any other relevant features, predicts whether or not a given Reddit post will have above or below the _median_ number of comments.

**BONUS PROBLEMS**
1. If creating a logistic regression, GridSearch Ridge and Lasso for this model and report the best hyperparameter values.
1. Scrape the actual text of the threads using Selenium (you'll learn about this in Webscraping II).
2. Write the actual article that you're pitching and turn it into a blog post that you host on your personal website.

### Scraping Thread Info from Reddit.com

#### Set up a request (using requests) to the URL below. 

*NOTE*: Reddit will throw a [429 error](https://httpstatuses.com/429) when using the following code:
```python
res = requests.get(URL)
```

This is because Reddit has throttled python's default user agent. You'll need to set a custom `User-agent` to get your request to work.
```python
res = requests.get(URL, headers={'User-agent': 'YOUR NAME Bot 0.1'})
```

In [164]:
import requests
import pandas as pd
import numpy as np
import re
import json
import praw
import pprint                    # from PRAW docs
from psaw import PushshiftAPI    # PSAW recommended by following PRAW errors
import datetime as dt            # PSAW docs
import time

In [165]:
def log_progress(sequence, every=None, size=None, name='Items'):
    from ipywidgets import IntProgress, HTML, VBox
    from IPython.display import display

    is_iterator = False
    if size is None:
        try:
            size = len(sequence)
        except TypeError:
            is_iterator = True
    if size is not None:
        if every is None:
            if size <= 200:
                every = 1
            else:
                every = int(size / 200)     # every 0.5%
    else:
        assert every is not None, 'sequence is iterator, set every'

    if is_iterator:
        progress = IntProgress(min=0, max=1, value=1)
        progress.bar_style = 'info'
    else:
        progress = IntProgress(min=0, max=size, value=0)
    label = HTML()
    box = VBox(children=[label, progress])
    display(box)

    index = 0
    try:
        for index, record in enumerate(sequence, 1):
            if index == 1 or index % every == 0:
                if is_iterator:
                    label.value = '{name}: {index} / ?'.format(
                        name=name,
                        index=index
                    )
                else:
                    progress.value = index
                    label.value = u'{name}: {index} / {size}'.format(
                        name=name,
                        index=index,
                        size=size
                    )
            yield record
    except:
        progress.bar_style = 'danger'
        raise
    else:
        progress.bar_style = 'success'
        progress.value = index
        label.value = "{name}: {index}".format(
            name=name,
            index=str(index or '?')
        )

In [166]:
reddit = praw.Reddit(client_id='CmKUgfSklwH6Gw',
                     client_secret='WprZwImA7V8TcggsN0GfpZOfl2g',
                     user_agent='ClassProjectBot-PRAW/PSAW',
                     password='dsBaLpQSua2ctCXU2XyupJ',
                     username='refused_dev')

In [167]:
api = PushshiftAPI()

In [168]:
print(reddit.read_only)
print(reddit.user.me())

False
refused_dev


In [169]:
# assume you have a Reddit instance bound to variable `reddit`
askhist = reddit.subreddit('AskHistorians')
r_all = reddit.subreddit('all')
moto = reddit.subreddit('motorcycles')
    

In [190]:
subs = []
# coms = []
# subcom = []
for sub in r_all.hot(limit=2000):
    _ = {}
    _['submissions'] = sub
#     _['comments'] = sub.comments
    subs.append(_)
#     coms.append(sub.comments)
#     subs.append(sub)

In [191]:
# len(subcom)

# subcom

len(subs)

2000

In [198]:
# df = pd.DataFrame(subcom)

df = pd.DataFrame(subs)
# df.rename(columns={0:'subs'}, inplace=True)

# df.drop(columns='comments', inplace=True)

df.head(), df.shape

(  submissions
 0      8nx8rk
 1      8nxkyp
 2      8nxocr
 3      8nxb6k
 4      8nwxos, (2000, 1))

In [199]:
for x in subs:
    print(type(x))

<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'di

<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'di

In [193]:
df.to_csv('2000subs.csv')

In [208]:
weds = pd.read_csv('2500subs.csv')
thurs = pd.read_csv('9199subs.csv')
fri = pd.read_csv('9068subs.csv')
sat = pd.read_csv('2000subs.csv')

# rename subs column to submissions
weds.rename(columns={"subs" : "submissions"}, inplace=True)
df = pd.concat([weds, thurs, fri, sat])
df.drop(columns='Unnamed: 0', inplace=True)

In [222]:
df.shape

(22767, 1)

In [223]:
# using this for the loaded csv ids
sublist = []
for x in df['submissions']:
    sub = reddit.submission(id=x)
    subdict = {}
    subdict['title'] = sub.title
    subdict['comments'] = sub.num_comments
    subdict['crossposts'] = sub.num_crossposts
    subdict['score'] = sub.score
    subdict['subreddit'] = sub.subreddit
    subdict['domain'] = sub.domain
    subdict['gilded'] = sub.gilded
    subdict['upvote_ratio'] = sub.upvote_ratio
    subdict['created'] = sub.created
    sublist.append(subdict)

In [224]:
df = pd.DataFrame(sublist)

In [226]:
df.shape

(22767, 9)

In [227]:
df.to_csv('df_w_feats.csv')

In [213]:

sublist = []
for c in df:
    subdict = {}
    subdict['title'] = c.title
    subdict['comments'] = c.num_comments
    subdict['crossposts'] = c.num_crossposts
    subdict['score'] = c.score
    subdict['subreddit'] = c.subreddit
    subdict['domain'] = c.domain
    subdict['gilded'] = c.gilded
    subdict['upvote_ratio'] = c.upvote_ratio
    subdict['created'] = c.created
    sublist.append(subdict)
df = pd.DataFrame(sublist)
df.to_csv(f'{df}_feats.csv')

AttributeError: 'str' object has no attribute 'num_comments'

In [211]:
getfeats(weds)
getfeats(thurs)
getfeats(fri)
getfeats(sat)

AttributeError: 'str' object has no attribute 'num_comments'

In [176]:
len(fri_sublist)

9068

In [51]:
# for c in df['comments']:
# #     print(c.score)
# #     print(c.num_comments)
# #     print(c.num_crossposts)
#     print(c.list())
# #     print(c.)
#     print('\n'*4)

In [177]:
data = pd.DataFrame(fri_sublist)
data.head()

Unnamed: 0,comments,created,crossposts,domain,gilded,score,subreddit,title,upvote_ratio
0,714,1527898000.0,3,i.redd.it,0,37263,mildlyinteresting,This news paper from the Dominican republic us...,0.92
1,2008,1527900000.0,1,v.redd.it,0,26487,funny,Kim k forgets her baby,0.84
2,285,1527896000.0,1,i.redd.it,0,17846,wholesomememes,I love Mick Jagger,0.97
3,717,1527896000.0,2,washingtonpost.com,0,13257,politics,Trumpâ€™s spent far more going to Mar-a-Lago alo...,0.93
4,521,1527897000.0,1,i.redd.it,0,18990,pics,Iâ€™m super late to the trend but I too was insp...,0.83


In [178]:
data.shape

(9068, 9)

In [133]:
# # try out created instead of created_utc and add it to the data df

# createdlist = []
# for i in df['submissions']:
#     createdlist.append(i.created)

# len(createdlist) # right number of observations

# createdlist # looks good

# data['created'] = createdlist

# data.head()

In [182]:
# convert the new created column from unix timestamp to utc

# function to get date 
def get_date(created):
    return dt.datetime.fromtimestamp(created)
stamp = data['created'].apply(get_date)
data = data.assign(timestamp = stamp)


In [183]:
data.head() 
# we have a proper timestamp, but it is in UTC, converting it to EST will have 
# more relevance for the market 

Unnamed: 0,comments,created,crossposts,domain,gilded,score,subreddit,title,upvote_ratio,timestamp
0,714,1527898000.0,3,i.redd.it,0,37263,mildlyinteresting,This news paper from the Dominican republic us...,0.92,2018-06-01 19:58:46
1,2008,1527900000.0,1,v.redd.it,0,26487,funny,Kim k forgets her baby,0.84,2018-06-01 20:37:07
2,285,1527896000.0,1,i.redd.it,0,17846,wholesomememes,I love Mick Jagger,0.97,2018-06-01 19:40:31
3,717,1527896000.0,2,washingtonpost.com,0,13257,politics,Trumpâ€™s spent far more going to Mar-a-Lago alo...,0.93,2018-06-01 19:37:04
4,521,1527897000.0,1,i.redd.it,0,18990,pics,Iâ€™m super late to the trend but I too was insp...,0.83,2018-06-01 19:43:01


In [None]:
# # drop the other date and time columns and convert the timestamp to EST
# # EST is 5 hours behind UTC

# data.drop(columns=['created_utc','datetime','year','month','date','created'],
#          inplace = True)

In [184]:
# get the dates
datelist = []
for date in data['timestamp']:
    date = str(date)
    year = int(re.match(r"(\d{4})-(\d{2})-(\d{2})\s(.{8})", date).group(1))
    month = int(re.match(r"(\d{4})-(\d{2})-(\d{2})\s(.{8})", date).group(2))
    day = int(re.match(r"(\d{4})-(\d{2})-(\d{2})\s(.{8})", date).group(3))
    time = re.match(r"(\d{4})-(\d{2})-(\d{2})\s(.{8})", date).group(4)
    
    data['year'] = year
    data['month'] = month
    data['day'] = day
    data['time'] = time
    data['time'] = pd.to_datetime(data['time'])
    

In [185]:
data.dtypes

comments                 int64
created                float64
crossposts               int64
domain                  object
gilded                   int64
score                    int64
subreddit               object
title                   object
upvote_ratio           float64
timestamp       datetime64[ns]
year                     int64
month                    int64
day                      int64
time            datetime64[ns]
dtype: object

In [188]:
data.head()

Unnamed: 0,comments,created,crossposts,domain,gilded,score,subreddit,title,upvote_ratio,timestamp,year,month,day,time
0,714,1527898000.0,3,i.redd.it,0,37263,mildlyinteresting,This news paper from the Dominican republic us...,0.92,2018-06-01 19:58:46,2018,6,1,2018-06-01 20:53:55
1,2008,1527900000.0,1,v.redd.it,0,26487,funny,Kim k forgets her baby,0.84,2018-06-01 20:37:07,2018,6,1,2018-06-01 20:53:55
2,285,1527896000.0,1,i.redd.it,0,17846,wholesomememes,I love Mick Jagger,0.97,2018-06-01 19:40:31,2018,6,1,2018-06-01 20:53:55
3,717,1527896000.0,2,washingtonpost.com,0,13257,politics,Trumpâ€™s spent far more going to Mar-a-Lago alo...,0.93,2018-06-01 19:37:04,2018,6,1,2018-06-01 20:53:55
4,521,1527897000.0,1,i.redd.it,0,18990,pics,Iâ€™m super late to the trend but I too was insp...,0.83,2018-06-01 19:43:01,2018,6,1,2018-06-01 20:53:55


In [189]:
data.drop(columns='created', inplace=True)
data.to_csv('9068df_with_time.csv')

In [None]:
datetime.datetime.utcfromtimestamp(epoch).replace(tzinfo=datetime.timezone.utc)

In [116]:
for t in data['created_utc']:
    data['datetime'] = dt.datetime.utcfromtimestamp(t).replace(tzinfo=dt.timezone.utc)
    parsed = dt.datetime.utcfromtimestamp(t)
    data['year'] = parsed.year
    data['month'] = parsed.month
    data['date'] = parsed.day
    

In [117]:
data.head().sort_values('comments', ascending=False)

Unnamed: 0,comments,created_utc,crossposts,domain,gilded,score,subreddit,title,upvote_ratio,datetime,year,month,date
0,3390,1527774000.0,1,cnbc.com,0,18693,worldnews,Trump administration will put steel and alumin...,0.94,2018-05-31 13:12:55+00:00,2018,5,31
9,3299,1527765000.0,1,bbc.co.uk,0,20712,soccer,Zinedine Zidane steps down from Real Madrid,0.92,2018-05-31 13:12:55+00:00,2018,5,31
1,2665,1527773000.0,1,cnbc.com,0,19114,politics,Trump will pardon conservative pundit Dinesh D...,0.95,2018-05-31 13:12:55+00:00,2018,5,31
6,1431,1527775000.0,3,reuters.com,1,6623,news,"U.S. hits EU, Canada and Mexico with steel, al...",0.96,2018-05-31 13:12:55+00:00,2018,5,31
21,883,1527765000.0,0,youtube.com,0,13129,television,"'Fillmore!', a kid-friendly parody of police d...",0.88,2018-05-31 13:12:55+00:00,2018,5,31
7,681,1527768000.0,1,en.wikipedia.org,0,15581,todayilearned,TIL that the song 'Africa' by Toto is actually...,0.92,2018-05-31 13:12:55+00:00,2018,5,31
24,519,1527765000.0,0,i.imgur.com,0,17486,PeopleFuckingDying,pSyChOTIc MAn BuRNS WIFeS fAcE Off WiTh ScAlDi...,0.84,2018-05-31 13:12:55+00:00,2018,5,31
3,511,1527772000.0,2,gfycat.com,0,13917,INEEEEDIT,Simple locking hinge to convert a door corner ...,0.89,2018-05-31 13:12:55+00:00,2018,5,31
8,480,1527768000.0,0,i.redd.it,0,12356,BlackPeopleTwitter,These are his truths,0.94,2018-05-31 13:12:55+00:00,2018,5,31
2,456,1527771000.0,6,i.imgur.com,1,34677,interestingasfuck,Amazing animation art,0.91,2018-05-31 13:12:55+00:00,2018,5,31


In [120]:
data['datetime']

0       2018-05-31 13:12:55+00:00
1       2018-05-31 13:12:55+00:00
2       2018-05-31 13:12:55+00:00
3       2018-05-31 13:12:55+00:00
4       2018-05-31 13:12:55+00:00
5       2018-05-31 13:12:55+00:00
6       2018-05-31 13:12:55+00:00
7       2018-05-31 13:12:55+00:00
8       2018-05-31 13:12:55+00:00
9       2018-05-31 13:12:55+00:00
10      2018-05-31 13:12:55+00:00
11      2018-05-31 13:12:55+00:00
12      2018-05-31 13:12:55+00:00
13      2018-05-31 13:12:55+00:00
14      2018-05-31 13:12:55+00:00
15      2018-05-31 13:12:55+00:00
16      2018-05-31 13:12:55+00:00
17      2018-05-31 13:12:55+00:00
18      2018-05-31 13:12:55+00:00
19      2018-05-31 13:12:55+00:00
20      2018-05-31 13:12:55+00:00
21      2018-05-31 13:12:55+00:00
22      2018-05-31 13:12:55+00:00
23      2018-05-31 13:12:55+00:00
24      2018-05-31 13:12:55+00:00
25      2018-05-31 13:12:55+00:00
26      2018-05-31 13:12:55+00:00
27      2018-05-31 13:12:55+00:00
28      2018-05-31 13:12:55+00:00
29      2018-0

In [48]:
# df.drop(columns='created_utc', inplace=True)
# df.head()

Unnamed: 0,comments,crossposts,domain,gilded,score,subreddit,title,upvote_ratio,year,month,date,time,datetime
0,205,3,i.redd.it,1,10941,aww,My landlord was replacing our sink and sent me...,0.98,2018,5,30,<built-in method timetuple of datetime.datetim...,2018-05-30 14:45:38+00:00
1,555,6,gfycat.com,1,33262,gifs,Built-in Lego Wall,0.94,2018,5,30,<built-in method timetuple of datetime.datetim...,2018-05-30 14:45:38+00:00
2,1560,2,i.imgur.com,1,44266,pics,Nothing about this picture has aged well.,0.88,2018,5,30,<built-in method timetuple of datetime.datetim...,2018-05-30 14:45:38+00:00
3,156,5,i.imgur.com,0,11682,educationalgifs,How to cross street during marathon,0.96,2018,5,30,<built-in method timetuple of datetime.datetim...,2018-05-30 14:45:38+00:00
4,4045,4,alexa.com,7,68139,technology,Reddit just passed Facebook as #3 most popular...,0.9,2018,5,30,<built-in method timetuple of datetime.datetim...,2018-05-30 14:45:38+00:00


In [91]:
rdf = df[['comments','datetime','title','subreddit']]

KeyError: "['datetime'] not in index"

In [31]:
subs=str(subs)

In [37]:
ids = re.findall("id='(.{6})", subs)
ids

['8n25ux',
 '8n14rc',
 '8n0tgf',
 '8n13k3',
 '8n1541',
 '8n0nyy',
 '8n0zsc',
 '8n0nvo',
 '8n136j',
 '8n1q7q',
 '8n07k9',
 '8n04vf',
 '8n00qa',
 '8n073m',
 '8mzsj8',
 '8mzwj0',
 '8n1orp',
 '8n031q',
 '8mzmgw',
 '8mzkfw',
 '8n0cf6',
 '8mzwn9',
 '8mzo3e',
 '8n14ob',
 '8n0com',
 '8n04zm',
 '8n1dbx',
 '8mzfko',
 '8mzt8a',
 '8mzsik',
 '8mzetu',
 '8mzvtc',
 '8n14fw',
 '8mzb7j',
 '8mzfgj',
 '8mzqx3',
 '8mz96h',
 '8mzxu6',
 '8n0f3z',
 '8mz7v0',
 '8mzagh',
 '8mzm4x',
 '8mz3sf',
 '8n02pw',
 '8n0o7h',
 '8n24k8',
 '8mzczb',
 '8n0nwv',
 '8mzdr0',
 '8n106h',
 '8mz919',
 '8myp0j',
 '8myki0',
 '8mzmw9',
 '8n09bd',
 '8myq5x',
 '8mysds',
 '8n015b',
 '8mygfi',
 '8mzgos',
 '8mz7zv',
 '8n0sjj',
 '8mz0ju',
 '8mylc9',
 '8n15di',
 '8myr2b',
 '8n0gpw',
 '8n1zys',
 '8mzr5q',
 '8n1pqa',
 '8myj9e',
 '8mz3rf',
 '8n1iea',
 '8mz8fx',
 '8myoj7',
 '8mzerj',
 '8n0kmx',
 '8n04x1',
 '8myrp7',
 '8myu68',
 '8n1pek',
 '8mz1q6',
 '8n20hu',
 '8n0aam',
 '8n06f8',
 '8mzcy9',
 '8mz15c',
 '8mz6m8',
 '8n0jnf',
 '8mz70o',
 '8myjgi',

In [38]:
# use the list of ids to grab each submission, and then its comments

submission = reddit.submission(id=ids[0])

In [50]:
vars(submission.comments)

{'_comments': [Comment(id='dzs7fcf'),
  Comment(id='dzs7ac7'),
  Comment(id='dzs7glr'),
  Comment(id='dzs88iy'),
  Comment(id='dzs7vin'),
  Comment(id='dzs8cy1'),
  Comment(id='dzs8ogf'),
  Comment(id='dzs76we'),
  Comment(id='dzs8e48'),
  Comment(id='dzs87pc'),
  Comment(id='dzs7516'),
  Comment(id='dzsaox6'),
  Comment(id='dzsaen0'),
  Comment(id='dzsaqzo'),
  Comment(id='dzs7x87'),
  Comment(id='dzs898w'),
  Comment(id='dzsarc2'),
  Comment(id='dzscpmc'),
  Comment(id='dzs7xa6'),
  Comment(id='dzs7shz'),
  Comment(id='dzsaj7r'),
  Comment(id='dzsa2e0'),
  Comment(id='dzsaq9d'),
  Comment(id='dzsdgi6'),
  Comment(id='dzs6h7j'),
  Comment(id='dzsdrzt'),
  Comment(id='dzsbumf'),
  Comment(id='dzse4e3'),
  Comment(id='dzseb0t'),
  Comment(id='dzsedqm'),
  Comment(id='dzsee1u'),
  Comment(id='dzsee97'),
  Comment(id='dzsegdq'),
  Comment(id='dzsfrln'),
  Comment(id='dzs8zew'),
  Comment(id='dzsbcdq'),
  Comment(id='dzse3m7'),
  Comment(id='dzs8pv3'),
  Comment(id='dzsdxcr'),
  Comment(id

In [54]:
comments = []
for x in submission.comments:
    print(x.depth)
#     cd = {}
#     cd['title']
#     cd['subreddit'] = x.subreddit
#     cd['duration']
#     cd['score']
#     cd['thread_size']

0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0


In [46]:
submission = str(subs[0])
# submission.split("',")
submission

'submission(author=\'SlightlyHigh_\', author_flair_css_class=None, author_flair_text=None, brand_safe=True, contest_mode=False, created_utc=1485020019, domain=\'self.television\', edited=1485020204, full_link=\'https://www.reddit.com/r/television/comments/5pbq24/should_i_watch_justified/\', id=\'5pbq24\', is_self=True, locked=False, num_comments=26, over_18=False, permalink=\'/r/television/comments/5pbq24/should_i_watch_justified/\', retrieved_on=1489459057, score=0, selftext="I\'m looking for a great drama to get into. For reference my favorite shows are The Americans, Fargo, Mad Men, GoT, The Sopranos, and Westworld.", spoiler=False, stickied=False, subreddit=\'television\', subreddit_id=\'t5_2qh6e\', suggested_sort=None, thumbnail=\'self\', title=\'Should I watch Justified?\', url=\'https://www.reddit.com/r/television/comments/5pbq24/should_i_watch_justified/\', created=1485034419.0, d_={\'author\': \'SlightlyHigh_\', \'author_flair_css_class\': None, \'author_flair_text\': None, \'

In [25]:
vars(r_all)

{'_reddit': <praw.reddit.Reddit at 0x116f0d358>,
 '_comments': None,
 '_fetched': False,
 '_info_params': {},
 'display_name': 'all',
 '_banned': None,
 '_contributor': None,
 '_filters': None,
 '_flair': None,
 '_emoji': None,
 '_mod': None,
 '_moderator': None,
 '_modmail': None,
 '_muted': None,
 '_quarantine': None,
 '_stream': None,
 '_stylesheet': None,
 '_wiki': None,
 '_path': 'r/all/'}

In [16]:
# using the PSAW wrapper to grab submissions and posts

start_date= int(dt.datetime(2017, 1, 1).timestamp())

comments = list(api.search_comments(after = start_date,
#                         subreddit='AskHistorians',
                        filter = ['author', 
                                  'title', 
                                  'body', 
                                  'subreddit',
                                  'score'],
                        limit = 1000,
                        sort = "desc",
                        sort_type = 'score'
                        )
    )

subs = list(api.search_submissions(after = start_date,
        #                            subreddit = 'r/all',
                                   field = 'title',
                                   sort_type = 'score', 
                                   limit = 10000))

len(subs)

In [23]:
# The title of the thread
# The subreddit that the thread corresponds to
# The length of time it has been up on Reddit
# The number of comments on the thread

for sub in r_all.hot(limit=10):
    title = sub.title
    
    

AttributeError: 'Subreddit' object has no attribute 'get_top_from_month'

In [8]:
for sub in askhist.hot(limit=25):
    print(sub.title)


{'_banned': None,
 '_comments': None,
 '_contributor': None,
 '_emoji': None,
 '_fetched': False,
 '_filters': None,
 '_flair': None,
 '_info_params': {},
 '_mod': None,
 '_moderator': None,
 '_modmail': None,
 '_muted': None,
 '_path': 'r/AskHistorians/',
 '_quarantine': None,
 '_reddit': <praw.reddit.Reddit object at 0x10d15bd68>,
 '_stream': None,
 '_stylesheet': None,
 '_wiki': None,
 'display_name': 'AskHistorians'}


In [12]:
# Get acquainted with the variables associated with each item:

pprint.pprint(vars(askhist))
for sub in r_all.hot(limit=1):
    pprint.pprint(vars(sub))
    print('\n'*4)
    for comment in sub.comments:
        print('\n'*4)
        pprint.pprint(vars(comment))

{'_banned': None,
 '_comments': None,
 '_contributor': None,
 '_emoji': None,
 '_fetched': False,
 '_filters': None,
 '_flair': None,
 '_info_params': {},
 '_mod': None,
 '_moderator': None,
 '_modmail': None,
 '_muted': None,
 '_path': 'r/AskHistorians/',
 '_quarantine': None,
 '_reddit': <praw.reddit.Reddit object at 0x10d15bd68>,
 '_stream': None,
 '_stylesheet': None,
 '_wiki': None,
 'display_name': 'AskHistorians'}


NameError: name 'r_all' is not defined

In [None]:
URL = "http://www.reddit.com/hot.json"

In [None]:
## YOUR CODE HERE

#### Use `res.json()` to convert the response into a dictionary format and set this to a variable. 

```python
data = res.json()
```

#### Getting more results

By default, Reddit will give you the top 25 posts:

```python
print(len(data['data']['children']))
```

If you want more, you'll need to do two things:
1. Get the name of the last post: `data['data']['after']`
2. Use that name to hit the following url: `http://www.reddit.com/hot.json?after=THE_AFTER_FROM_STEP_1`
3. Create a loop to repeat steps 1 and 2 until you have a sufficient number of posts. 

*NOTE*: Reddit will limit the number of requests per second you're allowed to make. When you create your loop, be sure to add the following after each iteration.

```python
time.sleep(3) # sleeps 3 seconds before continuing```

This will throttle your loop and keep you within Reddit's guidelines. You'll need to import the `time` library for this to work!

In [24]:
## YOUR CODE HERE


## (Optional) Collect more information

While we only require you to collect four features, there may be other info that you can find on the results page that might be useful. Feel free to write more functions so that you have more interesting and useful data.

In [None]:
## YOUR CODE HERE

### Save your results as a CSV
You may do this regularly while scraping data as well, so that if your scraper stops of your computer crashes, you don't lose all your data.

In [49]:
# Export to csv
df.to_csv('redditscrape.csv')

## Predicting comments using Random Forests + Another Classifier

#### Load in the the data of scraped results

In [None]:
## YOUR CODE HERE

#### We want to predict a binary variable - whether the number of comments was low or high. Compute the median number of comments and create a new binary variable that is true when the number of comments is high (above the median)

We could also perform Linear Regression (or any regression) to predict the number of comments here. Instead, we are going to convert this into a _binary_ classification problem, by predicting two classes, HIGH vs LOW number of comments.

While performing regression may be better, performing classification may help remove some of the noise of the extremely popular threads. We don't _have_ to choose the `median` as the splitting point - we could also split on the 75th percentile or any other reasonable breaking point.

In fact, the ideal scenario may be to predict many levels of comment numbers. 

In [None]:
## YOUR CODE HERE

#### Thought experiment: What is the baseline accuracy for this model?

In [None]:
## YOUR CODE HERE

#### Create a Random Forest model to predict High/Low number of comments using Sklearn. Start by ONLY using the subreddit as a feature. 

In [None]:
## YOUR CODE HERE

#### Create a few new variables in your dataframe to represent interesting features of a thread title.
- For example, create a feature that represents whether 'cat' is in the title or whether 'funny' is in the title. 
- Then build a new Random Forest with these features. Do they add any value?
- After creating these variables, use count-vectorizer to create features based on the words in the thread titles.
- Build a new random forest model with subreddit and these new features included.

In [None]:
## YOUR CODE HERE

#### Use cross-validation in scikit-learn to evaluate the model above. 
- Evaluate the accuracy of the model, as well as any other metrics you feel are appropriate. 

In [None]:
## YOUR CODE HERE

#### Repeat the model-building process with a non-tree-based method.

In [None]:
## YOUR CODE HERE

#### Use Count Vectorizer from scikit-learn to create features from the thread titles. 
- Examine using count or binary features in the model
- Re-evaluate your models using these. Does this improve the model performance? 
- What text features are the most valuable? 

In [None]:
## YOUR CODE HERE

# Executive Summary
---
Put your executive summary in a Markdown cell below.

### BONUS
Refer to the README for the bonus parts

In [None]:
## YOUR CODE HERE

In [3]:
# !conda install -c conda-forge praw -y

# Solving environment: done

# ## Package Plan ##

#   environment location: /Users/saqibnizami/anaconda3

#   added / updated specs: 
#     - praw


# The following packages will be downloaded:

#     package                    |            build
#     ---------------------------|-----------------
#     certifi-2018.4.16          |           py36_0         142 KB  conda-forge
#     prawcore-0.14.0            |             py_0          13 KB  conda-forge
#     update_checker-0.16        |             py_0           9 KB  conda-forge
#     conda-4.5.4                |           py36_0         622 KB  conda-forge
#     praw-5.4.0                 |             py_0          62 KB  conda-forge
#     ------------------------------------------------------------
#                                            Total:         848 KB

# The following NEW packages will be INSTALLED:

#     praw:            5.4.0-py_0        conda-forge
#     prawcore:        0.14.0-py_0       conda-forge
#     update_checker:  0.16-py_0         conda-forge

# The following packages will be UPDATED:

#     ca-certificates: 2018.03.07-0                  --> 2018.4.16-0      conda-forge
#     certifi:         2018.4.16-py36_0              --> 2018.4.16-py36_0 conda-forge
#     conda:           4.5.4-py36_0                  --> 4.5.4-py36_0     conda-forge
#     openssl:         1.0.2o-h26aff7b_0             --> 1.0.2o-0         conda-forge


# Downloading and Extracting Packages
# certifi-2018.4.16    |  142 KB | ####################################### | 100% 
# prawcore-0.14.0      |   13 KB | ####################################### | 100% 
# update_checker-0.16  |    9 KB | ####################################### | 100% 
# conda-4.5.4          |  622 KB | ####################################### | 100% 
# praw-5.4.0           |   62 KB | ####################################### | 100% 
# Preparing transaction: done
# Verifying transaction: done
# Executing transaction: done