## Pushshift

https://files.pushshift.io/reddit/ [Pushshift.io](https://pushshift.io/) is a public archive of all Reddit activity developed by Jason Baumgartner.

* If you go to https://files.pushshift.io/reddit/comments/ you can see all comment files that pushshift currently host.

* A url for comment file can be obtained by appending filename to the main url. For example, 
Reddit comment file for Feb 2006 - https://files.pushshift.io/reddit/comments/RC_2006-02.bz2

Its quite simple to download files from pushhift. This tutorial will show how to download and parse example Reddit-comments file from pushhift

* Another fun fact is if you face any problem with the pushshift data, you can post your question on the Reddit community: https://www.reddit.com/r/pushshift/ 

This is what makes social media great!! You have strangers helping you with your coding and data issues. Plus the maintainer of pushshift (Jason B) is very active on `r/pushshift`

### Download file from pushhift 

We are going to query file directly using python code. 
There are several ways of doing this, we will use `urllib`. Check [docs](https://docs.python.org/3/howto/urllib2.html) to learn more.

In [1]:
# We are going to query file directly using python code. 
# There are several ways of doing this, we will use urllib
from urllib.request import urlretrieve

# Go to "https://files.pushshift.io/reddit/comments/" all comment files are listed. 
# A url for comment file can be obtained by appending filename to the main url. For example, 
# Reddit comment file for Feb 2006 - https://files.pushshift.io/reddit/comments/RC_2006-02.bz2

# Retreive the file using the comment file url 
urlretrieve('https://files.pushshift.io/reddit/comments/RC_2006-02.bz2', filename='./example_reddit_comments.bz2')

# Check if there is 'example_reddit_comments.bz2' in your current dictionary

('./example_reddit_comments.bz2', <http.client.HTTPMessage at 0x7ff27854cc40>)

### Parsing "bz2" format file 

In [2]:
# We will need python bz2 and json module
import bz2, json

counter = 0
# read first 5 lines as json object and print. then stop
with bz2.BZ2File('./example_reddit_comments.bz2', "r") as fp:
    for line in fp:
        counter = counter +1
        if counter < 5:
            job = json.loads(line)
            print( job)
            print 
        else:
            break

{'created_utc': 1138752114, 'author_flair_css_class': None, 'score': 0, 'ups': 0, 'subreddit': 'reddit.com', 'stickied': False, 'link_id': 't3_15xh', 'subreddit_id': 't5_6', 'body': 'THAN the title suggests.  Whoops.', 'controversiality': 1, 'retrieved_on': 1473820870, 'distinguished': None, 'gilded': 0, 'id': 'c166b', 'edited': False, 'parent_id': 't3_15xh', 'author': 'gmcg', 'author_flair_text': None}
{'author_flair_text': None, 'author': 'joshuaknox', 'id': 'c166d', 'parent_id': 't3_15tx', 'edited': False, 'gilded': 0, 'retrieved_on': 1473820870, 'distinguished': None, 'controversiality': 0, 'body': "Thank you, willis3000.  This seems to be bunk:  self-discipline doesn't standard-deviation out well.  How do you measure it?   It, unlike IQ, is highly subjective, and non-controversial.  \r\n\r\nPerhaps more importantly, a two year study of eighth graders is just crap.   If they check back in in twenty years or so, this would perhaps have a shred of validity, but not a heck of a lot ha

### handling json 
Each entry in the file is a json field that can be parsed separately
Reading through the json [docs](https://docs.python.org/3/library/json.html) will be useful here. Check `json.dump` and `json.load`

In [3]:
# You can see 5 dictionary like lines printed above. 
# Each of the lines is a json field that can be parsed separately

# Read lines as json and print subreddit, author name, ups and score separately
counter =0
with bz2.BZ2File('./example_reddit_comments.bz2', "r") as fp:
    for line in fp:
        counter = counter +1
        if counter < 5:
            job = json.loads(line)
            print( "subreddit", job['subreddit'])
            print( "author", job['author'])
            print( "upvotes", job['ups'])
            print( "score", job['score'])
            print 
        else:
            break

subreddit reddit.com
author gmcg
upvotes 0
score 0
subreddit reddit.com
author joshuaknox
upvotes 2
score 2
subreddit reddit.com
author rah
upvotes -6
score -6
subreddit reddit.com
author rah
upvotes -4
score -4


**Note**: Many of these files are huge!! But will have very useful data for your projects

### Using pushshift.io

The PSAW library (different from the `praw` library!) lets you access this data resource as well: [PushShift.io API Wrapper](https://github.com/dmarx/psaw)

`pip install psaw` from the Terminal/command line

Read the psaw [docs](https://psaw.readthedocs.io/en/latest/)

In [4]:
from psaw import PushshiftAPI

api = PushshiftAPI()

# Define some keys for submission attribtues you care about -- these are similar to above
filter_keys = ['url','author','title','subreddit','id','num_comments',
               'score','upvote_ratio','domain','selftext']

If you want to get all the submissions to a sub-reddit.
Let's try a fun one [r/birdswitharms](http://reddit.com/r/birdswitharms).

In [19]:
# Handling dates and times
from datetime import datetime
#also import pandas
import pandas as pd

In [14]:
start = int(datetime(2010, 10, 10).timestamp())

search = api.search_submissions(after=start,
                                  subreddit='birdswitharms',
                                  filter=filter_keys,
                                  sort='asc',
                                  limit=None)

The search object is a [generator](https://wiki.python.org/moin/Generators) and doesn't actually have any data in it; you'll need to write a loop on the object and it will start spitting out data.

In [8]:
type(search)

generator

This step could take minutes, hours, or days, depending on the size of the subreddit you give it and other filters like start and stop. I've made sure that we get some status updates every couple thousand of submissions. I've found it takes 5-8 minutes per 10,000 submissions: so if there are 100,000 submissions to a subreddit, that's at least an hour just to get the data.

In [15]:
# Storage for the results
all_subs = []

# Loop through the search results to actually get data
for i,sub in enumerate(search):
    
    # Add each result's dictionary (the .d_ attribute) to the all_subs
    all_subs.append(sub.d_)
    
    # Print out status updates every 10,000 submissions
    if i % 10000 == 0:
        
        # The current time so you know how long in between updates
        time_now = datetime.now().time().replace(microsecond=0)
        
        # The date of the submission to give you an idea of how far along you are
        record_date = datetime.utcfromtimestamp(sub.d_['created']).date()
        
        # Print it out
        print("{0:,} for {1} received at {2}".format(i,record_date,time_now))

0 for 2011-06-21 received at 19:10:49


KeyboardInterrupt: 

In [16]:
len(all_subs)

900

Turn the data into a DataFrame. Save to a CSV so you don't lose all that hard work.

In [17]:
all_subs

[{'author': 'zezebox',
  'created_utc': 1308653378,
  'domain': 'imgur.com',
  'id': 'i55mr',
  'num_comments': 0,
  'score': 4,
  'selftext': '',
  'subreddit': 'birdswitharms',
  'title': 'CAW CAW CAW',
  'url': 'http://imgur.com/Xf7Zr',
  'created': 1308682178.0},
 {'author': 'zezebox',
  'created_utc': 1308890780,
  'domain': 'imgur.com',
  'id': 'i7u18',
  'num_comments': 0,
  'score': 9,
  'selftext': '',
  'subreddit': 'birdswitharms',
  'title': 'CAW CAW CAW',
  'url': 'http://imgur.com/ivJ7k',
  'created': 1308919580.0},
 {'author': 'zezebox',
  'created_utc': 1308890888,
  'domain': 'imgur.com',
  'id': 'i7u2v',
  'num_comments': 0,
  'score': 8,
  'selftext': '',
  'subreddit': 'birdswitharms',
  'title': 'CAW CAW CAW',
  'url': 'http://imgur.com/Yv4uh',
  'created': 1308919688.0},
 {'author': 'radioactivespider',
  'created_utc': 1321612934,
  'domain': 'i.imgur.com',
  'id': 'mgt20',
  'num_comments': 1,
  'score': 59,
  'selftext': '',
  'subreddit': 'birdswitharms',
  't

In [20]:
subs_df = pd.DataFrame(all_subs)
print('{:,}'.format(len(all_subs)))

subs_df['timestamp'] = subs_df['created'].apply(datetime.utcfromtimestamp)
subs_df['date'] = subs_df['timestamp'].apply(lambda x:x.date())
subs_df.to_csv('all_submissions.csv',encoding='utf8',index=False)

subs_df.head()

900


Unnamed: 0,author,created_utc,domain,id,num_comments,score,selftext,subreddit,title,url,created,timestamp,date
0,zezebox,1308653378,imgur.com,i55mr,0,4,,birdswitharms,CAW CAW CAW,http://imgur.com/Xf7Zr,1308682000.0,2011-06-21 18:49:38,2011-06-21
1,zezebox,1308890780,imgur.com,i7u18,0,9,,birdswitharms,CAW CAW CAW,http://imgur.com/ivJ7k,1308920000.0,2011-06-24 12:46:20,2011-06-24
2,zezebox,1308890888,imgur.com,i7u2v,0,8,,birdswitharms,CAW CAW CAW,http://imgur.com/Yv4uh,1308920000.0,2011-06-24 12:48:08,2011-06-24
3,radioactivespider,1321612934,i.imgur.com,mgt20,1,59,,birdswitharms,caw caw caw,http://i.imgur.com/907ce.jpg,1321642000.0,2011-11-18 18:42:14,2011-11-18
4,zezebox,1321668122,imgur.com,mhmt3,0,34,,birdswitharms,CAW CAW CAW,http://imgur.com/c2Qms,1321697000.0,2011-11-19 10:02:02,2011-11-19


<span class="mark">**Optional TODO**</span>

Pick another subreddit of your choice (or you can also stick with this same one) and now collect reddit submissions with more than a certain number of upvote score (say > 2000)