# Reading data from the Reddit API

This module runs a script to import reddit posts, stores them in a DataFrame and outputs that DataFrame to a csv file for usage later.  

To do this, it employs and a class object and some supporting functions which have been designed to interact with the Reddit HTTP JSON Application Programming Interface (API).  The class object is called ReadReddit and it is built to pull Reddity posts or listings.  It is built using the Requests Python [library](http://docs.python-requests.org/en/master/) for HTTP communication.

ReadReddit has the following attributes

* url_base - the base URL for data pulls in this case 'http://www.reddit.com/'
* url_ - the actual URL used to retrieve data from subreddit
* no_posts_ - the number of posts returned after calling collect_posts
* status_code_ - the HTTP status code returned after calling collect_posts
* json_ = the json format of the web call content
* after_ = the after parameter returned from a Reddit API

and the following methods
    
* collect_posts(sub_grp = None, params = {}, headers = {}) - collecting posts data
* return_posts() - return the individual posts as a list 
* return_post_keys() - return the keys of posts records
* posts(features = []) - return a list of dictionaries containing posts data
    
Key functions are hit_reddit and write_data.  Hit_reddit takes inputs of a lists of subreddits and features and repeatedly calls the a variable instantiated as ReadReddit object to retrieve data.  The results are returned as a data frame and saved as a csv file.




### References

- https://docs.python.org/3/library/time.html

## Import libraries

In [1]:
## Imports

import requests
import pandas as pd
import os
import time
from time import gmtime, strftime, sleep, localtime


## Establish functions

In [2]:
# Function to streamline min, max, type and null
def print_summary(df):
    for column in df.columns:
        try:
            col_type = df[column].dtype
        except:
            col_type = 'Unknown'
        try:
            col_min = df[column].min()
        except:
            col_type = 'Unknown'
        try:
            col_max = df[column].max()
        except:
            col_type = 'Unknown'
   
        print("Column: %15s  min: %15s  max: %15s  type: %15s  null: %15s" % (column[:15], 
                str(col_min)[:15], str(col_max)[:15], str(col_type)[:15], str(df[column].isnull().sum()))[:15])


# Function to hit the reddit API for specified subgroups and features to return
def hit_reddit(sub_groups = [], features = [], interval = 30, calls = 15):
    
    # parameters for the API call
    headers = {'user-agent': 'SteveG'}
    params = {}
    aft_lst = {}
    # Calculate the sleep interval
    slp_int = min(1, int(interval/calls))
    
    pst_lst = []
    # for each of the calls
    for i in range(calls):
        # for each subreddit
        for j, sub in enumerate(sub_groups):
            # If already called pass the after parameter to get latest posts
            if i != 0:
                params = {'after': aft_lst[j]}
            # Call the ReadReddit object to get the posts in a list of dictionaries
            posts = ReadReddit()
            posts.collect_posts(sub_grp=sub, params = params, headers = headers)
            sub_post = posts.posts(features = features)
            pst_lst.extend(sub_post)
            # Set the after value for the next call to the API
            aft_lst[j] = posts.after_
        # pause before hitting the API again
        time.sleep(slp_int)   
    
    # Convert the list to a DataFrame and drop dups
    df = pd.DataFrame(pst_lst)
    df.drop_duplicates(inplace = True)
    df.reset_index(drop=True, inplace = True)

    return df

def write_data(df, data_path):
    # assign a unique file name based on the current time
    t_stmp = strftime("%d%b%Y_%H_%M", localtime())
    o_file = "posts_" + t_stmp + ".csv"
    df.to_csv(os.path.join(data_path, o_file), index = False)
    

## Establish classes

In [3]:
class ReadReddit:
    # Attributes of the data retrieval
    url_base = 'http://www.reddit.com/'
    url_ = None
    no_posts_ = None
    status_code_ = None
    json_ = None
    after_ = None
    
    # Initialization method
    def __init__(self):
        pass
    
    # method to collect data from posts
    def collect_posts(self, sub_grp = None, params = {}, headers = {}):
        # Set the URL and save it to the class variable
        url = self.url_base + 'r/' + sub_grp + '.json'
        self.url_ = url
        # Hit the API to get posts from this URL
        res = requests.get(url, params = params, headers = headers)
        # If 200 return
        res_code_ = res.status_code
        if res.status_code == 200:
            self.json_ = res.json()
            self.no_posts_ = len(self.json_['data']['children'])
            self.after_ = self.json_['data']['after']
            return res.json()
        else:
            return 'Data retrieval error: status code:' + str(res.status_code)

    # Method to return the individual posts as a list    
    def return_posts(self):
        # Refer to the json variable set during collect_posts()
        data = self.json_
        # Return the children posts
        return data['data']['children']
    
    # Method to return the dictionary keys for posts
    def return_post_keys(self):
        # Refer to the json variable set during collect_posts()
        data = self.json_
        # Return the children posts
        return data['data']['children'][0]['data'].keys()

    # Method to return a list of dictionaries of posts with specified fields
    def posts(self, features = []):
        # Refer to the json variable set during collect_posts()
        data = self.json_
        posts = []
        # For every entry in the children posts add a dictionary to the list
        for entry in data['data']['children']:
            post = {}
            # For each item in features create a dictionary key: value pair
            for item in features:
                try:
                    post[item] = entry['data'][item]
                except:
                    post[item] = ''                   
            posts.append(post)
        return posts
    

##   Establish parameters

In [4]:
# These are the parameters for retrieving reddit posts data
sub_groups = ['relationships', 'diy','politics', 'woodworking']
inc_list = ['name','subreddit','selftext','created_utc','author_fullname',
           'title', 'num_comments','id']
# Set relative data path
data_path = "../data"




## Retrieve data from the reddit API and write to a file

In [6]:
# Return a dataframe of reddit posts
df =  hit_reddit(sub_groups = sub_groups, features = inc_list, interval = 300, calls = 150)
write_data(df, data_path)


## Examine the resulting DataFrame

In [22]:
# Look at the resulting DataFrame
print(df.shape)
df.head()


(2982, 8)


Unnamed: 0,author_fullname,created_utc,id,name,num_comments,selftext,subreddit,title
0,t2_qu2w0o,1553775000.0,b6i246,t3_b6i246,641,My husband and I have only been married for 6 ...,relationships,My [28/F] husband [28/M] plays video games non...
1,t2_2ywjm7if,1553794000.0,b6lpu6,t3_b6lpu6,193,"We have been married for a year, together for ...",relationships,My (25F) husband (30M) says there's nothing wr...
2,t2_3hwpjjxa,1553785000.0,b6jsvh,t3_b6jsvh,244,"A little background - I'm 32, have 3 kids with...",relationships,My (32M) girlfriend (29F) keeps using things I...
3,t2_8kmws,1553753000.0,b6f75y,t3_b6f75y,128,"I have a twin brother, which I can honestly sa...",relationships,My (29M) parents (~60MF) seem to favor my twin...
4,t2_3akkhafg,1553815000.0,b6pw4l,t3_b6pw4l,51,So I noticed my cousin was affectionate and fl...,relationships,I [21M] interfered when my cousin [14F] was ta...


In [12]:
# Examine Value counts of subreddit
df['subreddit'].value_counts()


politics         1072
woodworking       974
relationships     610
DIY               326
Name: subreddit, dtype: int64

In [15]:
# Count number of duplicated rows
print("Duplicated rows: %d \n" % sum([int(i) for i in df.duplicated()]))

#Print a summary of DataFrame columns
print_summary(df)


Duplicated rows: 0 

Column: author_fullname  min:                  max:      t2_zzoxkw2  type:          object  null:               0
Column:     created_utc  min:    1546696172.0  max:    1553823778.0  type:         float64  null:               0
Column:              id  min:          acuaez  max:          b6req9  type:          object  null:               0
Column:            name  min:       t3_acuaez  max:       t3_b6req9  type:          object  null:               0
Column:    num_comments  min:               0  max:            7104  type:           int64  null:               0
Column:        selftext  min:                  max: ~~https://m.img  type:          object  null:               0
Column:       subreddit  min:             DIY  max:     woodworking  type:          object  null:               0
Column:           title  min: "Still-Secret M  max: “She Was Not In  type:          object  null:               0


In [21]:
# Look for duplicates in the selftext column
print("There might be duplicates in %d rows" % (len(df['selftext']) - len(set(df['selftext']))))


There might be duplicates in 1955 rows
