# Data Collection

In [None]:
Importing libraries for webscraping data. 

In [1]:
import pandas as pd
import datetime as dt
import time 
import requests

To get our data from the subreddits we desire, we will be using a function given by my wonderful instructor Gwen Rathgeber. 

This function accesses Reddit's pushshift API, through sending URL requests on a loop. This loop retrieves the specified subfields of data from a post (submission) of a given subreddit, through a range of days, specified by the day_window parameter. It takes its maximum of 100 posts per request and takes six seconds in between requests, in an attempt to not overwhelm the Reddit servers. I added to the original code comments to better explain the mechanics of the function.

In [2]:
def query_pushshift(subreddit, kind = 'submission', day_window = 25, n = 130):
    SUBFIELDS = ['title', 'selftext', 'subreddit', 'created_utc', 'author', 'num_comments', 'score', 'is_self']
    
    # establish base url and stem
    BASE_URL = f"https://api.pushshift.io/reddit/search/{kind}" # also known as the "API endpoint" 
    stem = f"{BASE_URL}?subreddit={subreddit}&size=100" # always pulling max of 100
    
    # instantiate empty list for temp storage
    posts = []
    
    # implement for loop with `time.sleep(2)`
    for i in range(1, n + 1): #setting the range, beginning at one, and iterating as many times have been declared in the arguement for the n parameter.  
        URL = "{}&after={}d".format(stem, day_window * i) #decalres 'URL' combining our stem URL  
        print("Querying from: " + URL) #prints the source URL as it retrieves the data
        response = requests.get(URL) #retrieves information from URL location, stored in 'response variable' 
        assert response.status_code == 200 #throws an error if URL is inaccessible, (if the boolean false)
        mine = response.json()['data'] #user JavaScript encoder convert method
        df = pd.DataFrame.from_dict(mine) #converting to pandas dataframe
        posts.append(df) #creating a list from the dataframe. 
        time.sleep(6) #this method puts five seconds between each iteration of the list
    
    # pd.concat storage list
    full = pd.concat(posts, sort=False) 
    
    # if submission
    if kind == "submission":
        # select desired columns 
        full = full[SUBFIELDS]
        # drop duplicates
        full.drop_duplicates(inplace = True)
        # select `is_self` == True
        #full = full.loc[full['is_self'] == True]

    # create `timestamp` column
    full['timestamp'] = full["created_utc"].map(dt.date.fromtimestamp)
    
    print("Query Complete!")    
    return full 

## Running the pushshift query

In [3]:
results = query_pushshift('TalesFromRetail')

Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TalesFromRetail&size=100&after=25d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TalesFromRetail&size=100&after=50d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TalesFromRetail&size=100&after=75d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TalesFromRetail&size=100&after=100d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TalesFromRetail&size=100&after=125d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TalesFromRetail&size=100&after=150d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TalesFromRetail&size=100&after=175d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TalesFromRetail&size=100&after=200d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TalesFromRetail&size=100&after=225d
Quer

Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TalesFromRetail&size=100&after=1875d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TalesFromRetail&size=100&after=1900d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TalesFromRetail&size=100&after=1925d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TalesFromRetail&size=100&after=1950d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TalesFromRetail&size=100&after=1975d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TalesFromRetail&size=100&after=2000d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TalesFromRetail&size=100&after=2025d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TalesFromRetail&size=100&after=2050d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TalesFromRetail&size=100&afte

Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TalesFromRetail&size=100&after=3725d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TalesFromRetail&size=100&after=3750d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TalesFromRetail&size=100&after=3775d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TalesFromRetail&size=100&after=3800d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TalesFromRetail&size=100&after=3825d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TalesFromRetail&size=100&after=3850d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TalesFromRetail&size=100&after=3875d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TalesFromRetail&size=100&after=3900d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TalesFromRetail&size=100&afte

After retrieving all the posts from the retail and server subreddits, I take a lot at the shape of the dataframes and see how many rows in each of the dataframes do not have text in the posts field, either due to being removed, or never having text present to begin with. I then export the dataframes as csv files to be processed for analysis and modeling. 

In [10]:
results.shape

(12085, 9)

In [5]:
results[results['selftext'] == '[removed]']

Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,score,is_self,timestamp
0,Closing time.,[removed],TalesFromRetail,1626061814,Avacynarchangel,0,1,True,2021-07-11
4,"Karen, justice served?",[removed],TalesFromRetail,1626089107,Free_Lunch_18,0,1,True,2021-07-12
5,"""I don't want a hamburger, I want a cheeseburg...",[removed],TalesFromRetail,1626115986,KerryLouise1996,0,1,True,2021-07-12
6,"I don't want a hamburger, I want a cheeseburge...",[removed],TalesFromRetail,1626116544,KerryLouise1996,0,1,True,2021-07-12
8,The penny was apparently too ugly,[removed],TalesFromRetail,1626126394,decemberhunting,51,1,True,2021-07-12
...,...,...,...,...,...,...,...,...,...
95,"""You're pretty, but..""",[removed],TalesFromRetail,1442800250,pidjin00,142,524,True,2015-09-20
96,What happens when you actually say what everyo...,[removed],TalesFromRetail,1442802821,inkedcorset,0,1,True,2015-09-20
97,[s] But the lights are still on...,[removed],TalesFromRetail,1442804154,[deleted],0,1,True,2015-09-20
98,"Five years after quitting, this is still a mom...",[removed],TalesFromRetail,1442804690,[deleted],3,39,True,2015-09-20


In [6]:
results[results['is_self'] == False]

Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,score,is_self,timestamp
38,Entitled customer dislocated my finger; keeps ...,,TalesFromRetail,1622073390,Warrior_White,0,1,False,2021-05-26
52,"Angry Dad Hates "" The Rainbow "" (LGBT) !!!",,TalesFromRetail,1609213359,LeStachyPoro,0,1,False,2020-12-28
82,Entitled Karen called me sexist,,TalesFromRetail,1602998391,KitarRose,0,1,False,2020-10-17
15,Racist starts fight because their package isn’...,,TalesFromRetail,1591628678,lucyxariel,0,1,False,2020-06-08
56,Karen waits 1 hour to get no existant hand san...,,TalesFromRetail,1585422108,hurricanegold4,0,1,False,2020-03-28
...,...,...,...,...,...,...,...,...,...
7,A picture tells a thousand words,,TalesFromRetail,1336220821,Bbmajor,2,4,False,2012-05-05
0,"Acts of Gord: Love the Gord, Fear the Gord",,TalesFromRetail,1330458523,komichi1168,8,24,False,2012-02-28
0,X-post from r/funny (Can you check in the back?),,TalesFromRetail,1324311003,beefstick86,1,5,False,2011-12-19
2,x-post from tailsfromtechsupport,,TalesFromRetail,1324581520,[deleted],0,1,False,2011-12-22


In [11]:
results.to_csv('../data/retail.csv')

In [8]:
results.head()

Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,score,is_self,timestamp
0,Closing time.,[removed],TalesFromRetail,1626061814,Avacynarchangel,0,1,True,2021-07-11
1,Customers don't Read #1: Soda Water Eggs Chips,I just want to thank all you customers who jus...,TalesFromRetail,1626069356,DominicB547,5,1,True,2021-07-11
2,"No, I'm not giving you another free giveaway.",So I work at your average grocery store in Can...,TalesFromRetail,1626072531,PM_ME_UR_CATS__,81,1,True,2021-07-11
3,meowing like a cat in the middle of the night,"I'm new to reddit, so there for new to this su...",TalesFromRetail,1626081180,bigmacmcjackson,17,1,True,2021-07-12
4,"Karen, justice served?",[removed],TalesFromRetail,1626089107,Free_Lunch_18,0,1,True,2021-07-12


In [12]:
results1 = query_pushshift('TalesFromYourServer')

Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TalesFromYourServer&size=100&after=25d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TalesFromYourServer&size=100&after=50d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TalesFromYourServer&size=100&after=75d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TalesFromYourServer&size=100&after=100d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TalesFromYourServer&size=100&after=125d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TalesFromYourServer&size=100&after=150d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TalesFromYourServer&size=100&after=175d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TalesFromYourServer&size=100&after=200d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TalesFro

Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TalesFromYourServer&size=100&after=1800d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TalesFromYourServer&size=100&after=1825d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TalesFromYourServer&size=100&after=1850d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TalesFromYourServer&size=100&after=1875d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TalesFromYourServer&size=100&after=1900d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TalesFromYourServer&size=100&after=1925d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TalesFromYourServer&size=100&after=1950d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TalesFromYourServer&size=100&after=1975d
Querying from: https://api.pushshift.io/reddit/search/submission?subredd

Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TalesFromYourServer&size=100&after=3575d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TalesFromYourServer&size=100&after=3600d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TalesFromYourServer&size=100&after=3625d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TalesFromYourServer&size=100&after=3650d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TalesFromYourServer&size=100&after=3675d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TalesFromYourServer&size=100&after=3700d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TalesFromYourServer&size=100&after=3725d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TalesFromYourServer&size=100&after=3750d
Querying from: https://api.pushshift.io/reddit/search/submission?subredd

In [16]:
results1.shape

(11241, 9)

In [17]:
results1[results1['selftext'] == '[removed]']

Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,score,is_self,timestamp
9,Is being a host or busser easier?,[removed],TalesFromYourServer,1626115310,auusernammex,0,1,True,2021-07-12
10,Petition for action on third party delivery or...,[removed],TalesFromYourServer,1626116043,swayz38,0,1,True,2021-07-12
12,Gen Z are the worst tippers??,[removed],TalesFromYourServer,1626121120,Adventux,0,1,True,2021-07-12
21,[Research Participants Needed] Employee Health...,[removed],TalesFromYourServer,1626135578,everydaybentobox,0,1,True,2021-07-12
56,Our food sucks! but don’t discount it,[removed],TalesFromYourServer,1626295405,toe-eater19,0,1,True,2021-07-14
...,...,...,...,...,...,...,...,...,...
10,I've worked damn near twenty different places....,[removed],TalesFromYourServer,1442550874,junglethedwarf,14,3,True,2015-09-17
38,The time I had a candlelit dinner with a customer,[removed],TalesFromYourServer,1442869530,Spoontasic,0,1,True,2015-09-21
68,Any advice for a newby server at Applebee's?,[removed],TalesFromYourServer,1443130531,amacias408,9,8,True,2015-09-24
77,Tipped 100% to a short cute Indian man,[removed],TalesFromYourServer,1443220024,SlightlyCyborg,4,0,True,2015-09-25


In [18]:
results1[results1['is_self'] == False]

Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,score,is_self,timestamp
38,Revenge on a crap manager,,TalesFromYourServer,1580930489,Kinglens311,0,1,False,2020-02-05
70,Don't Tip Extra -Just Calculate from Here,,TalesFromYourServer,1581094704,TheArensic,0,1,False,2020-02-07
81,RAACHI,,TalesFromYourServer,1581128084,9311199953,0,1,False,2020-02-07
89,You guys need to read this guys insane ad. Pay...,,TalesFromYourServer,1581143530,TheCanadianbloke,2,1,False,2020-02-07
53,Thought you guys would appreciate this.,,TalesFromYourServer,1578790857,EllieDeeZoe,6,1,False,2020-01-11
...,...,...,...,...,...,...,...,...,...
53,I got a gold star! And a great tip!,,TalesFromYourServer,1355467814,[deleted],1,2,False,2012-12-13
55,"Waitress received this as part of a tip, tonig...",,TalesFromYourServer,1355630916,pudgypenguin,19,61,False,2012-12-15
56,Lovely things customers do that make you die o...,,TalesFromYourServer,1355639264,[deleted],5,14,False,2012-12-15
57,Shit Servers Say - YouTube,,TalesFromYourServer,1355646446,CrushCake21,5,27,False,2012-12-16


In [19]:
results1.to_csv('../data/server.csv')