# Sci-fi IRL

### A Data Storytelling Project by Tobias Reaper

### ----  Datalogue 002 ----


---

### Resources

- [PushShift API GitHub Repo](https://github.com/pushshift/api)
- [New to PushShift? Read This!](https://www.reddit.com/r/pushshift/comments/bcxguf/new_to_pushshift_read_this_faq/)
- [Python API Tutorial](https://www.dataquest.io/blog/python-api-tutorial/)

---
---

### Imports

In [1]:
# Three Musketeers
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [25]:
# For using the API
import requests
import json
from pandas.io.json import json_normalize

### Configuration

In [21]:
# Set pandas display option to allow for more columns
pd.set_option("display.max_columns", 100)

---

In [3]:
# Send the request and save into response object
response = requests.get("https://api.pushshift.io/reddit/search/comment/?q=utopia")

In [31]:
# Look at the status code
print(response.status_code)

# Use assert to stop the notebook's execution if not 200
assert response.status_code == 200

# Parse the json response into a python object
json_response = response.json()

200


In [24]:
# Convert the python object into a pandas dataframe
# I did it this way in order to get around the whole pandas "normalization" workflow
# If this turns out to be an inefficient way of going about it, I'll look into other methods
df_1 = pd.DataFrame(json_response["data"])
df_1.head(2)

200


Unnamed: 0,all_awardings,author,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,author_flair_text_color,author_flair_type,author_fullname,author_patreon_flair,awarders,body,created_utc,gildings,id,is_submitter,link_id,locked,no_follow,parent_id,permalink,retrieved_on,score,send_replies,steward_reports,stickied,subreddit,subreddit_id,total_awards_received
0,[],1911isokiguess,,,[],,,,text,t2_13v8yy,False,[],How doesn't anyone see this as the plan.\n\nRe...,1569421589,{},f1egcet,False,t3_d90900,False,True,t3_d90900,/r/Firearms/comments/d90900/i_dont_think_we_sh...,1569421591,1,True,[],False,Firearms,t5_2ryez,0
1,[],Debates_are_Dumb,,,[],,,,text,t2_3spmj6ay,False,[],I suggest attempting to narrow down the conver...,1569421294,{},f1efw2b,False,t3_d91fsh,False,True,t3_d91fsh,/r/MoreTankieChapo/comments/d91fsh/i_challenge...,1569421295,1,True,[],False,MoreTankieChapo,t5_zk52m,0


In [27]:
# Alright, I'm going to try out the pandas normalize method
# I found a stackoverflow answer that made it seem simple
df_2 = json_normalize(json_response["data"])
df_2.head(2)

# This looks to have done basically the same thing
# The only thing that would make this better is if 
# I could remove the middle step of converting to python object

Unnamed: 0,all_awardings,author,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,author_flair_text_color,author_flair_type,author_fullname,author_patreon_flair,awarders,body,created_utc,id,is_submitter,link_id,locked,no_follow,parent_id,permalink,retrieved_on,score,send_replies,steward_reports,stickied,subreddit,subreddit_id,total_awards_received
0,[],1911isokiguess,,,[],,,,text,t2_13v8yy,False,[],How doesn't anyone see this as the plan.\n\nRe...,1569421589,f1egcet,False,t3_d90900,False,True,t3_d90900,/r/Firearms/comments/d90900/i_dont_think_we_sh...,1569421591,1,True,[],False,Firearms,t5_2ryez,0
1,[],Debates_are_Dumb,,,[],,,,text,t2_3spmj6ay,False,[],I suggest attempting to narrow down the conver...,1569421294,f1efw2b,False,t3_d91fsh,False,True,t3_d91fsh,/r/MoreTankieChapo/comments/d91fsh/i_challenge...,1569421295,1,True,[],False,MoreTankieChapo,t5_zk52m,0


---

### Using aggs and other parameters

In [None]:
# Define what columns I actually need in the final dataframe
# I can use these in the "fields" parameter to only return those fields - that is nice!
keep_columns = [
    "author",
    "body",
    "created_utc",
    "parent_id",
    "permalink",
    "retrieved_on",
    "score",
    "subreddit",
    "subreddit_id",
]

In [40]:
# Create a time aggregation to show the number of comments mentioning "utopia" each month over the past year

keyword = "utopia"
after = "1y"
freq = "month"

agg_1_url = f"https://api.pushshift.io/reddit/search/comment/?q={keyword}&after={after}&aggs=created_utc&frequency={freq}&size=0"
print(agg_1_url)

https://api.pushshift.io/reddit/search/comment/?q=utopia&after=1y&aggs=created_utc&frequency=month&size=0


In [41]:
# Send the request and save into response object
resp_agg_1 = requests.get(agg_1_url)

In [42]:
# Look at the status code
print(resp_agg_1.status_code)

# Use assert to stop the notebook's execution if not 200
assert resp_agg_1.status_code == 200

# Parse the json response into a python object
json_resp_agg_1 = resp_agg_1.json()

200


In [45]:
# Take a look
json_resp_agg_1

{'aggs': {'created_utc': [{'doc_count': 1578, 'key': 1535760000},
   {'doc_count': 8652, 'key': 1538352000},
   {'doc_count': 8827, 'key': 1541030400},
   {'doc_count': 9219, 'key': 1543622400},
   {'doc_count': 10597, 'key': 1546300800},
   {'doc_count': 9840, 'key': 1548979200},
   {'doc_count': 10158, 'key': 1551398400},
   {'doc_count': 10004, 'key': 1554076800},
   {'doc_count': 10444, 'key': 1556668800},
   {'doc_count': 10774, 'key': 1559347200},
   {'doc_count': 11596, 'key': 1561939200},
   {'doc_count': 11744, 'key': 1564617600},
   {'doc_count': 8024, 'key': 1567296000}]},
 'data': []}

In [52]:
# Convert the python object into a pandas dataframe
df_agg_1 = pd.DataFrame(json_resp_agg_1["aggs"]["created_utc"])
df_agg_1.head()

Unnamed: 0,doc_count,key
0,1578,1535760000
1,8652,1538352000
2,8827,1541030400
3,9219,1543622400
4,10597,1546300800


In [53]:
# Convert "key" into a datetime column
df_agg_1["key"] = pd.to_datetime(df_agg_1["key"], unit="s", origin="unix")
df_agg_1.head()

Unnamed: 0,doc_count,key
0,1578,2018-09-01
1,8652,2018-10-01
2,8827,2018-11-01
3,9219,2018-12-01
4,10597,2019-01-01


---
---

# Examples and Inspiration

---