# Sci-fi IRL

### A Data Storytelling Project by Tobias Reaper

### ----  Datalogue 003 ----


---

### Resources

- [PushShift API GitHub Repo](https://github.com/pushshift/api)
- [New to PushShift? Read This!](https://www.reddit.com/r/pushshift/comments/bcxguf/new_to_pushshift_read_this_faq/)
- [Python API Tutorial](https://www.dataquest.io/blog/python-api-tutorial/)

---
---

### Imports

In [30]:
# Three Musketeers
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [31]:
# For using the API
import requests
# import json
# from pandas.io.json import json_normalize

In [32]:
# More advanced vizualizations
from bokeh.plotting import figure, output_file, output_notebook, show

### Configuration

In [33]:
# Set pandas display option to allow for more columns
pd.set_option("display.max_columns", 100)

---

### Utopia

In [34]:
# Create a time aggregation to show the number of comments mentioning "utopia" each month over the past 10 years
keyword = "utopia"
after = "14y"
aggs = "created_utc"  # Will create the "subreddit" aggregation myself because I want to have per-subreddit over time
freq = "month"

url_1 = f"https://api.pushshift.io/reddit/search/comment/?q={keyword}&after={after}&aggs={aggs}&frequency={freq}&size=0"
print(url_1)

https://api.pushshift.io/reddit/search/comment/?q=utopia&after=14y&aggs=created_utc&frequency=month&size=0


In [35]:
# Send the request and save into response object
resp_1 = requests.get(url_1)

In [36]:
# Look at the status code
print(resp_1.status_code)

# Use assert to stop the notebook's execution if not 200
assert resp_1.status_code == 200

# Parse the json response into a python object
resp_1_json = resp_1.json()

200


In [None]:
# Take a look
# resp_1_json

In [38]:
# Convert the python object into a pandas dataframe
df_1 = pd.DataFrame(resp_1_json["aggs"]["created_utc"])
df_1.head()

Unnamed: 0,doc_count,key
0,2,1138752000
1,5,1141171200
2,1,1143849600
3,6,1146441600
4,8,1149120000


In [39]:
# Convert "key" into a datetime column
df_1["key"] = pd.to_datetime(df_1["key"], unit="s", origin="unix")
df_1.head()

Unnamed: 0,doc_count,key
0,2,2006-02-01
1,5,2006-03-01
2,1,2006-04-01
3,6,2006-05-01
4,8,2006-06-01


In [40]:
# Rename "key" to reflect the fact that it is the beginning of the time bucket
# (in this case the month)
df_1 = df_1.rename(mapper={"key": "month", "doc_count": "utopia"}, axis="columns")

df_1.head()

Unnamed: 0,utopia,month
0,2,2006-02-01
1,5,2006-03-01
2,1,2006-04-01
3,6,2006-05-01
4,8,2006-06-01


---

### Dystopia

In [41]:
# Create a time aggregation to show the number of comments mentioning "dystopia" each month over the past 10 years
keyword = "dystopia"
after = "14y"
aggs = "created_utc"  # Will create the "subreddit" aggregation myself because I want to have per-subreddit over time
freq = "month"

url_2 = f"https://api.pushshift.io/reddit/search/comment/?q={keyword}&after={after}&aggs={aggs}&frequency={freq}&size=0"
print(url_2)

https://api.pushshift.io/reddit/search/comment/?q=dystopia&after=14y&aggs=created_utc&frequency=month&size=0


In [42]:
# Send the request and save into response object
resp_2 = requests.get(url_2)

In [43]:
# Look at the status code
print(resp_2.status_code)

# Use assert to stop the notebook's execution if not 200
assert resp_2.status_code == 200

# Parse the json response into a python object
resp_2_json = resp_2.json()

200


In [None]:
# Take a look
# resp_2_json

In [45]:
# Convert the python object into a pandas dataframe
df_2 = pd.DataFrame(resp_2_json["aggs"]["created_utc"])
df_2.head()

Unnamed: 0,doc_count,key
0,3,1143849600
1,1,1146441600
2,2,1149120000
3,1,1151712000
4,0,1154390400


In [46]:
# Convert "key" into a datetime column
df_2["key"] = pd.to_datetime(df_2["key"], unit="s", origin="unix")
df_2.head()

Unnamed: 0,doc_count,key
0,3,2006-04-01
1,1,2006-05-01
2,2,2006-06-01
3,1,2006-07-01
4,0,2006-08-01


In [47]:
# Rename "key" to reflect the fact that it is the beginning of the time bucket
# (in this case the month)
df_2 = df_2.rename(mapper={"key": "month", "doc_count": "dystopia"}, axis="columns")

df_2.head()

Unnamed: 0,dystopia,month
0,3,2006-04-01
1,1,2006-05-01
2,2,2006-06-01
3,1,2006-07-01
4,0,2006-08-01


---

### Joining Utopia and Dystopia

In [48]:
# Joining df_1 and df_2 on the month - using "inner" because I only want rows with both
df = pd.merge(df_1, df_2, how="inner", on="month")
df.head()

Unnamed: 0,utopia,month,dystopia
0,1,2006-04-01,3
1,6,2006-05-01,1
2,8,2006-06-01,2
3,14,2006-07-01,1
4,6,2006-08-01,0


In [49]:
# Reorder columns to get "month" first
cols = ["month", "utopia", "dystopia"]

df = df[cols]

df.head()

Unnamed: 0,month,utopia,dystopia
0,2006-04-01,1,3
1,2006-05-01,6,1
2,2006-06-01,8,2
3,2006-07-01,14,1
4,2006-08-01,6,0


---

## Visualizations

#### Resources

- [Bokeh Documentation](https://bokeh.pydata.org/en/latest/index.html)
- [Seaborn Example gallery](https://seaborn.pydata.org/examples/index.html)

---

In [50]:
# Look at datatypes
df.dtypes

month       datetime64[ns]
utopia               int64
dystopia             int64
dtype: object

In [53]:
# My first Bokeh Viz

# TODO: normalize to overall growth of reddit / # comments
# TODO: break it down by subreddit (maybe with violin plot? or mountain ranges?)

# Output to current notebook
output_notebook()

# Create new plot with title and axis labels
p = figure(title="Utopia vs Dystopia on Reddit", x_axis_label='Date', y_axis_label='Frequency')

# Add a line renderer with legend and line thickness
p.line(df["month"], df["utopia"], legend="Utopia", line_width=2, line_color="blue")
p.line(df["month"], df["dystopia"], legend="Dystopia", line_width=2, line_color="red")

In [52]:
# Show the results
show(p)

---

### Saving DataFrame as csv file

That way I don't have to run a new query whenever I run all cells in the notebook.

In [54]:
# Use pandas to save the dataframe to csv
df.to_csv(path_or_buf="utopia_dystopia_in_reddit_comments.csv")