# Sci-Fi IRL

### A Data Storytelling Project by Tobias Reaper

### ---- Datalogue 004 ----

---
---

### Resources

- [PushShift API GitHub Repo](https://github.com/pushshift/api)

---

### Pre-Session Notes

After figuring out how to use the API yesterday, I ran into a wall as far as inspiration goes. 

> Now what?

I can do all of the things I need to do to finish this project, but I realized that a simple query such as the frequency of mentions of the terms "utopia" and "dystopia" may not show anything interesting or give me any good meat to chew on in my analysis.

I came up with some ideas to get me over that wall.

Looking at differences across and between subreddits seems to be where I feel like the most interesting stories can be found. However, I'm not going to overwhelm myself with trying to encompass ALL THE SUBREDDITS with this project. That would go against the whole "MVP" ethos.

Here's what I'm going to do today / for this project...I'm going to:

1. Choose a few subreddits from the list below that I believe will have interesting differences in certain areas
2. Come up with a list of words that can be considered a measure of optimism or hope for the future
3. Create a hypothesis regarding the words and the subreddits
4. Gather data on the chosen words in the chosen subreddits using the PushShift API
5. Explore the data with tables and visualizations
6. Find the most compelling storytelling devices and base my writing off of those
7. ???
8. Profit

---

### 1. List of Subreddits, organized by category

#### Entertainment

- entertainment, 1.1m
- scifi, 1.2m
- sciencefiction, 113k
- AskScienceFiction, 163k
- WritingPrompts, 14.0m
- writing, 888k
- movies, 21.5m
- gaming, 23.7m
- books, 17.1m
- suggestmeabook, 596k

#### Science / Technology

- Futurology, 14.2m
- futureporn, 161k
- space, 15.9m
- technology, 8.2m
- science, 22.4m
- askscience, 18.1m
- MachineLearning, 779k
- artificial, 87.7k
- TechNewsToday, 82.6k
- EverythingScience, 175k

#### Media / General

- worldnews, 22.2m
- news, 19m
- politics, 5.4m
- philosophy, 14.1m
- conspiracy, 984k
- skeptic, 134k
- changemyview, 830k
- AskReddit, 24.6m
- environment, 608k
- memes, 6.4m

Of course nine subreddits per category might be a bit much. However, I think I'll start here and prune as needed along the way.

---

### 2. List of Keywords

I had the thought that using "utopia" and "dystopia" alone might cause the data and results to be biased because those are not in the vernacular of many. Therefore, it would bias the data toward those who consume sci-fi or are into futurism. While gathering data on these folks is somewhat the point of the analysis, I would think there to be other words that would provide more of a fair judgement of popular sentiment.

- Basic
  - Utopia / Dystopia

---
---

### Imports

In [1]:
# Three Musketeers
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
# For using the API
import requests
# import json
# from pandas.io.json import json_normalize

In [3]:
# More advanced vizualizations
from bokeh.plotting import figure, output_file, output_notebook, show
from bokeh.models import NumeralTickFormatter, DatetimeTickFormatter

---

### Configuration

In [4]:
# Set pandas display option to allow for more columns and rows
pd.set_option("display.max_columns", 100)
pd.set_option("display.max_rows", 500)

---

# MVP

To start things out and get things rolling, I'm going to do a much simpler analysis of a single subreddit...

> r/suggestmeabook

The reason I'm choosing to use this particular subreddit is that an effective measure of the content people *actually* enjoy is what they will suggest to friends, colleagues, or strangers on the InterWebz

---- ø ----

## Notes

#### 1. Abbreviations and variable names

| Abbreviation  | Meaning |
| ----- | ------------------ |
| smab | suggestmeabook |

---- ø ----

In [34]:
# Comments that mention "utopian" in r/suggestmeabook

q = "utopian"
sub = "suggestmeabook"
fields = [
    "author",
    "body",
    "created_utc",
    "parent_id",
    "score",
    "subreddit",
]
after = "10y"

smab_utopia_url = f"https://api.pushshift.io/reddit/search/comment/?q={q}&subreddit={sub}&fields={','.join(fields)}"

# smab_utopia_url = f"https://api.pushshift.io/reddit/search/comment/?q={q}&subreddit={sub}&after={after}&fields={','.join(fields)}"

print(smab_utopia_url)

https://api.pushshift.io/reddit/search/comment/?q=utopian&subreddit=suggestmeabook&after=10y


In [35]:
# Send the request and save into response object
resp_1 = requests.get(smab_utopia_url)

In [36]:
# Look at the status code
print(resp_1.status_code)

# Use assert to stop the notebook's execution if request fails (if not 200)
assert resp_1.status_code == 200

200


In [37]:
# Parse the json response into a python object
json_resp_1 = resp_1.json()

In [43]:
# Check out the json
json_resp_1["data"][0:2]

[{'author': 'autowikibot',
  'author_created_utc': 1388865420,
  'author_flair_css_class': None,
  'author_flair_text': None,
  'author_fullname': 't2_enhq4',
  'body': "#####&amp;#009;\n\n######&amp;#009;\n\n####&amp;#009;\n [**Mars trilogy**](https://en.wikipedia.org/wiki/Mars%20trilogy): [](#sfw) \n\n---\n\n&gt;\n\n&gt;The __Mars trilogy__ is a series of award-winning [science fiction novels](https://en.wikipedia.org/wiki/Science_fiction_novel) by [Kim Stanley Robinson](https://en.wikipedia.org/wiki/Kim_Stanley_Robinson) that chronicles the settlement and [terraforming of the planet Mars](https://en.wikipedia.org/wiki/Terraforming_of_Mars) through the intensely personal and detailed viewpoints of a wide variety of characters spanning almost two centuries. Ultimately more [utopian](https://en.wikipedia.org/wiki/Utopia) than [dystopian](https://en.wikipedia.org/wiki/Dystopia), the story focuses on [egalitarian](https://en.wikipedia.org/wiki/Egalitarianism), sociological, and scientifi

Although I might use other data for the visualizations,  this data has some very valuable information. Namely, I am able to browse the data to get a feel for the overall quality of the comments. If, for example, I looked through and found that the majority of the comments that mentioned the keyword `utopia` were not saying it in the context of describing (science-fiction) stories, I would not be able to explore the hypothesis I originally set out to test / the question I originally set out to answer or simply explore.

---

### Aggregated smab

According to the [PushShift API Documentation](https://github.com/pushshift/api):

> The size parameter was set to 0 because we are only interested in getting aggregation data and not comment data.  
The aggregation data is returned in the response under the key aggs -> created_utc.

With that in mind, I'm going to take out the `size=0` parameter in order to capture both the aggregation and the comments.

In [57]:
# Comments that mention "utopia" in r/suggestmeabook, aggregated by month

q = "utopia"
sub = "suggestmeabook"
# before = "27d"
aggs = "created_utc"
freq = "month"

smab_utopia_agg_url = f"https://api.pushshift.io/reddit/search/comment/?q={q}&subreddit={sub}&aggs={aggs}&frequency={freq}"

print(smab_utopia_agg_url)

https://api.pushshift.io/reddit/search/comment/?q=utopia&subreddit=suggestmeabook&aggs=created_utc&frequency=month


In [58]:
# Send the request and save into response object
resp_2 = requests.get(smab_utopia_agg_url)

In [59]:
# Look at the status code
print(resp_2.status_code)

# Use assert to stop the notebook's execution if request fails (if not 200)
assert resp_2.status_code == 200

200


In [60]:
# Parse the json response into a python object
json_resp_2 = resp_2.json()

In [61]:
# Eyeball the jerry-rigged head of the aggregated json
json_resp_2["aggs"]["created_utc"][0:10]

[{'doc_count': 4, 'key': 1398902400},
 {'doc_count': 4, 'key': 1401580800},
 {'doc_count': 2, 'key': 1404172800},
 {'doc_count': 5, 'key': 1406851200},
 {'doc_count': 9, 'key': 1409529600},
 {'doc_count': 1, 'key': 1412121600},
 {'doc_count': 4, 'key': 1414800000},
 {'doc_count': 5, 'key': 1417392000},
 {'doc_count': 6, 'key': 1420070400},
 {'doc_count': 11, 'key': 1422748800}]

In [63]:
# Eyeball the full comment json
json_resp_2["data"][0]

{'all_awardings': [],
 'author': 'Elsbeth55',
 'author_flair_background_color': None,
 'author_flair_css_class': None,
 'author_flair_richtext': [],
 'author_flair_template_id': None,
 'author_flair_text': None,
 'author_flair_text_color': None,
 'author_flair_type': 'text',
 'author_fullname': 't2_u0n8csl',
 'author_patreon_flair': False,
 'body': 'Not exactly- but “Everfair” by Shawl reimagines the Belgian Congo as being reclaimed and established as a  utopia for natives from the Congo as well as enslaved people from America and other places.  There is an element of steampunk.',
 'created_utc': 1569112901,
 'gildings': {},
 'id': 'f10i4lk',
 'is_submitter': False,
 'link_id': 't3_d7g2k9',
 'locked': False,
 'no_follow': True,
 'parent_id': 't3_d7g2k9',
 'permalink': '/r/suggestmeabook/comments/d7g2k9/this_probably_doesnt_exist_but_is_there_a_book/f10i4lk/',
 'retrieved_on': 1569113149,
 'score': 1,
 'send_replies': True,
 'steward_reports': [],
 'stickied': False,
 'subreddit': 'sugg

Looks like that worked! So I can combine the aggregate and non-aggregate queries into a single one

In [64]:
# Convert the python object into a pandas dataframe
df_smab_agg = pd.DataFrame(json_resp_2["aggs"]["created_utc"])
df_smab_agg.head()

Unnamed: 0,doc_count,key
0,4,1398902400
1,4,1401580800
2,2,1404172800
3,5,1406851200
4,9,1409529600


In [65]:
# Convert "key" into a datetime column
df_smab_agg["key"] = pd.to_datetime(df_smab_agg["key"], unit="s", origin="unix")
df_smab_agg.head()

Unnamed: 0,doc_count,key
0,4,2014-05-01
1,4,2014-06-01
2,2,2014-07-01
3,5,2014-08-01
4,9,2014-09-01


In [79]:
# Rename "key" to reflect the fact that it is the beginning of the time bucket
# (in this case the month)
df_smab_agg = df_smab_agg.rename(mapper={"key": "month", "doc_count": "utopia"}, axis="columns")

print(df_smab_agg.shape)
df_smab_agg.head()

(65, 2)


Unnamed: 0,utopia,month
0,4,2014-05-01
1,4,2014-06-01
2,2,2014-07-01
3,5,2014-08-01
4,9,2014-09-01


In [78]:
# Convert the comments into a pandas dataframe
df_smab_data = pd.DataFrame(json_resp_2["data"])

print(df_smab_data.shape)
df_smab_data.head()

(25, 29)


Unnamed: 0,all_awardings,author,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,author_flair_text_color,author_flair_type,author_fullname,author_patreon_flair,body,created_utc,gildings,id,is_submitter,link_id,locked,no_follow,parent_id,permalink,retrieved_on,score,send_replies,steward_reports,stickied,subreddit,subreddit_id,total_awards_received
0,[],Elsbeth55,,,[],,,,text,t2_u0n8csl,False,Not exactly- but “Everfair” by Shawl reimagine...,1569112901,{},f10i4lk,False,t3_d7g2k9,False,True,t3_d7g2k9,/r/suggestmeabook/comments/d7g2k9/this_probabl...,1569113149,1,True,[],False,suggestmeabook,t5_31t41,0
1,[],IAintBlackNoMore,,,[],,,,text,t2_173ps5,False,"Personally, I think starting with the classics...",1568935329,{},f0u78mu,False,t3_d6lhb2,False,True,t3_d6lhb2,/r/suggestmeabook/comments/d6lhb2/learning_abo...,1568935365,1,True,[],False,suggestmeabook,t5_31t41,0
2,[],littlemisshockey,,,[],,,,text,t2_hhca2,False,I'd start with some of the classics: *The Art ...,1568935308,{},f0u7773,False,t3_d6lhb2,False,True,t3_d6lhb2,/r/suggestmeabook/comments/d6lhb2/learning_abo...,1568935352,1,True,[],False,suggestmeabook,t5_31t41,0
3,[],PimpedUpMonk,,,[],,,,text,t2_go9g9,False,**The City and the Stars** from Arthur C. Clar...,1568927017,{},f0tqfix,False,t3_d6k2vb,False,True,t3_d6k2vb,/r/suggestmeabook/comments/d6k2vb/utopian_futu...,1568927028,1,True,[],False,suggestmeabook,t5_31t41,0
4,[],kelseycadillac,,,[],,,,text,t2_3oxiw6vy,False,The Scorpio Races by Maggie Stiefvater (realis...,1568690636,{},f0kx852,False,t3_d5ac37,False,True,t3_d5ac37,/r/suggestmeabook/comments/d5ac37/ya_books_for...,1568690637,1,True,[],False,suggestmeabook,t5_31t41,0


---

## MVP Func

Combine the steps above into functions to make things faster

In [110]:
def subreddit_agg(query, subreddit, frequency="month", aggs="created_utc"):
    """
    Returns the JSON response of a PushShift API aggregate comment search as a Python dictionary.
    
    Note: if you're reading this note, that means that this function is still only written
    with the intention of automating a specific set of actions for a specific project.
    
    ---- Arguments ----
    query: (str) keyword to search.
    subreddit: (str) subreddit name
    frequency: (str) set the size of the time buckets.
    aggs: (str) aggregate function name. Default is "created_utc".
    (For more information, read the PushShift API Documentation.)
    -------------------
    """
    
    # Build the query url based on endpoints and parameters 
    url = f"https://api.pushshift.io/reddit/search/comment/?q={query}&subreddit={subreddit}&aggs={aggs}&frequency={frequency}"
    
    # Send the request and save the response into the response object
    response = requests.get(url)
    
    # Check the response; stop execution if failed
    assert response.status_code == 200
    
    # Parse the JSON into a Python dictionary
    # and return it for further processing
    return response.json()

In [129]:
def time_agg_df(data, keyword, frequency="month"):
    """
    Returns cleaned Pandas DataFrame of keyword frequency over time, given correctly-formatted Python dictionary.
    Renames the frequency column to keyword; converts month to datetime.
    
    Note: if you're reading this note, that means that this function is still only written
    with the intention of automating a specific set of actions for a specific project.
    
    ---- Arguments ----
    data: (dict) Python dictionary converted from JSON API response.
    keyword: (str) the keyword that was queried.
    time_bucket: (str) size of time buckets, which is also the name of the resulting DataFrame column. Defaults to "month".
    -------------------
    """
    
    # Convert the python object into a pandas dataframe
    df = pd.DataFrame(data["aggs"]["created_utc"])

    # Convert "key" into a datetime column
    df["key"] = pd.to_datetime(df["key"], unit="s", origin="unix")

    # Rename "key" to reflect the fact that it is the beginning of the time bucket
    df = df.rename(mapper={"key": frequency, "doc_count": keyword}, axis="columns")
    
    # Return the DataFrame
    return df

In [74]:
# Function to convert the comment data into pandas dataframe
def data_df(data):
    """
    Returns Reddit comments in Pandas DataFrame, given the correctly-formatted Python dictionary.
    
    Note: if you're reading this note, that means that this function is still only written
    with the intention of automating a specific set of actions for a specific project.
    
    ---- Arguments ----
    data: (dict) Python dictionary converted from JSON API response.
    -------------------
    """
    
    # Convert the comments into a pandas dataframe
    df = pd.DataFrame(data["data"])

    # Return the DataFrame
    return df

In [75]:
def df_to_csv(data, filename):
    """
    Basically just a wrapper around the Pandas `.to_csv()` method,
    created to standardize the inputs and outputs.
    
    ---- Arguments ----
    data: (pd.DataFrame) Pandas DataFrame to be saved as a csv.
    filepath: (str) name or path of the file to be saved.
    -------------------
    """
    
    # Saves the DataFrame to csv
    data.to_csv(path_or_buf=filename)
    
    # And that's it, folks!

---

## Gather the Data

In [112]:
# Comments that mention "utopia" (and, I'm assuming, "utopian" as well) in r/suggestmeabook, aggregated by month

q = "utopia"
sub = "suggestmeabook"

# Make the request and create the Python dictionary
smab_utopia_dict = subreddit_agg(q, sub, "month")

# Parse the time aggregation to pd.DataFrame
df_smab_agg_uto = time_agg_df(smab_utopia_dict, "utopia")

In [113]:
print(df_smab_agg_uto.shape)
print()

print(df_smab_agg_uto.dtypes)
print()

df_smab_agg_uto.head()

(65, 2)

utopia             int64
month     datetime64[ns]
dtype: object



Unnamed: 0,utopia,month
0,4,2014-05-01
1,4,2014-06-01
2,2,2014-07-01
3,5,2014-08-01
4,9,2014-09-01


In [115]:
# Comments that mention "dystopia" (and, I'm assuming, "dystopian" as well) in r/suggestmeabook, aggregated by month

q = "dystopia"
sub = "suggestmeabook"

# Make the request and create the Python dictionary
smab_dystopia_dict = subreddit_agg(q, sub, "month")

# Parse the time aggregation to pd.DataFrame
df_smab_agg_dys = time_agg_df(smab_dystopia_dict, "dystopia")

In [116]:
# Take a look at the results
print(df_smab_agg_dys.shape)
print()

print(df_smab_agg_dys.dtypes)
print()

df_smab_agg_dys.head()

(65, 2)

dystopia             int64
month       datetime64[ns]
dtype: object



Unnamed: 0,dystopia,month
0,13,2014-05-01
1,9,2014-06-01
2,3,2014-07-01
3,9,2014-08-01
4,24,2014-09-01


---

### Joining Utopia and Dystopia

In [117]:
# Join the two dataframes together
# - using "inner" because I only want rows with both

df_smab_agg = pd.merge(df_smab_agg_uto, df_smab_agg_dys, how="inner", on="month")
df_smab_agg.head()

Unnamed: 0,utopia,month,dystopia
0,4,2014-05-01,13
1,4,2014-06-01,9
2,2,2014-07-01,3
3,5,2014-08-01,9
4,9,2014-09-01,24


In [118]:
# Reorder columns to get "month" first
cols = ["month", "utopia", "dystopia"]

df_smab_agg = df_smab_agg[cols]

df_smab_agg.head()

Unnamed: 0,month,utopia,dystopia
0,2014-05-01,4,13
1,2014-06-01,4,9
2,2014-07-01,2,3
3,2014-08-01,5,9
4,2014-09-01,9,24


---

## Visualization

In [119]:
# Basic Bokeh visualization

# Output to current notebook
output_notebook()

# Create new plot with title and axis labels
p = figure(title="Utopia vs Dystopia", x_axis_type="datetime", x_axis_label='Date', y_axis_label='Frequency')

# Add a line renderer with legend and line thickness
p.line(df_smab_agg["month"], df_smab_agg["utopia"], legend="Utopia", line_width=2, line_color="blue")
p.line(df_smab_agg["month"], df_smab_agg["dystopia"], legend="Dystopia", line_width=2, line_color="red")

# Style the legend
p.legend.location = "top_left"
p.legend.title = 'Comments in r/suggestmeabook that mention:'
p.legend.title_text_font_style = "bold"
p.legend.title_text_font_size = "8pt"

# Show the results
show(p)

---

## Saving the Data

In [120]:
# Save the joined utopia / dystopia dataset
df_to_csv(df_smab_agg, "suggestmeabook_utopia_dystopia_by_month.csv")

In [121]:
# Create the "utopia" comments DataFrame
smab_comments_uto = data_df(smab_utopia_dict)

In [126]:
# Save the "utopia" comments dataset to csv
df_to_csv(smab_comments_uto, "suggestmeabook_utopia_comments.csv")

In [124]:
# Create and save the "dystopia" comments dataset
smab_comments_dys = data_df(smab_dystopia_dict)

In [127]:
# Save the "dystopia" comments dataset to csv
df_to_csv(smab_comments_dys, "suggestmeabook_dystopia_comments.csv")

---

### Extras

In [None]:
# TODO: write a function that combines all the above functions
# so I can call it once on a subreddit with a certain pair of keywords
# And get out the dataframes

def reddit_data_setter(keywords, subreddits, frequency="month", aggs="created_utc"):
    """
    Creates two DataFrames that holds combined data of each combination of keyword / subreddit.
    
    Note: if you're reading this note, that means that this function is still only written
    with the intention of automating a specific set of actions for a specific project.
    
    ---- Arguments ----
    keywords: (list) keyword(s) to search.
    subreddits: (list) name of subreddit(s) to include.
    frequency: (str) set the size of the time buckets.
    aggs: (str) aggregate function name. Default is "created_utc".
    (For more information, read the PushShift API Documentation.)
    -------------------
    """

    columns = []  # Empty list to hold column names

    # Use itertools to create the columns names
    from itertools import product

    for column in product(subreddits, keywords):
        column = list(column)
        col_name = "_".join(column)
        columns.append(col_name)

    # Create an empty DataFrame for the loop below using column names generated above
    df_all = pd.DataFrame(columns=columns)
    
    return df_all

# One method of creating the initial dataframe
# I ended up using the date column to initiate it

Turns out I will have to use the above itertools function to create the columns...maybe. We'll see. Nope. Didn't end up using it.

---

## `reddit_data_setter()` v1.0

In [130]:
def reddit_data_setter(keywords, subreddits, csv=False, frequency="month", aggs="created_utc"):
    """
    Creates two DataFrames that holds combined data of each combination of keyword / subreddit.
    
    Note: if you're reading this note, that means that this function is still only written
    with the intention of automating a specific set of actions for a specific project.
    
    ---- Arguments ----
    keywords: (list) keyword(s) to search.
    subreddits: (list) name of subreddit(s) to include.
    csv: (bool) if True, save the resulting dataframes as csv file.
    frequency: (str) set the size of the time buckets.
    aggs: (str) aggregate function name. Default is "created_utc".
    (For more information, read the PushShift API Documentation.)
    -------------------
    """
    from time import sleep
    from functools import reduce

    comment_df_list = []  # Empty list to hold comment dataframes
    word_df_list = []  # Empty list to hold monthly word count dataframes
    
    df_comm = pd.DataFrame()  # Empty dataframe for comment data
    df_main = pd.DataFrame()  # Empty dataframe for keyword counts
    
    # Create the "month" (datetime) column - to be used when joining
    df_main["month"] = pd.date_range(start="2005-01-01", end="2019-09-01", freq="MS")
    
    # Loop through keywords and subreddits to create dictionaries for each combination
    # subreddit is outer loop because I want to go through list of keywords
    # for one subreddit before moving onto the next
    for subreddit in subreddits:
        
        for word in keywords:
            # Create column name that matches above / main DataFrame
            col_name = f"{subreddit}_{word}"
            
            # Increase sleep time and indicate current subreddit / keyword
            print(f"Starting {col_name}")
            sleep(0.5)
            print("...")
            print()

            # Make request and convert response to dictionary
            dictionary = subreddit_agg(word, subreddit)
  
            # Append aggs word count df to word_df_list
            word_df_list.append(time_agg_df(dictionary, col_name))

            # Append comments df to comment_df_list
            comment_df_list.append(data_df(dictionary))

#             # Add subreddit aggregate for keyword to df dict
#             word_df_dict[f"{col_name}"] = time_agg_df(dictionary, col_name)

#             # Append comments df to comment_df_list
#             comment_df_dict[f"{col_name}"] = data_df(dictionary)
            
            # Sleep for 1 sec to stay within API limits
            print(f"Finished {col_name}")
            sleep(0.5)
            print("...")
            sleep(0.5)
            print()
    
    # Set index to month column then concat
    df_main = pd.concat([df.set_index("month") for df in word_df_list], axis=1, join="outer").reset_index()
    
    # Merge word_df_list dataframes into df_main
#     df_main = pd.concat(word_df_list, how="outer", axis=1)
#     df_main = reduce(lambda x, y: pd.merge(x, y, how="outer", on="month"), word_df_list)
    
    # Concatenate comment_df_dicts dataframes
    df_comm = pd.concat(comment_df_list, sort=False, join="outer", axis=0)
        
    if csv:
        df_to_csv(df_main, f"monthly-{'_'.join(subreddits)}-{'_'.join(keywords)}.csv")
        df_to_csv(df_comm, "df_comm.csv")
    
    # Return df_main, df_comm, respectively
    return df_main, df_comm
