# Sci-Fi IRL #1: Utopia Dystopia

### A Data Storytelling Project by Tobias Reaper

### ---- Datalogue 005 ----

---
---

### Resources

- [PushShift API GitHub Repo](https://github.com/pushshift/api)
- [PushShift.io](https://pushshift.io/)
- [PushShift API Documentation](https://pushshift.io/api-parameters/)
- [Bokeh Gallery](https://bokeh.pydata.org/en/latest/docs/gallery.html)

---

### 1. List of Subreddits, organized by category

#### Entertainment

- entertainment, 1.1m
- scifi, 1.2m
- sciencefiction, 113k
- AskScienceFiction, 163k
- WritingPrompts, 14.0m
- writing, 888k
- movies, 21.5m
- gaming, 23.7m
- books, 17.1m
- suggestmeabook, 596k

#### Science / Technology

- Futurology, 14.2m
- futureporn, 161k
- space, 15.9m
- technology, 8.2m
- science, 22.4m
- askscience, 18.1m
- MachineLearning, 779k
- artificial, 87.7k
- TechNewsToday, 82.6k
- EverythingScience, 175k

#### Media / General

- worldnews, 22.2m
- news, 19m
- politics, 5.4m
- philosophy, 14.1m
- conspiracy, 984k
- skeptic, 134k
- changemyview, 830k
- AskReddit, 24.6m
- environment, 608k
- memes, 6.4m

---

### 2. List of Keywords

I had the thought that using "utopia" and "dystopia" alone might cause the data and results to be biased because those are not in the vernacular of many. Therefore, it would bias the data toward those who consume sci-fi or are into futurism. While gathering data on these folks is somewhat the point of the analysis, I would think there to be other words that would provide more of a fair judgement of popular sentiment.

- Basic
  - Utopia / Dystopia

---
---

### Imports

In [1]:
# Three Musketeers
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
# For using the API
import requests
# import json
# from pandas.io.json import json_normalize

In [3]:
# More advanced vizualizations
from bokeh.plotting import figure, output_file, output_notebook, show
from bokeh.models import NumeralTickFormatter, DatetimeTickFormatter

---

### Configuration

In [4]:
# Set pandas display option to allow for more columns and rows
pd.set_option("display.max_columns", 100)
pd.set_option("display.max_rows", 500)

---

### Functions

In [36]:
def subreddit_agg(query, subreddit, frequency="month", aggs="created_utc"):
    """
    Returns the JSON response of a PushShift API aggregate comment search as a Python dictionary.
    
    Note: if you're reading this note, that means that this function is still only written
    with the intention of automating a specific set of actions for a specific project.
    
    ---- Arguments ----
    query: (str) keyword to search.
    subreddit: (str) subreddit name
    frequency: (str) set the size of the time buckets.
    aggs: (str) aggregate function name. Default is "created_utc".
    (For more information, read the PushShift API Documentation.)
    -------------------
    """
    
    # Build the query url based on endpoints and parameters 
    url = f"https://api.pushshift.io/reddit/search/comment/?q={query}&subreddit={subreddit}&aggs={aggs}&frequency={frequency}"
    
    # Send the request and save the response into the response object
    response = requests.get(url)
    
    # Check the response; stop execution if failed
    assert response.status_code == 200
    
    # Parse the JSON into a Python dictionary
    # and return it for further processing
    return response.json()

In [37]:
def time_agg_df(data, keyword, frequency="month"):
    """
    Returns cleaned Pandas DataFrame of keyword frequency over time, given correctly-formatted Python dictionary.
    Renames the frequency column to keyword; converts month to datetime.
    
    Note: if you're reading this note, that means that this function is still only written
    with the intention of automating a specific set of actions for a specific project.
    
    ---- Arguments ----
    data: (dict) Python dictionary converted from JSON API response.
    keyword: (str) the keyword that was queried.
    time_bucket: (str) size of time buckets, which is also the name of the resulting DataFrame column. Defaults to "month".
    -------------------
    """
    
    # Convert the python object into a pandas dataframe
    df = pd.DataFrame(data["aggs"]["created_utc"])

    # Convert "key" into a datetime column
    df["key"] = pd.to_datetime(df["key"], unit="s", origin="unix")

    # Rename "key" to reflect the fact that it is the beginning of the time bucket
    df = df.rename(mapper={"key": frequency, "doc_count": keyword}, axis="columns")
    
    # Return the DataFrame
    return df

In [38]:
# Function to convert the comment data into pandas dataframe
def data_df(data):
    """
    Returns Reddit comments in Pandas DataFrame, given the correctly-formatted Python dictionary.
    
    Note: if you're reading this note, that means that this function is still only written
    with the intention of automating a specific set of actions for a specific project.
    
    ---- Arguments ----
    data: (dict) Python dictionary converted from JSON API response.
    -------------------
    """
    
    # Convert the comments into a pandas dataframe
    df = pd.DataFrame(data["data"])

    # Return the DataFrame
    return df

In [39]:
def df_to_csv(data, filename):
    """
    Basically just a wrapper around the Pandas `.to_csv()` method,
    created to standardize the inputs and outputs.
    
    ---- Arguments ----
    data: (pd.DataFrame) Pandas DataFrame to be saved as a csv.
    filepath: (str) name or path of the file to be saved.
    -------------------
    """
    
    # Saves the DataFrame to csv
    data.to_csv(path_or_buf=filename)
    
    # And that's it, folks!

---

In [77]:
# TODO: write a function that combines all the above functions
# so I can call it once on a subreddit with a certain pair of keywords
# And get out the dataframes

def reddit_data_setter(keywords, subreddits, csv=False, frequency="month", aggs="created_utc"):
    """
    Creates two DataFrames that holds combined data of each combination of keyword / subreddit.
    
    Note: if you're reading this note, that means that this function is still only written
    with the intention of automating a specific set of actions for a specific project.
    
    ---- Arguments ----
    keywords: (list) keyword(s) to search.
    subreddits: (list) name of subreddit(s) to include.
    csv: (bool) if True, save the resulting dataframes as csv file.
    frequency: (str) set the size of the time buckets.
    aggs: (str) aggregate function name. Default is "created_utc".
    (For more information, read the PushShift API Documentation.)
    -------------------
    """
    from time import sleep
    from functools import reduce

    comment_df_list = []  # Empty list to hold comment dataframes
    word_df_list = []  # Empty list to hold monthly word count dataframes
    
    df_comm = pd.DataFrame()  # Empty dataframe for comment data
    df_main = pd.DataFrame()  # Empty dataframe for keyword counts
    
    # Create the "month" (datetime) column - to be used when joining
    df_main["month"] = pd.date_range(start="2005-01-01", end="2019-09-01", freq="MS")
    
    # Loop through keywords and subreddits to create dictionaries for each combination
    # subreddit is outer loop because I want to go through list of keywords
    # for one subreddit before moving onto the next
    for subreddit in subreddits:
        
        for word in keywords:
            # Create column name that matches above / main DataFrame
            col_name = f"{subreddit}_{word}"
            
            # Increase sleep time and indicate current subreddit / keyword
            print(f"Starting {col_name}")
            sleep(0.5)
            print("...")
            print()

            # Make request and convert response to dictionary
            dictionary = subreddit_agg(word, subreddit)
  
            # Append aggs word count df to word_df_list
            word_df_list.append(time_agg_df(dictionary, col_name))

            # Append comments df to comment_df_list
            comment_df_list.append(data_df(dictionary))
            
            # Sleep for 1 sec to stay within API limits
            print(f"Finished {col_name}")
            sleep(0.5)
            print("...")
            sleep(0.5)
            print()
    
    # Set index to month column then concat
    df_main = pd.concat([df.set_index("month") for df in word_df_list], axis=1, join="outer").reset_index()
    
    # Concatenate comment_df_dicts dataframes
    df_comm = pd.concat(comment_df_list, axis=0, sort=False, join="outer", ignore_index=True)
        
    if csv:
        df_to_csv(df_main, f"monthly-{'_'.join(subreddits)}-{'_'.join(keywords)}.csv")
        df_to_csv(df_comm, f"comments-{'_'.join(subreddits)}-{'_'.join(keywords)}.csv")
    
    # Return df_main, df_comm, respectively
    return df_main, df_comm


---

In [78]:
subs = ["suggestmeabook", "books", "scifi"]
words = ["utopia", "dystopia"]

df_main, df_comm = reddit_data_setter(words, subs, True)

Starting suggestmeabook_utopia
...

Finished suggestmeabook_utopia
...

Starting suggestmeabook_dystopia
...

Finished suggestmeabook_dystopia
...

Starting books_utopia
...

Finished books_utopia
...

Starting books_dystopia
...

Finished books_dystopia
...

Starting scifi_utopia
...

Finished scifi_utopia
...

Starting scifi_dystopia
...

Finished scifi_dystopia
...



---

## Visualization

#### TODO for Viz

- [ ] Set initial visible area to 2014-2019
- [ ] Combine Plots
- [ ] 

In [79]:
# Bokeh Line Plots for each subreddit

# Output to current notebook
output_notebook()

for sub in subs:
    # Create new plot with title and axis labels
    p = figure(title="Utopia vs Dystopia", x_axis_type="datetime", x_axis_label='Date', y_axis_label='Frequency')

    # Add a line renderer with legend and line thickness
    p.line(df_main["month"], df_main[f"{sub}_utopia"], legend="Utopia", line_width=2, line_color="blue")
    p.line(df_main["month"], df_main[f"{sub}_dystopia"], legend="Dystopia", line_width=2, line_color="red")

    # Style the legend
    p.legend.location = "top_left"
    p.legend.title = f"Comments in r/{sub} that mention:"
    p.legend.title_text_font_style = "bold"
    p.legend.title_text_font_size = "8pt"

    # Show the results
    show(p)

In [80]:
print(df_main.shape)
df_main.head()

(139, 7)


Unnamed: 0,month,suggestmeabook_utopia,suggestmeabook_dystopia,books_utopia,books_dystopia,scifi_utopia,scifi_dystopia
0,2008-03-01,,,,,2,
1,2008-04-01,,,,,0,
2,2008-05-01,,,,,3,5.0
3,2008-06-01,,,,,0,0.0
4,2008-07-01,,,,,0,1.0


In [82]:
print(df_comm.shape)
df_comm.head(2)

(150, 31)


Unnamed: 0,all_awardings,author,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,author_flair_text_color,author_flair_type,author_fullname,author_patreon_flair,body,created_utc,gildings,id,is_submitter,link_id,locked,no_follow,parent_id,permalink,retrieved_on,score,send_replies,steward_reports,stickied,subreddit,subreddit_id,total_awards_received,awarders,updated_utc
0,[],Elsbeth55,,,[],,,,text,t2_u0n8csl,False,Not exactly- but “Everfair” by Shawl reimagine...,1569112901,{},f10i4lk,False,t3_d7g2k9,False,True,t3_d7g2k9,/r/suggestmeabook/comments/d7g2k9/this_probabl...,1569113149,1,True,[],False,suggestmeabook,t5_31t41,0,,
1,[],IAintBlackNoMore,,,[],,,,text,t2_173ps5,False,"Personally, I think starting with the classics...",1568935329,{},f0u78mu,False,t3_d6lhb2,False,True,t3_d6lhb2,/r/suggestmeabook/comments/d6lhb2/learning_abo...,1568935365,1,True,[],False,suggestmeabook,t5_31t41,0,,
