# INFO 6350 Final Project

## Your info
* NetID(s): tz332
* Name(s): Tammy Zhang
---

In [1]:
# needed installations
# pip install praw
# pip install geonamescache

In [3]:
# imports
import pandas as pd
import numpy as np
import json
import geonamescache
import spacy
import praw

## 1. Introduction and hypothesis

_What problem are you working on? Why is it interesting and important? What have other people said about it? What do you expect to find?_

![A screenshot of the visualization upon launch](home.png)

### Research Problem

**How can NLP approaches and tools such as Spacy be potentially used in developing informative and engaging interactive visualizations for text-based data?**

My research problem is investigating how computational text tools such as Spacy might be used to help supplement interactive web visualization pipelines. More specifically, I experiment with incorporating Spacy's entity recognition abilities into a visualization development workflow for exploring answers to five similar US-state-based questions posed on [r/AskReddit](https://www.reddit.com/r/AskReddit/), a popular community on the social media platform Reddit. 

My contribution to this space includes the following 1.) an example simple web application that draws on data generated through text processing including Spacy's entity recognition to visualize text data on a U.S. state map for 5 popular r/AskReddit questions; 2.) discussion on possible future directions for using NLP methods in text data visualization based on what I learned along the way while creating a custom workflow, and 3.) the specifications for a set of Python functions that together can take a URL to a Reddit post with thousands of comments and transform it into a `.json` file compatible with open-source, lightweight Javascript libraries such as [d3.js](https://d3js.org/) to rapidly generate interactive map visualizations shareable on the web with minimal hosting needs.

### Background and Motivation

Visualizing text data is noted as a [particularly challenging task in data visualization courses](https://hci.stanford.edu/courses/cs448b/f12/lectures/CS448B-20121115-Text.pdf) compared to traditional quantitative data, primarily because of the curse of dimensionality and the risk of producing results with low interpretability. However, much data on today's Internet is primarily in text content - particularly much user-generated content - which may carry important information on current social/economic/emotional landscapes and trends, but run the risk of being completely buried, lost, and overlooked due to the sheer quantity of text that is produced.

One such example of a source of user-generated texts is Reddit, an American social media platform that is currently [one of the most visited websites in the world](https://www.sci-tech-today.com/stats/reddit-statistics/). There are currently more than 500 million Reddit users, expected to increase to 556 million in the next few years. The website includes over 100000 subreddits, which form communities of common interest where users can gather to post text, images, and other content regarding particular topics; users often add their own replies under others' posts and to other people's comments as well. Of these subreddits, the second most popular is r/AskReddit with over 50 million members, where users post questions for others to answer.

These questions are frequently of a humorous, rhetorical, and opinion-based nature - but many may also be serious and telling of more widescale social attitudes, movements, and trends. Humanities researchers can potentially discover emerging themes and patterns of interest in investigating the responses of thousands of people to questions such as "What keeps you going in difficult times?" "What are you worried about right now?" "What is a personal experience you've had with __ ?" However, individual replies to popular posts frequently number in the thousands, and comments are frequently buried under multiple levels of nested replies or algorithmic sorting that prioritizes "hot" comments with more upvotes. Thus, extracting meaningful information or themes from this information may be difficult. Visualization is often used to get broad-level insights about general trends in a body of data, but generating effective visualizations generally requires structured data - again, something challenging to extract from thousands of text comments, even if Reddit's particular structure around subreddits might somewhat help concentrate text around specific topics.

Thus, my project aims to complete a preliminary exploration of how someone might begin to bring information buried across thousands of comments up into a visible interface. I do this by exploring a particular category of question on r/AskReddit - those related to American states - to see if very simple NLP methods might add value into data visualization workflows and interface designs.

My motivation for selecting Reddit questions related to American states in particular is because it poses a relatively simple task that can be addressed to some extent using entity recognition, and because the expected visual representation of the information is fairly standard - in the form of a map. By focusing on questions with a simpler, more closed set of possible answers, I can experiment with my workflow instead of overly concerning myself on the most intuitive ways of representing patterns in user responses. I discuss in Future Work several directions in which I can see NLP playing more advanced roles and with different visualization forms.

This direction was also informed in part by my own research background in and current work in the field of interactive data visualization. After building a number of web-based visualizations of different datasets, many multidimensional and/or geospatial, I've found that there is an overwhelming amount of tools and strategies available when choosing the best approach to data processing and interaction design planning each time. However, I have not yet tried to use any computational text processing in my own work, and I also have not attempted to make complex interactive visualizations of primarily text-based data in the past, so I felt that this project would be a good opportunity for me to begin experimenting.

### Related Work

A number of users have attempted to create data visualizations for Reddit in the past, but most have focused on website-level visualization of connections between overall subreddits based on content similarity or shared user activity ([1](https://www.reddit.com/r/dataisbeautiful/comments/mfmlho/oc_ive_made_an_interactive_map_of_reddit_based_on/), [2](https://redditmap.social/)). After searching through several visualization-related subreddits, I found only a few user-created posts attempting to visualize answers to r/AskReddit questions: [What is one country that you will never visit again?](https://www.reddit.com/r/dataisbeautiful/comments/omr3x4/comment/h5mqqi0/) and [The most hated U.S. states, according to r/AskReddit](https://www.reddit.com/r/dataisbeautiful/comments/p3yy6o/oc_the_most_hated_us_states_according_to/). Both posts were quite relatively popular, but both graphics are static and neither user explicitly mentioned using any particular text processing methods in their workflows.

I also searched for scholarly works related to r/AskReddit; I found that most papers tended to be high-level (focused on differences/relationships between communities) or more atomic (focused on conversation-level signals, trends, and interpersonal interaction). I was not able to locate any papers focused on using computational text techniques to cluster comments on a _post level_ to identify emerging themes in one particular context (i.e. one instance of a question posed by a user), or for use in data visualization.

More generally, I became interested in Reddit as a data source partly after reading [Antoniak et. al (2019)](https://maria-antoniak.github.io/resources/2019_cscw_birth_stories.pdf), in which birth stories, which historically have not been focused on by researchers, gained visibility through the use of Reddit in sharing one's personal narrative. I was also interested in the intersection of geographic visualization through a map and text analysis in [Wilkens et. al (2024)](https://tuprints.ulb.tu-darmstadt.de/27523/1/3917_Small_Worlds_Conference_Version.pdf). The inclusion of a map in Figure 2 led me to consider my past experiences collecting and cleaning data for visualization in maps, and also led me to the beginning of questioning how text mining might integrate with complex data visualization - moving beyond simple count-based graphics like word clouds or lists.

### Hypothesis and Predictions

I expect that computational methods such as using Spacy's entity recognition tools will be helpful in 1.) extracting and aggregating more information from Reddit comments than would be feasible on a more manual level and 2.) helping to standardize a workflow so that once a kind of visualization can be generated for a particular post's comments, it will be relatively quick and costless to generate visualizations for posts similar in structure/topic.

However, I also expect that without further processing or the inclusion of more complex text processing methods than simply extracting data points of interest, the upfront value of the increased volume of information may be limited - primarily because it might be difficult to meaningfully interpret. Since people on the Internet frequently may use sarcasm, informal language, inside jokes, and other difficult-to-detect tonal indicators that change their intended meaning, using such a blunt method that checks primarily for the presence of an entity is likely to produce at least sometimes confusing or inconsistent results.

## 2. Data and methods

_What data have you used? Where did it come from? How did you collect it? What are its limitations or omissions? What major methods will you use to analyze it? Why are those methods the appropriate ones?_

### Dataset Source and Description

My data consists of comments under five r/AskReddit questions, which each ask users to name particular U.S. states in their answers. Initially, I worked on developing a pipeline for creating a map-based visualization for the first question, with the smallest number of comments; once I had a working graphic for this example, I applied the same workflow to the rest of the questions. Each question's replies were fetched, processed, and visualized separately. For the first question, the length of each comment ranged from just two characters long to 4453 characters long.

1. [What do you think is the best U.S. state?](https://www.reddit.com/r/AskReddit/comments/172l3yj/what_do_you_think_is_the_best_us_state/) (4105 replies)

2. [What’s the one US state you absolutely will never step foot in and why?](https://www.reddit.com/r/AskReddit/comments/1bfq75y/whats_the_one_us_state_you_absolutely_will_never/) (10864 replies)

3. [What is perhaps the least talked about US state?](https://www.reddit.com/r/AskReddit/comments/13mat7e/what_is_perhaps_the_least_talked_about_us_state/) (5580 replies)

4. [US redditors, what does your state do better than all the others?](https://www.reddit.com/r/AskReddit/comments/4gjd4h/us_redditors_what_does_your_state_do_better_than/) (7775 replies)

5. [All 50 states are getting together for Thanksgiving dinner. What does your state bring and why?](https://www.reddit.com/r/AskReddit/comments/3uahca/all_50_states_are_getting_together_for/) (7225 replies)

### Methodology

#### Data Collection

I selected these particular posts by applying the search term "state" to r/AskReddit and then sorting by Top of All Time. My motivation in doing this was to select groups of texts that would be somewhat similar in their manner of addressing the question, but would vary in meaning, and also to ensure that I would have sufficient comments to gather substantive data from per post.

I collected this data using the [Python Reddit API Wrapper (PRAW) library](https://praw.readthedocs.io/en/stable/). I consulted the documentation on how to get started with registering an application on Reddit and authenticating through OAuth so that my requests would follow Reddit policy.

In [32]:
# initializing an authorized Reddit Instance for scraping comments
reddit = praw.Reddit(
    client_id="redacted",
    client_secret="redacted",
    password="redacted",
    user_agent="mac:6350-project:v1.0 (by /u/redacted)",
    username="redacted",
)

After setting up my Reddit authentication, I wrote two functions to help properly aggregate comments under a given post. `get_top_level_comments` only collected top level comments (replies directly to the post itself), and `get_all_comments` collected all comments under a post, including replies to other comments.

In [35]:
def get_top_level_comments(post_url):
    """
    A function for aggregating top-level comment responses to a Reddit post into a pandas dataframe.
    
    Parameter post_url: the url link to the intended Reddit post. Type string.
    
    Returns: a dataframe object.
    """
    # get reddit submission (request the post)
    submission = reddit.submission(url = post_url)
    
    # only pull top-level comments
    submission.comments.replace_more(limit = 0)  
    
    # extract top-level comments
    comments = []
    for comment in submission.comments:
        if comment.score > 0: # filtering out downvoted content
            comments.append({
                'body': comment.body,
                'upvotes': comment.score,
                'top_level': comment.is_root
            })
    
    # convert to dataframe
    df = pd.DataFrame(comments)
    return df

In [37]:
def get_all_comments(post_url):
    """
    A function for aggregating ** all ** comment responses to a Reddit post into a pandas dataframe.
    
    Parameter post_url: the url link to the intended Reddit post. Type string.
    
    Returns: a dataframe object.
    """
    # get reddit submission (request the post)
    submission = reddit.submission(url = post_url)
    
    comments = []
    
    submission.comments.replace_more(limit = None) # ensures all comments, not just top-level ones, are pulled
    
    for comment in submission.comments.list():
        if comment.score > 0: # filtering out downvoted content
            comments.append({
                'body': comment.body,
                'upvotes': comment.score,
                'top_level': comment.is_root
            })

    # convert to dataframe
    df = pd.DataFrame(comments)
    return df

I then applied these functions to my first question (_What do you think is the best U.S. state?_). I started out only working with top-level comments since these were less time-consuming to request and process, given that there were fewer of them (only 79 out of the total 4105), then later changed to working with all comments to populate the visualization with more information.

In [40]:
# retrieving top level comments of a post
# r/AskReddit: What do you think is the best U.S. state?
# https://www.reddit.com/r/AskReddit/comments/172l3yj/what_do_you_think_is_the_best_us_state/
df_best_state = get_top_level_comments("https://www.reddit.com/r/AskReddit/comments/172l3yj/what_do_you_think_is_the_best_us_state/")

df_best_state

Unnamed: 0,body,upvotes,top_level
0,Denial. \n\nSeems there are tons of folk in th...,2866,True
1,Hawaii feels like a cheat code.,1866,True
2,"i live in Colorado, it’s pretty great",1822,True
3,New England is one of the prettiest places on ...,330,True
4,You could spend your whole life traveling in C...,3488,True
...,...,...,...
74,Us Rhode Islanders just don’t want to tell you...,6,True
75,Maine. There is a charm to Maine that is unlik...,9,True
76,Michigan has it's moments.\n\nNear freshwater ...,7,True
77,New Hampshire,10,True


In [42]:
# retrieving all comments
df_best_state_all = get_all_comments("https://www.reddit.com/r/AskReddit/comments/172l3yj/what_do_you_think_is_the_best_us_state/")

df_best_state_all

Unnamed: 0,body,upvotes,top_level
0,Denial. \n\nSeems there are tons of folk in th...,2871,True
1,Hawaii feels like a cheat code.,1857,True
2,"i live in Colorado, it’s pretty great",1828,True
3,New England is one of the prettiest places on ...,338,True
4,You could spend your whole life traveling in C...,3490,True
...,...,...,...
3421,"You would need an extra o, for sure. But dang...",1,False
3422,Haha then why do you have to make up lies to t...,1,False
3423,If you click the link it’s from NY to LA by gr...,1,False
3424,Name a single lie I said and prove it,1,False


I then referenced a [Stackoverflow post](https://stackoverflow.com/questions/59444065/differentiate-between-countries-and-cities-in-spacy-ner) on using the Python library `geonamescache` in conjunction with Spacy for more fine-tuned categorization of geopolitical entities, creating a list of location names for later reference as follows.

In [53]:
# code in this block written with assistance of this stackoverflow post: https://stackoverflow.com/questions/59444065/differentiate-between-countries-and-cities-in-spacy-ner
# getting lists of country, city, and U.S. state names for filtering later on

gc = geonamescache.GeonamesCache()

countries = gc.get_countries()
states = gc.get_us_states()
cities = gc.get_cities()

def gen_dict_extract(var, key):
    if isinstance(var, dict):
        for k, v in var.items():
            if k == key:
                yield v
            if isinstance(v, (dict, list)):
                yield from gen_dict_extract(v, key)
    elif isinstance(var, list):
        for d in var:
            yield from gen_dict_extract(d, key)

cities = [*gen_dict_extract(cities, 'name')]
states = [*gen_dict_extract(states, 'name')]
countries = [*gen_dict_extract(countries, 'name')]

#### Entity Extraction Using Spacy

I then loaded the `en_core_web_sm` model, later changed to `en_core_web_lg` during the final visualization steps; I also created a list of types of entities that I was interested in including in the interface later on. This list evolved a lot during the course of the project - I initially started out with a much longer list, including organizations, products, and events, but found that relatively few entities of these types were returned from the data. After exploring my visualization, I limited the entity types of investigation to geopolitical entities, general locations, facilities, works of art, and nationalities/religious group, as I observed these types tended to be somewhat more numerous and informative.

In [57]:
# loading an nlp object
nlp = spacy.load('en_core_web_lg')

# entity types of interest for retaining for visualization later
entity_type_list = ["GPE", "LOC", "FAC", "WORK_OF_ART", "NORP"]

I then wrote a function for extracting these entities of interest and their corresponding text from a single document. `get_entities` returns a dictionary with a list of texts belonging to each entity type.

In [60]:
def get_entities(text): 
    """
    Helper function for generate_dictionary(df). Given a text, extracts entities into separate lists based on their labels.
    
    Parameter text: A text of type string.
    Returns: a dictionary of lists of entity texts.
    """
    state_entities = set()
    geopolitical_entities = []
    entity_dict = {entity_type: [] for entity_type in entity_type_list if entity_type != "GPE"}
        
    doc = nlp(text)
    for ent in doc.ents:
        
        if ent.label_ == "GPE":
            if ent.text.title() in states:
                state_entities.add(ent.text)
            else:
                geopolitical_entities.append(ent.text)
        elif ent.label_ in entity_dict:
            if ent.label_ != "GPE":
                entity_dict[ent.label_].append(ent.text)
    
    result = {
        "states": list(state_entities),
        "GPE": list(set(geopolitical_entities)),
        **{key: list(set(value)) for key, value in entity_dict.items()}
    }
    
    return result

As an example, this is what one sample comment from Question 1's top level replies looks like:

In [66]:
# one selected comment
df_best_state.iloc[15]['body']

"I love Michigan! 4 seasons the weather's not bad, lots and lots of fresh water! If you like the city, Detroit and the metro area rocks, if you like the outdoors we have huge beautiful forests and more beautiful shoreline than almost any other state in the country! Also I will give a shout out to the Upper Peninsula which is pure God's country!"

And this is what it looks like after calling `get_entities`.

In [70]:
# example of output of calling get_entities on the previous sample comment
get_entities(df_best_state.iloc[15]['body'])

{'states': ['Michigan'],
 'GPE': ['Detroit'],
 'LOC': ['the Upper Peninsula'],
 'FAC': [],
 'WORK_OF_ART': [],
 'NORP': []}

Using this function as a helper, I then eventually created another function called `add_entities_to_df` that takes a dataframe of comments and adds columns to the right, essentially one column for each entity type. Each column added consists of a list of strings. The final form of this function required some experimentation to limit redundancy and to ensure the added entities aligned properly by row.

In [78]:
def add_entities_to_df(df):
    """
    Function that adds entity columns to a Reddit comment dataframe where each row is one comment.
    
    Parameter df: a dataframe holding comment data for a Reddit post where each row is one comment. 
    Returns: the input dataframe with extra columns, one for each entity type of interest, which contain lists of corresponding entity texts.
    """
    
    # initializes empty dictionary
    location_data = {}
    
    # calls first helper function get_entities, which returns - for each comment - a dictionary of lists of entities separated by type
    entities_lists = df["body"].apply(get_entities)
    
    # turns the returned dictionary into a dataframe where each entity type is its own column
    entities_df = pd.json_normalize(entities_lists)
    
    # sticking this dataframe to the right of the original dataframe so each comment now has lists of entities separated into different columns by type in their row
    df_merged = pd.concat([df, entities_df], axis = 1)
    
    return df_merged

As an example, we can take a look at the first row of the Reddit data before calling `add_entities_to_df`...

In [82]:
df_best_state.iloc[:1]

Unnamed: 0,body,upvotes,top_level
0,Denial. \n\nSeems there are tons of folk in th...,2866,True


... And after.

In [86]:
add_entities_to_df(df_best_state.iloc[:1])

Unnamed: 0,body,upvotes,top_level,states,GPE,LOC,FAC,WORK_OF_ART,NORP
0,Denial. \n\nSeems there are tons of folk in th...,2866,True,[],[],[],[],[],[]


The most scripting-intensive part of the project occurred at this point. In order to get the data into a form that can be easily visualized in a web map, the data needs to transform from being organized by comment to being organized by state, ideally in a JSON-serializable dictionary format. In order to do this, I wrote two more functions - df-wide `generate_dictionary` and row-wide helper `reformat_data` - in order to transform my data from a collection of Reddit comments discussing states alongside other entities to a dictionary where each state is a key with the following properties:

- `weight` - the cumulative number of upvotes received by comments that mentioned this state,
- `comments` - a list of strings, each the full text of a top-level comment mentioning this state,
- `replies` - a list of strings, each the full text of a non-top-level comment mentioning this state, and 
- entity types, each representing a list of strings (entity texts of that type mentioned in the same comment as the state of interest).

In [95]:
def generate_dictionary(df):
    """
    Function that takes a Reddit comment dataframe with no added information as input and ultimately returns a 
    JSON-serializable dictionary indexed by U.S. state name.
    
    Parameter df: a Reddit comment dataframe with columns `body`, `upvotes`, and `top_level`, where `body` refers to the comment 
    text, `upvotes` refers to the number of upvotes on the comment, and `top_level` is a boolean indicating if the comment is a 
    direct reply to the post or not.
    
    Returns location_data: a dictionary of U.S. state names, each one representing a key to the following properties: 
    `weight` - the cumulative number of upvotes received by comments that mentioned this state,
    `comments` - a list of strings, each the full text of a top-level comment mentioning this state,
    `replies` - a list of strings, each the full text of a non-top-level comment mentioning this state,
    and a series of entity type strings, each representing a list of strings (entity texts of that type mentioned in the same 
    comment as the state of interest).
    
    """
     # initializes empty dictionary
    location_data = {}
    
    # calling add_entities_to_df to add entity columns to each comment
    df_merged = add_entities_to_df(df)
                                   
    # calling helper function reformat_data, which populates the previously initiated empty dictionary, with data by US state
    df_merged.apply(lambda row: reformat_data(row, location_data), axis = 1)
    
    return location_data

In [97]:
def reformat_data(df_row, location_data):
    """
    Helper function for generate_dictionary(df) that takes a row of a dataframe and updates a dictionary holding information for an overall dataframe. 
    Generates a dictionary whose keys are U.S. state names and their aggregated information instead of being indexed by comment as in the dataframe.
    
    Parameter df_row: a row of a dataframe of Reddit comments. Includes a comment's body text, upvote count, and entities (separated by type).
    Parameter location_data: a dictionary where each U.S. state is a key and has the following properties: 
    """
    # list of states mentioned in this row (aka in this one comment)
    states_list = [state for state in df_row["states"]]
    
    # iterating over each state mentioned in this comment
    for state in states_list:
        
        # if this state hasn't already been added to the dictionary, initialize an entry for it
        if state not in location_data:
            location_data[state] = {}
            location_data[state]["weight"] = df_row["upvotes"] # add a key for upvote info
            
            if df_row["top_level"]: # if this comment is top level, add it to a key for storing comments
                location_data[state]["comments"] = [df_row["body"]]
                location_data[state]["replies"] = [] # initiate key for storing replies, empty for now
        
            else: # otherwise, add it to a key for storing replies
                location_data[state]["comments"] = []
                location_data[state]["replies"] = [df_row["body"]]
                
            
            # iterating over entity types of interest, creating a key for each state that holds a list of entities
            for entity_type in entity_type_list:
                location_data[state][entity_type] = df_row[entity_type]
                
        else:
            # the state has already been added to the dictionary - access the entry and update the fields
            location_data[state]["weight"] += df_row["upvotes"] # add this comment's upvotes to this state's `weight` key
            
            if df_row["top_level"]:
                location_data[state]["comments"].append(df_row["body"]) # add this comment's text to the list of comments for this state
            else:
                location_data[state]["replies"].append(df_row["body"])
            
            # iterating over entity types of interest, creating a key for each state that holds a list of entities
            for entity_type in entity_type_list:
                if df_row[entity_type]: 
                    location_data[state][entity_type].extend(df_row[entity_type])
                    location_data[state][entity_type] = list(set(location_data[state][entity_type]))

To demonstrate what effect `generate_dictionary` has, we can look at the first 10 comments before:

In [100]:
df_best_state.iloc[:10]

Unnamed: 0,body,upvotes,top_level
0,Denial. \n\nSeems there are tons of folk in th...,2866,True
1,Hawaii feels like a cheat code.,1866,True
2,"i live in Colorado, it’s pretty great",1822,True
3,New England is one of the prettiest places on ...,330,True
4,You could spend your whole life traveling in C...,3488,True
5,I’ve been to all 50 and my personal favorite t...,198,True
6,Washington and NY are my favourites. West coas...,180,True
7,Oregon 300 miles of public beaches,432,True
8,"If money isn't an issue, California.",537,True
9,Washington hands down. I lived there briefly a...,616,True


and after.

In [103]:
generate_dictionary(df_best_state.iloc[:10])

{'Hawaii': {'weight': 1866,
  'comments': ['Hawaii feels like a cheat code.'],
  'replies': [],
  'GPE': [],
  'LOC': [],
  'FAC': [],
  'WORK_OF_ART': [],
  'NORP': []},
 'Colorado': {'weight': 1822,
  'comments': ['i live in Colorado, it’s pretty great'],
  'replies': [],
  'GPE': [],
  'LOC': [],
  'FAC': [],
  'WORK_OF_ART': [],
  'NORP': []},
 'California': {'weight': 4025,
  'comments': ['You could spend your whole life traveling in California and never see the whole thing or get tired of the landscape.',
   "If money isn't an issue, California."],
  'replies': [],
  'GPE': [],
  'LOC': [],
  'FAC': [],
  'WORK_OF_ART': [],
  'NORP': []},
 'Maine': {'weight': 198,
  'comments': ['I’ve been to all 50 and my personal favorite to visit has been Maine!'],
  'replies': [],
  'GPE': [],
  'LOC': [],
  'FAC': [],
  'WORK_OF_ART': [],
  'NORP': []},
 'Washington': {'weight': 796,
  'comments': ['Washington and NY are my favourites. West coast and east coast, both beautiful places but ver

After iterating over my functions and the intended output format, I finally reached the point where the comment data for the first question could be converted into a `.json` file - a standard format for interactive web data visualization.

In [108]:
# generating a dictionary of US states where for each one, we have collected comments that mentioned them and associated entities of particular types
dictionary_best_state_all = generate_dictionary(df_best_state_all)

# saving this dictionary as a json file for visualization in javascript
file_path = "best-state-results-all.json"
with open(file_path, "w") as json_file:
    json.dump(dictionary_best_state_all, json_file, indent = 4)

## 3. Results

_What did you find? How did you find it? How should we read your figures? Be sure to include confidence intervals or other measures of statistical significance or uncetainty where appropriate._

After producing a `.json` file containing the Reddit comments and spaCy entities associated with each U.S. state, I followed a fairly standard Javascript workflow (loading the file using d3.js and constructing a U.S. map using a separate topojson file). Further interactions such as populating a sidebar with state data upon hover and click were also implemented using Javascript. The final interface appears as below and is accessible live at [https://tammyzhang-1.github.io/reddit-text-viz/](https://tammyzhang-1.github.io/reddit-text-viz/) (best viewed on a Macbook).

#### Home Page

![A screenshot of the visualization upon launch](home.png)

States were colored based on their `weight`, calculated through the functions presented above as the aggregated count of upvotes across all comments that mentioned that state, with states of higher weight having darker fills than states with lower weight. However, this should be regarded not as a presentation of "what r/AskReddit users think is the best state", but instead "what states tend to be discussed most often talking about what is the best state". The distinction is subtle but crucial - hovering over and exploring individual comments tagged for each state reveal a good number of sarcastic or negative replies (such as "definitely not Illinois").

#### Viewing State Data

![Sidebar populated with state information](sidebar.png)

Upon clicking or hovering on a state, the sidebar populates with a scrollable box of all comments mentioning this state, as well as lists of entities mentioned in the same content. While some entities appear to be rather unrelated to or distant from the state at hand - likely due to unresolved co-references in the same comment to other locations - there are some notable patterns of co-occurrences that make sense - such as the appearance of major national parks in the LOC section, suggesting that access to nature may be a theme in the discussion of what makes a state "the best". Similarly, the mention of particular groups (NORPs) such as different racial groups, religious groups, and political parties (notably Christians and republicans in southern states) also suggest the presence of a related dialogue existing in the discussion centered around one's attitude towards these groups.

#### Switching Between Questions / Datasets

![Dropdown with question options expanded](dropdown.png)

After verifying that the `.json` file output by my script could viably be visualized using Javascript, I replicated the process for the four other questions of interest. This was a surprisingly quick process - after the script to request each post's comments had run, only a few lines of Python were needed in order to repeat processing the data. On the web application side, it took less than a minute to add each additional post's map once the file was obtained.

In [134]:
def get_state_data_from_reddit(url, output_file):
    """
    Making use of the previously defined functions to create a function that takes a url to a Reddit post, searches the top level
    comments for entities associated with U.S. states using spacy, and creates a json file that can be visualized in a web 
    browser using a simplistic Javascript map.
    
    Returns: none. Writes a json file to the current directory with name output_file.
    Parameter url: String. A URL to a Reddit post that might have comments pertaining to U.S. states.
    Parameter output_file: String. A name for the output json file generated from the comments, containing information aggregated by U.S. state.
    """
    df = get_all_comments(url)
    state_dict = generate_dictionary(df)
    
    file_path = output_file
    with open(file_path, "w") as json_file:
        json.dump(state_dict, json_file, indent = 4)

In [None]:
# not rerun in order to avoid exceeding Reddit rate limits
get_state_data_from_reddit("https://www.reddit.com/r/AskReddit/comments/13mat7e/what_is_perhaps_the_least_talked_about_us_state/", "least-talked-about.json")
get_state_data_from_reddit("https://www.reddit.com/r/AskReddit/comments/1bfq75y/whats_the_one_us_state_you_absolutely_will_never/", "avoided-states.json")
get_state_data_from_reddit("https://www.reddit.com/r/AskReddit/comments/4gjd4h/us_redditors_what_does_your_state_do_better_than/", "better-states.json")
get_state_data_from_reddit("https://www.reddit.com/r/AskReddit/comments/3uahca/all_50_states_are_getting_together_for/", "thanksgiving.json")

The maps aligned fairly well visually with the conclusions that one might get from quickly checking top comments at a glance, or from general intuition about the U.S. landscape. For example, when asked what is perhaps the least talked about US state, discussion visibly shifted towards more rural states such as the Dakotas, Montana, Wyoming, etc (shown through darker colors on the map). 

Exploring the datasets through the map yielded some interesting insights - it was certainly a new way of viewing the information once contained in nested layers of comments underneath a post. By organizing comments spatially, it was easier to quickly gather ideas about shared experiences and patterns across the country. For example, the question on what state you would never step foot in and why was marked by increased references to particular entities - especially location-based ones, including remote roads and towns that might be associated with crimes or folklore - and increased discussion of racial groups. A quick overview of comments regarding southern states confirmed that much of one's avoidance of particular states could be motivated by safety concerns related to one's race/ethnicity in historically dangerous areas.

## 4. Discussion and conclusions

_What does it all mean? Do your results support your hypothesis? Why or why not? What are the limitations of your study and how might those limitations be addressed in future work?_

Overall, results were consistent with my hypothesis. Utilizing very basic text processing methods such as spaCy's entity recognition allowed the extraction of more data to consider when attempting to visualize responses to r/AskReddit questions, and could be rapidly extended in the creation of similar data visualizations for a topic. However, limitations are considerable, mostly in part to the simplicity of the method included, and many exciting directions for future work in this area exist.

### Limitations

This work was marked by major limitations through its usage of an extremely simple approach - aggregating comments based on their mention of U.S. states and associating each state with different spaCy-recognized entities for display alongside a geospatial visualization. As mentioned earlier, the lack of further processing means that we cannot distinguish if a state is mentioned as an intended answer to a question or as part of an expression of disagreement/opposition. Furthermore, many comments do not only mention one state - some comments were very lengthy and included a large range of states and entities. Under this current workflow, all entities would be associated with all states mentioned in the same content. A more developed, refined, and customized NLP pipeline is needed to isolate particular targets from one another and pick out which entities are associated with which states. 

While exploring the visualizations produced from the dataset, I also felt that while there certainly were emergent patterns made more visible through the inclusion of spaCy entities, the approach of just listing all entities out as text is limited. Including entities that may not make sense - such as locations very far away from a state, or text that appears to be miscategorized / not semantically meaningful - can serve to distract the user. While this project was helpful in confirming that it is technically feasible to collect types of text related to specific objects in a crowd question-and-answer instance, reindex this data to look at it from another angle, and then display this information in the browser, it is still important for data visualization designers to understand what relationships they want to make clearly visible in their graphics beforehand, and then tailor their workflows to produce clean, relevant datasets. Future work may involve customizing spaCy workflows to possibly identify the most relevant entities to a visualization, as well as experimentation with ways of visualizing these entities themselves. Alternatively, this kind of work with outputting aggregated spaCy entities could have applications for scientists who want to interpret the results of their models.

Finally, it should be noted that any work with Reddit involves working with user-generated content, which come with their own set of special ethical concerns. While I did not collect or keep any associated user information with each comment, I did retain individual comment texts in this project for use in visualization so that they could be referred back to later on. It is possible that exploring alternative ways of visualizing Reddit comments may result in people having their responses - which they might have assumed to be fairly hidden or buried under other comments - becoming more visible than they would have expected, and particular care needs to be extended when handling data that may include highly personal experiences or narratives.

### Future Work

After trying a very simplistic base experiment with adding in computationally processed text data via spaCy entity recognition, I ultimately have the sense that there are many possible future directions for including NLP methods in interactive data visualization development. Getting the data into the browser, as done in this project, is the first step - what can be done with the data afterwards, and what unique or insightful new forms it can take, is all but limitless.

An interesting idea is the further automation of generating visualizations for comments under Reddit posts - not just in the form of U.S. state maps or even world maps for location-related topics, but in a even more diverse array of forms. For example, visualizations related to clustering or network diagrams, commonly used for visualizations on a broader scale but which might have interesting results even when applied to the comments under one post - particularly posts from communities with high engagement and concentration around particular topics at a time, such as questions posed on r/AskReddit. 

This project showed that it can be possible to write one Python script that can be used on multiple post-comment collections to generate web visualizations for similar posts quickly. Developing webpages with a Python server that can run text-processing functions, or the script shown here, means that visualizations can be generated on demand by users. For example, a user could be able to paste in a link to a Reddit post of interest, wait for a script in the background to generate a visualization, and proceed to explore from there.

Another component I sketched in early iterations of the interface for this project visualization was the inclusion of a [radar chart](https://en.wikipedia.org/wiki/Radar_chart) or something similar for visualizing word embedding cosine similarity with particular vectors representing concepts of interest (see Nelson, 2021). For example, we could identify areas of interest/concern with questions such as "What do you think is the best state in the U.S.?", possibly construct vectors related to these concerns (summarized as "nature", "affordability", "safety", etc), and visualize how much each state's discussion is associated with each area through a radar chart where each point represents the concept. While this was not ultimately included in this project due to feasibility constraints, it represents only one of a mulititude of ways I imagine text processing methods can be used to support data visualization in future projects.