# Sentiment analysis of reddit's opinions about walkability  in D.C. 

Sentiment analysis of user comments is a crucial tool for comprehending socially relevant issues like transportation and walkability of cities, which can help guide decisions that improve the quality of life for locals, workers, and visitors alike.

By analyzing the sentiment of user opinions, we can gain a deeper understanding of how people feel about these factors and how they impact their ability and willingness to transit in a particular city. For instance, sentiment analysis can show whether people feel safe walking through particular neighborhoods or whether they find it simple to get to bike lanes or public transportation. Additionally, sentiment analysis can be used to find neighborhoods that are more walkable for particular groups of people or where certain amenities are lacking, as well as places where there are disparities in walkability. Cities can work to build more equitable and inclusive communities by addressing these disparities.

In this work we gathered reddit user's opinions for different transportation and walkability topics in D.C. the topics are related to vehicles, bicycle and walkability.

We first want to observe the most frequent words accross categories. 

!["wordcloud"](..\img\wordcloud.png)

From this information we can see the word "metro" as a transportation method being mentioned while commenting on cars, to have a better understanding of the opinions we predicted the sentiment as "positive" "negative" or "neutral.

In [3]:
import pandas as pd
import re
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import tensorflow as tf
import pandas as pd
import random
import numpy as np
import torch as torch
import plotly.graph_objects as go
import plotly.express as px
import plotly.io as pio
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import json

#opening previously modified sentiment analysis
b_sent = pd.read_csv("../data/bikes_sent.csv")
c_sent = pd.read_csv("../data/cars_sent.csv")
w_sent = pd.read_csv("../data/walk_sent.csv")

# drop index
b_sent = b_sent.drop(columns=['Unnamed: 0'])
c_sent = c_sent.drop(columns=['Unnamed: 0'])
w_sent = w_sent.drop(columns=['Unnamed: 0'])

# add category
b_sent["category"] = "bike"
c_sent["category"] = "car"
w_sent["category"] = "walk"

b_sent = b_sent.rename(columns={"bikes": "text"})
c_sent = c_sent.rename(columns={"cars": "text"})
w_sent = w_sent.rename(columns={"walk": "text"})

#generate a single dataframe
df_sent = b_sent.append([c_sent,w_sent], ignore_index = True)

#obtain polarity as a range of negative, neutral and positives
polarity = []
for i in range(len(df_sent["label"])):
    if df_sent["POS"][i] > df_sent["NEG"][i]:
        polarity.append(df_sent["POS"][i])
    elif df_sent["POS"][i] < df_sent["NEG"][i]:
        polarity.append(df_sent["NEG"][i]*-1)

df_sent["polarity"] = polarity

y0 = df_sent.loc[df_sent['category'] == 'bike']['polarity']
y1 = df_sent.loc[df_sent['category'] == 'walk']['polarity']
y2 = df_sent.loc[df_sent['category'] == 'car']['polarity']



The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



In [4]:
trace0 = go.Box(
    y=y0,
    name = 'bike',
    marker = dict(
        color = 'rgba(148, 46, 46, 0.784)',
    ),
    notched=True
)
trace1 = go.Box(
    y=y1,
    name = 'walk',
    marker = dict(
        color = 'rgb(153, 91, 40)',
    ),
    notched=True
)
trace2 = go.Box(
    y=y2,
    name = 'car',
    marker = dict(
        color = 'rgb(214, 163, 32)',
    ),
    notched=True
)

data = [trace0, trace1, trace2]
layout = go.Layout(
    title = "Sentiment polarity boxplot by category"
)

fig = go.Figure(data=data,layout=layout)
iplot(fig, filename = "Sentiment polarity boxplot by category")

By analyzing the sentiment of each category we can see bike's polarity median is higher than walk and car. Walk median and values are closer to the neutral value of 0 although its q3 value is slightly up in the positive range. On the other hand the car category is below the neutral point having it's median in -.26 and its q3 value is closer to the median meaning most comments are under the neutral range and some are closer to it. Something interesting is that all values are ranging from -1 to 1 meaning there are fully positive and negative comments across all categories 

In [2]:
trace1 = go.Scatter(
    x=df_sent['polarity'], y=df_sent['category'], mode='markers', name='points',
    marker=dict(color='rgb(102,0,0)', size=2, opacity=0.4)
)
trace2 = go.Histogram2dContour(
    x=df_sent['polarity'], y=df_sent['category'], name='density', ncontours=30,
    colorscale='Hot', reversescale=True, showscale=False,
    hovertemplate='<br>Polarity: %{x}<br>Category: %{y}<br>Comments: %{z}'
)
trace3 = go.Histogram(
    x=df_sent['polarity'], name='Polarity',
    marker=dict(color='rgb(214, 163, 32)'),
    yaxis='y2',
    hovertemplate='<br>%{x}<br>'
)
trace4 = go.Histogram(
    y=df_sent['category'], name='Category', marker=dict(color='rgba(148, 46, 46, 0.784)'),
    xaxis='x2'
)
data = [trace1, trace2, trace3, trace4]

layout = go.Layout(
    showlegend=False,
    autosize=False,
    width=800,
    height=750,
    xaxis=dict(
        domain=[0, 0.85],
        showgrid=False,
        zeroline=False
    ),
    yaxis=dict(
        domain=[0, 0.85],
        showgrid=False,
        zeroline=False
    ),
    margin=dict(
        t=50
    ),
    hovermode='closest',
    bargap=0,
    xaxis2=dict(
        domain=[0.85, 1],
        showgrid=False,
        zeroline=False
    ),
    yaxis2=dict(
        domain=[0.85, 1],
        showgrid=False,
        zeroline=False
    )
)

fig = go.Figure(data=data, layout=layout)
fig.update_layout(yaxis_title="Category",xaxis_title="Polarity", title="Sentiment polarity of posts by category") 

iplot(fig, filename='2dhistogram-2d-density-plot-subplots')

An important factor to take into consideration is the quantity of comments towards walking in comparison to the rest of the categories, therefore, the density plot and histograms provide a better insight. 

The higher density of comments is concentrated among the highest positive polarity of comments in the walk category, by hovering the graph we can observe it has 382 comments. This is interesting because even though the walk category gas a huge amount of comments in the neutral range most of the comments are concentrated in the most positive value. It is also interesting to see the density of neutral-negative and fully negative values in the same category. It is important to understand that we gathered threads that were related to that category and having negative comments does not mean a negative opinion towards the category itself. For example a user saying "I hate that the walkability in D.C. is almost null" does not mean the user is against walkability.

Furthermore, the density plot shows the small density on positive opinions for cars, and how there are no comments in the neutral-positve range. From the histogram plot on the right we can observe the difference between number of comments in the different categories and even though the bike category has the smallest number of opinons we can see that it has more positive comments than cars.

Finally to have a better representation of the observed sentiments we will observe the most frequent pair of words by category
!["a"](..\img\bigrams_network.png)


There are plenty of interesting findings by observing something simple such as pair of words.

One of the biggest trends among bigrams are the pair of words for referring to popular spaces in D.C. such as Florida Avenue, Adams Morgan, Columbia Heights. What it is interesting about this pair of words is the frequency they appear by category.

For example, bike bigrams focus on specific avenues and intersections for example blocks and avenues meanwhile cars talk about suburbs such as Maryland and Virginia and walk category mentions parks, museums and monuments related to positive adjectives such as beautiful. This findings are related to the different analysis in this work showing how people in D.C. need to drive in order to go towards suburbs while walking and biking are related to more touristic places. 

On the negative aspects of comments there are bigrams that should be taken into consideration given that these are repetitive words for example pedestrian and death in the cars category. Another negative bigrams in this category are rush hour, traffic camera  and traffic safety. All of them being serious opinions that should be addressed. The walkability category has as well different "negative" bigrams such as sucks and unsafe, better and infrastructure, poor and person and the combination of words homeless and people.
 

As a conclusion we can see that people's attitudes toward different types of transportation in DC vary based on the insights obtained from sentiment analysis and visualizations. Although people tend to comment on bicycles more favorably than on walking or driving, all categories also draw criticism.

The analysis of the pairs of words additionally provides light on how different types of transportation relate to multiple parts of the city. For instance, while bigrams related to cars refer to suburbs like Maryland and Virginia, bigrams related to bikes concentrate on specific streets and blocks. While biking and walking are associated with more popular destinations like parks, museums, and monuments. Urban planners and policymakers may find this information helpful in understanding how various modes of transportation are used and in making decisions about urban planning and transportation infrastructure.

It's interesting to note that the positive comments for the walk category are concentrated around a few popular neighborhoods and landmarks in Washington, DC, indicating that the perception of the city's walkability may be influenced by the presence of popular destinations. Furthermore, the negative bigrams associated with automobiles, such as pedestrian fatalities and traffic safety, indicate that there are significant issues that require attention in order to increase safety and accessibility. 

Finally an interesting finding is a user's opinion on D.C. walkability gap:

*"I often times see "making DC a better place to live", but the question I ask often is "for who?". For those making enough to live in the inner-most part of the city? How do lower/working class individuals work, live, and enjoy this same city when the commute to do so.. is becoming unmanageable?"*