# **Intro to Data Visualization**
#### By: Tania Arya

*Originally presented in the QUESTech Datathon Workshop Series 2021*

---



# Setup

In [1]:
import pandas as pd
import plotly.express as px

[Learn how to read data into a Pandas DataFrame in 5 minutes](https://towardsdatascience.com/learn-how-to-read-data-into-a-pandas-dataframe-in-5-minutes-122af8e0b9db)

In [2]:
# Read sqlite query results into a pandas DataFrame
df = pd.read_csv('https://raw.githubusercontent.com/taniaarya/kaggle-airline-sentiment-tweets/main/Tweets.csv')

# Verify that result of SQL query is stored in the dataframe
df.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


# Data Cleaning

[Working with datetime in Pandas DataFrame](https://towardsdatascience.com/working-with-datetime-in-pandas-dataframe-663f7af6c587)

In [3]:
# convert tweet created column to datetime objects
df['tweet_created'] = pd.to_datetime(df['tweet_created'])

# extract date from tweet created column (using built in datetime properties)
df["tweet_date"] = df['tweet_created'].dt.date

[How to drop one or multiple columns in Pandas Dataframe](https://www.geeksforgeeks.org/how-to-drop-one-or-multiple-columns-in-pandas-dataframe/)

In [4]:
# remove unnecessary columns
df.drop(columns=['airline_sentiment_gold', 'negativereason_gold'], inplace=True)

# Numerical Analysis

[Python | Pandas Dataframe.describe() method](https://www.geeksforgeeks.org/python-pandas-dataframe-describe-method/)

In [5]:
# descriptive statistics
df.describe()

Unnamed: 0,tweet_id,airline_sentiment_confidence,negativereason_confidence,retweet_count
count,14640.0,14640.0,10522.0,14640.0
mean,5.692184e+17,0.900169,0.638298,0.08265
std,779111200000000.0,0.16283,0.33044,0.745778
min,5.675883e+17,0.335,0.0,0.0
25%,5.685592e+17,0.6923,0.3606,0.0
50%,5.694779e+17,1.0,0.6706,0.0
75%,5.698905e+17,1.0,1.0,0.0
max,5.703106e+17,1.0,1.0,44.0


Overview: [Python | Pandas dataframe.groupby()](https://www.geeksforgeeks.org/python-pandas-dataframe-groupby/)

In Depth Walkthrough: [Pandas GroupBy](https://www.geeksforgeeks.org/pandas-groupby/)

Official Documentation: [GroupBy](https://pandas.pydata.org/pandas-docs/stable/reference/groupby.html)

In [6]:
# counts for number of tweets per airline
df.groupby("airline").size()

airline
American          2759
Delta             2222
Southwest         2420
US Airways        2913
United            3822
Virgin America     504
dtype: int64

In [7]:
# counts for tweets per sentiment
df.groupby("airline_sentiment").size()

airline_sentiment
negative    9178
neutral     3099
positive    2363
dtype: int64

# Visualization

## Plotly Express

Interactive graphing library

[Documentation](https://plotly.com/python/plotly-express/)

[Overview](https://medium.com/plotly/introducing-plotly-express-808df010143d)

[Comprehensive Guide](https://towardsdatascience.com/visualization-with-plotly-express-comprehensive-guide-eb5ee4b50b57)



## Histograms

In [8]:
fig = px.histogram(df, x="airline", title="Counts of Tweets per Airline")
fig.show()

In [9]:
fig = px.histogram(df, x="airline_sentiment", title="Counts of Tweets per Sentiment")
fig.show()

In [10]:
# extract negative tweets
df_neg = df[df.airline_sentiment == "negative"]

fig = px.histogram(df_neg, x="negativereason", title="Counts of Negative Tweets Divided by Reason")
fig.show()

In [11]:
fig = px.histogram(df, x="airline", color="airline_sentiment", barmode="group", 
                   title="Counts of Tweets per Airline")
fig.show()

In [12]:
# extract tweets that are negative because of a "customer service issue"
df_cs = df[df.negativereason == 'Customer Service Issue']

# setting the histnorm to percent normalizes the y-axis
fig = px.histogram(df_cs, x="airline", title="Counts of Tweets About Bad Customer Service", histnorm="percent")
fig.show()

## Timeseries

In [13]:
# get counts for each date, divided by sentiment
df_counts = df.groupby(['tweet_date', 'airline_sentiment']).size().reset_index(name="count")
df_counts

Unnamed: 0,tweet_date,airline_sentiment,count
0,2015-02-16,negative,3
1,2015-02-16,neutral,1
2,2015-02-17,negative,838
3,2015-02-17,neutral,297
4,2015-02-17,positive,273
5,2015-02-18,negative,736
6,2015-02-18,neutral,335
7,2015-02-18,positive,273
8,2015-02-19,negative,751
9,2015-02-19,neutral,329


In [14]:
fig = px.line(df_counts, x='tweet_date', y="count", color='airline_sentiment', title="Sentiment of Tweets Over Time")
fig.show()

Potential cause for spike on Feb 22: 

There was a big winter storm that affected a large part of the East Coast on Feb 20 - 22. This may have may have cause flight delays resulting in a lot of angry passengers.

Sources:
https://ral.ucar.edu/sites/default/files/public/file_attach/features/PDF%20datastream.pdf
https://theweek.com/10things/536518/10-things-need-know-today-february22-2015