# Tutorial: Interactive Data Viz with Plotly
#### March 30, 2018
#### 15-388 Practical Data Science, CMU

## 1.1 Introduction

Note: this tutorial can be viewed through nbviewer http://nbviewer.jupyter.org/github/sarahshy/PDS/blob/master/Interactive%20Data%20Viz%20and%20Time%20Series%20Tutorial.ipynb?flush_cache=false.

This tutorial will introduce the basics of using Plotly for interactive data visualization. Visualizing data through graphs and plots lets us to explore patterns, trends, and even interactions between covariates. Adding an interactive element allows us to further break down these patterns over time or by another variable, as you will see later on. Many bloggers and news outlets are turning to interactive graphs to demonstrate their data-driven conclusions as they are more fun and easier to understand by the common reader. See the Wall Street Journal for some fun examples: http://graphics.wsj.com/wsj-interactives-2015/.

### Tutorial Content

In this tutorial, we will cover how the Plotly for Python library can be used for 3 different kinds of visualizations: histograms, time series plots, and mapping geospatial data. We will use the Austin bikeshare data as an example of application.

We will also use the cufflinks and pandas libraries for Python. Cufflinks binds plotly and pandas nicely so we can use plotly directly on pandas dataframes.

We will cover the following:
1. [Installing the libraries](#installing-the-libraries)
-  [Load the data](#load-the-data)
- [Plotting histograms](#plotting-histograms)
    - [Adding interaction to histograms](#interact-histograms)
- [Plotting time series](#plotting-time-series)
    - [Adding interaction to time series](#interact-time-series)
- [Plotting geospatial data](#plot-spatial)

<a id='installing-the-libraries'></a>

## 1.2 Installing Plotly

Before we start, we'll need to install plotly and load the libraries we'll be using throughout the tutorial.

You can install plotly using:

    \$ pip install plotly

In [4]:
#load libraries

import pandas as pd
import plotly.plotly as py
import plotly.tools as tls
import cufflinks as cf

<a id='load-the-data'></a>

## 2. Load the Data

For this tutorial, we will use the Austin bikeshare data. A description of the service can be found here: https://austinbcycle.com/how-it-works/faqs. The CSV files can be downloaded from Kaggle: https://www.kaggle.com/jboysen/austin-bike.

### Description of the data

We have two datasets: *austin_bikeshare_trips.csv* and *austin_bikeshare_stations.csv*.
The first contains information about all individual trips taken between December 2013 and May 2017. The second dataset contains information about the bike stations spread out across Austin, TX.

### Our variables
The _bike trips_ dataset contains the following variables:
-  bikeid: integer id of bike
-  checkout_time: HH:MM:SS, see start time for date stamp
-  duration_minutes: integer minutes of trip duration
-  end_station_id: integer id of end station
-  end_station_name: string of end station name
-  month: month, integer (1 = January)
-  start_station_id: integer id of start station
-  start_station_name: string of start station name
-  start_time: YYYY-MM-DD HH:MM:SS, string
-  subscriber_type: membership type
-  trip_id: unique trip id, int
-  year: year of trip, int


The _bike stations_ dataset contains the following variables:
-  latitude: geospatial latitude, precision to 5 places
-  location: (lat, lon)
-  longitude: geospatial longitude, precision to 5 places
-  name: station name, str
-  stations_id: unique station id, int
-  status: station status (active, closed, moved, ACL-only), ACL is a music festival


_Note: variable definitions were taken from the dataset's Kaggle description: https://www.kaggle.com/jboysen/austin-bike._

### Loading data

We can use the pandas *read_csv* function to load our data. To simplify, we will call our datasets *bike_trips* and *bike_stations*, respectively.

In [2]:
#load data
bike_trips = pd.read_csv("austin_bikeshare_trips.csv")
bike_stations = pd.read_csv("austin_bikeshare_stations.csv")

Let's take a quick look at our data.

In [52]:
print(bike_trips.head())
print(bike_stations.head())

   bikeid checkout_time  duration_minutes  end_station_id  \
0     8.0      19:12:00                41          2565.0   
1   141.0       2:06:04                 6          2570.0   
2   578.0      16:28:27                13          2498.0   
3   555.0      15:12:00                80          2712.0   
4    86.0      15:39:13                25          3377.0   

                           end_station_name  month  start_station_id  \
0                      Trinity & 6th Street    3.0            2536.0   
1                  South Congress & Academy   10.0            2494.0   
2   Convention Center / 4th St. @ MetroRail    3.0            2538.0   
3                   Toomey Rd @ South Lamar   11.0            2497.0   
4  MoPac Pedestrian Bridge @ Veterans Drive    4.0            2707.0   

                  start_station_name           start_time  \
0                   Waller & 6th St.  2015-03-19 19:12:00   
1                     2nd & Congress  2016-10-30 02:06:04   
2    Bullock Muse

In [31]:
# Dataset size
print(bike_trips.shape) #649231 by 12
print(bike_stations.shape) #72 by 6

(649231, 12)
(72, 6)


In our *bike_trips* dataset, we have **649,231** individual trips and 12 variables.

In our *bike_stations* dataset, we have **72** bike stations and 6 variables.

### Checking for missing data

Before we proceed with any analysis or graphing, we check if we have rows containing missing data. We could further analyze why the data is missing, however, this is outside the scope of this tutorial. We will therefore remove rows with missing data.

In [53]:
#number of rows with missing data in bike_trips
len(bike_trips[pd.isnull(bike_trips).any(axis=1)])

#number of rows with missing data in bike_stations
len(bike_stations[pd.isnull(bike_stations).any(axis=1)])

#remove rows with missing data
bike_trips_clean = bike_trips
bike_trips_clean = bike_trips.dropna()

#new dataset size
print(bike_trips_clean.shape)

(581625, 12)


There were 67,606 rows in *bike_trips* with missing data. The cleaned dataset is called *bike_trips_clean* and has 581,625 trips.

There were 0 rows in *bike_stations* with missing data.

<a id='plotting-histograms'></a>

## 3.1 Plotting Histograms: Trip Duration

We'll focus on the trip data first and look at stations later on. The cufflnks library allows us to plot our dataframe easily using 'df.iplot'.

Notice that we only plot trips shorter than 3 hours. This is because our histogram will otherwise have a long tail. Since the service is meant to be used for short trips through the city, we expect the bulk of our data to lie below 60 minutes.

Note that we receive a default warning that our dataset is rather large. This is simply a warning rather than an error and the plot output is correct.

In [42]:
duration_data = bike_trips_clean.duration_minutes[bike_trips_clean.duration_minutes <= 180]

layout = dict(title = "Histogram of Bike Trip Duration",
              xaxis = dict(title = 'Trip Duration (Minutes)'),
              yaxis = dict(title = 'Frequency'))

duration_data.iplot(kind = 'histogram', layout = layout, filename = "trip-duration", bins = 72)


Woah there! Look at all those points! Due to browser limitations, the Plotly SVG drawing functions have a hard time graphing more than 500k data points for line charts, or 40k points for other types of charts. Here are some suggestions:
(1) Use the `plotly.graph_objs.Scattergl` trace object to generate a WebGl graph.
(2) Trying using the image API to return an image instead of a graph URL
(3) Use matplotlib
(4) See if you can create your visualization with fewer data points




We see that trips most commonly last between 5 and 9 minutes. As expected, the bulk of our data lies below an hour. So, users seem to be using the bikes the way they were intended.

We might be interested in seeing how use changes over time. For example, we can hypothesize that trips in the winter are shorter than trips in the summer, when the weather is nice. We can use a dropdown menu that filters the seasons to test this.

<a id='interact-histograms'></a>

## 3.2 Dropdowns: Adding interactivity to histograms

The data does not immediately provide the season of each trip. Instead, we can use the month column to determine the season and label each trip with the appropriate string: 'fall', 'winter', spring', or 'summer'. We can convert this to a dataframe so we can plot it directly using *plotly*.

In [10]:
# label rows with season and convert to dataframe

bike_seasons = pd.DataFrame({'fall': bike_trips_clean.duration_minutes[bike_trips_clean.month.between(9,11)][bike_trips_clean.duration_minutes <= 180],
                   'winter': bike_trips_clean.duration_minutes[bike_trips_clean.month.isin([12,1,2])][bike_trips_clean.duration_minutes <= 180],
                   'spring': bike_trips_clean.duration_minutes[bike_trips_clean.month.between(3,5)][bike_trips_clean.duration_minutes <= 180],
                   'summer': bike_trips_clean.duration_minutes[bike_trips_clean.month.between(6,8)][bike_trips_clean.duration_minutes <= 180]
                  })

In [18]:
# create menu buttons

updatemenus = list([
    dict(active = -1,
        buttons=list([
            dict(label = 'All',
                method = 'update',
                args = [{'visible': [True, True, True, True]},
                        {'title': "Duration of Trips by Season"}]),
            dict(label = 'Fall',
                method = 'update',
                args = [{'visible': [True, False, False, False]},
                        {'title': 'Duration of Trips in Fall'}]),
            dict(label = 'Winter',
                method = 'update',
                args = [{'visible': [False, False, False, True]},
                        {'title': 'Duration of Trips in Winter'}]),
            dict(label = 'Spring',
                method = 'update',
                args = [{'visible': [False, True, False, False]},
                        {'title': 'Duration of Trips in Spring'}]),
            dict(label = 'Summer',
                method = 'update',
                args = [{'visible': [False, False, True, False]},
                        {'title': 'Duration of Trips in Summer'}])
        ]),)
])

layout = dict(xaxis = dict(title = 'Trip Duration (Minutes)'),
              yaxis = dict(title = 'Frequency', range = [0, 50000]),
              updatemenus = updatemenus,
              barmode='overlay',
              showlegend = False)

In [19]:
# plot
bike_seasons.iplot(kind = 'histogram',
                   barmode = 'overlay',
                   filename = 'duration_by_season',
                   layout = layout,
                   bins = 72,
                   shared_yaxes = True,
                   theme = 'solar')


Woah there! Look at all those points! Due to browser limitations, the Plotly SVG drawing functions have a hard time graphing more than 500k data points for line charts, or 40k points for other types of charts. Here are some suggestions:
(1) Use the `plotly.graph_objs.Scattergl` trace object to generate a WebGl graph.
(2) Trying using the image API to return an image instead of a graph URL
(3) Use matplotlib
(4) See if you can create your visualization with fewer data points




We set the default to display all 4 seasons, overlapping one another. By hovering over the histogram with your mouse, you can compare the number of trips in each bin by season. Using the dropdown, we can filter through each of the seasons individually.

By adding this dropdown, we see two notable patterns arise:
1. Fall and Spring are the most popular times to use bikeshare.
2. Regardless of season, 5-9 minutes is the most common trip length.

<a id='plotting-time-series'></a>

## 4.1 Plotting Time Series: Rides Over Time

Suppose we're further interested in how the popularity of this service has changed over time. Since we have data ranging from 2013 to 2017, we can inspect whether use has increased over time, decreased, or even if certain marketing campaigns have been effective.

The start_time column has both time and date. To plot trips over time, we will only need the date. First, we will convert time to a datetime object so we can extract the date using '.dt.date'.

We'll then count the number of rides taken on each day using the *groupby* function and plot the count over time.

In [51]:
# pull the date of each row
bike_trips_clean.loc[:,'start_time'] = pd.to_datetime(bike_trips_clean['start_time']) #convert to datetime object
bike_trips_clean.loc[:,'date'] = bike_trips_clean.start_time.dt.date

In [21]:
# get trip counts and create df for plotting

trip_count = bike_trips_clean.groupby(['date']).count()

trips_time_df = pd.DataFrame({'trip_count': trip_count.start_time})
print(trips_time_df.head())

            trip_count
date                  
2013-12-21         103
2013-12-22         117
2013-12-23          96
2013-12-24          85
2013-12-25         145


In [22]:
#plot
layout = dict(title = 'Number of Trips Over Time',
              xaxis = dict(title = 'Date'),
              yaxis = dict(title = 'Number of Rides'))

trips_time_df.iplot(layout = layout, filename = 'trips_over_time')

<a id='interact-time-series'></a>

## 4.2 Sliders: Adding interactivity to times series

We have a lot of data over 3 years and it's hard to see the lines. We'll add a slider that will allow us to zoom in and inspect the data more closely.

In [29]:
layout = dict(
    title='Number of Trips Over Time with Rangeslider',
    xaxis=dict(
        rangeselector=dict(
            buttons=list([
                dict(count=1,
                     label='1m',
                     step='month',
                     stepmode='backward'),
                dict(count=6,
                     label='6m',
                     step='month',
                     stepmode='backward'),
                dict(step='all')
            ])
        ),
        rangeslider=dict(),
        type='date'
    )
)

trips_time_df.iplot(layout = layout, filename = 'trips_over_time_slider')

With the slider, we can shorten the interval we look at, and we can drag it across the time axis. We observe huge recurring spikes in the end of March and beginning of April. We can also see times where we're missing data, for example, during April 2016 and December 2016. Using this data, we could further explore why we observe these spikes in March every year and why we might be missing data in certain months.

<a id='plot-spatial'></a>

## 5. Plotting Geospatial Data: Bike Station Locations

Let's have a look at the bike station data. We will use MapBox to aid us as Plotly can only plot country level maps currently (i.e. if we were mapping something out across the entire USA)

We are interested in knowing which stations are used most. This information can, for example, be used to determine which stations need more bikes or if we can charge more for stations with high demand.

We can combine our trips dataset with the stations dataset to determine the popularity of each station.

### Calculating station popularity

We will quantify the popularity of a station by the number of trips taken from it. Again, we can use the *groupby* function to get the count of each station. We then join with the *bike_stations* dataset to get the GPS coordinates of the stations.

In [31]:
# get count of each station
station_count_df = pd.DataFrame({'station_trip_count': bike_trips.groupby(['start_station_name']).size()})

# join with bike_stations
bike_stations_count = bike_stations.join(station_count_df, on = 'name')

print(bike_stations_count.head(3))

   latitude              location  longitude                     name  \
0  30.27041  (30.27041 -97.75046)  -97.75046           West & 6th St.   
1  30.26452   (30.26452 -97.7712)  -97.77120      Barton Springs Pool   
2  30.27595  (30.27595 -97.74739)  -97.74739  ACC - Rio Grande & 12th   

   station_id  status  station_trip_count  
0        2537  active             11905.0  
1        2572  active             12232.0  
2        2545  closed              1778.0  


As a safety measure, we will check that all stations from *bike_trips* matched up with a station in *bike_stations*.

In [27]:
print(len(bike_stations) == len(bike_stations_count)) # 72 bike stations

True


Now we can plot the stations in Austin and use the dot size to correspond with the popularity of the station.

In [22]:
from plotly.graph_objs import *

scale = 500 # scale count for appropriate dot size

data = Data([
    Scattermapbox(
        lat = bike_stations_count['latitude'],
        lon = bike_stations_count['longitude'],
        mode = 'markers',
        marker = Marker(
            size = (bike_stations_count['station_trip_count'])/scale,
            color = 'orange'
        ),
        text = bike_stations_count['name'],
        hoverinfo = 'station_trip_count'
    )
])
layout = Layout(
    title = 'Start Station Popularity',
    autosize = True,
    hovermode = 'closest',
    mapbox = dict(
        accesstoken = mapbox_access_token,
        bearing = 0,
        center = dict(
            lat = 30.267,
            lon = -97.743
        ),
        pitch = 0,
        zoom = 12
    ),
)

fig = dict(data = data, layout = layout)

py.iplot(fig, filename = 'stations-bubble-map-mapbox')

Interactivity: 
- Scroll to zoom in and out of the map
- Click-and-drag will move the map around
- Hover mouse over a dot to see the station name.

Mapbox and Plotly allow us to customize many properties of this map including: color, map style and coloring by another variable.

## Summary
This tutorial featured only a few of the possible types of plots and interactive elements that are possibly with Plotly for Python. Additional info and applications can be found here: https://plot.ly/python/.

The Austin Bikeshare datasets can be downloaded directly from Kaggle: https://www.kaggle.com/jboysen/austin-bike

Also note that Plotly is available in other programming languages such as R, MATLAB, and Julia.