# Data Visualization with Plotly!

Data visualization can often be as important as effective data analysis itself. In many organizations, accurate and insightful graphs are the most effective way to present your findings. Data visualization in python can be done with the help of a variety of packages; but for this tutorial, we will focus on Plotly. Plotly boasts a myriad of offline and online features that make it an extremely powerful and effective data visualization tool for any level data scientist. 

In [21]:
import plotly.plotly as py
import plotly.graph_objs as go
import pandas as pd

The data used for this visualization tutorial is derived from the Pittsburgh Port Authority TrueTime API bus data we all explored earlier in the semester. However, I have added to the analysis as we will see.

Getting Started:

This notebook has been pre-run to show all plots. However, if you would like to run it yourself, you must navigate your browser to Plot.ly and sign up for a free account. Make sure you are signed in to your Plotly account on your computer (nothing specific in notebook). The three csv files required can be attained from the three links below.  When downloaded to the same folder as this notebook, the code will execute as written.

Google Drive links:
https://drive.google.com/file/d/0BzWjARgQDlRPRzR3bHpoVU1adVE/view?usp=sharing
https://drive.google.com/file/d/0BzWjARgQDlRPNDlPUllVWUw4d1k/view?usp=sharing
https://drive.google.com/file/d/0BzWjARgQDlRPMENfc3A2M3BnRkk/view?usp=sharing)

In [22]:
vdfpdf = pd.read_csv("supervpNewHood.csv")

First lets look at delays separated by route. Delays are actual arrival time compared to the predicted arrival at t=0. In order to visualize this, we will use a Plotly box plot. When building a plot in Plotly, each data subset being plotted will be initialized as a go.Box object (called trace below). In each constructor we set the y variable to a subset of the data. Each trace below will be plotted as a box in the plot field for each route we want to display.

In [23]:
rt61A = vdfpdf.loc[vdfpdf['rt_vdf'] == "61A"]["delay"]
rt61B = vdfpdf.loc[vdfpdf['rt_vdf'] == "61B"]["delay"]
rt61C = vdfpdf.loc[vdfpdf['rt_vdf'] == "61C"]["delay"]
rt61D = vdfpdf.loc[vdfpdf['rt_vdf'] == "61D"]["delay"]

trace0 = go.Box(
    y=rt61A,
)
trace1 = go.Box(
    y=rt61B,
)
trace2 = go.Box(
    y=rt61C,
)
trace3 = go.Box(
    y=rt61D,
)
data = [trace0, trace1, trace2, trace3]
py.iplot(data)


We have a plot, but the axis scale automatically includes all outliers, which makes the overall message of delay average by route hard to see. We might normally remove such outliers in analysis, but let's pretend that in this case we want to keep them in the dataset. We can manually adjust the axis scale. Creating a go.Layout object will give us the ability to set features of how the graph looks. A dictionary for each box axis is initialized. Since this is a box plot, we are only concerned with the y axis. We set autorange to True for the x axis and limit the y axis to values between -10 and 10. 

In [24]:
layout = go.Layout(
    xaxis=dict(
        autorange=True
    ), 
    yaxis=dict(
        range=[-10,10]
    ), 
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig)

Better, but we should highlight the mean in each box. In this next iteration, we not only add a dotted line for the mean (boxmean=TRUE), but we also add axis labels for each trace and our overall X and Y axes. With x-axis labels, we no longer need our legend, which we hide by setting the layout parameter to False.

In [25]:
trace0 = go.Box(
    name='61A',
    y=rt61A,
    boxmean=True
)
trace1 = go.Box(
    name='61B',
    y=rt61B,
    boxmean=True
)
trace2 = go.Box(
    name='61C',
    y=rt61C,
    boxmean=True
)
trace3 = go.Box(
    name='61D',
    y=rt61D,
    boxmean=True
)
layout = go.Layout(
    title="Delay by Bus Route",
    xaxis=dict(
        title="Bus Route",
        autorange=True
    ), 
    yaxis=dict(
        title="Accumulated Delay (mins)",
        range=[-10,10]
    ), 
    showlegend=False
)
data = [trace0, trace1, trace2, trace3]
fig = go.Figure(data=data, layout=layout)
py.iplot(fig)


Note that plotly has a default feature of displaying plot details when hovering with your mouse. Give it a try!


Next, in order to modify the label for each route, we set the “name” parameter under each trace. To set the axis titles, set the “title” parameter within the xaxis and yaxis dictionaries in the Layout object. In order to set the overall graph title, you also set the “title” parameter in layout (outside of the axis dictionaries).

Now, let's adjust the visibility of the labels and gridlines. In the example below, we manually increase the font size for both labels and titles in our plot. To change the font size for the titles of our axis, we add a "titlefont" dictionary within our xaxis and yaxis dictionaries in our go.Layout object. To change the font size of the tick mark labels on our axes, we add a "tickmark" dictionary within our xaxis and yaxis dictionaries as well. In both instances, you can also change the font and the color by using the respective parameters "family" and "color". To make the ticklines more visible, we set "showgrid" to True and then the gridwidth and grid color parameters to be more prominent. When making the grid more visible, separate settings need to be set for the zeroline as it has its own parameters for customization. First ensure that it is visible by setting the "zeroline" parameter to True. Then set the "zerolinecolor" and "zerolinewidth" parameters as shown below.


In [26]:
layout = go.Layout(
    title="Delay by Bus Route",
    titlefont=dict(
            family='Arial, sans-serif',
            size=26,
            color='black'),
    xaxis=dict(
        title="Bus Route",
        titlefont=dict(
            family='Arial, sans-serif',
            size=22,
            color='navyblue'),
        tickfont=dict(
            family='Old Standard TT, serif',
            size=14,
            color='black'),
        autorange=True,
    ), 
    yaxis=dict(
        title="Accumulated Delay (mins)",
        titlefont=dict(
            family='Arial, sans-serif',
            size=22,
            color='navyblue'),
        tickfont=dict(
            family='Old Standard TT, serif',
            size=14,
            color='black'),
        range=[-10,10],
        showgrid=True,
        gridwidth=1,
        gridcolor="black",
        zeroline=True,
        zerolinewidth=1,
        zerolinecolor="red"
    ), 
    showlegend=False
)
data = [trace0, trace1, trace2, trace3]
fig = go.Figure(data=data, layout=layout)
py.iplot(fig)


Lets change gears and look at visualizing histograms with Plotly. A "delay" is still defined as the difference between the original predicted arrival at Forbes and Morewood and the actual arrival time. Below, we build a histogram with the proportion of trips at various delay values. A negative delay denotes a trip that arrived ahead of its predicted time. 

To make a histogram, create a go.Histogram object as a data group you want to represent on the plot. To normalize your histogram around add a parameter to the go.Histogram object called "histnorm" and set it to the type of normalization you would like to use (in my case I used 'probability' to show the proportion of records at each x-axis value). Next, just as with our box plot above, create a list (in this case called "data") with every trace you would like to display. And finally plot it with iplot. 

In [27]:
trips_condensed = pd.read_csv("supervpNewHood.csv")

trace = go.Histogram(
    x=trips_condensed['delay'],
    histnorm='probability'
)
data = [trace]
py.iplot(data)


Woah there! Look at all those points! Due to browser limitations, the Plotly SVG drawing functions have a hard time graphing more than 500k data points for line charts, or 40k points for other types of charts. Here are some suggestions:
(1) Use the `plotly.graph_objs.Scattergl` trace object to generate a WebGl graph.
(2) Trying using the image API to return an image instead of a graph URL
(3) Use matplotlib
(4) See if you can create your visualization with fewer data points




The plot defaults to a scale large enough to include all values. In this next iteration, we use the go.Layout object to customize the appearance of the plot. We reduce the range on the x-axis so we can focus on the majority of data points with our histogram. Titles are also added for each axis and the overall plot.

In [28]:
layout = go.Layout(
    title = "Probability Histogram of Trip Delays",
    xaxis = dict(
        title = 'Accumulated Delay',
        range=[-20,40]
    ),
    yaxis = dict(
        title = "Proportion of Trips")
)
figure = go.Figure(data=data, layout=layout)
py.iplot(figure)

Lets examine localized delays around neighborhoods. I calculated which neighborhood in Pittsburgh the bus was located in at each stop. This was done using the latitude and longitude of the timestamps as well as neighborhood parameters attained from Google Maps. 

Lets use a Heatmap to show which neighborhood on which routes bring about the largest delay. Localized delays were calculated by looking at the changes in predicted arrival from one timestamp to the next. I then extracted the timestamp record with the highest delay from each trip. This gives insight into where the biggest impact on a delay was occuring. The key values for this heatmap are the localized delays at the intersection of the routes (y) and the neighborhoods (x). The localized delays are defined in the graph parameters as 'z' which is represented as a list of lists. In this z list, each list represents a row in the plot, which in this case is a bus route. On the y axis, we have all of the neighborhoods visited by all four routes. If a route did not experience any localized delays in a particular neighborhood (or does not pass through there), it will have a zero as the localized delay for that neighborhood. 

After some code to build our list of lists with the mean localized delay for each neighborhood in each route, we intialize our plot. The go.Heatmap object is what is used to create the data for this plot. Within the go.Heatmap object, we set our x, y, and z. X is each neighborhood represented in the data set, and y is each of the four bus routes. If you do not include an X and Y parameter when instantiating your go.Heatmap object, you can still create the same plot as below, but there will be no labels on the axes.

In [29]:
neighborhood_delays = pd.read_csv("tripsComplete.csv")
z = []
neighs = neighborhood_delays['neighborhood'].unique()
x = neighs
y = neighborhood_delays['rt_vdf'].unique()
for i in neighborhood_delays['rt_vdf'].unique():
    this_z = []
    subset = neighborhood_delays[neighborhood_delays['rt_vdf']==i]
    for j in neighs:
        if j not in subset['neighborhood'].unique():
            this_z.append(0)
        else:
            value = round(subset[subset['neighborhood']==j]['offset'].mean(), 2)
            if value <=1:
                this_z.append(value)
            else:
                this_z.append(.75)
    z.append(this_z)
data = [
    go.Heatmap(
        z=z,
        x=x,
        y=y 
    )
]
py.iplot(data)

Great, now lets use another go.Layout object to set some axis labels. 

In [30]:
layout = go.Layout(
    title = 'Mean Localized Delay by Neigborhood',
    xaxis = dict(
        title = 'Neighborhood'
    ),
    yaxis = dict(
        title = 'Bus Route'
    ))
figure = go.Figure(data=data, layout=layout)
py.iplot(figure)

To avoid having to dart your eyes back and forth from the legend to the plot to discern approximate values, set the plot's "annotations" parameter in the go.Layout object. Here I take the opportunity to display a way to update the go.Layout object instead of recreating it. When we want to update the go.Layout object below, we actually access it through the go.Figure object where it was held, and call an update to add additional parameters. Take note that when we update, we are appending the parameters additionally initialized when the object was instantiated, not overwriting the initialized object

In order to annotate your graph you must loop through your z values and add them to a list which you will pass into the go.Layout object as annotations. What is interesting here is the list of annotations is actually a dictionary with parameters set for the location and the representation of the data in the plot. For instance, in the font parameter, you can set the color to white outright, or you can put an if statement and set it to black or white depending on the value (which means background color). If you look below you will see a commented line which shows how that would be done. 

In [31]:
mean_delays = []
for index, row in enumerate(z):
    for column, val in enumerate(row):
        box_mean = z[index][column]
        mean_delays.append(
            dict(
                text=str(box_mean),
                x=x[column], y=y[index],
                font=dict(color='white'),
                        #color='white' if val > 0.5 else 'black'), 
                showarrow=False)
            )
figure['layout'].update(
    title='Localized Delay Heatmap with Values',
    annotations = mean_delays
)
py.iplot(figure)

There are a number of other ways to customize heatmaps, one example is customizing the color spectrum with the "colorscale" parameter in the go.Layout object. 

Pittsburgh weather (from weather underground) at trip time was joined with the data. Let's build a scatter plot of rainfall on the x-axis and delay on the y axis. I create a go.Scatter object which will be added to the 'data' list for this plot and then call iplot. 


In [32]:
delays = pd.read_csv("tripDelaysWithRain.csv")
rt61A = delays.loc[delays['rt'] == "61A"]["delay"]
rt61B = delays.loc[delays['rt'] == "61B"]["delay"]
rt61C = delays.loc[delays['rt'] == "61C"]["delay"]
rt61D = delays.loc[delays['rt'] == "61D"]["delay"]


x = delays['rain']
y = delays['delay']

trace = go.Scatter(
    x=x,
    y=y,
    mode='markers'
)
data = [trace]
py.iplot(data)

Now I display a scatter plot with multiple traces (each route). When creating go.Scatter objects, set the 'mode' parameter to 'markers' to get a traditional scatter plot. Unlike layout objects where you set titles for axes, you must set the 'name' parameter in each go.Scatter object before adding them to the data list. This automatically creates a legend. Plotly will automatically select a color array for your data, but it is also customizable.

In [33]:
delays = delays.loc[delays['rain'] > 0]
rt61A = delays.loc[delays['rt'] == "61A"]
rt61B = delays.loc[delays['rt'] == "61B"]
rt61C = delays.loc[delays['rt'] == "61C"]
rt61D = delays.loc[delays['rt'] == "61D"]

trace1 = go.Scatter(
        name='61A',
        x=rt61A['rain'],
        y=rt61A['delay'],
        mode='markers'
)
trace2 = go.Scatter(
        name='61B',
        x=rt61B['rain'],
        y=rt61B['delay'],
        mode='markers'
)
trace3 = go.Scatter(
        name='61C',
        x=rt61C['rain'],
        y=rt61C['delay'],
        mode='markers'
)
trace4 = go.Scatter(
        name = '61D',
        x=rt61D['rain'],
        y=rt61D['delay'],
        mode='markers'
)
data = [trace1, trace2, trace3, trace4]
layout = go.Layout(
    title = 'Delay by Rainfall Per Trip',
    xaxis = dict(
        title = 'Rainfall During Trip (inches)' 
    ),
    yaxis = dict(
        title = 'Delay (minutes)')
)
figure = go.Figure(data=data, layout=layout)
py.iplot(figure)

Take a look at the bottom right of each of the graphs we have created above. There is an "Edit Chart" button. When clicking on this button, it takes you to the Plot.ly website where there is an interactive dashboard where you can edit and tweak your graphs. This is a great place to learn some of the additional capabilities of the plotting library. Plotly also has a service in which you can open an account and publish your plots to the web. Visit the site an apply for a key today! 