# PLOTLY TUTORIAL 


## 1. Introduction
The aim of this tutorial is to introduce you to the Plotly library, which provides a wide variety of tools for the visualization of datasets, and how to use these tools effectively. Data Visualization forms a key and final part of the Data Science pipe-line, ensuring that the analytics that one so tirelessly extracted from a dataset are communicated clearly, in a manner that can most effectively elicit the desired reaction.
    
![alt text](https://upload.wikimedia.org/wikipedia/commons/2/29/Minard.png)
(Charles Minard's graph depicting the Napoleanic Invasion of Russia, regarded as many as one of the most well-designed  data visualizations of all time.) 

<img src="https://i.kinja-img.com/gawker-media/image/upload/s--pbtXseN9--/c_fit,fl_progressive,q_80,w_636/18ymos4n2x2xyjpg.jpg" width="400" height="400" />
(A terrible infographic depicting hospital admissions from Gawker) 

The world in which we currently live, in large part revovles around the collection and communication of data. Regardless of field, marketing or experimental physics, a large emphasis is placed on the statistical basis of conclusions drawn. As such, the ability to effectively communicate collected data is equally important as their collection, as, without subject-specific knowledge, large data-sets can seem incomprehensible when simply expressed tabularly. The examples above depict the night and day difference between a well-designed an terrible infographic. As such, libraries like Plotly provide an integral asset in data communication.

## Types of Data 

An important step before visualizing a data-set, is understanding the nature of the data and how it can be ordered, which can help ward off some statisical misrepresentations. There are four different types of data: Nominal, Ordinal, Interval, and Ratio. Nominal data is data that cannot be ordered in any sense, they just describe names of distinct categories. An example might be set of Countries: {USA, UK, India, China}. These values are in-orderable as there is no sense that any one country is "greater" in value than another. Ordinal data is similar to Nominal data in that it is non-numeric, however, differs in that there is a distinct ordering to the data,while the scale of this ordering is not defined. The example most commonly cited is the set {strong disagree, slightly disagree, neutral, slightly agree, strongly agree}, which appears frequently on survey data. In this there is a clear ordering, but no sense of scale of difference between each type; i.e how slightly is slightly disagree? The next type of data is Interval data. Interval data is inherently numeric, and as such is orderable with a sense of the difference between each point. For example, Integer values have certain characteristics of interval data. There is a sense that 1 and 2 are seperated by the same amount as 4 and 5. A common example of this that people run into on a daily basis is ratings out of 5. However, what Interval data lacks is a sense of ratio between two values. On a five star scale, everyone understands that 4 is better than 2, but the idea that a 4 is twice as good as a two is tricky and not necessarily true. Ratio data is another form of numeric data, and is the most comparable data type. Ratio data is data that essentially has integral properties in that the ratio between distinct data points is constant. In a ratio data set 20 is twice as much as 10, and 10 is twice as much as 5, where as in an Interval data set the ratio would be ambigious. The most commmonly cited example is Kelvin. Since Kelvin is an absolute scale, zero Kelvin has a real meaning. This extends to ratios, 2 Kelvin has twice as much energy as 1 Kelvin and half of 4 Kelvin, so on and so forth.


## 2. Getting Started With Plotly

Getting started with Plotly will require the installation of a few packages: 

Pandas: a package required for the wrangling of data, into data-frames and otherwise. Documentation : https://pandas.pydata.org/pandas-docs/stable/index.html

NumPy and Scipy: packages required for a plethora of statistical and numerical operations. 
Documentation: https://docs.scipy.org/doc/

Plotly: of course, the eponymous package, Plotly, which will allow you to do the data visualizations covered in this tutorial. Documentation: https://plot.ly/python. You'll require a plotly account and API key, which can be created at plot.ly. To run the code below, replace "username" and "API-key" with your username and key. However, so that anyone can view this tutorial, I will be using the offline version.

In [1]:
import pandas as pd
import numpy as np
import scipy as sp
import plotly
import plotly.offline as py
py.init_notebook_mode(connected=True)
#plotly.tools.set_credentials_file(username='username', api_key='API-key')
from plotly.graph_objs import *
import sqlite3

## Importing Data and Tabular Representations

To import data, it is most convenient to use the Pandas package, which provides a number of ways to do so, from importing an SQL table to regular csvs. We can then display this table using the Plotly package.

In [2]:
import plotly.figure_factory as ff

tweets = pd.read_csv('tweets.csv')
                        
table1 = ff.create_table(tweets)

py.iplot(table1)
# plotly also supports a filename arg here which allows you to save your graphs on the cloud, however that requires
# a premium plot.ly account. While for students this is out of reach, it might be useful in a professional setting.

It is also possible to create tables and data directly in plot.ly itself

In [3]:
trace = Table(
    header=dict(values=['Student A', 'Student B'],
                line = dict(color='#7D7F80'),
                fill = dict(color='#FFD7C9'),
                align = ['left'] * 5),
    cells=dict(values=[[51, 87, 87, 95],
                       [97, 97, 100, 96]],
               line = dict(color='#00'),
               fill = dict(color='#FFFBD8'),
               align = ['left'] * 5))

layout = dict(width=1000, height=300)
data = [trace]
fig = dict(data=data, layout=layout)
py.iplot(fig)

For more information about importing data using pandas, and tabular representations in Plotly, explore the documentation here: 
https://pandas.pydata.org/pandas-docs/stable/api.html#input-output, 

https://plot.ly/python/table/

## Bar graphs and other Basic Graphs

Now we can move on to more complicated, interactive graphs, the first of which can be the bar graph. A simple way to convey data, a bar graph is ubiquitious, and as a result is often be used incorrectly. A bar graph is best used when the data is both numerical, and easily differentiable.

An important part of any graph is it's labeling. The labeling of a graph can change it from informative to misleading so it is important to get it right. Plotly makes labeling very easy, all that is required is to make a layout object and pass in the desired args!


In [4]:
data = [Bar(x=tweets.screen_name,y=tweets.retweet_count, name = "retweets"),
        Bar(x=tweets.screen_name,y=tweets.favorite_count, name = "favorites")] 

layout = Layout(title="Tweets by Retweets and Favorites",
                xaxis=dict(title='Screen names'),
                yaxis=dict(title='Number of'))

figure = Figure(data=data, layout=layout)
py.iplot(figure)

In this example, tweets were used. This is a good place to use a bar graph as the data is easily quantifiable (it is just the number of retweets and favorites), and easily differentiable (each value is significantly different such that its visibly distinct). Most Plotly graphs are also interactive! Hovering over the bar of each graph will tell you the exact value that it expresses. In later graphs, we will build upon this interactivity in exciting ways. For more information about Bar graphs specifically visit: https://plot.ly/python/reference/#bar

Next let's look at Line and Scatter plots! Very closely related, Line and Scatter Plots are tuned to represent 2-dimensional data. While Scatter Plots lend themselves to the comparison of un-related data points, line graphs have a degree of directionality to them and therefore are best suited to the depiction of data collected over time. In Plotly's implementation, line plots are simply an extension of a scatter plot, with a line object argument. For more information about Scatter plots visit: https://plot.ly/python/reference/#scatter

Using the data set from the past homeworks, here is an example of how to use the Scatter plot.

In [5]:
from time_series import split_trips
import copy
vdf = pd.read_sql_query("Select * from vehicles",sqlite3.connect('bus_aug23.db'),parse_dates = {"tmstmp" : "ns"})

vdf = vdf.truncate(after=1000)
times = map(lambda ind: ind.time(),vdf["tmstmp"])

vdf["tmstmp"] = pd.Series(list(times), index = vdf.index)

all_trips = { rt : split_trips(vdf[vdf["rt"]==rt]) for rt in ["61A", "61B", "61C", "61D"] }

data = []
lst = []
trips = copy.deepcopy(all_trips["61A"])
sorted(trips, key = lambda ind: ind.index.tolist()[0].hour*60+ind.index.tolist()[0].minute)
trips.pop(8)
trips.pop(7)
trips.pop(2)
for trip in trips:
    trace1 = Scatter(
                x = trip.index.map(lambda x: x.hour*60+x.minute-656),
                y = trip.pdist,
                line = dict(
                    color = ('rgb'+str(tuple(np.random.randint(256, size=3)))),
                    width = 4))
    data.append(trace1)
    
layout = dict(title = 'Distance Traveled by 61A Buses',
              xaxis = dict(title = 'Time in Minutes'),
              yaxis = dict(title = 'Feet Traveled'),
              )
figure = Figure(data = data, layout = layout)
py.iplot(figure)


Plotly integrates a lot of variants of scatter plots very easily as arguments to the Scatter plot object. This includes, as mentioned before, Line graphs, but also Error bars, bubble charts, dot plots and so on and so forth. For example, below is an Error bar implementation, and for further information about statistical graphs visit: https://plot.ly/python/statistical-charts/

In [6]:
data = []
for trip in trips:
    trace1 = Scatter(
                x = trip.index.map(lambda x: x.hour*60+x.minute-656),
                y = trip.pdist.map(lambda x: x//1000),
                line = dict(
                    color = ('rgb'+str(tuple(np.random.randint(256, size=3)))),
                    width = 4))
    data.append(trace1)
df = pd.concat(trips)
errortrace = Scatter(
                x = trip.index.map(lambda x: x.hour*60+x.minute-656),
                y = trip.pdist.map(lambda x: x//1000),
                error_y=dict(
                    type='data',
                    array=[1, 2, 3],
                    visible=True
                )
             )
data.append(errortrace)
py.iplot(data)

This concept extends seamlessly to 3-Dimensional plots, the only difference being the provision of a third variable dimension. However, unlike line graphs, 3D surface plots are not implemented as joined 3D scatter plots. They are implemented through another graph object Surface. To learn more about 3-D plots in Plotly, scatter, surface and many more, visit : https://plot.ly/python/3d-charts/

In [7]:
x1, y1, z1 = vdf["lat"],vdf["lon"],vdf["pdist"]
scatter = Scatter3d(x=x1, y=y1, z=z1,
    mode='markers',
    marker=dict(
        size=5,
        line=dict(color='rgb'+str(tuple(np.random.randint(256, size=3))), width=.25),
        opacity=1
    )
)
data = Data([scatter])

py.iplot(data)

However, sometimes 3-dimensional data is not always most opportunely expressed in a three dimenionsal plot. Often times they can become confusing and cluttered. In the above example, much of the data points are actually obscured as, in three dimensions, others might block them. As such, there are a number of other 3D+ graphs at your disposal. Most noteable amongst these is the Bubble plot. The bubble plot is similar to a 2-d scatter plot, however, the size of each data-point can vary according to a third parameter. To learn more about bubble plots in plotly visit: https://plot.ly/python/bubble-charts/.
    

In [8]:
data = pd.read_csv("country.csv")
data = data.set_index(data["Countries"])
data['GDP per capita'] = data['GDP per capita'].astype(float)
data['Life Expectancy'] = data['Life Expectancy'].astype(float)
data['Population'] = data['Population'].astype(float)
trace0 = Scatter(
    x=data['GDP per capita'],
    y=data['Life Expectancy'],
    mode='markers',
    marker=dict(
        symbol='circle',
        sizemode='area',
        sizeref=2.*max(data['Population'])/(100**2),
        size=data['Population'],
        line=dict(
            width=2
        ),
    )
)

py.iplot([trace0])

## Conclusions

In this tutorial, I've discussed some basics of data visulaization and the package Plotly. However, I have not even scraped the surface. There are many unique and data-specific ways that one might decide to visualize their data, and many of these are supported through plotly. Though I only discussed basic graphs in this tutorial, Plotly is a very powerful package that allows you to create any graph that you could think of, as well as animations, and interactive graphs. To explore Plotly more, please visit: https://plot.ly/python/. Their documentation is extensive and will probably answer any questions you have. 

### Additional References 

1. https://plot.ly/python/ : The Plotly Python API Tutorial has almost everything you would ever need to know about the package. Links to documentation are at the bottom of each tutorial.

2. https://datavizcatalogue.com/ : A great resource to find the specific graph that might suit your data, the data viz catalogue has an article on almost every type of graph there is. 