INRODUCTION

This tutorial will describe a broad overview of the Plotly visualization tool for Python. It is a wrapper based around the matplotlib library.  The advantages of using the plotly library is that it has a faster learning curve and lower complexity compared to the matplotlib library. It is suitable for data science 'newbies' and non data science individuals to acclerate the process of making sense of their data through visualization as it produces professional looking charts in a relatively short timespan. Plotly is one of many other python libraries built for visualization around matplotlib such as Seaborn, ggplot, bokeh, geoplotlib, gleam etc.

Plotly offers a number of interactive raphing/mapping options:
-Basic charts: Line charts, bubble charts, scatter plots, bar charts, gantt charts,pie charts, filled area plots, time series  etc
-Statistical Charts: Error bars, histograms, 2D histograms, density plots, tree plots, tree maps etc
-Scientific charts: Log plots, contour plots, heatmaps, polar charts, ternary plots, streamline plots, chord diagram etc
-Financial charts
-Maps: Scatter plot on maps, bubble maps, lines on maps, choropleth maps etc
-3D charts: scatter plots, bubble plots, line plots, ribbon plots, surface plots, mesh plots, parametric plots, 3D network graphs, 3D clustering, surface triangulation etc

NOTE: Plotly is an online web-hosting service for graphs. Graphs are stored online under a private account. However plotly can also be used offline in IPython notebooks. Plotly API is also available for other programming environments such as R, Matlab and JavaScript.

This tutorial will demonstrate the use of Plotly in constructing 2 or 3 different types of plots utilizing real world data sets.

INSTALLATION

Use the package manager pip inside the terminal.

$pip install plotly

The plotly username and api can be generated using the following link:
https://plot.ly/api

The generated credentials need to be assigned to the username and api_key arguments as shown below.

In [159]:
import plotly
plotly.tools.set_credentials_file(username='rohit269mail', api_key='ooqdid05dr')


GETTING STARTED

Every plotly graph comprises of JSON objects, which are dictionary like datastructures. Simply by changing the value of the keywords of the objects, we can generate different kinds of graphs.

The different objects that define a graph in plotly are:
-Data
-Layout 
-Figure

Data consists of the trace to be plotted with the specifications and data to be plotted.
Layout defines the look of the graph and controls features unrelated to the Data.
Figure creates the final object which contains both the Data and Layout objects.

Two modules need to be necessarily imported to generate Plotly graphs:
1. plotly.plotly: contains functions that will help communicate with the Plotly server
2. plotly.graph_objs: contains functions that will generate the graph object


In [160]:
import plotly.plotly as py
from plotly.graph_objs import *
#import plotly.graph_objs as go
import pandas as pd
import numpy as np
import math

The help function can be used to see the description of any plotly object

In [161]:
help(py.iplot)

Help on function iplot in module plotly.plotly.plotly:

iplot(figure_or_data, **plot_options)
    Create a unique url for this plot in Plotly and open in IPython.
    
    plot_options keyword agruments:
    filename (string) -- the name that will be associated with this figure
    fileopt ('new' | 'overwrite' | 'extend' | 'append')
        - 'new': create a new, unique url for this plot
        - 'overwrite': overwrite the file associated with `filename` with this
        - 'extend': add additional numbers (data) to existing traces
        - 'append': add additional traces to existing data lists
    sharing ('public' | 'private' | 'secret') -- Toggle who can view this graph
        - 'public': Anyone can view this graph. It will appear in your profile
                    and can appear in search engines. You do not need to be
                    logged in to Plotly to view this chart.
        - 'private': Only you can view this plot. It will not appear in the
                     Plot

EXAMPLE 1:

The following code is an example of constructing a 'Lines on Maps' plot. 
In this example, we take data from http://openflights.org/data.html to construct a map to visualize the commercial air connectivity of the city of Pittsburgh.
The data was curated from its original form for the purpose of this application. The structure of the two data files are shown below. 

In [162]:
df_airports = pd.read_csv('airports.csv')
print df_airports.head()     #read USA airport data file

df_flight_paths = pd.read_csv('PITroute.csv')   #read Pittsburgh connections data file
print df_flight_paths.head()  

  iata                            airport               city state country  \
0  ORD       Chicago O'Hare International            Chicago    IL     USA   
1  ATL  William B Hartsfield-Atlanta Intl            Atlanta    GA     USA   
2  DFW    Dallas-Fort Worth International  Dallas-Fort Worth    TX     USA   
3  PHX   Phoenix Sky Harbor International            Phoenix    AZ     USA   
4  DEN                        Denver Intl             Denver    CO     USA   

         lat        long    cnt  Unnamed: 8  
0  41.979595  -87.904464  25129         NaN  
1  33.640444  -84.426944  21925         NaN  
2  32.895951  -97.037200  20662         NaN  
3  33.434167 -112.008056  17290         NaN  
4  39.858408 -104.667002  13781         NaN  
  airline  airline_code  start_lat  start_lon airport1     id1    end_lat  \
0      AA          24.0  40.491467 -80.232872      PIT  3570.0  41.938889   
1      AA          24.0  40.491467 -80.232872      PIT  3570.0  42.364347   
2      AA          24.0 

In [163]:
#code to generate 'Lines on Maps' plot to visualize the air connectivity of Pittsburgh
airports = [ dict(                                #data object containg airports information
        type = 'scattergeo',                      #type of graph object
        locationmode = 'USA-states',
        lon = df_airports['long'],
        lat = df_airports['lat'],
        hoverinfo = 'text',
        text = df_airports['airport'],
        mode = 'markers',
        marker = dict( 
            size=2, 
            color='rgb(255, 0, 0)',
            line = dict(
                width=3,
                color='rgba(68, 68, 68, 0)'
            )
        ))]
        
flight_paths = []                                    #data object containing flight path information
for i in range( len( df_flight_paths ) ):
    flight_paths.append(
        dict(
            type = 'scattergeo',
            locationmode = 'USA-states',
            lon = [ df_flight_paths['start_lon'][i], df_flight_paths['end_lon'][i] ],
            lat = [ df_flight_paths['start_lat'][i], df_flight_paths['end_lat'][i] ],
            mode = 'lines',
            line = dict(
                width = 1,
                color = 'red',
            )
        )
    )
    
layout = dict(                               #layout object 
        title = 'Piitsburgh flight connection, Jan 2012<br>(Hover for airport names)',
        showlegend = False, 
        geo = dict(
            scope='north america',
            projection=dict( type='azimuthal equal area' ),
            showland = True,
            landcolor = 'rgb(243, 243, 243)',
            countrycolor = 'rgb(204, 204, 204)',
        ),
    )
    
fig = dict( data=flight_paths + airports, layout=layout )
py.iplot( fig, filename='Pitt-flight-paths' )

EXAMPLE 2:

A resourcefull reservoir of real world datasets is the World Bank website. A rich repository of various world devopment indicators can be found on http://wdi.worldbank.org/tables. For the next example, we will plot a simple horizontal bar graph. The data chosen was the number of scientific and technical journals published by a country for the year 2013. Journals include articles published in the following fields: physics, biology, chemistry, mathematics, clinical medicine, biomedical research, engineering and technology, and earth and space sciences.
The raw data set extracted from the website is shown below.

In [166]:
df=pd.read_csv('rnd.csv')
print df.head(10)

               country papers
0          Afghanistan     27
1              Albania    184
2              Algeria  3,653
3       American Samoa     ..
4              Andorra      6
5               Angola     23
6  Antigua and Barbuda      2
7            Argentina  8,053
8              Armenia    559
9                Aruba     ..


First, we will remove countries with missing values. Then we will sort the series in descending order and plot a horizontal bar graph of the 20 countries with the most number of scientific publications. 

In [183]:
df=pd.read_csv('rnd.csv')
df = df.applymap(lambda x: np.nan if x == '..' else x)
df = df.dropna().reset_index(drop=True)   #remove missing rows with missing data
df['papers'] = df['papers'].str.replace(',','').astype(int)   #convert attribute type string to integer
df = df.sort_values('papers',ascending=False)
df=df.head(20)
data = [Bar(
            x=df['papers'],
            y=df['country'],
            orientation = 'h'
)]

layout = Layout(
    title='top 20 countries with scientific publications',
    xaxis=dict(title='number of scientific publications'),
    yaxis=dict(title='country'))

fig=dict( data=data, layout=layout )
py.iplot(fig, filename='horizontal-bar')

EXAMPLE 3:

For the next example, we will plot a Scatter Bubble Chart. Scatter Bubble charts are extremely useful to visualize and compare three variables. They are defined in terms of three distinct numeric parameters. They allow comparison of entities in terms of their relative x-y postions and the area of the bubbles. The scatter chart data points are replaced by bubbles  an additional dimension of the data is represented in the size of the bubbles. 

Sizing the bubbles correctly is critical for this visualization. According to wikipedia, "The human visual system naturally experiences a disk's size in terms of its area. And the area of a disk—unlike its diameter or circumference—is not proportional to its radius, but to the square of the radius. " So if one chooses to scale the disks' radii to the third data values directly, then the apparent size differences among the disks will be non-linear and misleading. To get a properly weighted scale, one must scale each disk's radius to the square root of the corresponding data value. This scaling issue can lead to extreme misinterpretations, especially where the range of the data has a large spread. 


For the bubble chart, we have taken the example from the Plotly website, and will explain how its constructed. The data used consists of demographic details such as population, life expectancy and GDP per capita for countries over collected over 55 years at an interval of 5 years.

In [182]:
data = pd.read_csv("https://raw.githubusercontent.com/plotly/datasets/master/gapminderDataFiveYear.csv")
print data.head(13)

        country  year         pop continent  lifeExp    gdpPercap
0   Afghanistan  1952   8425333.0      Asia   28.801   779.445314
1   Afghanistan  1957   9240934.0      Asia   30.332   820.853030
2   Afghanistan  1962  10267083.0      Asia   31.997   853.100710
3   Afghanistan  1967  11537966.0      Asia   34.020   836.197138
4   Afghanistan  1972  13079460.0      Asia   36.088   739.981106
5   Afghanistan  1977  14880372.0      Asia   38.438   786.113360
6   Afghanistan  1982  12881816.0      Asia   39.854   978.011439
7   Afghanistan  1987  13867957.0      Asia   40.822   852.395945
8   Afghanistan  1992  16317921.0      Asia   41.674   649.341395
9   Afghanistan  1997  22227415.0      Asia   41.763   635.341351
10  Afghanistan  2002  25268405.0      Asia   42.129   726.734055
11  Afghanistan  2007  31889923.0      Asia   43.828   974.580338
12      Albania  1952   1282697.0    Europe   55.230  1601.056136


We will be plotting a scatter plot of GDP per capita v/s life expectancy for the year 2007, with the bubbles denoting size of the population. This graph also visualizes a 4th dimension of categorical distribution by filling different colours depending on the continent of the country. 

The data for 2007 is first extracted and sorted by each country and continent alphabetically. A scaling factor is chosen to scale down the large value of the population attribute, to represent the bubble area conveniently.

Since the data is sorted by continent, five traces are created for each continent which contain data specific to the continent. The five traces are clubbed into a list which represents the final data to be plotted. 

The layout needs to specify the correct ranges for the x and y axis so that the data is scaled properly for visualization. Since GDP per capita varies by a large value for different countries, a log scale is used.

In [165]:

data = pd.read_csv("https://raw.githubusercontent.com/plotly/datasets/master/gapminderDataFiveYear.csv")  #import file from online link
df_2007 = data[data['year']==2007]     #extract data for year 2007
df_2007 = df_2007.sort_values(['continent', 'country'])
slope = 2.666051223553066e-05   #scaling factor for bubble size
hover_text = []
bubble_size = []

for index, row in df_2007.iterrows():
    hover_text.append(('Country: {country}<br>'+
                      'Life Expectancy: {lifeExp}<br>'+
                      'GDP per capita: {gdp}<br>'+
                      'Population: {pop}<br>'+
                      'Year: {year}').format(country=row['country'],
                                            lifeExp=row['lifeExp'],
                                            gdp=row['gdpPercap'],
                                            pop=row['pop'],
                                            year=row['year']))
    bubble_size.append(math.sqrt(row['pop']*slope))    #bubble size list

df_2007['text'] = hover_text
df_2007['size'] = bubble_size

trace0 = Scatter( 
    x=df_2007['gdpPercap'][df_2007['continent'] == 'Africa'],
    y=df_2007['lifeExp'][df_2007['continent'] == 'Africa'],
    mode='markers',
    name='Africa',
    text=df_2007['text'][df_2007['continent'] == 'Africa'],
    marker=dict(
        symbol='circle',
        sizemode='diameter',
        sizeref=0.85,
        size=df_2007['size'][df_2007['continent'] == 'Africa'],
        line=dict(
            width=2
        ),
    )
)
trace1 = Scatter(
    x=df_2007['gdpPercap'][df_2007['continent'] == 'Americas'],
    y=df_2007['lifeExp'][df_2007['continent'] == 'Americas'],
    mode='markers',
    name='Americas',
    text=df_2007['text'][df_2007['continent'] == 'Americas'],
    marker=dict(
        sizemode='diameter',
        sizeref=0.85,
        size=df_2007['size'][df_2007['continent'] == 'Americas'],
        line=dict(
            width=2
        ),
    )
)
trace2 = Scatter(
    x=df_2007['gdpPercap'][df_2007['continent'] == 'Asia'],
    y=df_2007['lifeExp'][df_2007['continent'] == 'Asia'],
    mode='markers',
    name='Asia',
    text=df_2007['text'][df_2007['continent'] == 'Asia'],
    marker=dict(
        sizemode='diameter',
        sizeref=0.85,
        size=df_2007['size'][df_2007['continent'] == 'Asia'],
        line=dict(
            width=2
        ),
    )
)
trace3 = Scatter(
    x=df_2007['gdpPercap'][df_2007['continent'] == 'Europe'],
    y=df_2007['lifeExp'][df_2007['continent'] == 'Europe'],
    mode='markers',
    name='Europe',
    text=df_2007['text'][df_2007['continent'] == 'Europe'],
    marker=dict(
        sizemode='diameter',
        sizeref=0.85,
        size=df_2007['size'][df_2007['continent'] == 'Europe'],
        line=dict(
            width=2
        ),
    )
)
trace4 = Scatter(
    x=df_2007['gdpPercap'][df_2007['continent'] == 'Oceania'],
    y=df_2007['lifeExp'][df_2007['continent'] == 'Oceania'],
    mode='markers',
    name='Oceania',
    text=df_2007['text'][df_2007['continent'] == 'Oceania'],
    marker=dict(
        sizemode='diameter',
        sizeref=0.85,
        size=df_2007['size'][df_2007['continent'] == 'Oceania'],
        line=dict(
            width=2
        ),
    )
)

data = [trace0, trace1, trace2, trace3, trace4]
layout = Layout(
    title='Life Expectancy v. Per Capita GDP, 2007',
    xaxis=dict(
        title='GDP per capita (2000 dollars)',
        gridcolor='rgb(255, 255, 255)',
        range=[2.003297660701705, 5.191505530708712],
        type='log',       # log scaling GDP per capita due to large range
        zerolinewidth=1,
        ticklen=5,
        gridwidth=2,
    ),
    yaxis=dict(
        title='Life Expectancy (years)',
        gridcolor='rgb(255, 255, 255)',
        range=[36.12621671352166, 91.72921793264332],
        zerolinewidth=1,
        ticklen=5,
        gridwidth=2,
    ),
    paper_bgcolor='rgb(243, 243, 243)',
    plot_bgcolor='rgb(243, 243, 243)',
)

fig = Figure(data=data, layout=layout)
py.iplot(fig, filename='life-expectancy-per-GDP-2007')
