## Introduction

This tutorial will introduce you some basic functions of plotly, a python  graphing library that can make interactive, publication-quality graphs. A primary goal of data visulization is to communicate information to users. Effective visulization will give people more information hidden behind data and a deeper understanding of the data. One way to accomplish that goal is making the plot interactive. 

A general example of data visualization showed in the following plot:
<img src="https://matplotlib.org/_images/lines3d_demo.png">
This is a 3D line plot generated by the most popular python visualization library, matplotlib. This line described the function z = f(x,y) value when given different value for x and y. Although this is better than show x,y,z value in a table it still not effective enough, e.g, we want to know a certain point's value on the line or we are interested in the projection of this line in xz-plane. If we can click at some point and get the value or drag the plot and get different perspectives of it, that's called interactive.

Besides that, we will use Wikipedia ClickStream dataset(talk about it later) to help you understand this library.

### Tutorial content
By reading this tutorial, you will have a basic picture of how to use plotly and how to make the plot interactive. We will use [plotly](https://plot.ly) and [pandas](https://pandas.pydata.org) to achive the goal

- [Installation and Configuration](#Installation and Configuration)
- [Dataset introduction and loading data](#Dataset introduction and loading data)
- [Dataframe plotting](#Dataframe plotting)
- [Basic charts](#Basic charts)
- [Example application:A 3D knowledge network of Donald Trump]()
- [Summary and reference](#Summary and reference)

## Installation and Configuration
Like most of python libraries, plotly can be installed by `pip`:
    
    $ pip install plotly
    
  or
  
    $ sudo pip install plotly
plotly's python pakcages is updated frequently,to upgrade, run:
    
    $ pip install plotly --upgrade

Plotly provide a web-service for saving graphs, which means you can save your graphs in a public server and retrive it when you want to use it. To get this service, you need to create a [free account](https://plot.ly/ssu/) and set your account credentials into lib.

In [9]:
#fill in your account name and api_key, these info can be found in setting page of your account
import plotly
plotly.tools.set_credentials_file(username='#your usr name', api_key='#your api key')

Plotly also allows you to create offline graphs and save them locally. There are two methods for plotting offline:

In [None]:
import plotly.offline as pf
# to create and standalone HTML that is saved locally and opened inside your browser
pf.plot()#this function missed arguments, just a show.
# use this function when working in a Jupyter Notebook to display the plot in the notebook
pf.iplot()#this function missed arguments, just a show.

## Dataset introduction and loading data

As mentioned earlier, we will use Wikipedia ClickStream dataset to show the use of this library. A clickstream is the recording of user's clicks on while web browsing or using other application with hyperlink to some web page. This dataset contains counts of (referer, resource) pairs extracted from the request logs of Wikipedia. A refer is an HTTP header field that identifies the address of the webpage that linked to the resource being requested.

This dataset is collected base on this scheme:
- an article in the main namespace -> the article title
- a page from any other Wikipedia project -> other-interal
- an external search engine -> other-search
- any other external site -> other-external
- an empty referer -> other-empty
- anthing else -> other-other

e.g, Suppose you are currently browsing Wikipedia page for "Carnegie Mellon University" and click a link to "Computer Science", after that, we got a pair(Carnegie Mellon University, Computer Science) based on your click stream.


To get the data, we download part of the data(whole dataset is too large to used in a tutorial) from [https://ndownloader.figshare.com/files/7563832](https://ndownloader.figshare.com/files/7563832) Then unzip the package and get a tsv file. There are four columns in this tsv file:
- prev: the result of mapping the referer URL to the fixed set of values described above
- curr: the title of the article the client requested
- type: there are three types: 'link', 'external', 'other'
- n: the number of occurrences of the (referer, resource) pair



In [6]:
#if you wait to use plotly in offline mode init the offline plotly library
import plotly
plotly.offline.init_notebook_mode()

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.


In [7]:
#load data into panda dataframe
import pandas as pd

df = pd.read_csv('./2017_01_en_clickstream.tsv', sep = '\t')

## Dataframe plotting
Dataframe is one of the most important classes in pandas, each dataframe object describe a table of data. In many cases, we need to print it to stdout, but unfortunately, the default format is reading-unfriendly by printing title and all data row by row together. In `plotly` library, there's a function in figure_factory class called create_table which take a pandas dataframe as input then plot out a reading-friendly plot.

In [10]:
import plotly.figure_factory as ff
import plotly.offline as pf
#this function take a dataframe or part of a dataframe as input, convert it i into a plot
#there are some default settings for creating the plot
table = ff.create_table(df.head())

#print out the table and save it into my account
plotly.plotly.iplot(table, filename = 'fg0')


Because the limitation of Jupyter Notebook, we may not displace any interactive plot here without installing the library and running the program, so we provide a screenshot of real plot here,also give you a link below that you can visit the corresponding plot.

To view the interactive version of previous plot, pleat click [https://plot.ly/~red_baron/64](https://plot.ly/~red_baron/64)

Besides that, we can also use custom settings to create a table. In the following example, we will implement three points:
- Header and cell have different font size and alignment
- In data rows, different index have different background color
- In "n" column, bigger value with a darker background color and vice versa

In [11]:
import colorlover as cl
import numpy as np
import plotly.graph_objs as go

#set a range of different color, from light to dark
colors = cl.scales['9']['seq']['Reds']

#even and odd row has different color
r1 = 'darkgrey'
r2 = 'lightgrey'
r3 = 'white'
rColors = [r1, r2, r3] * 5 
color_dark = [int(x)/10 for x in df.n[:15]]
#create a new table object
"""argument:
    values: set values for each cell
    line: set the framework line's color
    fill: set the background color for each cell
    align: set content alignment 
    font: set content font 
"""
trace0 = go.Table(type = 'table', 
    header = dict(values = df.columns.values.tolist(), line = dict(color = 'grey'), fill = dict(color = 'grey'), 
                    align = ['left'], font = dict(color = 'black', size = 16)), 
    cells = dict(values = [df.prev[:15], df.curr[:15], df.type[:15], df.n[:15]], line = dict(color = 'grey'), fill = dict(color = [rColors, rColors,rColors, np.array(colors)[color_dark]]), 
                    align = ['center'], font = dict(color = 'black', size = 11)))

#convert to a list
data = [trace0]
layout = go.Layout(title = 'Wikipedia Clickstream dataframe')
fig = go.Figure(data = data, layout = layout)

plotly.plotly.iplot(fig, filename = 'fg1')

To view the interactive version, please click [https://plot.ly/~red_baron/66](https://plot.ly/~red_baron/66)

## Basic charts

In this section, You will learn how to use the plotly library to draw some basic charts, such as histograms and pie charts. In the process of drawing these diagrams, we will discuss some concepts in the plotly library and the meanings and settings of common parameters. Also, we will continue to introduce the Wikipedia Clickstream dataset and focus on extracting information from the data.

##### Bar chart
Bar chart is a set of rectangular bars which represents different categories of data. A bar chart can show the comparisons and difference among discrete categories. Usually, the x-axis shows the all of the categories that shown in chart and y-axis shows the value of that category and it often can be measured numerically. For a more intuitive presentation of bar chart, we will use a bar chart to show the 20 most popular Wikipedia articles.


So far, we have seen several chart/plot generated by plotly, but there's one thing we haven't mentioned yet, interactive. In this bar chart, you can feel it through some actions:

1.Put you mouse on the right top of the chart, you can see some functional buttons

2.Put you mouse on the top of each bar, you can view the category and exact value of it

3.Draw a rectangle of a certain area with your mouse, that part will be enlarged



In [12]:
#get 20 most popular wikipedia articles by sum the curr column
df_most_popular = df.groupby('curr').sum().sort_values(by=['n'], ascending = False)[1:21]

#create the bar chart, argument text set the value that showed on top of the bar
trace0 = go.Bar(x = df_most_popular.index, y = df_most_popular.n, text = df_most_popular.index,
               marker = dict(color = 'rgb(158,202,225)'))
#put it into list
data = [trace0]
#set the layout, e.g title
layout = go.Layout(title = 'The 20 most popular wikipedia article in 2017-01')

fig = go.Figure(data = data, layout = layout)
plotly.plotly.iplot(fig, filename = 'fg2')


If it doesn't work, please view it at [https://plot.ly/~red_baron/68](https://plot.ly/~red_baron/68)

##### Pie Chart
Pie chart is a kind of statistical graph, it's a set of slices that combined into a circle. Each slice represents a category of data and it's designed to illustrate numerical proportion. For this chart, we will use the top 10 referers in Wikipedia Clickstream dataset to draw a pie chart. 

In pie chart, from the interactive perspective, you can double-click the tags then use single click to choose the categories that you are interested to see the proportion of this subset.

In [13]:
#get the top 20 refers and all the rest are counted into 'others'
df_most_refer = df.groupby('prev').sum().sort_values(by = ['n'], ascending = False)
categories = df_most_refer.index.tolist()[:20] + ['others']
values = df_most_refer.n.tolist()[:20] + [sum(df_most_refer.n.tolist()[20:])]

#create a pie chart
trace = go.Pie(labels = categories, values = values)
layout = go.Layout(title = 'The top 20 refers to wikipedia article in 2017-01')

fig = go.Figure(data = [trace], layout = layout)
plotly.plotly.iplot(fig, filename = 'fg3')

If it doesn't work, please view it at [https://plot.ly/~red_baron/70](https://plot.ly/~red_baron/70)

##### Sankey Diagram

Generally speaking, sankey diagram is a specific kind of flow diagram. In sankey diagram, the width of each coming flow and going flow is proportional to the quantity. Although sankey diagram is not very common in data science charts, it's definitely one of the best charts to describe Wikipedia Clickstream dataset, in fact, sankey diagram is also the chart that used to illustrate this dataset. We will implement it by code now!

After we get this chart, try to move your mouse on different parts of this chart, interact with it.

In [14]:
#get the top 10 refer for 'Donald_Trump' and top 10 outgoing page from 'Donald_Trump'
df_incoming_trump = df.loc[df['curr'] == 'Donald_Trump'].sort_values(by=['n'], ascending = False)[:10]
df_outgoing_trump = df.loc[df['prev'] == 'Donald_Trump'].sort_values(by=['n'], ascending = False)[:10]

#make the labels and links.
#links are described by three list: source, target, value
labels = df_incoming_trump.prev.tolist() + ['Donald_Trump'] + df_outgoing_trump.curr.tolist()
source = [x for x in range(10)] + [10] * 10
target = [10] * 10 + [x+11 for x in range(10)]
values = df_incoming_trump.n.tolist() + df_outgoing_trump.n.tolist() 

#create sankey diagram
data = dict(type = 'sankey',
           node = dict(pad = 15, thickness = 20, label = labels, line = dict(color = "black", width = 0.5),color = ['blue']*21),
           link = dict(source = source, target = target, value = values))

layout = dict(title = 'Basic Sankey Diagram for Donald Trump')
fig = dict(data = [data], layout = layout)
plotly.plotly.iplot(fig, validate = False, filename = 'fg4')

If it doesn't work, please view it at [https://plot.ly/~red_baron/72](https://plot.ly/~red_baron/72)

## Example application:A 3D knowledge network of Donald Trump

In this section, we will try to dig out more information from this dataset. Suppose we want to find some keywords, such as famous politicians or events, that related to Donald Trump. But we do not have the knowledge which words are related and which are not. All we have are lots of (refer, resource) pairs.In wikipedia Clickstream dataset, we can view these pairs as the relations between two articles, so we could build a network with these pairs: each of article is a node in this network and the value of that pair is the weight of corresponding edges in the figure. To show the relations more clearly, we will build an interactively 3D network for all these knowledge we get from the data.

In [16]:
import networkx as nx
from plotly.graph_objs import *

direction = {1:'prev', 0:'curr'}
node_dict = {'Donald_Trump':0} ; data = {'nodes':[{'name':'Donald_Trump', 'group':0}], 'links':[]}
word_list = ['Donald_Trump']
stop_list = ['other-search', 'other-empty', 'other-internal', 'other-external', 'Main_Page']
#add nodes into data
def add_nodes(df, data, keyword, flow_direction, layer, node_dict, direction):
    top = 10
    if layer == 2:
        top = 5
    df_subset = df.loc[df[direction[flow_direction]] == keyword].sort_values(by=['n'], ascending = False)[:top]
    for index, x in df_subset.iterrows():
        #find a new word not in nodes
        if x[direction[1-flow_direction]] not in node_dict and x[direction[1-flow_direction]] not in stop_list:
            #print 'find' + x[direction[1-flow_direction]]
            node_dict[x[direction[1-flow_direction]]] = len(word_list)
            word_list.append(x[direction[1-flow_direction]])
            data['nodes'].append({'name':x[direction[1-flow_direction]], 'group':2*flow_direction+layer})
            #call it recursive for one more time
            if layer == 1:
                add_nodes(df,data, x[direction[1-flow_direction]], flow_direction, 2, node_dict, direction)
#add links into data
def add_links(df, data, node_dict):
    df_subset = df.loc[(df['prev'].isin(word_list)) | (df['curr'].isin(word_list))]
    for prev in word_list:
        for curr in word_list:
            if prev != curr:
                value = df_subset.loc[(df_subset['prev'] == prev) & (df_subset['curr'] == curr)].n.tolist()
                #when n < 100, ignore it
                if len(value) != 0 and value[-1] > 100:
                    #print 'find ' + prev['name'] + ' and ' + curr['name'] + ' value:' + str(value[-1])
                    link = {'source': node_dict[prev], 'target':node_dict[curr], 'value':value[-1] / 1000}
                    data['links'].append(link)
                    
#build data structure for graph
#refers
add_nodes(df, data, 'Donald_Trump',0,1,node_dict, direction)
#next
add_nodes(df, data, 'Donald_Trump',1,1,node_dict, direction)
add_links(df, data, node_dict)


L=len(data['links'])
Edges=[(data['links'][k]['source'], data['links'][k]['target']) for k in range(L)]


#create a layout to descirbe the net work
G = nx.Graph()
G.add_edges_from(Edges)
layt = nx.kamada_kawai_layout(G, dim = 3)

for k in layt:
    v= layt[k]
    v[-1] = np.random.random_sample()
    layt[k] = v
    
#set labels & group
labels = [x['name'] for x in data['nodes']]
group = [x['group'] for x in data['nodes']]
#Set data for the Plotly plot of the graph
Xn=[layt[k][0] for k in range(len(word_list))]# x-coordinates of nodes
Yn=[layt[k][1] for k in range(len(word_list))]# y-coordinates
Zn=[layt[k][2] for k in range(len(word_list))]# z-coordinates
Xe=[]; Ye=[]; Ze=[]
for e in Edges:
    Xe+=[layt[e[0]][0],layt[e[1]][0], None]# x-coordinates of edge ends
    Ye+=[layt[e[0]][1],layt[e[1]][1], None]
    Ze+=[layt[e[0]][2],layt[e[1]][2], None]

#draw nodes and links in graph
trace1=Scatter3d(x=Xe, y=Ye, z=Ze, mode='lines',
               line=Line(color='rgb(125,125,125)', width=1), hoverinfo='none')
trace2=Scatter3d(x=Xn, y=Yn, z=Zn, mode='markers', name='actors',
               marker=Marker(symbol='dot',
                             size=6,
                             color=group,
                             colorscale='Viridis',
                             line=Line(color='rgb(50,50,50)', width=0.5)
                             ),
               text=labels, hoverinfo='text')
#hidden all unecessary info
axis=dict(showbackground=False, showline=False, zeroline=False,
          showgrid=False, showticklabels=False, title='')
#add layout
layout = Layout(
         title="Network of Donald Trump (3D visualization)",
         width=1000, height=1000, showlegend=False,
         scene=Scene(
         xaxis=XAxis(axis),
         yaxis=YAxis(axis),
         zaxis=ZAxis(axis),
        ),
     margin=Margin(t=100), hovermode='closest')
data=Data([trace1, trace2])
fig=Figure(data=data, layout=layout)

plotly.plotly.iplot(fig, filename = 'fg5')

If it doesn't work, please view it at [https://plot.ly/~red_baron/74](https://plot.ly/~red_baron/74)

## Summary and reference:

This tutorial give you a general view of plotly and Wikipedia Clickstream Dataset. After reading this tutorial, you should know:
- How to use plotly to show a dataframe clearly and effiectively.
- How to create basic statistic chart and plots with given data.
- How to make a 3D Network Graph with plotly
- What is Wikipedia Clickstream Dataset and What can we do with it.

Reference:

1. Plotly:[https://plot.ly/python/](https://plot.ly/python/)
2. Click Stream:[https://en.wikipedia.org/wiki/Clickstream](https://en.wikipedia.org/wiki/Clickstream)
3. Wikipedia Clickstream Project:[https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream](https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream)
4. Wikipedia Clickstream Data:[https://figshare.com/articles/Wikipedia_Clickstream/1305770](https://figshare.com/articles/Wikipedia_Clickstream/1305770)
5. gis_tutorial:[https://nbviewer.jupyter.org/url/www.datasciencecourse.org/tutorial/gis_tutorial.ipynb](https://nbviewer.jupyter.org/url/www.datasciencecourse.org/tutorial/gis_tutorial.ipynb)
