# Data Visualization Using Bokeh





## 1.  Introduction
Data visualization helps decipher large amounts of data quickly and easily.  Finding meaning in a huge spreadsheet full of numbers is difficult, but modeling those numbers as visuals helps in making quick analysis and extract meaning which might not be obvious otherwise.

Bokeh is a library for data visualization in Python.  It fills the gaps where libraries like Matplotlib fail due to its interactive nature and usability in web browsers.  It works well with Jupyter Notebooks to quickly and easily create interactive plots, charts and other types of visual representation of data.


## 2.  Getting Started with Installation
Installation is quick and easy by typing this code at the command prompt in anaconda:  __conda install bokeh__ <br>
If you already have all the dependencies like numpy etc installed, then you can use pip as well.  Use: __pip install bokeh__

Once all installation is complete, go ahead and import some of the relevant libraries into the Jupyter Notebook.  The first item to import is 'figure' from the bokeh.plotting library, which is the container for the graphs. To display the visualization inline, use output_notebook.  To display the plot in a sepaerate tab, use output_file('filename.htm')

In [1]:
from IPython.display import Image           #To display images in the notebook
import numpy as np
from bokeh.plotting import figure           #basic building block
from bokeh.io import output_notebook, show  #for displaying plots inline
output_notebook()                           #loads the module the same notebook as the code.

## 3.  Basic Line Plot
First, a very basic line plot is shown below.  After generating our data (or importing it), we need to create a figure object.  This defines the size, and various labels of the plot.  Next, we create the line by defining our input data, color. In the end we can use the show command to display the graph.  Notice how the output graph has a built in button for zoom, pan and save which is absent in the usual matplotlib formats.

In [2]:
#first we define our data and equations
n = 100                          #number of data points
x = np.linspace(-2.5,2.5, num=n) #creates n equally spaced series of numbers between -2.5 and 2.5
y = np.power(x,4) - 3*np.power(x,2) + x         #equation 1
z = 3*np.power(x,5) - 25* np.power(x,3) + 60*x  #equation 2

output_notebook() #use output_notebook to output inline .  Use output_file("filename.html") for new tab
newLine = figure(title="Equations Graph", x_axis_label='x-axis', y_axis_label='y-axis', plot_width=700, plot_height=300)
newLine.line(x, y, legend="equation 1", line_width=3)  #each shape is referred to as a Glyph
newLine.line(x,z, legend = 'equation 2', color = 'Red')
show(newLine)   #displays the results

![Image](https://imgur.com/cj5o8qN.jpg)

## 4.  Adding Colors, Styles and Interaction:  Spiral with Movable Data Points
Bokeh includes a number of handy tools like custom color palettes, different shapes to use for plotting, using histograms etc.  It also has tools that let you inspect the data carefully like the zoom and pan as well as the mouseover funtions for point data.<br>

__Colors__:  Color palettes identify a preselected group of colors that can be used as a theme in a visualization.  Included in the bokeh.palettes module are all palettes from ColorBrewer, and some palettes from the D3 and Matplotlib palettes.  To view all available palettes, import bokeh.palettes and then type in __bp.all_palettes__.  These palettes contain lists of hex color string, which are useful while working with multiple groups of data.  For example typing in Spectral[3] will return a list of three colors from the spectral palette of this format ['#99d594', '#ffffbf', '#fc8d59'].  For a detailed list of palettes, see the [Palettes documentation](https://bokeh.pydata.org/en/latest/docs/reference/palettes.html#bokeh-palettes).<br> 

__Styles__:  With bokeh all types of graphs with all sorts of symbols can be plotted.  The [Models Documentation](https://bokeh.pydata.org/en/latest/docs/user_guide/plotting.html) contains the full list and references.  While this example uses the circle to define data, many other shapes are also avialable to use.  Here's some basic ones:
Image adapted from [here](http://bokeh.pydata.org/en/latest/docs/gallery/markers.html)
![Image](https://i.imgur.com/IGsiphJ.jpg)

__Bokeh Version of DataFrame__:  We will now modify the code to interactively add, remove, or move points within the data set.  This sort of interaction means that data is being changed by the user and instantly updated.  We will use __ColumnDataSource__ for storing and updating our data.  You can think of ColumnDataSource as being the Bokeh version of the Pandas Dataframe, because it can name columns in a similar way to Pandas.  However, all columns have to be the same length, or else, it will throw a warning.

In the following code, we will assign a different size and color to each data point. We will use an equation called the involute of a circle to generate our data. We will define our data then store this data in the data structure called ColumnDataSource, so that Bokeh can allow users to modify it.  The __PointDrawTool__ enables users to interactively add, remove or edit points in a scatter plot.

In [3]:
import bokeh.palettes as bp                              #loads all built in color palettes
from bokeh.models import PointDrawTool,ColumnDataSource  #PointDrawTool lets users change data 

sz = 350    #plot size
n = 100     #number of data points
t = np.linspace(-15,15,num = n) #interval
a = 3
x = a * (np.cos(t) + t * np.sin(t))  #equation to generate the data
y = a * (np.sin(t) - t * np.cos(t))
spiral = figure(title = "Involute of a Circle", plot_width=sz, plot_height=sz, tools=[])
radius = np.linspace(sz/10, 8, num=n)#tie the biggest size of the circle with the plot size so that bigger plots give bigger circles and vice versa
col = np.flip(bp.magma(n), 0)                                          #Magma is a color Palette from Matplotlib.
source = ColumnDataSource(data = dict(x=x,y=y,col=col,radius=radius))  #defining our data source.
a1 = spiral.circle('x','y', size='radius',color='col',fill_alpha=.7, source=source)#Using the source to define changable inputs

draw_tool = PointDrawTool(renderers=[a1],empty_value='20') #Glyph has to be passed on to the renderer.
spiral.add_tools(draw_tool)
spiral.toolbar.active_tap = draw_tool
show(spiral)

<img src="https://i.imgur.com/0dM0xW5.gif" alt="dots" style="width: 400px;"/>

## 5.  Liked Zoom / Pan and Custom Tooltips: Equation Output
Linking two plots to pan and zoom together can sometimes be useful if inspecting two closely related plots.  For doing this, one graph references the other's x-range, and y-range which is just the limits of the x and y axis respectively.  The x_range can take another graph's xrange as an argument, or it can take in range1d(min,max) to define a min and max value.  To add custom tooltips, the __HoverTool__ has to be imported and used as seen in the code below.  The hovertool also reads from the ColumnDataSource, we can define variables for the tooltips which are not visible in the graph itself.  While defining arguments for the hovertool, The values to be picked from a column of the columndatasource are identified with a @ before the name.  A $ sign signifies a variable identified elsewhere in the function.  See the below code for examples.

In [4]:
#Some palettes are built with multiple lists of colors.  Thisfunction flattens the list of colors into a 1d array
def getColors(palettename):
    s = [ x for x in palettename.values()]
    return [item for sublist in s for item in sublist]

In [5]:
from bokeh.layouts import gridplot      #for plotting to a grid
from bokeh.models import HoverTool

#Defining the data and colors
n = 1560
t = np.linspace(-np.pi*4,np.pi*4,num = n)
x = t - 1.6* np.cos(24*t)
y = t - 1.6* np.sin(25*t)
col = getColors(bp.Category10)
lengths = np.round(np.cumsum(np.sqrt(np.square(1+1.6*24*np.sin(24*t)) + np.square(1+1.6*24*np.sin(24*t)))))
cols = np.repeat(col,n//len(col))
source = ColumnDataSource(data=dict(x=x,y=y,lengths=lengths,cols=cols))
#creating a figure, and then the line glyph.
newLine = figure(title='Equation', plot_width=sz, plot_height=sz,background_fill_color='black', tools= ['pan,wheel_zoom,box_zoom,reset'])
hover = HoverTool(tooltips=[("(x,y)", "($x, $y)")])
newLine.add_tools(hover)
newLine.xgrid.grid_line_color = None
newLine.ygrid.grid_line_color = None
newLine.multi_line(np.array_split(x, len(col)),np.array_split(y, len(col)),color = col, line_width=2)
#create second figure and link it with the first by defining it's x_range & y_range
newLine1 = figure(title='Equation', plot_width=sz, plot_height=sz,x_range=newLine.x_range,y_range=newLine.y_range, tools= ['pan,wheel_zoom,box_zoom,reset'])
newLine1.circle_cross('x','y',color = 'cols',fill_alpha=.3, size=10,line_width=1,source=source)
hover1 = HoverTool(tooltips=[("index", "$index"),("(x,y)", "($x, $y)"),("Total length",'@lengths'),("fill color", "$color[hex, swatch]:cols")])
newLine1.add_tools(hover1)

plot = gridplot([[newLine,newLine1]]) # for 2x2 grid use: gridplot([[p1, p2], [p3, p4]])
show(plot)

![Misc](https://i.imgur.com/Plp2wc1.gif)




## 6.  Data Exploration Using Multiple Tabs:  Costs and Earnings of 155 RKO movies from 1930-1941

Now we will analyze earnings and costs data of 155 movies made by RKO in the 1930s.  This data can be imported using Pandas, and the DataFrame can be used to define the ColumnDataSource.  We will now use multiple tabs to display different aspects of the data.  First we will plot the cost vs. profit as a scatter plot, and use a clickable tab to display some figures in the form of a bar graph.  To create clickable tabs, the __Panel__ command can be used.  It takes the argument of 'child' which is just the name of the figure to use as a plot.  In the end we will use the __Tab__ command to put all the tabs together into the same box.

This data has been taken from: "J. Sedgwick (1994). "Richard B. Jewell's RKO Film Grosses, 1929-1951:  The C.J. Trevlin Ledger: A Comment", Historical Journal of Film, Radio & Television, Vol.14, Issue 1, p. 51".


In [6]:
#import data and label all variables
import pandas
mov = pandas.read_csv('movies.csv',sep=",")
mov.dropna()
mov = mov.sort_values(by = ['Year'])
years = mov.Year.unique()
movGrp = mov.groupby('Year')

Apart from the cost and profit, we will also visualize the total revenues generated as a third dimension.  To do this, we will remap the column for revenues to a new range of numbers, which will express the diameter of a circle.  We will then represent each year using a different color in the scatter plot.

In [7]:
#takes an arrray as data, and two floats to specify as the min and max of the new range to be mapped to
def redistribute(data,newMin,newMax):
    oldMin = min(data)
    oldMax = max(data)
    return [(((value - oldMin) * (newMax - newMin)) / (oldMax - oldMin)) + newMin for value in data]

In [8]:
from bokeh.models import Range1d, BoxAnnotation,FactorRange
from bokeh.models.widgets import Panel, Tabs

col = bp.plasma(len(years))
colorDict = dict(zip(years, col))  
mov['colors'] = mov['Year'].map(colorDict)  #create a column with a color representing the year.
mov['radiiP'] = redistribute(mov['Profit'], 6,50)
mov['radiiR'] = redistribute(mov['Revenue'],6,50)
source = ColumnDataSource(mov)    #easily define the ColumnDataSource with Pandas DF.
#create the scatter plot
dots = figure( plot_width=800, plot_height=500, background_fill_color='#FEFDF5',x_axis_label='Cost in $1000',y_axis_label='Profit in $1000')
dots.x_range = Range1d(0,2500)
dots.line(x=[0, max(mov['Cost'])] , y =0, line_dash=[18], line_width=4, line_alpha=.8, color='#8B0101')
dots.circle(x='Cost', y='Profit', color='colors',size='radiiR',line_color='colors',line_width=2,fill_alpha = .6,legend='Year', source=source)
dots.legend.background_fill_alpha = .8
dots.add_layout(BoxAnnotation(top=0, fill_alpha=0.05, fill_color='red'))
tab1 = Panel(child=dots, title="Cost-Profit")  #once scatter plot is created fit it into a tab using Panel

dots2 = figure( plot_width=800, plot_height=500, background_fill_color='#FEFDF5',x_axis_label='Cost in $1000',y_axis_label='Revenue in $1000')
dots2.x_range = Range1d(0,2500)
dots2.circle(x='Cost', y='Revenue', color='colors',size='radiiP',line_color='colors',line_width=2,fill_alpha = .6,legend='Year', source=source)
dots2.legend.background_fill_alpha = .8
tab2 = Panel(child=dots2, title="Cost-Revenue")  #once scatter plot is created fit it into a tab using Panel

#create the bar graph
colr= (bp.plasma(3)) *len(years)        #this will create [col1,col2,col3,col1,col2,col3,...]
strYears = [str(yr) for yr in years]
cost =  (movGrp['Cost'].sum()).values
profit =( movGrp['Profit'].sum()).values
rev = (movGrp['Revenue'].sum()).values
#prepare all the labels as a list: [(cost,1938),(revenue,1938),(profit,1938),(cost,1939)....]
props = ['cost','revenue','profit'] 
x = [ (yr, prop) for yr in strYears for prop in props ]
vals = sum(zip(cost, rev, profit),())
source = ColumnDataSource(data=dict(x=x, vals=vals,colr=colr))
#create the figure and output
bars = figure(x_range=FactorRange(factors=x),plot_width=800, background_fill_color='#FEFDF5')
bars.vbar(x='x', top='vals',color='colr',fill_alpha=.8, width=1,source=source)
bars.x_range.range_padding = 0.1
bars.xaxis.major_label_orientation = 1
bars.xgrid.grid_line_color = None
tab3 = Panel(child=bars, title="Bar")

tabs = Tabs(tabs=[ tab1,tab2, tab3 ])
show(tabs)

![dots](https://i.imgur.com/n9NypyN.gif)


## 7. Adding Sliders for Interaction:  Visualizing Polynomial featurs of degree n

Bokeh's greatest strength is the ability  to add interactions.  We can add different types of widgets in Bokeh, which can be different types of buttons, tabs, sliders etc.  These widgets can take input from users, and update the graph accordingly.  In this part of the tutorial, we will fit a regression line to a scatter plot, and then build a sider, which will control the degree of the polynomial.  The regression curve will update according to the degree defined by the user.

These interactive applications have to be run on a separate bokeh server.  This server synchronizes between Python and the web browser, and responds to tool events to update the plots.  This means that after writing the code, the Bokeh server has to be started from the terminal.  In the terminal, navigate to the same directory as the python Notebook, and then type __bokeh serve filename.ipynb__ at the command prompt to launch the server.  Then we can navigate to the webpage __http://localhost:5006/filename__ to view our application.  In the code we can use __curdoc()__ to maintain the connection. 

In [9]:
#importing the data using a pandas.
io = pandas.read_csv('data.csv',sep=",",usecols=(0,1))
x = io['x'].values
y = io['y'].values

While the 'x' and 'y' variables store the data for the scatter plot, we will define a new variable 'cx' and 'cy' to create the regression line.  This will ensure a smooth looking curve even if there are very few data points in x and y.  After importing our data, and defining our variables, we will store these in a ColumnDataSource.  Next we can define our slider using the __Slider__ command, and identify a start, stop, and title for the slider.  After that we have to create our event handlers, and define our update function.  The slider takes an __on_change__ method, which triggers the update function.  The update function recalculates all values that need changing, and then updates them in the columndatasource.

In [10]:
from bokeh.io import curdoc
from bokeh.layouts import column, widgetbox
from bokeh.models.widgets import Slider

cx = np.linspace(min(x), max(x), 100) 
cpoly = np.poly1d(np.polyfit(x,y,1))
cy = cpoly(cx)
col = np.flip(bp.viridis(len(x)),0)

source = ColumnDataSource(data=dict(cx=cx, cy=cy, x=x, y=y))
plot = figure(plot_width=600, plot_height=400,background_fill_color='#F9FDFF' )
a2 = plot.circle(x,y, radius=5, fill_alpha =.6,color = col,line_color = col,line_width=2  ) #scatter plot
plot.line('cx', 'cy', source=source, line_width=2, line_alpha=0.7, line_color="#153858") #regresion line
degreeFit = Slider(start=1, end=20, value=1, step=1, title="Degree of Polynomial")       #slider for input

def updateData(attrname, old, new):
    d = degreeFit.value
    pcpoly = np.poly1d(np.polyfit(x,y,d))
    cy = pcpoly(cx)
    source.data = dict(cx=cx, cy=cy, x=x, y=y)

degreeFit.on_change('value', updateData)
inputs = widgetbox(degreeFit)
curdoc().add_root(column(inputs, plot))
curdoc().title = "Regression"



<h3><center>Visualizing Polynomial featurs of degree n</center></h3>![Image](https://i.imgur.com/5WYUw8h.gif)

## 8.  Adding Buttons:  Interactive Regression using polynomial of degree n
Being able to stream data, or change it interactively is the true strength of Bokeh.   Now, to the previous code, we will add user ability to add, move and remove data points, and see how the regression line changes accordingly. We will also add a table to the output, which will show how the 'x' and 'y' variables have changed.

We will now have two new imports.  We will import a button, which will trigger the update of the regression line.  We will also add TableColumn and DataTable which will display the table with all the required variables.

To change the data of the scatter plot, we will use the PointDrawTool which we talked about in the spiral graph.  Please note that when we add new points, it adds a new value to each column of the columndatasource.  This new value can be defined by the __empty_value__.  So if we use 'yellow' as our empty_value, and the columndatasource has columns like x, y, cx, cy, color then when you  add a new point, it will assign that value to the x and y, and to the cx, cy and color column it will add 'yellow', 'yellow' and 'yellow'.  For this reason, we have to define a seperate columndatasource for the scatter circles and a separate one for the regression line.

In [11]:
from bokeh.models import TableColumn,DataTable  #for custom tools
from bokeh.models.widgets import Button

cx = np.linspace(min(x), max(x), 100)
cpoly = np.poly1d(np.polyfit(x,y,1))
cy = cpoly(cx)
col = np.flip(bp.viridis(len(x)),0)
sourceR = ColumnDataSource(data=dict(cx=cx, cy=cy))        #identify our data sources.
sourceDots = ColumnDataSource(data=dict(x=x,y=y, col=col))

plot = figure(plot_width=600, plot_height=400,background_fill_color='#F9FDFF' )
a2 = plot.circle('x','y', radius=5,color='col', fill_alpha =.6,line_width=2,source=sourceDots  )
plot.line('cx', 'cy', source=sourceR, line_width=2, line_alpha=0.7, line_color="#153858")
degreeFit = Slider(start=1, end=20, value=1, step=1, title="Degree of Polynomial")
button = Button(label="Click to Update", button_type="success")    #define our button.

draw_tool = PointDrawTool(renderers=[a2],empty_value='yellow')
plot.add_tools(draw_tool)
plot.toolbar.active_tap = draw_tool

def updateData(attrname, old, new):                        #triggered by the slider
    d = degreeFit.value
    pcpoly = np.poly1d(np.polyfit(sourceDots.data['x'],sourceDots.data['y'],d))
    cy = pcpoly(cx)
    sourceR.data = dict(cx=cx, cy=cy)

def updateAll():                                            #triggered by the button
    d = degreeFit.value
    pcpoly = np.poly1d(np.polyfit(sourceDots.data['x'],sourceDots.data['y'],d))
    cy = pcpoly(cx)
    sourceR.data = dict(cx=cx, cy=cy)    
#define the columns of the 
columns = [TableColumn(field="x", title="x"),
           TableColumn(field="y", title="y"),
           TableColumn(field='col', title='color')]
table = DataTable(source=sourceDots, columns=columns, editable=True,height=200)    

degreeFit.on_change('value', updateData)
button.on_click(updateAll)
inputs = widgetbox(degreeFit,button)
curdoc().add_root(column(inputs, plot,table))
curdoc().title = "Regression"

<h3><center>Improved Visualizatio of Polynomial Degree Fit for Regression with Bokeh</center></h3>
![Improved Regression](https://i.imgur.com/PDqV6Nw.gif)

<h3><center>Table with updated values of the data</center></h3>![Table](https://i.imgur.com/Q2NSHMy.gif)