## Prepping the data

In [None]:
import pandas as pd

ks_df = pd.read_csv("ks-projects-201801.csv")

# Bokeh

An interactive visualization library, mixing python and javascript. It is opensource under the BSD license.

It can be used standalone, or embedded in Jupyter and Zeppelin notebooks.

Always refer to the official documentation: https://docs.bokeh.org/en/latest/docs/reference.html

In [None]:
from bokeh.plotting import figure, show, output_notebook, output_file, reset_output
from bokeh.models import ColumnDataSource, CategoricalColorMapper, PrintfTickFormatter, NumeralTickFormatter, Legend, CDSView, GroupFilter, CustomJS, BoxSelectTool, FactorRange
from bokeh.layouts import gridplot, column
from bokeh.models.widgets import Div
from bokeh.palettes import Spectral6, Pastel1, Category20c, Inferno256


output_notebook()

## Output and running mode

Different output mode are provided: to file and to notebook.

The output to file mode is activated with the command `output_file(<filename>)`: in this mode, the output generated by bokeh is saved to a file when `show` is called.

Conversely, in notebook mode --activated with the command `output_notebook()`-- the output is directed to a notebook cell. Note that, the 2 modes can be active at the same time.

## Histogram - bar plot

Bar plot can be created in a similar way as in matplotlib/seaborn. The main difference is that the output is an interactive (javascript based) chart with a default toolbar to zooom, pan, select, save, etc. All options can be customized.

The bar plot creation is composed of 3 steps:
1. create the plot object. We do this with the figure method, which creates a Figure, a subclass of Plot that comes with a default configuration (axes, grids, tools, etc.). The Plot object will then contain the glyphs
2. create the glyphs within the plot. This can be done with convenience methods on the Plot object for a number of different charts.
3. render the plot within a cell. This is achieved with the show method on the plot, which renders it in a cell if the output mode is notebook.

In [None]:
main_cat = ks_df.main_category.value_counts()

p = figure(x_range=FactorRange(factors=main_cat.index), tools="pan, save, wheel_zoom")

p.vbar(x=main_cat.index, top=main_cat.values)

show(p)

The `x_range=FactorRange(factors=main_cat.index)` parameter in the figure call is used to tell the Plot object that the x axis is categorical. The FactorRange is a range of values for a categorical dimension.

Creating the FactorRange object explicitly is not required. We can directly pass a list to the figure object and the FactorRange will be created for us automatically.

Let's now customize the plot a little bit to make it better, for example by fixing the spacing, the y axis start, and by removing the vertical grid.

In [None]:
p = figure(x_range=main_cat.index.values, height=300, width=700,toolbar_location=None, tools="save")

p.vbar(x=main_cat.index, top=main_cat.values, width=0.9)

p.xgrid.grid_line_color = None
p.xaxis.major_label_orientation = 0.45
p.y_range.start = 0

show(p)

Zooming and panning in a bar plot are actually counter productive, so we can remove them. We also rotate the label so that thet don't overlap.

In [None]:
p = figure(x_range=main_cat.index.values, height=300, width=700, tools="save")

p.vbar(x=main_cat.index, top=main_cat.values, width=0.9)

p.xgrid.grid_line_color = None
p.xaxis.major_label_orientation = 0.45
p.y_range.start = 0
p.yaxis[0].formatter = NumeralTickFormatter(format="0,0")
p.yaxis.minor_tick_line_color = None


show(p)

Now we also format the numbers in the y axis with a NumeralTickFormatter (there are several of these classes for different type of objects) and remove the minor ticks.

We also add the toolbar back, but with only the possibility to save the plot.

## ColumnDataSource

So far, we have provided data to bokeh directly (via lists) without an intermediate model. To create more advanced graphs, possibly composed of several components, and to do this easily, bokeh provides the ColumnDataSource class, which makes it easy to share data across multiple plots and to share selections on such data.

The ColumnDataSource object can be created from lists or from pandas dataframes.

A ColumnDataSource is composed of a number of named columns, that can be created from a dictionary. The columns should always have the same lenght.

When creating glyphs, we can reference the columns by name in the ColumnDataSource.

In [None]:
data = {
    'category_name' : main_cat.index.values,
    'project_count' : main_cat.values
}

cat_count_df = ks_df.groupby("main_category")[["name"]].count().reset_index().rename({"name":"count"},axis=1)

source = ColumnDataSource(data=data)
source = ColumnDataSource(data=cat_count_df)

p = figure(x_range=cat_count_df.main_category, height=300, width=700, tools="save", title = "project counts by category", x_axis_label = "project category", y_axis_label = "number of project")

p.vbar(x='main_category', top='count', source=source, width=0.9)

p.xgrid.grid_line_color = None
p.xaxis.major_label_orientation = 0.45
p.y_range.start = 0
p.yaxis[0].formatter = NumeralTickFormatter(format="0,0")
p.yaxis.minor_tick_line_color=None

show(p)

## Tooltip

One important feature of interactive charts is the possibility to inspect the element, to know more information about them. One way to provide additional information is to use a tooltip. Tooltips integrate nicely with ColumnDataSource, and the data to be displayed can be accessed by referring to the column name or by plot propertis (index, x/y positions, etc).

In [None]:
source = ColumnDataSource(data=cat_count_df)

TOOLTIP = [
    ("Project count", "@count{0,0}"),
    ("Category name", "@main_category"),
    ("index", "$index")
]

p = figure(x_range=cat_count_df.main_category, height=300, width=700,
           toolbar_location=None, tools="", title="Project counts by categories",
           x_axis_label="Project category", y_axis_label="Project count", tooltips=TOOLTIP)

p.vbar(x='main_category', top='count', width=0.9, source=source)

p.xgrid.grid_line_color = None
p.xaxis.major_label_orientation = 0.45
p.y_range.start = 0
p.yaxis[0].formatter = NumeralTickFormatter(format="0,0")
p.yaxis.minor_tick_line_color=None

show(p)

## Stacked bar plot

With ColumnDataSource it's easy to create more advanced types of bar plot, like for example stacked bar plot. You need to have multiple columns in the ColumnDataSource to represents the different bar sets, and then you can reference them by column name.

In the example below we also see how to customize the selection tool, to make it more useful for a bar plot, where selection on the y dimension does not make sense. Note also how we use an additional column (`All project count`) not for the bar plot itself, but only as an additional piece of information for the tooltip.

In [None]:
ks_df.state.value_counts()

cat_count_state_df = ks_df[ks_df.state.isin(["failed", "successful"])]

cat_count_state_df = cat_count_state_df.groupby(["main_category", "state"])[["name"]].count().unstack()
cat_count_state_df.columns = cat_count_state_df.columns.droplevel()
cat_count_state_df = cat_count_state_df.reset_index()
cat_count_state_df["total"] = cat_count_state_df["failed"] + cat_count_state_df["successful"]
cat_count_state_df = cat_count_state_df.sort_values("total", ascending=False)

source = ColumnDataSource(data=cat_count_state_df)

TOOLTIPS = [
    ("Category", "@{main_category}"),
    ("Failed project count", "@{failed}{0,0}"),
    ("Successful project count", "@{successful}{0,0}"),
    ("All project count", "@{total}{0,0}"),
]

p = figure(x_range=cat_count_state_df.main_category, height=400, width=700,
           title="Failed and successful project counts by category",
           x_axis_label="Project category", y_axis_label="Project counts",
           tooltips=TOOLTIPS, tools="save,reset")

box_select = BoxSelectTool(dimensions="width")
p.add_tools(box_select)

project_state = ["successful", "failed"]
p.vbar_stack(project_state, x="main_category", width=0.9, source=source, color=["green", "blue"], legend_label=project_state)

p.xgrid.grid_line_color = None
p.y_range.start = 0
p.xaxis.major_label_orientation = 0.45

show(p)

## Multiple plots

Depending on what you want to learn from the data, stacked bar plot is not always the best solution. Splitting the data into 2 different bar plots can show both trends.

This can be achieved by using layouts: by column, by row, or the most generic grid layout. The steps we follow are:
1. create the common ColumnDataSource
2. create the 2 plots (and the corresponding glyphs) individually
3. layout the plots in a grid layout. The grid can be composed of plots as well as custom divs.

Note that, by using the same ColumnDataSource the selection is linked.

In [None]:
from bokeh.models import Range1d

source = ColumnDataSource(data=cat_count_state_df)

TOOLTIPS = [
    ("Category", "@{main_category}"),
    ("Failed project count", "@{failed}{0,0}"),
    ("Successful project count", "@{successful}{0,0}"),
    ("All project count", "@{total}{0,0}"),
]

p1 = figure(x_range=cat_count_state_df.main_category, height=350, width=700,
           y_axis_label="Successful project count", tooltips=TOOLTIPS,
           tools='reset,save')

box_select1 = BoxSelectTool(dimensions="width")
p1.add_tools(box_select1)

p1.vbar(x="main_category", top="successful", source=source, width=0.9)

# 

p1.xgrid.grid_line_color = None
p1.xaxis.major_label_text_font_size = '0pt'
p1.xaxis.major_tick_line_color = None
p1.y_range = Range1d(0, 35000)

p2 = figure(x_range=cat_count_state_df.main_category, height=350, width=700,
           y_axis_label="Failed project count", tooltips=TOOLTIPS,
           tools='reset,save')

box_select2 = BoxSelectTool(dimensions="width")
p2.add_tools(box_select2)

p2.vbar(x="main_category", top="failed", source=source, width=0.9)

p2.xgrid.grid_line_color = None
p2.xaxis.major_label_orientation = "vertical"
p2.y_range = Range1d(0, 35000)

#



grid = gridplot([p1,p2], ncols=1)
title = Div(text="<h3> Susseccesull vs failed project count </h3>")
l = column(title,grid)
show(l)


## Scatterplot

To create a scatterplot we use the circle convenience method on the figure (plot) object. We map the x and y positions on specific columns of the ColumnDataSource object.

In addition, we map the project main category on the circle color. To do this we use a particular type of mapper, the CategoricalColorMapper. A number of other mappers exist: https://docs.bokeh.org/en/latest/docs/reference/models/mappers.html.

Note also how we can create a legend (for the project category) by simply referring to the ColumnDataSource column.

In [None]:
TOOLTIPS = [
    ("Name", "@name"),
    ("Category", "@main_category"),
    ("Pledges", "$ @{usd pledged}{0,0}"),
    ("Backers", "@backers{0,0}")
]

category_uniq = ks_df.main_category.unique()

source = ColumnDataSource(data=ks_df.head(10000))
color_mapper = CategoricalColorMapper(factors=category_uniq, palette=Category20c[len(category_uniq)])

p = figure(width=700, height=500, tooltips=TOOLTIPS, title="Pledges vs backers")

p.scatter(x='usd pledged', y='backers', size=5, source=source, alpha=0.9, color={'field' : 'main_category', 'transform': color_mapper},
        legend_field='main_category')

p.xaxis.axis_label = "USD Pledged"
p.yaxis.axis_label = "Backers"
p.x_range.start = 0 
p.y_range.start = 0 

p.add_tools(BoxSelectTool())
p.toolbar.autohide = True
p.x_range = Range1d(0, 5000000, bounds=(0, None))
p.y_range = Range1d(0, 40000, bounds=(0, None))

show(p)

## Callbacks

Now we want to create 2 plots, and update one based on a selection on the other one. For example, we can create the bar plot of the project category and the scatterplot of pledges vs backers, and in the scatterplot, only show the dots for the project belonging to the categories selected in the bar plot.

To do this, we need to update the scatterplot by attaching a callback to the barplot.

We start by creating the bar plot and by attaching a test callback on the selection event. The DataSource class (superclass of ColumnDataSource) has a [selected](https://docs.bokeh.org/en/latest/docs/reference/models/sources.html?highlight=datasource#bokeh.models.DataSource.selected) property that provides the selected indices on the DataSource.
By using the [on_change](https://docs.bokeh.org/en/latest/docs/reference/model.html#bokeh.model.Model.on_change) method (part of the base `Model` bokeh class) on the `selected` property, we can attach the callback to a change in the selection of the datasource.

In [None]:
source = ColumnDataSource(data=cat_count_df)

TOOLTIPS = [
    ("Project count", "@count{0,0}"),
    ("Category", "@main_category"),
    ("index", "$index")
]

p = figure(x_range=cat_count_df.main_category, height=300, width=700,
           toolbar_location=None, tools="", title="Project counts by categories",
           x_axis_label="Project category", y_axis_label="Project count",
           tooltips=TOOLTIPS)
box_select = BoxSelectTool(dimensions='width')
p.add_tools(box_select)

p.vbar(x='main_category', top='count', width=0.9, source=source)

p.xgrid.grid_line_color = None
p.xaxis.major_label_orientation = 0.45
p.y_range.start = 0
p.yaxis[0].formatter = NumeralTickFormatter(format="0,0")
p.yaxis.minor_tick_line_color=None

def callback(attr, old, new):
    print(new)

source.selected.on_change("indices", callback)


show(p)

The warning here is telling us that what we are trying to do is not possible, as python callback cannot be attached to standalone html output (because there is no place for them to run).

We thus have 2 possible options:
1. use javascript callback that can run in the html output
2. use python callback and a bokeh server to serve such callbacks

We show the javascript callback example here, and we'll use a bokeh server later in the notebook.

In the code below note that there are javascript correspondents to our python objects. To update the scatterplot we have to manipulate the javascript corresponding objects.

In [None]:
category_uniq = ks_df.main_category.unique()

TOOLTIPS1 = [
    ("Name", "@name"),
    ("Category", "@main_category"),
    ("Pledges", "$ @{usd pledged}{0,0}"),
    ("Backers", "@backers{0,0}")
]


source1 = ColumnDataSource(data=ks_df.head(10000))
immutable_source = ColumnDataSource(data=ks_df)
color_mapper = CategoricalColorMapper(factors=category_uniq, palette=Category20c[15])

p1 = figure(width=700, height=500, tooltips=TOOLTIPS1, title="USD Pledged vs backers")
p1.scatter(x='usd pledged', y='backers', 
         color={'field': 'main_category', 'transform': color_mapper}, 
         source=source1, alpha=0.7, legend_field='main_category', size=5)

p1.xaxis.axis_label = "USD Pledged"
p1.yaxis.axis_label = "Backers"
p1.x_range.start = 0
p1.y_range.start = 0

source2 = ColumnDataSource(data=cat_count_df)

TOOLTIPS2 = [
    ("Category", "@main_category"),
    ("Project count", "@count{0,0}")
]

p2 = figure(x_range=cat_count_df.main_category, height=250, width=700,
           y_axis_label="Project count", 
           x_axis_label="Project category", tooltips=TOOLTIPS2,
           tools='save', title="Project count by category")
box_select = BoxSelectTool(dimensions='width')
p2.add_tools(box_select1)

p2.vbar(x='main_category', top='count', width=0.9, source=source2)

p2.xaxis.major_label_orientation = 0.45
p2.xgrid.grid_line_color = None
p2.y_range.start = 0
p2.yaxis.minor_tick_line_color=None

source2.selected.js_on_change('indices', CustomJS(args=dict(source1=source1, source2=source2, immutableSource=immutable_source),
                              code="""
        var indices = cb_obj.indices;
        var data1 = source1.data;
        var data2 = source2.data;
        var immutableData = immutableSource.data;
        var categories = indices.map(index => data2['main_category'][index])
        data1['usd pledged'] = [];
        data1['backers'] = [];
        data1['main_category'] = [];
        data1['name'] = [];
        for (var i = 0; i < immutableData['usd pledged'].length; i++) {
            if (categories.includes(immutableData['main_category'][i])) {
                data1['usd pledged'].push(immutableData['usd pledged'][i]);
                data1['backers'].push(immutableData['backers'][i]);
                data1['main_category'].push(immutableData['main_category'][i]);
                data1['name'].push(immutableData['name'][i]);
            }
        }
        source1.change.emit();
    """)
)


show(column(p1,p2))

## Datatables & Widgets

### Datatable

Datatable is a tabular representation of data, which supports showing and editing the data. It is composed of TableColumn objects and it is highly customizable.

In [None]:
from bokeh.models.widgets import DataTable, DateFormatter, TableColumn

source = ColumnDataSource(data=ks_df)

columns = [
    TableColumn(field="name", title="Name"),
    TableColumn(field="backers", title="Backers"),
    TableColumn(field="main_category", title="Category"),
    TableColumn(field="usd pledged", title="USD Pledged")]

data_table = DataTable(source=source, columns=columns, editable=True)

show(data_table)

### Datacube

Datacube, introduced in bokeh 1.3.0 (https://docs.bokeh.org/en/latest/docs/releases.html?highlight=datacube#release-1-3-0), is a specialized datatable that provides collapsing groups and aggregation metrics for these groups (e.g., totals and sub-totals).

The grouping is provided by a GroupingInfo object, which in turn uses a getter as a "group by" criterion and an aggregator object (e.g., SumAggregator) to compute the aggregation metrics.

In [None]:
from bokeh.models.widgets import DataTable, DateFormatter, TableColumn, GroupingInfo, MaxAggregator, DataCube, SumAggregator

source = ColumnDataSource(data=ks_df.head(10000))

columns = [
    TableColumn(field="name", title="Category / Project"),
    TableColumn(field="backers", title="Backers")
]

grouping = [
    GroupingInfo(getter='main_category', aggregators=[MaxAggregator(field_="backers")])
]

target = ColumnDataSource(data=dict(row_indeces=[], labels=[]))

data_cube = DataCube(source=source, columns=columns, grouping=grouping, target=target)

show(data_cube)



## Tabs

Selectable tabs that contains plots or widgets or layouts.

Tabs are composed of a list of Panel objects, where a Panel has a child and a title.

In [None]:
from bokeh.models import TabPanel, Tabs


source1 = ColumnDataSource(data=cat_count_df)

TOOLTIPS1 = [
    ("Project count", "@count{0,0}"),
    ("Category", "@main_category")
]

p1 = figure(x_range=cat_count_df.main_category, height=300, width=700,
           toolbar_location=None, tools="", title="Project counts by categories",
           x_axis_label="Project category", y_axis_label="Project count",
           tooltips=TOOLTIPS1)

p1.vbar(x='main_category', top='count', width=0.9, source=source1)

p1.xgrid.grid_line_color = None
p1.xaxis.major_label_orientation = "vertical"
p1.y_range.start = 0
p1.yaxis[0].formatter = NumeralTickFormatter(format="0,0")
p1.yaxis.minor_tick_line_color=None

source2 = ColumnDataSource(data=ks_df.head(10000))

columns=[TableColumn(field="name", title="Name"), 
         TableColumn(field="backers", title="Backers")]

grouping = [
    GroupingInfo(getter='main_category', aggregators=[SumAggregator(field_='backers')])
]

target = ColumnDataSource(data=dict(row_indices=[], labels=[]))

data_cube = DataCube(source=source2, columns=columns, grouping=grouping, target=target)


tab1 = TabPanel(child=p1, title="Category barchart")
tab2 = TabPanel(child=data_cube, title="Backers table")
tabs = Tabs(tabs=[tab1,tab2])

show(tabs)


## Creating an interactive dashboard

To create this interactive dashboard we will use, in addition to what we have already seen, MultiSelect and Slider widgets. We will also use a bokeh server and python callbacks to update the dashboard based on selections in the widgets.

### Bokeh server

To be able to use python callback, we need a bokeh server to run our callback function code. A bokeh server can be run from the command line with:

`bokeh serve --show --port 5001 bokeh_dashboard`

However, if we want to have our dashboard within the notebook, we can use an embedded server, provided by the `Application` class, which is a factory to create bokeh documents. The `Application` takes a function handler which is used to process bokeh documents. 

The function handler will essentially create the dashboard and attach it to the bokeh document that the handler receives in input. Within the function handler we will also add the python callback and we will register them to widget selection events.

### Multiselect & Slider

A Multiselect shows multiple available options and supports multiple selections. The input is a list of possible options.

A Slider shows an interval (start - end), where only the end can be moved. To move also the start of the interval a RangeSlider can be used. A DateSlider (simple or range) is a particular type of slider for date objects.

In [None]:
ks_df = pd.read_csv("ks-projects-201801.csv")
ks_df.launched = pd.to_datetime(ks_df.launched)
ks_df = ks_df[ks_df.launched > '1980-1-1']

In [None]:
ks_df_small = ks_df.head(10000)

In [None]:
from bokeh.models.widgets import MultiSelect, Slider, DateRangeSlider
from bokeh.layouts import row
from bokeh.application import Application
from bokeh.application.handlers import FunctionHandler
from bokeh.models.ranges import Range1d

def modify_doc(doc):
        
    categories = ks_df_small.main_category.unique().tolist()

    TOOLTIPS1 = [
        ("Name", "@name"),
        ("Category", "@main_category"),
        ("Pledges", "$ @{usd pledged}{0,0}"),
        ("Backers", "@backers{0,0}"),
        ("Launched", "@launched"),
    ]


    source = ColumnDataSource(data=ks_df_small)

    color_mapper = CategoricalColorMapper(factors=categories, palette=Category20c[len(categories)])

    backer_range = Range1d(ks_df_small.backers.min(), ks_df_small.backers.max())
    pledges_range = Range1d(ks_df_small['usd pledged'].min(), ks_df_small['usd pledged'].max())
    
    p = figure(width=550, height=500, tooltips=TOOLTIPS1,
               title="Pledges vs backers", sizing_mode="fixed", 
               x_range=pledges_range, y_range=backer_range)
    p.scatter(x='usd pledged', y='backers', 
         color={'field': 'main_category', 'transform': color_mapper}, 
         source=source, alpha=1, legend_field='main_category', size=5)

    p.xaxis.axis_label = "Pledged"
    p.yaxis.axis_label = "Backers"
        
    select = MultiSelect(title="Project main category:", 
                           options=categories, width=200, height=300, value=categories)
        
    slider = Slider(start=0, end=ks_df_small.backers.max(), value=0, 
                    step=1, title="Filter by number of backers",
                    width=200)

    mindate = ks_df_small.launched.min()
    maxdate = ks_df_small.launched.max()
    
    date_slider = DateRangeSlider(start=mindate, end=maxdate, value=(mindate, maxdate),
                    title="Filter by project launched",
                    width=550)

    
    def filter_data():
        selected_categories = select.value
        min_backers = slider.value
        min_date = date_slider.value_as_date[0]
        max_date = date_slider.value_as_date[1]

        new_df = ks_df_small[ks_df_small.backers >= min_backers]
        new_df = new_df[new_df.main_category.isin(selected_categories)]
        new_df = new_df[(new_df.launched.dt.date >= min_date) & (new_df.launched.dt.date <= max_date)]
                
        return ColumnDataSource(data=new_df)

    
    def update_data(attr, old, new):
        updated_source = filter_data()
        source.data.update(updated_source.data)
        
    
    select.on_change("value", update_data)
    slider.on_change("value", update_data)
    date_slider.on_change("value", update_data)

    doc.add_root(row(column(select, slider), column(p, date_slider)))

handler = FunctionHandler(modify_doc)
app = Application(handler)

show(app)