# Bokeh: Data Sources & Transformation

Most parts of this tutorial are taken from the bokeh [website](https://docs.bokeh.org/en/latest/index.html).

## Imports and Setup

First, let's make the standard imports

In [None]:
from bokeh.io import output_notebook, show
from bokeh.plotting import figure

In [None]:
output_notebook()

This notebook uses Bokeh sample data. If you haven't downloaded it already, this can be downloaded by running the following:

In [None]:
import bokeh.sampledata
bokeh.sampledata.download()

## Overview

We've seen how Bokeh can work well with Python lists, NumPy arrays, Pandas series, etc. At lower levels, these inputs are converted to a Bokeh `ColumnDataSource`. This data type is the central data source object used throughout Bokeh. Although Bokeh often creates them for us implicitly, there are times when it is useful to create them explicitly.

In later sections we will see features like hover tooltips, computed transforms, interactions that make use of the `ColumnDataSource`, so let's take a quick look now. 

### Creating with Python Dicts

The `ColumnDataSource` can be imported from `bokeh.models`:

In [None]:
from bokeh.models import ColumnDataSource

The `ColumnDataSource` is a mapping of column names (strings) to sequences of values. Here is a simple example. The mapping is provided by passing a Python `dict` with string keys and simple Python lists as values. The values could also be NumPy arrays, or Pandas sequences.

***NOTE: ALL the columns in a `ColumnDataSource` must always be the SAME length.***


In [None]:
source = ColumnDataSource(data={
    'x' : [1, 2, 3, 4, 5],
    'y' : [3, 7, 8, 5, 1],
})

Up until now we have called functions like `p.circle` by passing in literal lists or arrays of data directly. When we do this, Bokeh automatically creates a `ColumnDataSource` for us. But it is possible to specify a `ColumnDataSource` explicitly by passing it as the `source` argument to a glyph method. Whenever we do this, if we want a property (like `"x"` or `"y"` or `"fill_color"`) to have a sequence of values, we pass the ***name of the column*** that we would like to use for a property:

In [None]:
p = figure(plot_width=400, plot_height=400)
p.circle('x', 'y', size=20, source=source)
show(p)

<h3>
<font color='blue'>
Exercise
</font>
</h3>
Create a column data source with NumPy arrays as column values and plot it

In [None]:
import numpy as np


### Creating with Pandas DataFrames

It's also possible to create `ColumnDataSource` objects directly from Pandas data frames. To do this, just pass the data frame to  `ColumnDataSource` when you create it:

In [None]:
from bokeh.sampledata.iris import flowers as df

# show first 5 rows of example df
display(df.head())

# create ColumnDataSource
source = ColumnDataSource(df)

Now we can use it as we did above by passing the column names to glyph methods:

In [None]:
p = figure(plot_width=400, plot_height=400)
p.circle('petal_length', 'petal_width', source=source)
show(p)

<h3>
<font color='blue'>
Exercise
</font>
</h3>
Create a column data source with the autompg sample data frame and plot it

In [None]:
from bokeh.sampledata.autompg import autompg_clean as df


### Automatic Conversion

If you do not need to share data sources, it may be convenient to pass dicts, Pandas `DataFrame` or `GroupBy` objects directly to glyph methods, without explicitly creating a `ColumnDataSource`. In this case, a `ColumnDataSource` will be created automatically.

In [None]:
from bokeh.sampledata.iris import flowers as df

p = figure(plot_width=400, plot_height=400)
p.circle('petal_length', 'petal_width', source=df)
show(p)

## Transformations

In addition to being configured with names of columns from data sources, glyph properties may also be configured with transform objects that represent transformations of columns. These live in the `bokeh.transform` module. It is important to note that when using these objects, the tranformations occur *in the browser, not in Python*. 

The first transform we look at is the `cumsum` transform, which can generate a new sequence of values from a data source column by cumulatively summing the values in the column. This can be useful for pie or donut type charts as seen below.

In the code below we first create a data frame containing several countries and their corresponding (meaningless) values. Next, we add a column in which color is defined for every row, or country, in the data frame. Coloring palettes (like Category20c, used below) can be found at https://docs.bokeh.org/en/latest/docs/reference/palettes.html. We also add a column 'angle' to the data frame, here the value of every country is normalised to an angle (in radians).  

Next, we create the figure. Here, we use the `wedge()` method to draw wedges based on the angles we've calculated earlier. First, we define the *x* and *y* coordinates of the points of the wedges and choose the radius of the wedge. We then use `cumsum` to define `start_angle` and `end_angle`, (the angles at which the wedges start and end, as measured from the horizontal). `cumsum` is a method from the `bokeh.transform` module and is used to sum the "angle" column of a data source, in this case our data frame. `include_zero` is set to `True`, indicating that we start at an angle of zero. Lastly, we add some styling elements in this method and tell the method what data it should look at. 

To summarize, we've turned the values of the data frame into ratios which we use to draw a wedge for every entry in our data frame. 

In [None]:
from math import pi
import pandas as pd
from bokeh.palettes import Category20c # palette for color mapping
from bokeh.transform import cumsum

# data
x = { 'United States': 157, 'United Kingdom': 93, 'Japan': 89, 'China': 63,
      'Germany': 44, 'India': 42, 'Italy': 40, 'Australia': 35, 'Brazil': 32,
      'France': 31, 'Taiwan': 31, 'Spain': 29 }

# create df
data = pd.Series(x).reset_index(name='value').rename(columns={'index':'country'})

# add column with a color of the Category20c palette for every row 
data['color'] = Category20c[len(x)]

# represent each value as an angle = value / total * 2pi
data['angle'] = data['value']/data['value'].sum() * 2*pi

# create figure
p = figure(plot_height=350, title="Pie Chart", toolbar_location=None,
           tools="hover", tooltips="@country: @value")

p.wedge(x=0, y=1, radius=0.4, 
        
        # use cumsum to cumulatively sum the values for start and end angles
        start_angle=cumsum('angle', include_zero=True), end_angle=cumsum('angle'),
        line_color="white", fill_color='color', legend_field='country', source=data)

# don't show the axes or a grid
p.axis.axis_label=None
p.axis.visible=False
p.grid.grid_line_color = None

show(p)

The next transform we look at is the `linear_cmap` transform, which can generate a new sequence of colors by applying a linear colormapping to a data source column. 

We again create a data set, in this case we create a dictionary with three keys (`x`, `y` and `r`), with as value an array of random numbers of size `N`. The arrays of `x` and `y` contain random numbers between 0 and 100 and `r` between 0 and 1.5. 

Next, we create a figure in which we draw circles with the `circle` method. The location of the circle is defined by `x` and `y`, the radius of the circles by `r`. `color` is set with the `linear_cmap` transform, which maps a range of numerical values across the available colors from high to low. In this case, we apply a linear color mapper to `x` and we colors are taken from the `Viridis256` color palette (from [bokeh.palettes](https://docs.bokeh.org/en/latest/docs/reference/palettes.html)). The final two arguments indicate the minimum and maximum value of the range to map into the palette. Check out the [documentation](https://docs.bokeh.org/en/latest/docs/reference/transform.html) to find out more about this and other color mapping transformations. 

When looking at our final figure we can indeed observe that we have drawn a graph with circles on random locations in the specified domain with varying radius and color. The color is based on the position of the circle on the x-axis. 


In [None]:
from bokeh.transform import linear_cmap

# data
N = 4000
data = dict(x=np.random.random(size=N) * 100,
            y=np.random.random(size=N) * 100,
            r=np.random.random(size=N) * 1.5)

# create figure
p = figure()
p.circle('x', 'y', radius='r', source=data, fill_alpha=0.6,
        
         # color map based on the x-coordinate
         color=linear_cmap('x', 'Viridis256', 0, 100))

show(p) 

Change the code above to use `log_cmap` and observe the results. Try changing `low` and `high` and specificying `low_color` and `high_color`.

<h3>
<font color='blue'>
Exercise
</font>
</h3>
Use the corresponding factor_cmap to color a scatter plot of the iris data set

In [None]:
from bokeh.sampledata.iris import flowers
