<table style="float:left; border:none">
   <tr style="border:none">
       <td style="border:none">
           <a href="https://bokeh.org/" target="_blank">
           <img
               src="assets/bokeh-transparent.png"
               style="width:50px"
           >
           </a>
       </td>
       <td style="border:none">
           <h1>Bokeh Tutorial</h1>
       </td>
   </tr>
</table>

<div style="float:right;"><a href="TOC.ipynb" target="_blank">Table of contents</a><br><h2>06 Data sources</h2></div>

In [None]:
# load tutorial data
from tutorial_data import data

In [None]:
# activate notebook output
from bokeh.io import output_notebook

output_notebook()

This chapter is focused on how Bokeh handles data. The concepts introduced here are
fundamental to Bokeh. You will use them throughout the rest of the tutorial.

In the previous examples, you have used standard Python lists or Pandas DataFrames as
inputs for your data.

Behind the scenes, Bokeh converts all these inputs to a Bokeh **ColumnDataSource**.
This is **Bokeh's primary internal data structure**. It is used in almost all plots
(with the exception of [map plots](#Map-plots)).

In most cases, Bokeh can just handle the ColumnDataSource automatically. However,
there are many cases where it is useful to create and use a ColumnDataSource
directly. Several of Bokeh's more advanced functionalities rely on using
a ColumnDataSource. This includes hover tooltips, automatically placed labels,
computed transforms, or custom interactions, for example.

### Creating a ColumnDataSource from a dictionary

The first step to creating a `ColumnDataSource` is to import it from `bokeh.models`:

In [None]:
from bokeh.models import ColumnDataSource

A ColumnDataSource works similarly to a table or a pandas DataFrame. It is a mapping of
column names to sequences of values.

You can create a ColumnDataSource from Python dictionaries. The keys of the dictionary
are the column names, and the values of the dictionary are the sequences of values:

In [None]:
source = ColumnDataSource(
    data={
        "x": [1, 2, 3, 4, 5],  # first dictionary creates a column named "x"
        "y": [3, 7, 8, 5, 1],  # second dictionary creates a column named "y"
    }
)

To access the contents of any column, use the `data` property of a `ColumnDataSource`:

In [None]:
source.data["x"]

The data you provide here is not limited to lists. You can also use NumPy arrays or 
pandas Series:

In [None]:
import numpy as np

# load pandas series frame from demo data set
monthly_passengers_series = data.get_monthly_values()["passengers"]

# create NumPy array of same length as the pandas series
range_array = np.array(range(0, len(monthly_passengers_series), 1))

# create a ColumnDataSource from pandas series and NumPy array
source = ColumnDataSource(
    data={
        "x": monthly_passengers_series,  # first dictionary uses a pandas Series
        "y": range_array,  # second dictionary uses a NumPy array
    }
)

print(f"pandas Series: \n {source.data['x']}")
print(f"NumPy Array: \n {source.data['y']}")

**All the columns in a ColumnDataSource must always be the SAME length**. This is why
the NumPy array in the example above uses `len(monthly_passengers_series)`. This way,
the NumPy array is the same length as the pandas Series used for the other column.

The following code cell will show an error. Adjust one of the lists to create a valid
ColumnDataSource:

In [None]:
# 🔁 Adjust one of the lists so that both lists have the same amount of elements
source = ColumnDataSource(
    data={
        "x": [1, 2, 3, 4],  # first list contains 4 elements
        "y": [3, 7, 8, 5, 1],  # ⚠ second list contains 5 elements - this will throw an error
    }
)

In the examples so far, you have used a Python list or pandas series for the `x` and `y`
values of functions like `p.scatter`. This means that Bokeh has created the
ColumnDataSource for you automatically.

However, instead of passing individual sequences of values to a renderer, you can also
use a ColumnDataSource directly. To use a ColumnDataSource directly, you do two things
differently:

1. You pass the ColumnDataSource as the `source` argument to a glyph method.
2. To use values from the column of a ColumnDataSource, you pass the **name** of that
    column as the value for the property. For example, instead of passing `x=[1 ,2 ,3]`,
    you pass `x="x_values"`.

In the following code cell, you first create a ColumnDataSource from two lists in a
dictionary.
Then you use those two columns as the `x` and `y` values for a circle glyph:

In [None]:
from bokeh.plotting import figure, show
from bokeh.models import ColumnDataSource

# create dict as basis for ColumnDataSource
data_dict = {"x_values": [1, 2, 3, 4, 5], "y_values": [6, 7, 2, 3, 6]}

# create ColumnDataSource based on the dict
source = ColumnDataSource(data=data_dict)

# create a plot and renderer with ColumnDataSource data
p = figure(height=300)
p.scatter(
    x="x_values",  # use the sequence in the "x_values" column
    y="y_values",  # use the sequence in the "y_values" column
    source=source,  # use the ColumnDataSource as the data source
)

show(p)

### Creating a ColumnDataSource from a DataFrame

There are many similarities between a ColumnDataSource and a pandas DataFrame.
This is why it is simple to create a `ColumnDataSource` object directly from a
DataFrame.

Let's use the monthly passenger, freight, and mail data from the demo dataset again:

In [None]:
monthly_values_df = data.get_monthly_values()
monthly_values_df.head(5)

To create a ColumnDataSource from a DataFrame, pass the dataframe when creating the
`ColumnDataSource` object:

In [None]:
source = ColumnDataSource(monthly_values_df)

You now have a ColumnDataSource with the same columns as the DataFrame:
- a series of values in a column called `"passengers"`
- a series of values in a column called `"freight"`
- a series of values in a column called `"mail"`
- a series of strings in a column called `"month_names"`

You can use the ColumnDataSource in the same way as before:

In [None]:
from bokeh.plotting import figure, show
from bokeh.models import ColumnDataSource

# create ColumnDataSource based on DataFrame from the demo data set
source = ColumnDataSource(monthly_values_df)

# set up the figure
p = figure(
    height=300,
    x_range=source.data["month_name"],  # use the sequence of strings from the "month_name" column as categories
)

# create a line renderer with data from the "freight" column
p.vbar(
    x="month_name",  # use the sequence of strings from the "month_name" column as categories
    top="freight",  # use the sequence of values from the "freight" column as values
    width=0.9,
    source=source,
)

# create a second line renderer with data from a different column
p.vbar(
    x="month_name",  # use the sequence of strings from the "month_name" column as categories
    top="passengers",  # use the sequence of values from the "passengers" column as values
    # top="mail",       # 🔁 use this line instead of the one above to use data from the "mail" column
    width=0.9,
    color="tomato",
    source=source,
)

show(p)

For more information about the ColumnDataSource, see [Data sources](https://docs.bokeh.org/en/latest/docs/user_guide/basic/data.html)
in the user guide.

### ColumnDataSource transforms

Using ColumnDataSource objects also allows you to use Bokeh's built-in transforms.
Transforms are useful for performing computations on the data before the data is
displayed.
These transforms are performed by BokehJS, in the browser.
This means the underlying data is not modified and is always available for other plots
in the same document.

All transforms need to be imported from `bokeh.transform`.

#### The cumsum transform

The `cumsum` transform generates a new sequence of values from a ColumnDataSource
column. This new column **cumulatively sums the values in the original column**.

For example:

In [None]:
from bokeh.transform import cumsum

# create ColumnDataSource based on DataFrame from the demo data set
source = ColumnDataSource(monthly_values_df)

# set up the figure
p = figure(
    height=300,
    x_range=source.data["month_name"],
)

# create a bar chart with cumulative data from the "mail" column
p.vbar(
    x="month_name",
    top=cumsum("passengers", include_zero=True),  # use the cumulative sums of the "passengers" column as values
    width=0.9,
    source=source,
)


show(p)

#### The linear_cmap transform

The ``linear_cmap`` transform generates a new sequence of colors by **applying a linear
color map** to a ColumnDataSource column.

See [05 Styling plots](05_styling.ipynb#Color-mappers-and-palettes) for more information
about color mappers.

In [None]:
import numpy as np
from bokeh.transform import linear_cmap

# create random values for x, y and radius
N = 2000
source = dict(
    x=np.random.random(size=N) * 100,
    y=np.random.random(size=N) * 100,
    r=np.random.random(size=N) * 1.5,
)

p = figure(height=300)

p.scatter(
    x="x",
    y="y",
    radius="r",
    source=source,
    fill_alpha=0.6,
    color=linear_cmap("x", "Viridis256", 0, 100),  # use a color map based on the "x" column using the Viridis256 palette
)

show(p)

To learn more about color mapping with transforms, see
[Client-side color mapping](https://docs.bokeh.org/en/latest/docs/user_guide/basic/data.html#client-side-color-mapping)
in the user guide.

Bokeh contains several other transforms. See
[Transforming data](https://docs.bokeh.org/en/latest/docs/user_guide/basic/data.html#transforming-data)
in the user guide for more examples. The
[entry for bokeh.transform](https://docs.bokeh.org/en/latest/docs/reference/transform.html)
in the reference guide also contains a list of all available transforms.

# Next section

<a href="07_annotations.ipynb" target="_blank">
    <img src="assets/arrow.svg" alt="Next section" width="100" align="right">
</a>

In the [next chapter](07_annotations.ipynb), you'll learn how to use annotations to
provide additional information in your plots.