In [1]:
import pandas as pd

Let's start by making a really simple dataset. Imagine we have some farms, which grow apples and bananas to sell to a few different customers. We can describe the *flow* of fruit from the farms (the *source* of the flow) to the customers (the *target* of the flow):

In [14]:
flows = pd.DataFrame([
    ('farm1', 'Mary', 'apples', 5),
    ('farm1', 'James', 'apples', 3),
    ('farm2', 'Fred', 'apples', 10),
    ('farm2', 'Mary', 'bananas', 10),
    ('farm2', 'Susan', 'bananas', 5),
    ('farm3', 'Susan', 'apples', 10),
    ('farm4', 'Susan', 'bananas', 1),
    ('farm5', 'Susan', 'bananas', 1),
    ('farm6', 'Susan', 'bananas', 1),
], columns=['source', 'target', 'type', 'value'])
flows

Unnamed: 0,source,target,type,value
0,farm1,Mary,apples,5
1,farm1,James,apples,3
2,farm2,Fred,apples,10
3,farm2,Mary,bananas,10
4,farm2,Susan,bananas,5
5,farm3,Susan,apples,10
6,farm4,Susan,bananas,1
7,farm5,Susan,bananas,1
8,farm6,Susan,bananas,1


Drawn directly as a Sankey diagram, this data would look something like this:

In [16]:
from ipysankeywidget import SankeyWidget
SankeyWidget(nodes=[{'id': k} for k in list(flows.source.unique()) + list(flows.target.unique())],
             links=flows.to_dict('records'))

A Jupyter Widget

But you don't always want a direct correspondence between the flows in your data and the links that you see in the Sankey diagram. For example:
- Farms 4, 5 and 6 are all pretty small, and to make the diagram clearer we might want to group them in an "other" category.
- The flows of apples are mixed in with the flows of bananas -- we might want to group the kinds of fruit together to make them easier to compare
- We might want to group farms or customers based on some other attributes -- to see difference between genders, locations, or organic/non-organic farms, say.

This introduction shows how to use `sankeyview` to do some of these for this simple example, in the simplest possible way. Later tutorials will show how to use it on real data, and more efficient ways to do the same things.

Let's start with the first example: grouping farms 4, 5 and 6 into an "other" category. `sankeyview` works by setting up a "Sankey diagram definition" which describes the structure of the diagram we want to see. In this case, we need to set up some groups:

In [18]:
from sankeyview import *
nodes = {
    'farms': ProcessGroup(['farm1', 'farm2', 'farm3', 'farm4', 'farm5', 'farm6']),
    'customers': ProcessGroup(['James', 'Mary', 'Fred', 'Susan']),
}

We need to describe roughly how these groups should be placed in the final diagram by defining an "ordering" -- a list of vertical slices, each containing a list of node ids:

In [19]:
ordering = [
    ['farms'],
    ['customers'],
]

And we also need to say which connections should appear in the diagram (sometimes you don't want to actually see all the connections). This is called a "bundle" because it bundles up multiple flows -- in this case all of them.

In [20]:
bundles = [
    Bundle('farms', 'customers'),
]

Putting that together into a Sankey diagram definition (SDD) and applying it to the data gives this result:

In [23]:
sdd = SankeyDefinition(nodes, bundles, ordering)
dataset = Dataset(flows)
from sankeyview.jupyter import show_sankey
show_sankey(sdd, dataset)

A Jupyter Widget

That's not very useful. What's happened? Every farm and every customer has been lumped together into one group. To the picture we want -- like the first one, but with an "other" group containing farms 4, 5 and 6 --