#Understanding Bostock’s Les Miserables Co-occurrence chart in Bokeh


-- *Notes by Philip Dürholt and Silvia Gutiérrez*

In this notebook we went through the data you can feed and the [code you need](http://bokeh.pydata.org/en/latest/docs/tutorials/solutions/gallery/les_mis.html) to get a super nice co-occurrence chart using Python's viz-library "[Bokeh](http://bokeh.pydata.org/en/latest/)". We commented everything step by step in order to understand better its characteristics and hopefully our findings will be as useful to you as they were to us.

In [2]:
#Import statements
from bokeh.sampledata.les_mis import data
import numpy as np
from bokeh.plotting import figure, output_file, show
from bokeh.models import HoverTool, ColumnDataSource

#Data structure
#print(data)
print(len(data['nodes']))
print(len(data['links']))

77
254


##The DATA: JSON

The sampledata.les_mis that was taken for this tutorial is written in JSON and it has two records: **nodes** and **links**, and each one of them has one array (square brackets hold arrays).

* *Nodes* contains multiple objects [curly braces hold objects] with two attribute-value pairs, name and group, and they look like this:
>`'nodes': [{'group': 1, 'name': 'Myriel'}, {'group': 1, 'name': 'Napoleon'}]`

* *Links* has three attribute-value pairs: target, source and value, and this is how they look like:

>`'links': [{'value': 1, 'target': 0, 'source': 1}, {'value': 8, 'target': 0, 'source': 2}]`

In [3]:

#print(names)
print(data['links'][54])
groups = [node['group'] for node in data['nodes']]
#max(groups)
nodes
print(len(nodes))

{'source': 26, 'target': 11, 'value': 31}
77


#Setting data for the plot

##Coocurrence counts

In order to store the information about the counts, we can create an empty numpy array shaped as a tuple whose values are the length of the nodes.

Sounds harsh huh? 

Don't worry, it is not so difficult and after taking a look at the code you can check our notes in the next cell explaining everything step by step.

In [4]:
N = len(nodes)
print('This is the length of nodes:\n', N,'\n')
counts = np.empty((N, N))
#cooclist = []
for link in data['links']:
    #cooclist.append(link['value'])
    #print(link) 
    #print(counts[link['source'], link['target']])
    #print('Target - Source', [link['target'], link['source']])
    counts[link['source'], link['target']] = link['value']
    counts[link['target'], link['source']] = link['value']

#print('This is the sorted values of the co-occurences would look like:\n',sorted(cooclist), '\n')
print('This is what is stored in counts:\n',counts[:1],'...' '\n', len(counts) * len(counts[0]))

This is the length of nodes:
 77 

This is what is stored in counts:
 [[  0.00000000e+000   1.00000000e+000   8.00000000e+000   1.00000000e+001
    1.00000000e+000   1.00000000e+000   1.00000000e+000   1.00000000e+000
    2.00000000e+000   1.00000000e+000  -8.85480012e-284   5.00000000e+000
    6.93252930e-310   3.01911373e-156   0.00000000e+000   0.00000000e+000
    3.91928793e+208   6.93256290e-310   6.93252930e-310   2.12199579e-314
    0.00000000e+000   0.00000000e+000   2.12199579e-314   0.00000000e+000
    0.00000000e+000   2.12199579e-314   0.00000000e+000   0.00000000e+000
    2.12199579e-314   2.46010107e-319   6.93256346e-310   6.93256346e-310
    6.93256397e-310   2.47032823e-323   0.00000000e+000   0.00000000e+000
    0.00000000e+000   4.39265575e-167   0.00000000e+000   0.00000000e+000
   -8.85480012e-284   0.00000000e+000   0.00000000e+000   3.01911373e-156
    0.00000000e+000   0.00000000e+000   3.91928793e+208   0.00000000e+000
    0.00000000e+000   6.36598737e-314   0.


###np-empty()

The first thing you gotta do is to understand **np.empty**. You can learn anything you need by reading [the documentation](http://docs.scipy.org/doc/numpy/reference/generated/numpy.empty.html) but we'll approach the matter with some examples: 

So `np.empty((5,2))`, for instance, generates a numpy array with the "shape" or "size" of `(5,2)`. You can understand this as shape thing if you imagine it is a table and that the first number represents the number of rows and the second the number of columns. So (5,2) would be a table with 5 rows and two columns. That was not so hard, was it? So in the code they created a two-dimensional np.array (a table that has two dimensions). After the initialization **`counts`** consists of random floats but when we run the code it overwrites those random values of counts with the values of the 'coordinates' (source, target) and (target, source).

To do this we "call" every link in the links dataset (these guys --> {'target': 0, 'value': 1, 'source': 1} ) and then we ask for the values-cooccurences of our source-target combinations. 

For example: [1, 0]) were 1 and 0 are "indices" or positions of our of source and target characters if `nodes` was enummerated. We do this "twice" (`[link['source'], link['target']` ANNND `[link['target'], link['source']`) in order to get the "square-map" we need for a our viz (cf. the diagonal symmetry of the chart). If we would just get either all source -> target links or all target -> source links but not both - we would have only half the chart.

If you want a square(-map) you need (1,1), (1,2), (1,3), (2,2), (2,3), (3,3) AND (2,1), (3,1), (3,2) [which are the 'reverse combinations' of the first group]

BUT, not all 'coordinates' can be found in data['links']! 

Remember we had 77 nodes and 254 links? If we had a link from each node to all other existing nodes we would have 5925 (77\*\*2) links! 

If you print **`counts`** you might notice that some "funny numbers" like -1.09611685e-080. The 'funny numbers' we see, are the random values that are left from the initial counts, that was generated with np.empty.

Bare this in mind when you see the gray rectangles, that represent the characters with no co-occurence but nonetheless  *have* a count value and differ in alpha values.

Alpha what? 

Check the next part and you'll undertand what we're talking about

##Alphas



In the following code the co-occurrence count should be given by the pair of names in `count[i,j]`. The strategy is to color each rect by the group, and set its alpha based on the count.



To understand how we set our alpha with `min(counts[i,j]/4.0, 0.9)`, let's take a look at the following code:


In [24]:
print('==[The "count can be 0"-example doesn\'t work because every time we call np.empty, the value at [76, 52] will be different!]== \n')
print('Let us take the following count as a starting point \n counts[0,2]= ', counts[0,2],'\n')
print('8.0 exceeds 0.9 after being divided by 4: \n counts[0,2]/4.0 = ', counts[0,2]/4.0,'\n')
print('With min we choose to store 0.9 instead: \n min(counts[0,2]/4.0, 0.9) = ', min(counts[0,2]/4.0, 0.9),'\n')
#print(counts[11,58]/4.0)
print('Sometimes are count can equal 0: \n counts[76, 52] = ', counts[76, 52],'\n', '(We\'ll just act as if it was a 0)\n')
print('In this cases we add 0.1 to keep all alphas in range 0.1 to 1.0: \n min(counts[76,52], 0.9) + 0.1) = ',
      min(counts[76,52], 0.9) + 0.1 ,'\n')

==[The "count can be 0"-example doesn't work because every time we call np.empty, the value at [76, 52] will be different!]== 

Let us take the following count as a starting point 
 counts[0,2]=  8.0 

8.0 exceeds 0.9 after being divided by 4: 
 counts[0,2]/4.0 =  2.0 

With min we choose to store 0.9 instead: 
 min(counts[0,2]/4.0, 0.9) =  0.9 

Sometimes are count can equal 0: 
 counts[76, 52] =  2.084148742e-57 
 (We'll just act as if it was a 0)

In this cases we add 0.1 to keep all alphas in range 0.1 to 1.0: 
 min(counts[76,52], 0.9) + 0.1) =  0.1 



###Getting meaningful alphas

So you might have asked yourself: why are we dividing our counts by 4? Well it is not a magic number, I can tell you that, and it depends completely on your data. In this case 80% of the coocurrence values are between 1 and 4. Therefore, with the formula, we get a nice (significant) alpha value for most of our data. The other 20% will have the  maximum alpha of 1.0 (max = 0.9 + 0.1 = 1.0).


In [25]:
colormap = [
    "#444444", "#a6cee3", "#1f78b4", "#b2df8a", "#33a02c", "#fb9a99",
    "#e31a1c", "#fdbf6f", "#ff7f00", "#cab2d6", "#6a3d9a"
]
xname = []
yname = []
color = []
alpha = []
for i, n1 in enumerate(nodes):
    for j, n2 in enumerate(nodes):
        xname.append(n1['name'])
        yname.append(n2['name'])
       
        #print(i, j, counts[i,j])
        
        a = min(counts[i,j]/4.0, 0.9) + 0.1
        alpha.append(a)

        if n1['group'] == n2['group']:
            color.append(colormap[n1['group']])
        else:
            color.append('lightgrey')

###Concluding

As we learned before, counts provides a value for every rectangle in the graphic, some of them carry the information given by the link value (transformed by the 'formula' to have a significant alpha value) - some are useless random numbers. Since we color every co-occurrence in the output and label every non-co-occurent character combination with lightgrey we are left with a nice visualization.

## ColumnDataSource

Why do the counts need to be flatten? When I print the result of the flattened counts I get one list of numbers like : 6.90178054e-310   1.00000000e+000   8.00000000e+000

In [14]:
# EXERCISE: output static HTML file
output_file("les_mis.html")

# Create a ColumnDataSource to hold the xnames, ynames, colors, alphas,
# and counts. NOTE: the counts array is 2D and will need to be flattened
source = ColumnDataSource(
    data=dict(
        xname=xname,
        yname=yname,
        colors=color,
        alphas=alpha,
        count=counts.flatten()
    )
)

#print(counts.flatten())

##Looking at the Data

In [70]:
source_df = source.to_df()

source_df[:8]

Unnamed: 0,xname,colors,alphas,yname,count
0,Myriel,#a6cee3,0.1,Myriel,5.240945e-316
1,Myriel,#a6cee3,0.35,Napoleon,1.0
2,Myriel,#a6cee3,1.0,Mlle.Baptistine,8.0
3,Myriel,#a6cee3,1.0,Mme.Magloire,10.0
4,Myriel,#a6cee3,0.35,CountessdeLo,1.0
5,Myriel,#a6cee3,0.35,Geborand,1.0
6,Myriel,#a6cee3,0.35,Champtercier,1.0
7,Myriel,#a6cee3,0.35,Cravatte,1.0


**Resp:** If we want to have counts as a column it can't be a 2D array (we would have 77 values in each row of 'count').
Just test it yourself - delete '.flatten()' after counts and read the Exception (must be 1-dimensional).

## Figure

In [60]:
# create a new figure
p = figure(title="Les Mis Occurrences (one at a time)",
           x_axis_location="above", tools="resize,hover",
           x_range=list(reversed(names)), y_range=names,
           plot_width=800, plot_height=800)

## Adding a rect glyph

The [rect glyph](http://bokeh.pydata.org/en/latest/docs/reference/plotting.html#bokeh.plotting.Figure.rect) displays rectangles centered on the given coordinates with the given dimensions and angle.

Parameters:	
* x (str or list[float]) – values or field names of center x coordinates
* y (str or list[float]) – values or field names of center y coordinates
* width (str or list[float]) – values or field names of widths
* height (str or list[float]) – values or field names of heights
* angle (str or list[float], optional) – values or field names of rotation angles, defaults to 0
* dilate (bool, optional) – whether to dilate pixel distance computations when drawing, defaults to False


In [61]:
# EXERCISE: use the `p.rect` renderer to render a categorical heatmap of all the
# data. Experiment with the widths and heights (use categorical percentage
# unite) as well as colors and alphas

p.rect('xname', 'yname', 0.9, 0.9, source=source,
       color='colors', alpha='alphas', line_color=None)

# EXERCISE: use p.grid, p.axis, etc. to style the plot. Some suggestions:
#   - remove the axis and grid lines
#   - remove the major ticks
#   - make the tick labels smaller
#   - set the x-axis orientation to vertical, or angled
p.grid.grid_line_color = None
p.axis.axis_line_color = None
p.axis.major_tick_line_color = None
p.axis.major_label_text_font_size = "5pt"
p.axis.major_label_standoff = 0
p.xaxis.major_label_orientation = np.pi/3

#print(np.pi/3) #Result : 1.0471975511965976

# EXERCISE: configure the hover tool to display both names as well as
# the count value as tooltips
hover = p.select(dict(type=HoverTool))
hover.tooltips = [
    ('names', '@yname, @xname'),
    ('count', '@count'),
]

# EXERCISE: show the plot
show(p)