#Final Project - Building an Interactive Graph

#Download sample data from bokeh  

We will work with the Gapminder data set. Specifically with data about fertility rate and life expectancy from 1964 until 2012. The data set is already available within the Bokeh sample data.

*Gapminder is a non-profit venture promoting sustainable global development and achievement of the United Nations Millennium Development Goals by increased use and understanding of statistics and other information about social, economic and environmental development at local, national and global levels.*


In [1]:
import pandas as pd
import numpy as np
import bokeh

from jinja2 import Template

from bokeh.models import (
    ColumnDataSource, Plot, Circle, Range1d, 
    LinearAxis, HoverTool, Text,
    SingleIntervalTicker, Slider, CustomJS
)
from bokeh.palettes import Spectral6
from bokeh.plotting import vplot
from bokeh.resources import JSResources
from bokeh.embed import file_html

bokeh.sampledata.download()

Using data directory: /Users/svenballentin/.bokeh/data
Downloading: CGM.csv (1589982 bytes)
   1589982 [100.00%]
Downloading: US_Counties.zip (3182088 bytes)
   3182088 [100.00%]
Unpacking: US_Counties.csv
Downloading: us_cities.json (713565 bytes)
    713565 [100.00%]
Downloading: unemployment09.csv (253301 bytes)
    253301 [100.00%]
Downloading: AAPL.csv (166698 bytes)
    166698 [100.00%]
Downloading: FB.csv (9706 bytes)
      9706 [100.00%]
Downloading: GOOG.csv (113894 bytes)
    113894 [100.00%]
Downloading: IBM.csv (165625 bytes)
    165625 [100.00%]
Downloading: MSFT.csv (161614 bytes)
    161614 [100.00%]
Downloading: WPP2012_SA_DB03_POPULATION_QUINQUENNIAL.zip (5148539 bytes)
   5148539 [100.00%]
Unpacking: WPP2012_SA_DB03_POPULATION_QUINQUENNIAL.csv
Downloading: gapminder_fertility.csv (64346 bytes)
     64346 [100.00%]
Downloading: gapminder_population.csv (94509 bytes)
     94509 [100.00%]
Downloading: gapminder_life_expectancy.csv (73243 bytes)
     73243 [100.00%]
Downl

#Define some basic functions to process the data  

We define three functions that will help us to process the Gapminder sample data into a format that can be processed more easily for our plot.

1. `def _process_gapminder_data:`
    + Make column names into strings
    + Turn population into bubble sizes
    + Use pandas categories and map to colors
    
2. `def get_gapminder_1964_data():`
    + Get a dataframe consisting of data from 1964
    
3. `def get_gapminder_1964_scatter_data():`

In [2]:
def _process_gapminder_data():
    from bokeh.sampledata.gapminder import fertility, life_expectancy, population, regions

    # Make the column names ints not strings for handling
    columns = list(fertility.columns)
    years = list(range(int(columns[0]), int(columns[-1])))
    rename_dict = dict(zip(columns, years)) # a dict containing year-strings as keys and year-ints as values

    fertility = fertility.rename(columns=rename_dict) # mapping from string to int
    life_expectancy = life_expectancy.rename(columns=rename_dict)
    population = population.rename(columns=rename_dict)
    regions = regions.rename(columns=rename_dict)

    # Turn population into bubble sizes. Use min_size and factor to tweak.
    scale_factor = 200
    population_size = np.sqrt(population / np.pi) / scale_factor
    min_size = 3
    population_size = population_size.where(population_size >= min_size).fillna(min_size)

    # Use pandas categories and categorize & color the regions
    regions.Group = regions.Group.astype('category') # Apply category type on regions
    regions_list = list(regions.Group.cat.categories) # Just store all unique regions in a list

    def get_color(r):
        return Spectral6[regions_list.index(r.Group)] #Map index in list to color on palette
    regions['region_color'] = regions.apply(get_color, axis=1) #Add regions-color column

    return fertility, life_expectancy, population_size, regions, years, regions_list


def get_gapminder_1964_data():
    fertility_df, life_expectancy_df, population_df_size, regions_df, years, regions = _process_gapminder_data()
    year = 1964
    region_color = regions_df['region_color']
    region_color.name = 'region_color'
    fertility = fertility_df[year] # get only data for 1964
    fertility.name = 'fertility'
    life = life_expectancy_df[year]
    life.name = 'life'
    population = population_df_size[year]
    population.name = 'population'
    new_df = pd.concat([fertility, life, population, region_color], axis=1) #concat pandas Series' to a DF
    return new_df


def get_gapminder_1964_scatter_data():
    fertility_df, life_expectancy_df, population_df_size, regions_df, years, regions = _process_gapminder_data()
    xyvalues = OrderedDict()
    xyvalues['1964'] = list(
        zip(
            fertility_df[1964].dropna().values,
            life_expectancy_df[1964].dropna().values
        )
    )
    return xyvalues

#Setting up the data  

The plot animates with the slider showing the data over time from 1964 to 2012. We can think of **each year as a seperate static plot**, and when the slider moves, we **use the Callback to change the data source** that is driving the plot.

We **could use bokeh-server to drive this change, but as the data is not too big** we can also pass all the datasets to the javascript at once and switch between them on the client side.

This means that we need to **build one data source for each year** that we have data for and are going to switch between using the slider. We build them and add them to a **dictionary `sources`** that holds them under a key that is the name of the year preficed with a `_`.

In [3]:
fertility_df, life_expectancy_df, population_df_size, regions_df, years, regions = _process_gapminder_data()

sources = {}

region_color = regions_df['region_color']
region_color.name = 'region_color'

for year in years:
    fertility = fertility_df[year]
    fertility.name = 'fertility'
    life = life_expectancy_df[year]
    life.name = 'life'
    population = population_df_size[year]
    population.name = 'population'
    new_df = pd.concat([fertility, life, population, region_color], axis=1)
    sources['_' + str(year)] = ColumnDataSource(new_df) # add datasource to dict "sources"

In [None]:
print sources

We will **pass this dictionary to the Callback**. We will create a **js_source_array that contains keys and values, where the values that refers to our ColumnDataSource**. Note that we need the prefixing as JS objects cannot begin with a number.

The string js_source_array looks like this: {1964: _1964, 1965: _1965, ....}

Note the **keys of this object are integers and the values are the references to our ColumnDataSources** from above. So that now, in our JS code, we have an object that's storing all of our ColumnDataSources and we can look them up.

In [4]:
dictionary_of_sources = dict(zip([x for x in years], ['_%s' % x for x in years]))
js_source_array = str(dictionary_of_sources).replace("'", "")
print(js_source_array)

{1964: _1964, 1965: _1965, 1966: _1966, 1967: _1967, 1968: _1968, 1969: _1969, 1970: _1970, 1971: _1971, 1972: _1972, 1973: _1973, 1974: _1974, 1975: _1975, 1976: _1976, 1977: _1977, 1978: _1978, 1979: _1979, 1980: _1980, 1981: _1981, 1982: _1982, 1983: _1983, 1984: _1984, 1985: _1985, 1986: _1986, 1987: _1987, 1988: _1988, 1989: _1989, 1990: _1990, 1991: _1991, 1992: _1992, 1993: _1993, 1994: _1994, 1995: _1995, 1996: _1996, 1997: _1997, 1998: _1998, 1999: _1999, 2000: _2000, 2001: _2001, 2002: _2002, 2003: _2003, 2004: _2004, 2005: _2005, 2006: _2006, 2007: _2007, 2008: _2008, 2009: _2009, 2010: _2010, 2011: _2011, 2012: _2012}


#Build the plot

In [5]:
# Set up the plot
xdr = Range1d(1, 9)
ydr = Range1d(20, 100)
plot = Plot(
    x_range=xdr,
    y_range=ydr,
    title="",
    plot_width=1600,
    plot_height=800,
    outline_line_color=None,
    toolbar_location=None,    
)
AXIS_FORMATS = dict(
    minor_tick_in=None,
    minor_tick_out=None,
    major_tick_in=None,
    major_label_text_font_size="10pt",
    major_label_text_font_style="normal",
    axis_label_text_font_size="10pt",

    axis_line_color='#AAAAAA',
    major_tick_line_color='#AAAAAA',
    major_label_text_color='#666666',

    major_tick_line_cap="round",
    axis_line_cap="round",
    axis_line_width=1,
    major_tick_line_width=1,
)

xaxis = LinearAxis(axis_label="Children per woman (total fertility)", **AXIS_FORMATS)
yaxis = LinearAxis(axis_label="Life expectancy at birth (years)", **AXIS_FORMATS)   
plot.add_layout(xaxis, 'below')
plot.add_layout(yaxis, 'left')

#Add the background year text

We add this first so it is below all the other glyphs

In [6]:
# Add the year in background (add before circle)
text_source = ColumnDataSource({'year': ['%s' % years[0]]}) #define column datasource in JSON, 
                                                            # we could also do {'year': ['1964']})
text = Text(x=2, y=35, text='year', text_font_size='150pt', text_color='#EEEEEE')
plot.add_glyph(text_source, text)

<bokeh.models.renderers.GlyphRenderer at 0x10609d350>

#Add the bubbles and hover

We add the **bubbles using the Circle glyph**. We start from the first year of data and that is our source that drives the circles (the other sources will be used later).

plot.add_glyph returns the renderer, and we pass this to the HoverTool so that hover only happens for the bubbles on the page and not other glyph elements.

In [7]:
# Add the circle
renderer_source = sources['_%s' % years[0]]
circle_glyph = Circle(
    x='fertility', y='life', size='population',
    fill_color='region_color', fill_alpha=0.8, 
    line_color='#7c7e71', line_width=0.5, line_alpha=0.5)
circle_renderer = plot.add_glyph(renderer_source, circle_glyph)

# Add the hover (only against the circle and not other plot elements)
plot.add_tools(HoverTool(tooltips="@index", renderers=[circle_renderer]))

#Add the legend 

Finally we manually build the legend by adding circles and texts to the plot.

In [8]:
text_x = 7
text_y = 95
for i, region in enumerate(regions):
    plot.add_glyph(Text(x=text_x, y=text_y, text=[region], text_font_size='10pt', text_color='#666666'))
    plot.add_glyph(Circle(x=text_x - 0.1, y=text_y + 2, fill_color=Spectral6[i], size=10, line_color=None, fill_alpha=0.8))
    text_y = text_y - 5 


#Add the slider and callback

Last, but not least, we **add the slider widget and the JS callback code which changes the data of the renderer_source (powering the bubbles / circles) and the data of the text_source (powering background text)**. After we've set() the data we need to trigger() a change. slider, renderer_source, text_source are all available because we add them as args to Callback.

It is the combination of sources = %s % (js_source_array) in the JS and Callback(args=sources...) that provides the ability to look-up, by year, the JS version of our python-made ColumnDataSource.

In [11]:
# Add the slider
code = """
    var year = slider.get('value'),
        sources = %s,
        new_source_data = sources[year].get('data');
    renderer_source.set('data', new_source_data);
    text_source.set('data', {'year': [String(year)]});
""" % js_source_array

callback = CustomJS(args=sources, code=code)
slider = Slider(start=years[0], end=years[-1], value=1, step=1, title="Year", callback=callback, name='testy')
callback.args["renderer_source"] = renderer_source
callback.args["slider"] = slider
callback.args["text_source"] = text_source


In [12]:
# Stick the plot and the slider together
layout = vplot(plot, slider)

from bokeh.io import output_file, show

output_file("figures/bokeh_final_project.html")
show(layout)
