# Learning Module - Bokeh

### Here we will show you how to use some of the basic functionalities of Bokeh, a python visualization library.
Keep in mind that there are several other libraries, each one with their own pros and cons, and, as of now, none of them can be classified as the best overall.

Bokeh, Plotly and Matplotlib are the most well known in data visualization but there's more.

Some of these can be used to create rich and interactive visualizations that can be server directly to the end-user in a web-page (Bokeh is one of them)

[Bokeh](https://docs.bokeh.org/en/latest/index.html) is open-source, which means that you can see how it is built, and you can also help built it!

Bookmark [this](https://docs.bokeh.org/en/latest/docs/user_guide/plotting.html).

In this module we will cover:
* Basic data manipulation
* Data Visualization with Bokeh
    * Piecharts
    * Scatter plots
    * Line and Multi-Line plots
    * Areas
    * Annotations

## Data !

Let's use some covid data taken from Jonh Hopkin's University.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("covid19data_cleaned.csv")

In [3]:
total = df.groupby("Country").sum()

In [4]:
from bokeh.io import output_file, show
from bokeh.palettes import Category20c,Turbo256
from bokeh.plotting import figure
from bokeh.transform import cumsum
from bokeh.io import output_notebook

from math import pi



Run this so that the plots are shown here

In [5]:
output_notebook()

## Ya'll like pies?
#### Everybody likes pies. That was retorical.
*I dont remember the last time I had one thou*

In [6]:
n = 15

In [7]:
n_largest = total["confirmed_cases"].nlargest(n)

We need data for the pie. 

Let us setup the recipe first.

In [8]:
data = pd.Series(n_largest).reset_index(name='value').rename(columns={'index':'Country'})
data['angle'] = data['value']/data['value'].sum() * 2*pi
data['color'] = Category20c[n] #pick n colors from this pallete

In [9]:
data

Unnamed: 0,Country,value,angle,color
0,US,11202979.0,1.723609,#3182bd
1,India,8873541.0,1.365219,#6baed6
2,Brazil,5876464.0,0.90411,#9ecae1
3,France,2041293.0,0.314059,#c6dbef
4,Russia,1932711.0,0.297353,#e6550d
5,Spain,1496864.0,0.230297,#fd8d3c
6,United Kingdom,1394299.0,0.214517,#fdae6b
7,Argentina,1318384.0,0.202837,#fdd0a2
8,Italy,1205881.0,0.185528,#31a354
9,Colombia,1205217.0,0.185426,#74c476


### Pie Chart!

#### Not an actual pie, I am sorry.

#### Let's see the confirmed cases

In [10]:
p = figure(plot_height=350, title="Top {} Confirmed cases".format(n), toolbar_location=None,
           tools="hover", tooltips="@Country: @value", x_range=(-0.5, 1.0))

p.wedge(x=0, y=1, radius=0.4,
        start_angle=cumsum('angle', include_zero=True), end_angle=cumsum('angle'),
        line_color="white", fill_color='color', legend_field='Country', source=data)

p.axis.axis_label=None
p.axis.visible=False
p.grid.grid_line_color = None
show(p)

## What about deaths?

In [11]:
n_largest = total["deaths"].nlargest(n)
data = pd.Series(n_largest).reset_index(name='value').rename(columns={'index':'Country'})
data['angle'] = data['value']/data['value'].sum() * 2*pi
data['color'] = Category20c[n] #Picking n colors from this palette

In [12]:
p = figure(plot_height=350, title="Top {} Deaths".format(n), toolbar_location=None,
           tools="hover", tooltips="@Country: @value", x_range=(-0.5, 1.0))

p.wedge(x=0, y=1, radius=0.4,
        start_angle=cumsum('angle', include_zero=True), end_angle=cumsum('angle'),
        line_color="white", fill_color='color', legend_field='Country', source=data)

p.axis.axis_label=None
p.axis.visible=False
p.grid.grid_line_color = None

show(p)

### Apparently the countries with most confirmed cases are not always the ones with the most deaths...
#### Why would that be?

___

## Now, since we got the basics of pie charts and data sources out of the way, let us just do a regular line.

### First lets pick a country to see the evolution of cases throughout time!

In [13]:
country = "Portugal"

In [14]:
mask = df["Country"] ==country

In [15]:
selected_country = df[mask]

In [16]:
fig = figure(width= 600,height=500,tools="hover,pan,wheel_zoom,box_zoom,reset",title="{} cases evolution".format(country),x_axis_type="datetime",tooltips="value: @y")

In [17]:
x = pd.to_datetime(selected_country["date"])
y = selected_country["confirmed_cases"].astype(int)

In [18]:
fig.line(x = x,y = y,legend_label="Confirmed by Day")

In [19]:
show(fig)

## Feels a bit empty no?

#### Lets correlate the cases to the deaths.

In [20]:
y = selected_country["deaths"].astype(int)

In [21]:
fig.line(x = x,y = y,legend_label="Death by Day",color="red")

In [22]:
show(fig)

### Yet again, we still have space for more!
#### Now let us add the cumulative data for both cases and deaths.

These lines share the same x, with bokeh we can simplify the way we do this.
And this time we want to be able to zoom, but only on the x axis.

In [23]:
total = selected_country[["confirmed_cases","deaths"]].cumsum()

In [24]:
y = total["deaths"]

fig.line(x = x,y = y,legend_label="Cumulative Death",color="yellow")

y = total["confirmed_cases"]
fig.line(x = x,y = y,legend_label="Cumulative Cases",color="orange")

show(fig)

### Ok. Now it is to hard to make sense out of this.

Lets keep thing neat and tidy, so that we dont overwhelm our end-users with a lot of information.

### As an alternative, lets try to plot both the deaths and confirmed cases, but as % variances of the day before.

In [25]:
country = "Portugal"
mask = df["Country"] ==country
selected_country = df[mask][["confirmed_cases","deaths"]].pct_change()
fig = figure(width= 600,height=500,tools="hover,pan,wheel_zoom,reset",title="{} cases evolution".format(country),x_axis_type="datetime",tooltips="value: @y")
x = selected_country.index
y =selected_country["confirmed_cases"]
fig.line(x = x,y = y,legend_label="% difference per day",color="orange")
y =selected_country["deaths"]

fig.line(x = x,y = y,legend_label="% difference per day",color="red")

show(fig)


### Might be an interesting view of the data. What do you think?

___

#### Lets use the absolute values. However, with less granularity. Let us see how Covid was increasing by week. (7D)

In [26]:
df.index = pd.to_datetime(df["date"]) #Why do we need to do this ?

In [27]:
country = "Portugal"
mask = df["Country"] ==country
selected_country = df[mask]

In [28]:
import numpy as np

In [29]:
selected_country

Unnamed: 0_level_0,date,Country,deaths,confirmed_cases
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2020-01-22,2020-01-22,Portugal,0.0,0.0
2020-01-23,2020-01-23,Portugal,0.0,0.0
2020-01-24,2020-01-24,Portugal,0.0,0.0
2020-01-25,2020-01-25,Portugal,0.0,0.0
2020-01-26,2020-01-26,Portugal,0.0,0.0
...,...,...,...,...
2020-11-12,2020-11-12,Portugal,78.0,5839.0
2020-11-13,2020-11-13,Portugal,69.0,6653.0
2020-11-14,2020-11-14,Portugal,55.0,6602.0
2020-11-15,2020-11-15,Portugal,76.0,6035.0


In [30]:
selected_country = selected_country.resample("7D").agg([np.mean,np.sum,np.std])

In [31]:
x = selected_country.index
y = selected_country["confirmed_cases"]["mean"]

In [32]:
fig = figure(width= 600,height=500,tools="hover,pan,wheel_zoom,box_zoom,reset",title="{} cases evolution".format(country),x_axis_type="datetime",tooltips="value: @y")
fig.line(x = x,y = y,legend_label="Mean confirmed by week",color="orange")
show(fig)

In [33]:
conf_cases = selected_country["confirmed_cases"]
upper = conf_cases["mean"] + conf_cases["std"]
lower = conf_cases["mean"] - conf_cases["std"]

In [34]:
source = { "x": x, "upper" :upper,"lower":lower}

In [35]:
from bokeh.models import Band, ColumnDataSource

In [36]:
band = Band(base="x", lower="lower", upper="upper", source=ColumnDataSource(data=source), level='underlay',
            fill_alpha=0.2, line_width=1, line_color='orange',fill_color="orange")

fig.add_layout(band) #we need to add this one explicitly..
show(fig)

In [37]:
fig.varea(x="x",y1= "upper", y2= "lower",legend_label="std",color="orange",fill_alpha=0.2,source=source) #This also works, but then I wouldnt be able to make a pun with bands.

### Lets do the same, but now, lets separate the data per season.

### Select seasons

In [38]:
#some of the possible params: start,end,periods,frequency
winter = pd.date_range(start='21/12/2019', end='20/03/2020')
spring =pd.date_range(start='20/03/2020', end='20/06/2020')
summer =pd.date_range(start='20/06/2020', end='22/10/2020')
autumn = pd.date_range(start='22/10/2020', end='21/12/2020')


### Select the country

In [39]:
country = "Portugal"
mask = df["Country"] ==country
selected_country = df[mask]

In [40]:
selected_country = selected_country.resample("7D").agg([np.mean,np.sum,np.std])

In [41]:
fig = figure(width= 600,height=500,tools="hover,pan,wheel_zoom,box_zoom,reset",title="{} cases evolution".format(country),x_axis_type="datetime",tooltips="value: @y")

In [42]:
mask = selected_country.index.isin(winter)

winter = selected_country[mask]
x = winter.index
y = winter["confirmed_cases"]["mean"]
source = {
    "x":x,
    "upper":y + winter["confirmed_cases"]["std"],
    "lower":y - winter["confirmed_cases"]["std"],
}
fig.varea(x="x",y1= "upper", y2= "lower",color="blue",fill_alpha=0.2,source=source)
fig.line(x = x,y = y,legend_label="Winter cases",color="blue")


In [43]:
mask = selected_country.index.isin(spring)

spring = selected_country[mask]
x = spring.index
y = spring["confirmed_cases"]["mean"]
source = {
    "x":x,
    "upper":y + spring["confirmed_cases"]["std"],
    "lower":y - spring["confirmed_cases"]["std"],
}
fig.varea(x="x",y1= "upper", y2= "lower",color="orange",fill_alpha=0.2,source=source)
fig.line(x = x,y = y,legend_label="Spring cases",color="orange")


In [44]:
mask = selected_country.index.isin(autumn)

autumn = selected_country[mask]
x = autumn.index
y = autumn["confirmed_cases"]["mean"]
source = {
    "x":x,
    "upper":y + autumn["confirmed_cases"]["std"],
    "lower":y - autumn["confirmed_cases"]["std"],
}
fig.varea(x="x",y1= "upper", y2= "lower",color="yellow",fill_alpha=0.2,source=source) 
fig.line(x = x,y = y,legend_label="Autumn cases",color="yellow")


In [45]:
mask = selected_country.index.isin(summer)

summer = selected_country[mask]
x = summer.index
y = summer["confirmed_cases"]["mean"]
source = {
    "x":x,
    "upper":y + summer["confirmed_cases"]["std"],
    "lower":y - summer["confirmed_cases"]["std"],
}
fig.varea(x="x",y1= "upper", y2= "lower",color="red",fill_alpha=0.2,source=source)
fig.line(x = x,y = y,legend_label="Summer cases",color="red")


In [46]:
show(fig)

### Lets just organize that code

In [47]:
winter = pd.date_range(start='21/12/2019', end='20/03/2020')
spring =pd.date_range(start='20/03/2020', end='20/06/2020')
summer =pd.date_range(start='20/06/2020', end='22/10/2020')
autumn = pd.date_range(start='22/10/2020', end='21/12/2020')

colors = ["blue","orange","red","yellow"]
seasons = [winter,spring,summer,autumn]
labels = ["Winter","Spring","Summer","Autumn"]



In [48]:
fig = figure(width= 600,height=500,tools="hover,pan,xwheel_zoom,reset",title="{} cases evolution".format(country),x_axis_type="datetime",tooltips="value: @y")

for i in range(4):
    mask = selected_country.index.isin(seasons[i])

    season = selected_country[mask]
    x = season.index
    y = season["confirmed_cases"]["mean"]
    source = {
        "x":x,
        "upper":y + season["confirmed_cases"]["std"],
        "lower":y - season["confirmed_cases"]["std"],
    }
    fig.varea(x="x",y1= "upper", y2= "lower",color=colors[i],fill_alpha=0.2,source=source)
    fig.line(x = x,y = y,legend_label="{} cases".format(labels[i]),color=colors[i])


show(fig)

## The line should be continuous, right? How could we do this?
* Change data
* Add data


In [49]:
country = "Portugal"
mask = df["Country"] ==country
selected_country = df[mask]

In [50]:
selected_country = selected_country.resample("7D").agg([np.mean,np.sum,np.std])

In [51]:
winter = pd.date_range(start='21/12/2019', end='20/03/2020')
spring =pd.date_range(start='20/03/2020', end='20/06/2020')
summer =pd.date_range(start='20/06/2020', end='22/10/2020')
autumn = pd.date_range(start='22/10/2020', end='21/12/2020')


colors = ["blue","orange","red","yellow"]
seasons = [winter,spring,summer,autumn]
labels = ["Winter","Spring","Summer","Autumn"]



In [52]:
fig = figure(width= 600,height=500,tools="hover,pan,xwheel_zoom,reset",title="{} cases evolution".format(country),x_axis_type="datetime",tooltips="value: @y")

for i in range(4):
    mask = selected_country.index.isin(seasons[i])

    season = selected_country[mask]
    x = season.index
    y = season["confirmed_cases"]["mean"]
    source = {
        "x":x,
        "upper":y + season["confirmed_cases"]["std"],
        "lower":y - season["confirmed_cases"]["std"],
    }
    fig.varea(x="x",y1= "upper", y2= "lower",color=colors[i],fill_alpha=0.2,source=source)
    fig.line(x = x,y = y,legend_label="{} cases".format(labels[i]),color=colors[i])


show(fig)

#### Uhm, since now we have 4 groups, 4 categories, what type of plots can we explore ?

In [53]:
country = "Portugal"
mask = df["Country"] ==country
selected_country = df[mask]
#No need to resample at this point.

In [54]:
cases = []
deaths = []
for i in seasons:
    mask = selected_country.index.isin(i)

    season = selected_country[mask]
    cases.append(season["confirmed_cases"].sum())
    deaths.append(season["deaths"].sum())

In [55]:
source = ColumnDataSource(data=dict(seasons=labels, cases=cases, color=colors,deaths=deaths))

p = figure(x_range=labels, plot_height=450, title="Cases by season",
           toolbar_location=None, tools="hover",tooltips="@deaths | @cases")

p.vbar(x="seasons", top="cases", width=0.9,color="color",legend_field="seasons",source=source)

p.xgrid.grid_line_color = None
p.y_range.start = 0

show(p)

### These colors, yikes. 

### Cases are simply increasing, can we correlate them to seasons ?


### No bueno, we can do better.

In [56]:
source = ColumnDataSource(data= {
    
 "seasons":labels,
 "cases":cases,
 "deaths":deaths
    
})


categories = ["cases","deaths"]

p = figure(x_range=labels, plot_height=450, title="Cases by season",
           toolbar_location=None, tools="hover",tooltips="$y{0.}")

p.yaxis.formatter.use_scientific = False #Alert

p.vbar_stack(categories,x="seasons", width=0.9,legend_label=categories,source=source,color=["blue","red"])

p.y_range.start = 0
p.x_range.range_padding = 0.1
p.xgrid.grid_line_color = None
p.axis.minor_tick_line_color = None
p.outline_line_color = None
p.legend.location = "top_left"
p.legend.orientation = "horizontal"
show(p)

### These colors are awful. I'll let you fix that later if you want.

_______
### Lets try and use all of the data. 

#### Any idea what type of plot we could use to visualize the cases per day per country ?

First we load the data from the original file. 

We do this because the original dataset is already formatted in the way we need.

In [279]:
data = pd.read_csv("covid_confirmed.csv") 

Now, some necessary data manipulation.

In [280]:
data = data.drop(["Province/State","Lat","Long"],axis=1)

In [281]:
data = data.groupby("Country/Region").sum().reset_index().rename(columns={"Country/Region":"Country"}) 

In [282]:
data.Country = data.Country.astype(str)
data = data.set_index('Country')
data.columns.name = 'Days'


In [283]:
data = data.diff(axis=1) #Get confirmed cases by day

In [284]:
data = data.fillna(0)

In [301]:
subset = data.head(10) #Select rows

In [313]:
subset = subset.iloc[:, -50:] #Select columns


### Heatmap
In this case *confirmed_cases*map
*For more info see [here](https://docs.bokeh.org/en/latest/docs/user_guide/categorical.html)*

In [None]:
import pandas as pd

from bokeh.io import output_file, show
from bokeh.models import (BasicTicker, ColorBar, ColumnDataSource,
                          LinearColorMapper, PrintfTickFormatter,)
from bokeh.plotting import figure
from bokeh.transform import transform
from bokeh.palettes import Inferno256


In [314]:

df = pd.DataFrame(subset.stack(), columns=['cases']).reset_index()



source = ColumnDataSource(df)

colors = Inferno256
mapper = LinearColorMapper(palette=colors, low=df.cases.min(), high=df.cases.max())

p = figure(plot_width=1100, plot_height=700, title="Covid Confirmed Cases",
           y_range=list(subset.index), x_range=list(reversed(subset.columns)),
            tools="hover,box_zoom,pan", x_axis_location="above",tooltips="@Country @cases at @Days")

p.rect(x="Days", y="Country", width=1, height=1, source=source,
       line_color=None, fill_color=transform('cases', mapper))

color_bar = ColorBar(color_mapper=mapper, location=(0, 0),
                     ticker=BasicTicker(desired_num_ticks=10),
                     formatter=PrintfTickFormatter(format="%d"))

p.add_layout(color_bar, 'right')

p.axis.axis_line_color = None
p.axis.major_tick_line_color = None
p.axis.major_label_text_font_size = "7px"
p.axis.major_label_standoff = 0
p.xaxis.major_label_orientation = 1.0

show(p)