### Bokeh with Numpy and Pandas

#### Import Bokeh for plotting

In [None]:
from bokeh.plotting import figure, show
from bokeh.io import output_notebook

In [None]:
#To display bokeh plots inline in Jupyter Notebook
output_notebook()

#### Working with Numpy arrays

Numpy is nothing but numeric python. Instead of lists, numpy uses arrays for easy and faster calculations. In real time scenarios we'll be dealing a lot data analysis with numpy arrays. Bokeh deals numpy arrays just as list.

In [None]:
# Import numpy library
import numpy as np

##### Bar Chart

Generating random data using numpy random class and building a bar plot.

In [None]:
#Generate data
x = np.array([1,2,3,4,5])
y = np.random.rand(5)*100

#Initiate plot objects
bar = figure(width = 500, height = 300)

#Create a vertical bar plot
bar.vbar(x = x, top = y, color = 'lightblue', width = 0.5)

#Print the plot
show(bar)

<img src="./plots/03 - Bar Numpy.png">

##### Scatter plot

Similar way creating random data array using numpy and plotting a scatter plot.

In [None]:
np.random.seed(12)
x = np.random.rand(10)
y = np.random.rand(10)
p = figure(width = 500, height = 300)
p.scatter(x,y)
show(p)

<img src="./plots/03 - Scatter Numpy.png">

#### Working with Pandas

Pandas is known as python library for data analysis. As the name suggests it is known for its high-performance containers for performing data analysis. __Series and the Data Frame__ are the two main data structures of Pandas library. The advantage over numpy arrays is that pandas data structure contains meaningful labels for explanation. It adds flexibility in plotting time series data.

In [None]:
import pandas as pd

Let's pull "default of credit card clients" data from UCI repository for our experiment.

In [None]:
#default of credit card clients Data Set 
data = pd.read_excel('https://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls',
                     header = 1
                    )

The dataset is encoded by numeric variables. Let's quickly convert some of them into categorical variables.

In [None]:
data['SEX'] = data['SEX'].map({1:'Male', 2:'Female'})
data['EDUCATION'] = data['EDUCATION'].map({1: 'graduate school', 2: 'university', 3: 'high school', 4: 'others'})
data['MARRIAGE'] = data['MARRIAGE'].map({1: 'married', 2: 'single', 3: 'others'})

Inspecting pandas dataframe.

In [None]:
data.head()

##### Bar Chart

In [None]:
#Initiate plot objects
bar = figure(x_range=data['SEX'].value_counts().index.tolist(),width = 500, height = 300)

#Create a vertical bar plot
bar.vbar(x = data['SEX'].value_counts().index.tolist(), top = data['SEX'].value_counts(), color = 'lightblue', width = 0.5)

#Print the plot
show(bar)

<img src="./plots/03 - Bar pandas.png">

##### Scatter plot

Scatter plot with 2 dimensions(x-axis & y-axis).

In [None]:
p = figure(width = 500, height = 300)
p.scatter(data['AGE'],data['LIMIT_BAL'])
show(p)

<img src="./plots/03 - Scatter 2d.png">

Scatter plot with 3 dimensions(x-axis, y-axis and color/shape).

In [None]:
p = figure(width = 500, height = 300)
p.diamond(data.loc[data['SEX'] == 'Male', 'AGE'],data.loc[data['SEX'] == 'Male', 'LIMIT_BAL'], color = 'yellow')
p.square(data.loc[data['SEX'] == 'Female', 'AGE'],data.loc[data['SEX'] == 'Female', 'LIMIT_BAL'], color = 'pink')
show(p)

<img src="./plots/03 - Scatter 3d.png">

These are some of the available shapes in bokeh library.

In [None]:
# Available Shapes
# cross(), x(), diamond(), diamond_cross(), circle_x()
# circle_cross(), triangle(), inverted_triangle(), square()
# square_x(), square_cross(), asterisk()

#### Time Series plotting

Date and time are import metrics of any business. We cannot escape creating a dashboard without date/time functionality. It is important to understand how bokeh handles date and time functions for plotting.

Let's pull a use a daily level weather data for our analysis.

In [None]:
#Beijing PM2.5 Data Data Set 
ts = pd.read_csv('PRSA_data_2010.1.1-2014.12.31.csv')

Create a date variable and then parse it using pandas date-time function.

In [None]:
ts['Date'] = pd.to_datetime(ts['year'].astype(str)+'/'+ts['month'].astype(str)+'/'+ts['day'].astype(str),
                            infer_datetime_format=True)

In [None]:
ts.head()

Aggregate the data at date level for plotting the attributes.

In [None]:
dew = ts.groupby('Date')['DEWP'].mean()
tem = ts.groupby('Date')['TEMP'].mean()

Create a time series plot using bokeh package.

In [None]:
p = figure(x_axis_type = 'datetime', width = 1000, height = 300)
p.line(x = dew.index, y = dew.values, color = 'lightblue')
p.line(x = tem.index, y = tem.values, color = 'lightgreen')
show(p)

<img src="./plots/03 - Time Series.png">

#### ColumnDataSource

The ColumnDataSource is a fundamental data structure of Bokeh. ColumnDataSource is the object where the data of a Bokeh graph is stored. Certain functionality in bokeh plotting like popups rely on this ColumnDataSource data stucture. SO it is good practice to use ColumnDataSource when possible. There is an implicit assumption that all the columns in a given ColumnDataSource all have the same length at all times.

In [None]:
from bokeh.models import ColumnDataSource

##### Time series plotting

In [None]:
col_data = ColumnDataSource(ts.groupby('Date')['DEWP', 'TEMP'].mean().reset_index())

In [None]:
p = figure(x_axis_type = 'datetime', width = 900, height = 300)
p.line(x = 'Date', y = 'DEWP', source = col_data, color = 'lightblue')
p.line(x = 'Date', y = 'TEMP', source = col_data, color = 'lightgreen')
show(p)

<img src="./plots/03 - Time Series CDS.png">

##### Scatter Plot

In [None]:
col_data = ColumnDataSource( data = {
    'x1': data.loc[data['SEX'] == 'Male', 'AGE'],
    'y1': data.loc[data['SEX'] == 'Male', 'LIMIT_BAL'],
    'x2': data.loc[data['SEX'] == 'Female', 'AGE'],
    'y2': data.loc[data['SEX'] == 'Female', 'LIMIT_BAL']
}
)

As I explained before, we can see the warning when the number of entries are different between male and female data.

In [None]:
p = figure(width = 900, height = 300)
p.circle_cross(x = 'x1', y = 'y1', source = col_data, color = 'lightblue')
p.diamond_cross(x = 'x2', y = 'y2', source = col_data, color = 'lightgreen')
show(p)

<img src="./plots/03 - Scatter CDS.png">

#### Summary

In this notebook we discovered how to use following external packages for plotting Bokeh graphs.
* Numpy
* Pandas
* Time Series
* ColumnDataSource