# **Fundamentals of Data Visualization in Bokeh**

Bokeh is an interactive visualization library for modern web browsers. It provides elegant, concise construction of versatile graphics and affords high-performance interactivity across large or streaming datasets.
Bokeh can help anyone who wants to create interactive plots, dashboards, and data applications quickly and easily.

In this blog post we are going to use bokeh to explore a subset of the dataset (Taxi & Limousine dataset of 2022 in the month of November) and create elegant, engaging visualizations that effectively communicate data insights.

In [1]:
# Import necessary libraries
import pyforest
from datetime import datetime, timedelta

In [2]:
# Loading the dataset using pandas
df = pd.read_parquet('yellow_tripdata_2022-11.parquet')
df.head()

<IPython.core.display.Javascript object>

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee
0,1,2022-11-01 00:51:22,2022-11-01 00:56:24,1.0,0.6,1.0,N,151,151,2,4.5,0.5,0.5,0.0,0.0,0.3,5.8,0.0,0.0
1,1,2022-11-01 00:39:43,2022-11-01 00:48:44,0.0,1.8,1.0,N,90,79,1,8.5,3.0,0.5,3.05,0.0,0.3,15.35,2.5,0.0
2,1,2022-11-01 00:55:01,2022-11-01 01:01:35,0.0,2.0,1.0,N,137,141,1,8.0,3.0,0.5,2.36,0.0,0.3,14.16,2.5,0.0
3,1,2022-11-01 00:24:49,2022-11-01 00:31:04,2.0,1.0,1.0,N,158,113,1,6.0,3.0,0.5,0.0,0.0,0.3,9.8,2.5,0.0
4,1,2022-11-01 00:37:32,2022-11-01 00:42:23,2.0,0.8,1.0,N,249,158,2,5.5,3.0,0.5,0.0,0.0,0.3,9.3,2.5,0.0


## Data Processing using pandas library

In [3]:
# Defining columns
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3252717 entries, 0 to 3252716
Data columns (total 19 columns):
 #   Column                 Dtype         
---  ------                 -----         
 0   VendorID               int64         
 1   tpep_pickup_datetime   datetime64[ns]
 2   tpep_dropoff_datetime  datetime64[ns]
 3   passenger_count        float64       
 4   trip_distance          float64       
 5   RatecodeID             float64       
 6   store_and_fwd_flag     object        
 7   PULocationID           int64         
 8   DOLocationID           int64         
 9   payment_type           int64         
 10  fare_amount            float64       
 11  extra                  float64       
 12  mta_tax                float64       
 13  tip_amount             float64       
 14  tolls_amount           float64       
 15  improvement_surcharge  float64       
 16  total_amount           float64       
 17  congestion_surcharge   float64       
 18  airport_fee           

In [4]:
# Checking for null values
df.isnull().sum()

VendorID                      0
tpep_pickup_datetime          0
tpep_dropoff_datetime         0
passenger_count          121958
trip_distance                 0
RatecodeID               121958
store_and_fwd_flag       121958
PULocationID                  0
DOLocationID                  0
payment_type                  0
fare_amount                   0
extra                         0
mta_tax                       0
tip_amount                    0
tolls_amount                  0
improvement_surcharge         0
total_amount                  0
congestion_surcharge     121958
airport_fee              121958
dtype: int64

In [5]:
# Duplicates
df.duplicated().sum()

0

In [6]:
# Dropping columns that contain null values
df.drop(['passenger_count', 'RatecodeID', 'store_and_fwd_flag', 'congestion_surcharge', 'airport_fee'], inplace = True, axis = 1)
# Dropping other columns that are not going to be used in analysis
df.drop(['extra', 'mta_tax', 'tip_amount', 'improvement_surcharge', 'tolls_amount'], inplace = True, axis = 1)
df.head()


Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,trip_distance,PULocationID,DOLocationID,payment_type,fare_amount,total_amount
0,1,2022-11-01 00:51:22,2022-11-01 00:56:24,0.6,151,151,2,4.5,5.8
1,1,2022-11-01 00:39:43,2022-11-01 00:48:44,1.8,90,79,1,8.5,15.35
2,1,2022-11-01 00:55:01,2022-11-01 01:01:35,2.0,137,141,1,8.0,14.16
3,1,2022-11-01 00:24:49,2022-11-01 00:31:04,1.0,158,113,1,6.0,9.8
4,1,2022-11-01 00:37:32,2022-11-01 00:42:23,0.8,249,158,2,5.5,9.3


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3252717 entries, 0 to 3252716
Data columns (total 9 columns):
 #   Column                 Dtype         
---  ------                 -----         
 0   VendorID               int64         
 1   tpep_pickup_datetime   datetime64[ns]
 2   tpep_dropoff_datetime  datetime64[ns]
 3   trip_distance          float64       
 4   PULocationID           int64         
 5   DOLocationID           int64         
 6   payment_type           int64         
 7   fare_amount            float64       
 8   total_amount           float64       
dtypes: datetime64[ns](2), float64(3), int64(4)
memory usage: 223.3 MB


In [8]:
# Rename some columns
df = df.rename(columns={'tpep_pickup_datetime': 'pickup_time', 'tpep_dropoff_datetime': 'dropoff_time'})
df.head()

Unnamed: 0,VendorID,pickup_time,dropoff_time,trip_distance,PULocationID,DOLocationID,payment_type,fare_amount,total_amount
0,1,2022-11-01 00:51:22,2022-11-01 00:56:24,0.6,151,151,2,4.5,5.8
1,1,2022-11-01 00:39:43,2022-11-01 00:48:44,1.8,90,79,1,8.5,15.35
2,1,2022-11-01 00:55:01,2022-11-01 01:01:35,2.0,137,141,1,8.0,14.16
3,1,2022-11-01 00:24:49,2022-11-01 00:31:04,1.0,158,113,1,6.0,9.8
4,1,2022-11-01 00:37:32,2022-11-01 00:42:23,0.8,249,158,2,5.5,9.3


## Data visualization with bokeh



In [10]:
# Importing necessary functions from bokeh
from bokeh.models import DatetimeTickFormatter, NumeralTickFormatter, ColumnDataSource
from bokeh.plotting import figure, show, output_notebook, output_file

In [11]:
output_notebook()

In [12]:
df.columns

Index(['VendorID', 'pickup_time', 'dropoff_time', 'trip_distance',
       'PULocationID', 'DOLocationID', 'payment_type', 'fare_amount',
       'total_amount'],
      dtype='object')