# **Fundamentals of Data Visualization in Bokeh**

Bokeh is an interactive visualization library for modern web browsers. It provides elegant, concise construction of versatile graphics and affords high-performance interactivity across large or streaming datasets.
Bokeh can help anyone who wants to create interactive plots, dashboards, and data applications quickly and easily.

In this blog post we are going to use bokeh to explore a subset of the dataset (Taxi & Limousine dataset of 2022 in the month of November) and create elegant, engaging visualizations that effectively communicate data insights.

In [1]:
# Import necessary libraries
import pyforest
from datetime import datetime, timedelta

In [2]:
# Loading the dataset using pandas
df = pd.read_parquet('yellow_tripdata_2022-11.parquet')
# Pick the first 10000
df = df.head(10000)
df.head()

<IPython.core.display.Javascript object>

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee
0,1,2022-11-01 00:51:22,2022-11-01 00:56:24,1.0,0.6,1.0,N,151,151,2,4.5,0.5,0.5,0.0,0.0,0.3,5.8,0.0,0.0
1,1,2022-11-01 00:39:43,2022-11-01 00:48:44,0.0,1.8,1.0,N,90,79,1,8.5,3.0,0.5,3.05,0.0,0.3,15.35,2.5,0.0
2,1,2022-11-01 00:55:01,2022-11-01 01:01:35,0.0,2.0,1.0,N,137,141,1,8.0,3.0,0.5,2.36,0.0,0.3,14.16,2.5,0.0
3,1,2022-11-01 00:24:49,2022-11-01 00:31:04,2.0,1.0,1.0,N,158,113,1,6.0,3.0,0.5,0.0,0.0,0.3,9.8,2.5,0.0
4,1,2022-11-01 00:37:32,2022-11-01 00:42:23,2.0,0.8,1.0,N,249,158,2,5.5,3.0,0.5,0.0,0.0,0.3,9.3,2.5,0.0


## Data Processing using pandas library

In [3]:
# Defining columns
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 19 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   VendorID               10000 non-null  int64         
 1   tpep_pickup_datetime   10000 non-null  datetime64[ns]
 2   tpep_dropoff_datetime  10000 non-null  datetime64[ns]
 3   passenger_count        10000 non-null  float64       
 4   trip_distance          10000 non-null  float64       
 5   RatecodeID             10000 non-null  float64       
 6   store_and_fwd_flag     10000 non-null  object        
 7   PULocationID           10000 non-null  int64         
 8   DOLocationID           10000 non-null  int64         
 9   payment_type           10000 non-null  int64         
 10  fare_amount            10000 non-null  float64       
 11  extra                  10000 non-null  float64       
 12  mta_tax                10000 non-null  float64       
 13  ti

In [4]:
# Checking for null values
df.isnull().sum()

VendorID                 0
tpep_pickup_datetime     0
tpep_dropoff_datetime    0
passenger_count          0
trip_distance            0
RatecodeID               0
store_and_fwd_flag       0
PULocationID             0
DOLocationID             0
payment_type             0
fare_amount              0
extra                    0
mta_tax                  0
tip_amount               0
tolls_amount             0
improvement_surcharge    0
total_amount             0
congestion_surcharge     0
airport_fee              0
dtype: int64

In [5]:
# Duplicates
df.duplicated().sum()

0

## Data visualization with bokeh



In [6]:
# Importing necessary functions from bokeh
from bokeh.models import LinearColorMapper, ColorBar, LinearColorMapper, ColorBar, NumeralTickFormatter, ColumnDataSource
from bokeh.plotting import figure, show, output_notebook
from bokeh.palettes import Viridis256
from bokeh.transform import transform

In [7]:
# Output in the notebook
output_notebook()

In [8]:
# Define columns
df.columns

Index(['VendorID', 'tpep_pickup_datetime', 'tpep_dropoff_datetime',
       'passenger_count', 'trip_distance', 'RatecodeID', 'store_and_fwd_flag',
       'PULocationID', 'DOLocationID', 'payment_type', 'fare_amount', 'extra',
       'mta_tax', 'tip_amount', 'tolls_amount', 'improvement_surcharge',
       'total_amount', 'congestion_surcharge', 'airport_fee'],
      dtype='object')

In [9]:
# Convert pickup_time and dropoff_time to Unix timestamps
df['tpep_pickup_unix'] = pd.to_datetime(df['tpep_pickup_datetime']).astype(int) // 10**9
df['tpep_dropoff_unix'] = pd.to_datetime(df['tpep_dropoff_datetime']).astype(int) // 10**9

# Calculate the duration of each trip in seconds
df['duration'] = df['tpep_dropoff_unix'] - df['tpep_pickup_unix']

# Calculate the average trip speed in miles per hour
df['speed'] = df['trip_distance'] / (df['duration'] / 3600)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [10]:
# Create a line plot of average trip speed by time of day
df['pickup_hour'] = pd.to_datetime(df['tpep_pickup_datetime']).dt.hour
speed_by_hour = df.groupby('pickup_hour')['speed'].mean()
p = figure(title="Average Trip Speed by Time of Day", x_axis_label="Hour of Day", y_axis_label="Speed (mph)")
p.line(speed_by_hour.index, speed_by_hour.values)
show(p)


<IPython.core.display.Javascript object>

In [11]:
# Create the figure
fig = figure(title="Pickup and Dropoff Times", x_axis_label='Pickup Time', y_axis_label='Dropoff Time',
             x_axis_type='datetime', y_axis_type='datetime')

# Add the scatter plot
fig.scatter(x=df['tpep_pickup_datetime'], y=df['tpep_dropoff_datetime'], size=3, alpha=0.5)
show(fig)

In [12]:
# Create the figure
fig = figure(title="Histogram of Pickup Times", x_axis_label='Pickup Time', y_axis_label='Frequency',
             x_axis_type='datetime', x_range=(0, 7500))

# Add the histogram
hist, edges = np.histogram(df['duration'], bins=100)
fig.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:],
         fill_color='blue', line_color='white', alpha=0.5)

# Format the y-axis ticks as integers
fig.yaxis.formatter = NumeralTickFormatter(format='0')
show(fig)


<IPython.core.display.Javascript object>

In [13]:
# Group trips by pickup location and count the number of trips
grouped = df.groupby('PULocationID').count()['VendorID']
pickup_locations = pd.DataFrame({'location': grouped.index, 'count': grouped.values})

# Create the figure
fig = figure(title="Heatmap of Pickup Locations", x_axis_label='Longitude', y_axis_label='Latitude')

# Define the color mapper and color bar
color_mapper = LinearColorMapper(palette=Viridis256, low=0, high=max(pickup_locations['count']))
color_bar = ColorBar(color_mapper=color_mapper, location=(0, 0))

# Add the heatmap
fig.rect(x='location', y=0, width=1, height='count', source=pickup_locations,
         fill_color=transform('count', color_mapper))

# Add the color bar
fig.add_layout(color_bar, 'right')
show(fig)

<IPython.core.display.Javascript object>

In [14]:
# Create a histogram of trip durations
p = figure(title="Trip Durations", x_axis_label="Duration (seconds)", y_axis_label="Count", x_range=(0, 6000))
hist, edges = np.histogram(df['duration'], bins=500)
p.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:], line_color="white")
show(p)


<IPython.core.display.Javascript object>

In [15]:
# Create a scatter plot of trip speed vs. trip distance
p = figure(title="Trip Speed vs. Trip Distance", x_axis_label="Distance (miles)", y_axis_label="Speed (mph)", y_range=(0, 80))
p.circle(df['trip_distance'], df['speed'])
show(p)
