<img src="https://github.com/jupytercon/2020-exactlyallan/raw/master/images/RAPIDS-header-graphic.png" style="width:50%">


# RAPIDS Visualization Guide Notebook
### A Streamlined Guide to RAPIDS Accelerated Visualization and Visual Analtyics
The guide will walk through using RAPIDS [cuDF](https://github.com/rapidsai/cudf), [cuML](https://github.com/rapidsai/cuml), [cuGraph](https://github.com/rapidsai/cugraph) and [cuxfilter](https://github.com/rapidsai/cuxfilter), with [hvPlot](https://hvplot.holoviz.org/), [Datashader](https://datashader.org/), and [Plotly Dash](https://github.com/plotly/dash) with the publicly available [Divvy Bike share dataset](https://divvybikes.com/system-data).

<img src="images/cuxfilter-graph.png" style="width:50%">

## Requirements
System and GPU that meets the [RAPIDS system and GPU requirements](https://docs.rapids.ai/install#system-req)

## Dependencies
Use the below to install all the required dependencies in your preferred environment:

In [None]:
# sys imports
import os
from zipfile import ZipFile
from pathlib import Path
import math

# rapids etc
import cudf
import cugraph
import cuml
import cuxfilter
import cupy

# holoviz and geo
from bokeh.models import NumeralTickFormatter
import hvplot.cudf

hvplot.extension("bokeh")
import colorcet
import panel as pn
import cartopy
from pyproj import Proj, Transformer

# plotly
import plotly.graph_objects as go
import plotly.express as px
from dash import Dash, html, dcc, callback, Output, Input, ctx
from dash.exceptions import PreventUpdate

## Download Dataset
The dataset can be downloaded from the [Divvy Bike Share public dataset](https://divvybikes.com/system-data). Use the following cell to download data for the desired date range and load it into a dataframe. You can also download the data manually. 

**NOTE:** 2021 + 2022 full year datasets have over 11,000,000 trips and may use up to 24GB of GPU memory.

In [None]:
# Define the URL of the Divvy trip data and save dir
S3 = "https://divvy-tripdata.s3.amazonaws.com/"
DATA_DIR = "./data"

In [None]:
# Check for data directory
Path(DATA_DIR).mkdir(parents=True, exist_ok=True)

# Download the zip files from the URL within date range and unzip
# NOTE: 2021 + 2022 dataset is over 11M trips, which requires at least a 24GB GPU
for year in range(2021, 2022):
    for month in range(1, 13):
        file = f"{year}{month:02d}-divvy-tripdata.zip"
        URL = f"{S3}{file}"
        file_path = os.path.join(DATA_DIR, file)

        if os.path.exists(file_path):
            print(f"Skipping download for {file}. File already exists.")
        else:
            print(f"Downloading {file}...")
            ! wget -P {DATA_DIR} {URL}

        with ZipFile(file_path) as zip:
            zip.extractall(DATA_DIR)

### Load into cuDF

In [None]:
# Load all individual csv files as dataframes and combine into one cudf
df_array = []

for file in Path(DATA_DIR).rglob("20*.csv"):
    gdf = cudf.read_csv(file)
    df_array.append(gdf)

df = cudf.concat(df_array)

# Check the data
df.reset_index()

## Reformat and Clean Data

The data seems unreasonabliy clean, but there are still a few things we should fix. Lets check for blanks and nulls.

In [None]:
df.isnull().sum()

In [None]:
# Drop end_lat nulls, we will look into station names later
df = df.dropna(subset=["end_lat"])
df.isnull().sum()

In [None]:
# Replace null values in 'start_station_id' with 'none'
df["start_station_name"] = df["start_station_name"].fillna("none")

# Replace null values in 'end_station_id' with 'none'
df["end_station_name"] = df["end_station_name"].fillna("none")

In [None]:
# Since this is only for the Chicago area, lets remove any odd out of area lat/lng values
min_lat = 41.5
max_lat = 42.5
min_lng = -88.0
max_lng = -87.0

df = df[
    (df["start_lat"] >= min_lat)
    & (df["start_lat"] <= max_lat)
    & (df["start_lng"] >= min_lng)
    & (df["start_lng"] <= max_lng)
    & (df["end_lat"] >= min_lat)
    & (df["end_lat"] <= max_lat)
    & (df["end_lng"] >= min_lng)
    & (df["end_lng"] <= max_lng)
]

In [None]:
# Lets check for correct data types
df.dtypes

In [None]:
# Convert started_at and ended_at into datetimes
df["started_at"] = cudf.to_datetime(df["started_at"])
df["ended_at"] = cudf.to_datetime(df["ended_at"])

In [None]:
# Extract values out of date time for easier filtering, assuming we just care about start times
df["year"] = df["started_at"].dt.year
df["month"] = df["started_at"].dt.month
df["day"] = df["started_at"].dt.day
df["hour"] = df["started_at"].dt.hour
df["day_of_week"] = df["started_at"].dt.dayofweek

In [None]:
# Calculate trip duration in minutes
df["dur_min"] = df["ended_at"] - df["started_at"]
df["dur_min"] = (
    (df["dur_min"].dt.seconds / 60).round().astype("float32")
)  # NOTE: Float32 needed for cuML KDE

In [None]:
# Do some minor cleanup for cols we will not need
df = df.drop(
    [
        "ride_id",
        "started_at",
        "ended_at",
        "start_station_id",
        "end_station_id",
    ],
    axis=1,
).reset_index(drop=True)
df

## Start Visualizing 

In [None]:
# Simple groupby to see members vs casual users trips
rider_type = df.groupby("member_casual").size().rename("count").reset_index()
rider_type

### Using hvPlot bar, line, and heatmaps

In [None]:
# Using hvplot is as simple as replacing the typical pandas.plot() with .hvplot().
# Often hvplot will be able to deduce chart types from the data.
# You can also directly specify charst with '.hvplot.bar()' inline or as '.hvplot(kind='bar')
rider_type.hvplot.bar(
    x="member_casual", y="count", title="Total Rider Types", yformatter="%0.0f"
)

In [None]:
# Now lets check bike type counts
bike_type = df.groupby("rideable_type").size().rename("count").reset_index()
bike_type.hvplot.bar(
    x="rideable_type", y="count", title="Total Bike Types", yformatter="%0.0f"
)

### Preattentive Visual Processing
Even for simple multi column values, using bar charts makes noticing the magnitude difference in values much more apparent. This is because of a concept called [preattentive visual processing](https://www.interaction-design.org/literature/article/preattentive-visual-properties-and-how-to-use-them-in-information-visualization), which acts as a hack for your brain to understand large amounts of data quickly.

Using [appropriate data visualization design principles](https://flowingdata.com/data-points/) can help you effectively leverage this latent ability and is one of the reasons visualization is a powerful tool in data analysis. 

In [None]:
# Lets dig more into the data
# NOTE: Day of week mapping is 0:'Mon', 1:'Tue', 2:'Wed', 3:'Th', 4:'Fri', 5:'Sat', 6:'Sun'
day_counts = (
    df.groupby("day_of_week")
    .size()
    .rename("count")
    .reset_index()
    .sort_values("day_of_week")
)
day_counts.hvplot.bar(
    "day_of_week",
    "count",
    title="Trip starts per Week Day",
    yformatter="%0.0f",
)

In [None]:
# Using hvplot for histograms, its easy to set bin sizes
df.hvplot.hist(
    y="dur_min", bins=20, title="Trips Duration Histrogram", yformatter="%0.0f"
)

### Using cuML + KDE

In [None]:
# While we can increase the plot bin size, lets verify the distribution with KDE
# Options: start, end, step size
dur_range = cupy.arange(1.0, 400.0, 5.0)
kde = cuml.KernelDensity(kernel="gaussian", bandwidth=3).fit(df["dur_min"])
log_density_values = kde.score_samples(dur_range)
density_values = cupy.exp(log_density_values)

# create a dataframe
density_df = cudf.DataFrame({"duration": dur_range, "density": density_values})

# plot using hvplot
density_df.hvplot.line(
    x="duration",
    y="density",
    xlabel="Data",
    ylabel="Density",
    title="Duration in min KDE",
)

In [None]:
# That is a very long tail of long rides. 1440 min seems to be a cap at 24 hours.
df.loc[df["dur_min"].argsort().tail(200)]

In [None]:
# Group trips by hour and average duration
trips_by_hour = (
    df.groupby("hour").size().rename("count").reset_index().sort_values("hour")
)
avg_duration_by_hour = (
    df.groupby("hour")["dur_min"]
    .mean()
    .rename("duration_mean")
    .reset_index()
    .sort_values("hour")
)

# Add the charts side by side with the + operator
trips_by_hour.hvplot.bar(
    "hour", "count", title="Trip Starts per Hour", yformatter="%0.0f"
) + avg_duration_by_hour.hvplot.bar(
    "hour", "duration_mean", title="Trip Duration per Hour", yformatter="%0.0f"
)

In [None]:
# There appears to be an abnormal amount of long trip durations late into the night.
# Lets filter our excessively long trips (or forgotten bikes) to the start of the long tail
trips_by_hour_300 = (
    df[df["dur_min"] <= 300]
    .groupby("hour")
    .size()
    .rename("count")
    .reset_index()
    .sort_values("hour")
)
avg_duration_by_hour_300 = (
    df[df["dur_min"] <= 300]
    .groupby("hour")["dur_min"]
    .mean()
    .rename("duration_mean")
    .reset_index()
    .sort_values("hour")
)

# Plot again
trips_by_hour_300.hvplot.bar(
    "hour",
    "count",
    title="Trip Starts per Hour (under 300min)",
    yformatter="%0.0f",
) + avg_duration_by_hour_300.hvplot.bar(
    "hour",
    "duration_mean",
    title="Trip Duration per Hour (under 300min)",
    yformatter="%0.0f",
)

In [None]:
# Lets group data by day_of_week and hour, then count the number of rows in each group
heatmap_data_dw = (
    df.groupby(["day_of_week", "hour"]).size().rename("count").reset_index()
)

# HvPlot heatmap
heatmap_data_dw.hvplot.heatmap(
    x="day_of_week", y="hour", C="count", title="Trips by Hour and Day of Week"
)

In [None]:
# Lets break the group data by month, day_of_week and hour, then count the number of rows in each group
heatmap_data_dwm = (
    df.groupby(["month", "day_of_week", "hour"])
    .size()
    .rename("count")
    .reset_index()
)

# By adding the 'groupby' option to our month value, hvPlot automatically adds a widget so we can slide through each month
# We can see there are distinct patterns of increased weekend use during the warmer months, but a pretty consistent use during weekend commute hours
heatmap_data_dwm.hvplot.heatmap(
    x="day_of_week",
    y="hour",
    C="count",
    groupby="month",
    widget_location="left_top",
    title="Trips by Hour and Day of Week per Month",
)

### Using hvPlot geospatial maps

In [None]:
# Lets verify the lat/lng data is good by using hvplot 'geo=True' option with a hexbin map
# Gridsize adjust the hex bin sizing, and cmap is color scale. We are using perceptually accurate colorcet presets: https://colorcet.holoviz.org/
# The list of available tiles are here: https://holoviews.org/reference/elements/bokeh/Tiles.html
# The map seems to match that of the system data: https://account.divvybikes.com/map

trip_starts = df.hvplot.hexbin(
    x="start_lng",
    y="start_lat",
    cmap=colorcet.bgy,
    geo=True,
    tiles="OSM",
    logz=False,
    gridsize=150,
    width=700,
    height=600,
    title="Trip Start Counts",
)
trip_ends = df.hvplot.hexbin(
    x="end_lng",
    y="end_lat",
    geo=True,
    cmap=colorcet.bgy,
    tiles="OSM",
    logz=False,
    gridsize=150,
    width=700,
    height=600,
    title="Trip End Counts",
)
trip_starts + trip_ends

### Using cupy to calculate Haversine distance

In [None]:
def haversine_distance_cupy(lon1, lat1, lon2, lat2, earth_radius_km=6371.0):
    """
    Calculate the Haversine distance between two sets of points.
    """
    # Convert degrees to radians
    lon1_rad = cupy.radians(lon1.values)
    lat1_rad = cupy.radians(lat1.values)
    lon2_rad = cupy.radians(lon2.values)
    lat2_rad = cupy.radians(lat2.values)

    # Differences
    dlon = lon2_rad - lon1_rad
    dlat = lat2_rad - lat1_rad

    # Haversine formula
    a = (
        cupy.sin(dlat / 2) ** 2
        + cupy.cos(lat1_rad) * cupy.cos(lat2_rad) * cupy.sin(dlon / 2) ** 2
    )
    c = 2 * cupy.arcsin(
        cupy.sqrt(a)
    )  # Alternatively: 2 * cupy.arctan2(cupy.sqrt(a), cupy.sqrt(1 - a))

    distances_km = earth_radius_km * c
    return distances_km


# Calculate distance
distances_in_km = haversine_distance_cupy(
    df.start_lng, df.start_lat, df.end_lng, df.end_lat
)
# Add the distances back into the dataframe, convert from KM to M, and rounding values to make it more obvious if the stopped at the same place it started
dist_m = cudf.Series(distances_in_km).values * 1000
df["dist_m"] = dist_m.round().astype("int32")
df

In [None]:
# By comparing distance, lets quickly compare how many trips start and end at the same spot. Interestingly, electric bikes dont dramatically increase distance traveled.
# Including the 'by' term creates a stacked bar chart. Clicking on the legend will show/hide values.
returns = df.hvplot.hist(
    y="dist_m",
    by="rideable_type",
    bins=80,
    title="Trips Distance By Type",
    yformatter="%0.0f",
)
no_returns = df[df["dist_m"] > 0].hvplot.hist(
    y="dist_m",
    by="rideable_type",
    bins=80,
    title="Trips Distance By Type ( W/O Returns)",
    yformatter="%0.0f",
)

returns + no_returns

### Using cuxfilter dashboards

In [None]:
# FIX-NOTE: adding extension here explicitly RELOADS bokeh js and css so cuxfilter plots work
hvplot.extension("bokeh")

In [None]:
# Having multiple cross-filtered charts allow for quick discovery of patterns without manually configuring individual queries.
# cuxfilter is specifically designed for creating cross filtering dashboards quickly
# By clicking through various ranges, a distinct pattern between weekday-weekend, as well as day-evenings emerges.

# Load the data
cux_df = cuxfilter.DataFrame.from_dataframe(df)

# Chart options
charts = [
    cuxfilter.charts.bar("dist_m", data_points=20, title="Distance in M"),
    cuxfilter.charts.bar("dur_min", data_points=20, title="Duration in Min"),
    cuxfilter.charts.bar("day_of_week", title="Day of Week"),
    cuxfilter.charts.bar("hour", title="Trips per Hour"),
    cuxfilter.charts.bar("day", title="Trips per Day"),
    cuxfilter.charts.bar("month", title="Trips per Month"),
]

# Elements for the side panel
widgets = [cuxfilter.charts.multi_select("year")]

# Generate the dashboard with selected layout and theme
d = cux_df.dashboard(
    charts,
    sidebar=widgets,
    layout=cuxfilter.layouts.two_by_three,
    theme=cuxfilter.themes.rapids,
    title="Bike Trips Dashboard",
)

# Update the yaxis ticker to a more readable format
for i in charts:
    if hasattr(i.chart, "yaxis"):
        i.chart.yaxis.formatter = NumeralTickFormatter(format="0,0")


# d.show creates a button to open a dashboard in a full seperate tab
# d.app() opens the app inline of the notebook
# d.stop() stops the dashboard
d.show()

### cuxfilter dashboard screenshot
<img src="images/cuxfilter-bars.png" style="width:50%">

### Using hvPlot + Datashader + Panel

In [None]:
# Checking below, there are more trip start locations then stations, so bikes must be able to start and stop outside of stations.
print(
    "Unique start station names: %d" % df["start_station_name"].unique().size
)
print(
    "Unique rounded start latitudes: %d"
    % df["start_lat"].round(4).unique().size
)

In [None]:
# Rendering every individual point for large datasets is usually prohibitively slow, but by using datashader with hvPlot via 'datashade=True' the points are interactively aggregated.
# Setting 'dynspread=True' enlarges individual points so they are more visible.
# Though its useful for creating full data apps, panel can also be used for some layout help: https://panel.holoviz.org/
# Trips, while clustered, seem to disperse.

# Create two datashader charts side by side
# NOTE: Might take a few seconds
start_elec = df.hvplot.points(
    x="start_lng",
    y="start_lat",
    geo=True,
    tiles="CartoDark",
    width=700,
    height=500,
    datashade=True,
    dynspread=True,
    title="Trip Starts",
)
end_elec = df.hvplot.points(
    x="end_lng",
    y="end_lat",
    geo=True,
    tiles="CartoDark",
    width=700,
    height=500,
    datashade=True,
    dynspread=True,
    title="Trip Stops",
)
elec_row = pn.Row(start_elec, end_elec)
elec_row

In [None]:
# Lets investigate further by getting a df of the station names
# Remove none
start_stations = df[df["end_station_name"] != "none"]

# Drop duplicates
unique_stations = start_stations.drop_duplicates(subset="end_station_name")

In [None]:
# Lets overlay station points with bike trips, using trip ends since it's more dispersed. Clearly a bike trip has no guarantee that it will start or end near a station.
# Since there are only a few hundred stations we can use the standard hvPoint point rendering. For the trips, we will use 'rasterize=True' which does not aggregate points but shows them in a flattened manner.
# NOTE: Since using the * operator combines charts, only one chart needs tiles enabled otherwise it would cover the data.
raster = df.hvplot.points(
    x="end_lng",
    y="end_lat",
    geo=True,
    tiles="CartoDark",
    projection=cartopy.crs.GOOGLE_MERCATOR,
    hover=True,
    width=700,
    height=500,
    rasterize=True,
)
station_points = unique_stations.hvplot.points(
    x="end_lng",
    y="end_lat",
    geo=True,
    tiles=False,
    projection=cartopy.crs.GOOGLE_MERCATOR,
    hover=False,
    width=700,
    height=500,
    color="red",
    alpha=0.5,
)

raster * station_points

### Using cuML + Kmeans

In [None]:
# In order to generate a graph visualization of trips without each individual trip becoming a node, we need to cluster trips to fewerer nodes. Lets use cuML's K-Means.

# Combine all lat values
lat_df = cudf.DataFrame()
lat_df["lat"] = cudf.concat(
    [df["start_lat"], df["end_lat"]], ignore_index=True
)

# Combine all lng values
lng_df = cudf.DataFrame()
lng_df["lng"] = cudf.concat(
    [df["start_lng"], df["end_lng"]], ignore_index=True
)

# Combine lat lng
combined_lat_lng_df = cudf.concat([lat_df, lng_df], axis=1)

In [None]:
# Perform k-means clustering, from the approximate station count with a bit of headroom
kmeans = cuml.cluster.KMeans(
    n_clusters=unique_stations.shape[0] + 20,
    oversampling_factor=1.5,
    max_iter=300,
)

# NOTE: This will take a few min on larger datasets
kmeans.fit(combined_lat_lng_df)

# Get the cluster labels
cluster_labels = kmeans.labels_

In [None]:
# Get the edge list from the computed clusters by splitting un-combining
half_length = len(cluster_labels) // 2

# Create edge list df
edge_list_df = cudf.DataFrame(
    {
        "src": cluster_labels[:half_length]
        .reset_index(drop=True)
        .astype("int16"),
        "dst": cluster_labels[half_length:]
        .reset_index(drop=True)
        .astype("int16"),
    }
)

# Check
edge_list_df

In [None]:
# Get the cluster centers or Nodes
node_centers_df = kmeans.cluster_centers_

# Clean up
node_centers_df = node_centers_df.rename(
    columns={0: "node_lat", 1: "node_lng"}
).astype("float32")

# Save node centers
node_centers_df.to_parquet("./data/kmean_node_center.parquet")

In [None]:
# Add the edge list back into the original df
df = cudf.concat([df, edge_list_df], axis=1)

In [None]:
# Lets verify the clustering worked by overlaying each node with the previous station points and raster end trips map
# Looks pretty good, as each node is within about a two block tolerance, but more importantly each point now is associated with a nearby node. Visually, purple means good overlap and more blue means better coverage.
cluster_map = node_centers_df.hvplot.points(
    x="node_lng",
    y="node_lat",
    geo=True,
    tiles=False,
    projection=cartopy.crs.GOOGLE_MERCATOR,
    hover=False,
    width=800,
    height=600,
    color="blue",
    alpha=0.8,
)

# Overlay multiple hvPlot geospatial plots using the * operator
# NOTE: Only specify 'geo=True' once otherwise the map tiles overlay the data
raster * cluster_map * station_points

## Reduce and Save

In [None]:
# The dataframe is becoming large, especially with float64 values. Now KMEANS is complete, we don't need that level of precision.
# Save our original work to file
df.to_parquet("./data/bike_df_full.parquet")

# Reduce value type
df_dur = df["dur_min"].astype("int16")

# Reduce values to float32
df_geo = df[["start_lat", "start_lng", "end_lat", "end_lng"]].astype("float32")

# Drop redundant values
df = df.drop(
    ["start_lat", "start_lng", "end_lat", "end_lng", "dur_min"], axis=1
)

# Recombine
df = cudf.concat([df, df_dur, df_geo], axis=1)

# Save the minimized data to file
df.to_parquet("./data/bike_df_clean.parquet")

# Check dtypes
df.dtypes

### Transforming map projection

In [None]:
# Reload the data if needed
df = cudf.read_parquet("./data/bike_df_clean.parquet")

# cuxfilter needs to explicitly transfrom the /lng projection system to 3857
transform_4326_to_3857 = Transformer.from_crs("epsg:4326", "epsg:3857")

# Update the df
df["end_lat"], df["end_lng"] = transform_4326_to_3857.transform(
    df["end_lat"].values_host, df["end_lng"].values_host
)

### Using cuGraph + ForceAtlas2

In [None]:
# Initalize cuGraph
G = cugraph.Graph()

# Create an edgelist from nodes in df
G.from_cudf_edgelist(df, source="src", destination="dst")

# Save out edgelist
edges = G.edges()

In [None]:
# NOTE: It may take a few iterations to dial in the values
ITERATIONS = 600
THETA = 5.0
OPTIMIZE = True

# Using the previously created edge list, we calculate the FA2 layout positions
trips_FA_df = cugraph.layout.force_atlas2(
    G,
    max_iter=ITERATIONS,
    strong_gravity_mode=True,
    outbound_attraction_distribution=False,
    lin_log_mode=False,
    barnes_hut_optimize=OPTIMIZE,
    barnes_hut_theta=THETA,
    verbose=False,
)

In [None]:
# Combine previous df with the graph node FA2 positions
graph_df = trips_FA_df.merge(
    df, left_on="vertex", right_on="dst", suffixes=("", "_original")
)

# Check df
graph_df

### Using advanced cuxfilter spatial dashboards

In [None]:
# FIX-NOTE: adding extension here explicitly RELOADS bokeh and all plots will work
hvplot.extension("bokeh")

In [None]:
# cuxfilter can quickly create complicated dashboards integrated with RAPIDS
# In this instance, the clustering processes has worked and each corresponds well to other nearby nodes. We can also see several day/night, seasonal, and outward to inwards trip patterns emerge.
# Specifying a cuxfilter graph chart type will use Datashader and its required parameters
cx_df = cuxfilter.DataFrame.load_graph((graph_df, edges))

# Graph chart with src and dst
graph = cuxfilter.charts.graph(
    edge_source="src",
    edge_target="dst",
    node_x="x",
    node_y="y",
    unselected_alpha=0.2,
    edge_color_palette=["gray", "black"],
    node_pixel_shade_type="linear",
    edge_transparency=0.2,
    title="ForceAtlas2 Trip Graph",
)

# Geospatial scatter chart
scatter = cuxfilter.charts.scatter(
    x="end_lat",
    y="end_lng",
    unselected_alpha=0.1,
    pixel_shade_type="eq_hist",
    tile_provider="CartoDark",
    title="Trip Endpoints",
)

# Bar and table charts
bar1 = cuxfilter.charts.bar("dur_min", data_points=20, title="Duration in Min")
bar2 = cuxfilter.charts.bar("hour", title="Trips per Hour")
bar3 = cuxfilter.charts.bar("day_of_week", title="Trips per Day of Week")
bar4 = cuxfilter.charts.bar("month", title="Trips per Month")
table1 = cuxfilter.charts.view_dataframe(
    ["start_station_name", "end_station_name"], drop_duplicates=True
)

# Custom layout as explained here: https://docs.rapids.ai/api/cuxfilter/stable/layouts/layouts/
# NOTE: by clicking on each card's upper left arrow, you can dynamically move the card to a new location. Clicking on the lower right side of the card enables you to resize it.
layout_array = [[1, 1, 1, 2, 2], [3, 4, 5, 6, 7]]

# Generate the dashboard, order the charts, select a layout, and set theme
d = cx_df.dashboard(
    [graph, scatter, bar1, bar2, bar3, bar4, table1],
    layout_array=layout_array,
    theme=cuxfilter.themes.rapids,
    title="Divvy Bike Trip Clustering",
)


# d.show creates a button to open a dashboard in a full seperate tab
# d.app() opens the app inline of the notebook
# d.stop() stops the dashboard
d.show()

### cuxfilter advanced cuxfilter spatial screenshot
<img src="images/cuxfilter-graph.png" style="width:50%">

### Using Plotly Dash + cuGraph

In [None]:
# While the above cuxfilter dashboard is powerful, it needs to be simplified to make it more widely accessible and digestible.
# Reload the data if needed
df = cudf.read_parquet("./data/bike_df_clean.parquet")
node_centers_df = cudf.read_parquet("./data/kmean_node_center.parquet")

# Save only the required sections
plotly_df = df[
    [
        "rideable_type",
        "member_casual",
        "year",
        "month",
        "day",
        "hour",
        "day_of_week",
        "src",
        "dst",
    ]
]

# Check df
plotly_df

In [None]:
# By combining our analysis from above, we can encapsulate the findings in an easy to use, interactive, and fast dashboard.
# Using the nodes generated from the Kmeans, we can calculate a real time PageRank using cuGraph for each destination node. This eliminates the business of the graph lines.
# We can then map the nodes and show trip counts by node size. Drill down and filtering can be achieved with cross-filtered bar charts showing the day/night and weekday/weekend patterns. Further options can be side widgets.
# All the complex and interactive-speed compute for a large dataset can be cast into an easy to use Plotly Dash app UI.

# NOTE: More details on dash app environments: https://dash.plotly.com/dash-in-jupyter
# Python based Plotly Dash apps should use the `dash.Dash(__name__)` convention.
app = Dash(__name__)

# The layout, UI, and styling is done here. Additional css and js can be automatically included by adding an '/assets' folder.
app.layout = html.Div(
    [
        html.Div(
            [
                html.H2(
                    "Divvy Bikeshare | PageRanking of Destinations",
                    style={"margin-bottom": "0.25rem"},
                ),
                html.H4(
                    "Total Selected Trips:", style={"margin-bottom": "0.25rem"}
                ),
                html.H2(id="tripcount", style={"margin-bottom": "0.25rem"}),
                html.H4("Year:", style={"margin-bottom": "0.25rem"}),
                dcc.Dropdown(
                    id="year",
                    options=sorted(plotly_df["year"].unique().to_pandas()),
                    value="",
                    clearable=True,
                    style={"color": "#3a97d3"},
                ),
                html.H4("Month:", style={"margin-bottom": "0.25rem"}),
                dcc.Dropdown(
                    id="month",
                    options=sorted(plotly_df["month"].unique().to_pandas()),
                    value="",
                    clearable=True,
                    style={"color": "#3a97d3"},
                ),
                html.H4("Bike Type:", style={"margin-bottom": "0.25rem"}),
                dcc.Dropdown(
                    id="bikes",
                    options=sorted(
                        plotly_df["rideable_type"].unique().to_pandas()
                    ),
                    value="",
                    clearable=True,
                    style={"color": "#3a97d3"},
                ),
                html.H4("User Type:", style={"margin-bottom": "0.25rem"}),
                dcc.Dropdown(
                    id="user",
                    options=sorted(
                        plotly_df["member_casual"].unique().to_pandas()
                    ),
                    value="",
                    clearable=True,
                    style={"color": "#3a97d3"},
                ),
            ],
            style={
                "z-index": "99",
                "font-family": "sans-serif",
                "position": "absolute",
                "width": "15vw",
                "height": "calc(100vh - 3rem)",
                "padding": "1em",
                "background-color": "#3a97d3",
                "color": "#f1f1f1",
                "border-radius": "0.5rem",
                "box-shadow": "5px 0px 3px 0px rgba(0,0,0,0.3)",
            },
        ),
        html.Div(
            [
                html.Div(
                    [
                        html.H3(
                            "Area Importance PageRank (Color) by Trips (Size)",
                            style={
                                "font-family": "sans-serif",
                                "color": "#3a97d3",
                            },
                        ),
                        dcc.Graph(
                            id="pagerank_plot",
                            config={
                                "responsive": True,
                                "displaylogo": False,
                                "modeBarButtonsToRemove": [
                                    "select2d",
                                    "lasso2d",
                                    "toImage",
                                ],
                            },
                        ),
                    ],
                    style={
                        "display": "inline-block",
                        "width": "70vw",
                        "vertical-align": "top",
                    },
                ),
                html.Div(
                    [
                        html.H3(
                            "Trips Per Day of Week",
                            style={
                                "font-family": "sans-serif",
                                "color": "#3a97d3",
                            },
                        ),
                        dcc.Graph(
                            id="dow_plot",
                            config={
                                "responsive": True,
                                "displaylogo": False,
                                "modeBarButtonsToRemove": [
                                    "zoom2d",
                                    "zoomIn2d",
                                    "zoomOut2d",
                                    "toImage",
                                ],
                            },
                        ),
                    ],
                    style={
                        "display": "inline-block",
                        "width": "35vw",
                        "vertical-align": "bottom",
                    },
                ),
                html.Div(
                    [
                        html.H3(
                            "Trips Per Hour",
                            style={
                                "font-family": "sans-serif",
                                "color": "#3a97d3",
                            },
                        ),
                        dcc.Graph(
                            id="hour_plot",
                            config={
                                "responsive": True,
                                "displaylogo": False,
                                "modeBarButtonsToRemove": [
                                    "zoom2d",
                                    "zoomIn2d",
                                    "zoomOut2d",
                                    "toImage",
                                ],
                            },
                        ),
                    ],
                    style={
                        "display": "inline-block",
                        "width": "35vw",
                        "vertical-align": "bottom",
                    },
                ),
            ],
            style={
                "width": "70vw",
                "margin-left": "18vw",
                "padding-top": "1em",
                "display": "inline-block",
                "vertical-align": "top",
            },
        ),
    ]
)

In [None]:
# set inital none
hour_data_backup, dow_data_backup = None, None
qry = [None, None, None, None]


# Function callbacks to update charts with the layout ID and type cross-filtering the input and output
@app.callback(
    [
        Output("tripcount", "children"),
        Output("pagerank_plot", "figure"),
        Output("dow_plot", "figure"),
        Output("hour_plot", "figure"),
    ],
    [
        Input("year", "value"),
        Input("month", "value"),
        Input("bikes", "value"),
        Input("user", "value"),
        Input("dow_plot", "selectedData"),
        Input("hour_plot", "selectedData"),
    ],
)
def update_figure(year, month, bikes, user, dow_data, hour_data):
    global hour_data_backup, dow_data_backup
    global qry

    data = plotly_df

    # condition to avoid a bug in plotly where selectedData is reset following a box-select
    # hour data query conditions
    if hour_data and len(hour_data["points"]) > 0:
        hour_data_backup = hour_data
        range0 = hour_data["range"]["x"][0]
        range1 = math.floor(hour_data["range"]["x"][1])
        qry[0] = f"(hour >= {range0} and hour <= {range1})"

    elif hour_data is None and hour_data_backup is not None:
        hour_data_backup = hour_data
        qry[0] = None
    elif ctx.triggered_id == "hour_plot":
        raise PreventUpdate

    # day of week data query conditions
    if dow_data and len(dow_data["points"]) > 0:
        dow_data_backup = dow_data
        range0 = dow_data["range"]["x"][0]
        range1 = math.floor(dow_data["range"]["x"][1])
        qry[1] = f"(day_of_week >= {range0} and day_of_week <= {range1})"

    elif dow_data is None and dow_data_backup is not None:
        dow_data_backup = dow_data
        qry[1] = None
    elif ctx.triggered_id == "dow_plot":
        raise PreventUpdate

    # dropdowns
    if year is not None and year != "":
        qry[2] = f"(year == {year})"
    else:
        qry[2] = None

    if month is not None and month != "":
        qry[3] = f"(month == {month})"
    else:
        qry[3] = None

    # NOTE: cudf.query() doesnt support strings
    if bikes is not None:
        if bikes != "":
            data = data[data["rideable_type"] == bikes]

    if user is not None:
        if user != "":
            data = data[data["member_casual"] == user]

    # build query
    full_query = " and ".join(item for item in qry if item is not None)

    if full_query != "":
        data = data.query(full_query)
    else:
        data = data

    # update trip count
    tripcount = "{:,}".format(data.shape[0])

    # update charts data
    pagerank_plot = get_pagerank_plot(data)
    hour_plot = get_hour_chart(data)
    dow_plot = get_dow_chart(data)

    return tripcount, pagerank_plot, dow_plot, hour_plot

In [None]:
# Real time cuGraph PageRank calculation
def calculate_page_rank(data):
    G = cugraph.Graph()
    G.from_cudf_edgelist(
        data, source="src", destination="dst", store_transposed=True
    )
    data_rank = cugraph.pagerank(G)
    return data_rank


# Geospatial bubble chart using PageRank and Trip counts
def get_pagerank_plot(data):
    # Get PageRanks
    data_rank = calculate_page_rank(data)

    # Get trip counts
    trips = (
        data.groupby("dst")
        .agg({"dst": "size"})
        .rename(columns={"dst": "arrivals"})
        .reset_index()
    )

    # Combine
    trips = trips.merge(data_rank, left_on="dst", right_on="vertex").drop(
        columns=["vertex"]
    )

    # Plot bubble locations from node_centers_df calculated earlier
    rank_chart = trips.merge(
        node_centers_df, left_on="dst", right_index=True
    ).reset_index(drop=True)

    # Build chart
    g = px.scatter_mapbox(
        rank_chart.to_pandas(),
        lat="node_lat",
        lon="node_lng",
        color="pagerank",
        size="arrivals",
        hover_data=["pagerank", "dst"],
        mapbox_style="carto-positron",
        color_continuous_scale=px.colors.sequential.haline,
        size_max=20,
        zoom=10,
        height=800,
    )
    g.update_layout(margin=dict(l=0, r=0, b=0, t=0, pad=4))
    g.layout["uirevision"] = True
    return g


# Bar chart based on day of week
def get_dow_chart(data):
    # Group days
    dow = data.groupby("day_of_week").size().rename("count").reset_index()

    # Build chart
    g = px.bar(
        dow.to_pandas(),
        x="day_of_week",
        y="count",
        template=dict(
            layout={
                "selectdirection": "h",
            }
        ),
        height=300,
    )
    g.update_layout(margin=dict(l=50, r=20, b=50, t=20, pad=5))
    g.update_xaxes(range=[-0.5, 6.5])
    g.update_traces(
        marker_color="#3a97d3", selected=dict(marker=dict(color="#3a97d3"))
    )  # fixes bug where selected marker appears unselected
    g.layout["dragmode"] = "select"
    g.layout["uirevision"] = True
    return g


# Bar chart based on day of hour
def get_hour_chart(data):
    # Group hours
    hour = data.groupby("hour").size().rename("count").reset_index()

    # Build chart
    g = px.bar(
        hour.to_pandas(),
        x="hour",
        y="count",
        template=dict(
            layout={
                "selectdirection": "h",
            }
        ),
        height=300,
    )
    g.update_layout(margin=dict(l=50, r=20, b=50, t=20, pad=5))
    g.update_xaxes(range=[-0.5, 24])
    g.update_traces(
        marker_color="#3a97d3", selected=dict(marker=dict(color="#3a97d3"))
    )  # fixes bug where selected marker appears unselected
    g.layout["dragmode"] = "select"
    g.layout["uirevision"] = True
    return g

In [None]:
# NOTE: Any changes requires re-running all the Dash cells. To run app inline set jupyter_mode='inline'
# Click on link below to open a full tab dashboard
# Filter with the drop downs and by area selecting the bar charts. Reset the bar chart filter by double-clicking.
# The most interesting points have been designed to be the brightest and largest circles
if __name__ == "__main__":
    app.run(debug=False, jupyter_mode="external")

### Plotly Dash screenshot
<img src="images/plotly-dash-divvy.png" style="width:50%">

## Conclusion
Searching for these sorts of insights becomes satisfying when visualization tools interact with large data at the “speed of thought.” By using GPU accelerated RAPIDS frameworks, and taking advantage of the simplicity to integrate accelerated visualization frameworks, data analytics workflows can become faster, more insightful, more productive, and (just maybe) more enjoyable.

## Troubleshooting

#### I keep getting a CUDA memory errors
- The dataset size is too large for your GPU, try loading a smaller subset of data or clear unused GPU memory. For reference, two full years needs approximately a 24GB GPU. 

#### hvPlot widgets aren't working
- Currently, hvPlots widgets only support using JuplyterLab 3.6.4. 

#### cuxfilter's green dashboard button doesnt show up / the charts dont show up
- There is a conflict between hvPlot, Jupyter, and cuxfilter CSS/JS assets over-riding each other. Be sure to run "hvplot.extension('bokeh')" before using cuxfilter to reload the assets. 

#### cuxfilter graph chart nodes aren't selecting as I expect
- There are two unique graph chart options available in the buttons toolbar: The bottom arrow button turns off edge rendering and the button above turns off "inspecting neighboring edges." With the latter on, EVERY connection to and from the selected nodes are highlighted. With it off, only the edges from the selected nodes are highlighted. 

#### cuxfilter dashboard hangs, especially on large graphs
- When rendering a large graph, especially with the edges showing, it is possible to spike GPU memory and go out of memory (OOM). Reload the cuxfilter cell and restart the dashboard. Sometimes it might be necessary to restart the jupyter kernel. 

#### where can I find more information
- The [holoviz user guides](https://holoviz.org/)
- The [cuxfilter documentation](https://docs.rapids.ai/api/cuxfilter/stable/)
- The [Plotly Dash documentation](https://plotly.com/examples/)
- The [rapids.ai site](https://rapids.ai)
