Skip to content

[Optimization 3/4] New h3_clustering.ipynb — H3 map clustering demo #4

@rdhyee

Description

@rdhyee

Priority: 6

Context

No clustering visualization exists in the notebooks. Lonboard shows every point, which overwhelms at 6M points. H3 res6 columns enable zoom-adaptive clustering.

Data Files on R2

File URL Size
Wide + H3 https://pub-a18234d962364c22a50c787b7ca09fa5.r2.dev/isamples_202601_wide_h3.parquet 292 MB
Facet summaries https://pub-a18234d962364c22a50c787b7ca09fa5.r2.dev/isamples_202601_facet_summaries.parquet 2 KB

H3 columns: h3_res4 (BIGINT), h3_res6 (BIGINT), h3_res8 (BIGINT). 11.96M rows have H3 values.

File to Create

examples/basic/h3_clustering.ipynb

Notebook Structure

Cell 1: Introduction (markdown)

Explain H3 hierarchical hexagonal indexing, why it's useful for geospatial clustering, link to h3geo.org.

Cell 2: Setup and H3 stats

import duckdb

con = duckdb.connect()
wide_h3_url = "https://pub-a18234d962364c22a50c787b7ca09fa5.r2.dev/isamples_202601_wide_h3.parquet"

# Show H3 column distribution
stats = con.sql(f"""
    SELECT
        COUNT(*) as total,
        COUNT(h3_res4) as with_h3,
        COUNT(DISTINCT h3_res4) as cells_res4,
        COUNT(DISTINCT h3_res6) as cells_res6,
        COUNT(DISTINCT h3_res8) as cells_res8
    FROM read_parquet('{wide_h3_url}')
    WHERE otype = 'MaterialSampleRecord'
""").df()
stats

Cell 3: Cluster at res6 (~3.2km hexagons)

clusters = con.sql(f"""
    SELECT
        h3_res6,
        COUNT(*) as n,
        AVG(latitude) as lat,
        AVG(longitude) as lon,
        MODE(n) as dominant_source
    FROM read_parquet('{wide_h3_url}')
    WHERE otype = 'MaterialSampleRecord' AND h3_res6 IS NOT NULL
    GROUP BY h3_res6
""").df()

print(f"{len(clusters):,} clusters from {clusters.n.sum():,} samples")
print(f"Cluster sizes: min={clusters.n.min()}, median={clusters.n.median():.0f}, max={clusters.n.max():,}")

Cell 4: Lonboard clustered visualization

from lonboard import Map, ScatterplotLayer
import numpy as np

# Scale radius by log of count
clusters['radius'] = np.clip(np.log2(clusters['n']) * 500, 500, 50000)

# Color by dominant source
source_colors = {
    'SESAR': [0, 100, 255],
    'OpenContext': [0, 200, 100],
    'GEOME': [255, 165, 0],
    'Smithsonian': [148, 0, 211]
}
clusters['color'] = clusters['dominant_source'].map(
    lambda s: source_colors.get(s, [128, 128, 128])
)

layer = ScatterplotLayer.from_dataframe(
    clusters,
    get_position=['lon', 'lat'],
    get_radius='radius',
    get_fill_color='color',
    opacity=0.6,
    pickable=True,
)
Map(layer)

Cell 5: Compare resolutions side-by-side

Show how res4 vs res6 vs res8 produce different clustering granularity. Include a table comparing cluster counts and a note about when to use each.

Cell 6: Benchmark — clustering vs full points

Time comparison: loading 112K clusters vs 6M individual points into Lonboard.

Cell 7: Regional drill-down demo

Select a res4 cell, then show its res6 children, then res8. Demonstrate hierarchical zoom.

Acceptance Criteria

  • New notebook created at examples/basic/h3_clustering.ipynb
  • Clear H3 explanation for newcomers
  • Lonboard map showing clusters colored by source
  • Multi-resolution comparison (res4/6/8)
  • Performance benchmark included
  • Notebook runs end-to-end without errors

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions