Skip to content

[Optimization 1/4] H3-accelerated bbox filtering in geoparquet.ipynb #2

@rdhyee

Description

@rdhyee

Priority: 2

Context

H3 geospatial indexing gives ~5x speedup for bounding box queries (170ms → 35ms). We published an H3-indexed wide parquet to R2. This notebook should demonstrate the speedup.

Data Files on R2

File URL Size
Wide + H3 https://pub-a18234d962364c22a50c787b7ca09fa5.r2.dev/isamples_202601_wide_h3.parquet 292 MB
Original wide https://pub-a18234d962364c22a50c787b7ca09fa5.r2.dev/isamples_202601_wide.parquet 278 MB

The H3 file has 3 extra columns: h3_res4 (BIGINT), h3_res6 (BIGINT), h3_res8 (BIGINT). 11.96M of 20.7M rows have H3 values (rows with coordinates).

File to Modify

examples/basic/geoparquet.ipynb

Current Behavior

All spatial queries use raw lat/lon range scans:

# ~170ms per query
filtered = table.filter(
    (ibis._.latitude.between(32, 49)) & (ibis._.longitude.between(-125, -104))
)

Desired Changes

1. Add H3 bbox filtering section

After the existing spatial query section, add a new section demonstrating H3 acceleration:

import duckdb

con = duckdb.connect()
con.execute("INSTALL h3 FROM community; LOAD h3;")

wide_h3_url = "https://pub-a18234d962364c22a50c787b7ca09fa5.r2.dev/isamples_202601_wide_h3.parquet"

# Compute H3 cells covering the Western US bbox
# h3.geo_to_cells requires the h3 Python package
# Alternative: use DuckDB to find cells within the bbox
import time

# Baseline: raw lat/lon
start = time.time()
baseline = con.sql(f"""
    SELECT COUNT(*) FROM read_parquet('{wide_h3_url}')
    WHERE otype = 'MaterialSampleRecord'
    AND latitude BETWEEN 32 AND 49 AND longitude BETWEEN -125 AND -104
""").fetchone()
baseline_ms = (time.time() - start) * 1000

# H3: pre-filter by res4 cells, then refine
start = time.time()
h3_result = con.sql(f"""
    WITH bbox_cells AS (
        SELECT DISTINCT h3_res4
        FROM read_parquet('{wide_h3_url}')
        WHERE otype = 'MaterialSampleRecord'
        AND latitude BETWEEN 32 AND 49 AND longitude BETWEEN -125 AND -104
    )
    SELECT COUNT(*) FROM read_parquet('{wide_h3_url}')
    WHERE h3_res4 IN (SELECT h3_res4 FROM bbox_cells)
    AND otype = 'MaterialSampleRecord'
""").fetchone()
h3_ms = (time.time() - start) * 1000

print(f"Baseline: {baseline_ms:.0f}ms ({baseline[0]:,} samples)")
print(f"H3 res4: {h3_ms:.0f}ms ({h3_result[0]:,} samples)")
print(f"Speedup: {baseline_ms/h3_ms:.1f}x")

2. Add H3 stats section

Show the H3 column distribution:

stats = con.sql(f"""
    SELECT
        COUNT(*) as total_rows,
        COUNT(h3_res4) as with_h3,
        COUNT(DISTINCT h3_res4) as unique_res4,
        COUNT(DISTINCT h3_res6) as unique_res6,
        COUNT(DISTINCT h3_res8) as unique_res8
    FROM read_parquet('{wide_h3_url}')
    WHERE otype = 'MaterialSampleRecord'
""").df()

3. Update Lonboard visualization

Use the H3 file URL instead of the original wide file, and demonstrate how H3 filtering can speed up the data loading for the map.

Acceptance Criteria

  • Side-by-side benchmark: baseline vs H3 query timing
  • H3 stats showing cell distribution
  • Lonboard map still works with H3 file
  • Clear markdown explanation of what H3 is and why it helps
  • Notebook runs end-to-end without errors

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions