Priority: 2
Context
H3 geospatial indexing gives ~5x speedup for bounding box queries (170ms → 35ms). We published an H3-indexed wide parquet to R2. This notebook should demonstrate the speedup.
Data Files on R2
| File |
URL |
Size |
| Wide + H3 |
https://pub-a18234d962364c22a50c787b7ca09fa5.r2.dev/isamples_202601_wide_h3.parquet |
292 MB |
| Original wide |
https://pub-a18234d962364c22a50c787b7ca09fa5.r2.dev/isamples_202601_wide.parquet |
278 MB |
The H3 file has 3 extra columns: h3_res4 (BIGINT), h3_res6 (BIGINT), h3_res8 (BIGINT). 11.96M of 20.7M rows have H3 values (rows with coordinates).
File to Modify
examples/basic/geoparquet.ipynb
Current Behavior
All spatial queries use raw lat/lon range scans:
# ~170ms per query
filtered = table.filter(
(ibis._.latitude.between(32, 49)) & (ibis._.longitude.between(-125, -104))
)
Desired Changes
1. Add H3 bbox filtering section
After the existing spatial query section, add a new section demonstrating H3 acceleration:
import duckdb
con = duckdb.connect()
con.execute("INSTALL h3 FROM community; LOAD h3;")
wide_h3_url = "https://pub-a18234d962364c22a50c787b7ca09fa5.r2.dev/isamples_202601_wide_h3.parquet"
# Compute H3 cells covering the Western US bbox
# h3.geo_to_cells requires the h3 Python package
# Alternative: use DuckDB to find cells within the bbox
import time
# Baseline: raw lat/lon
start = time.time()
baseline = con.sql(f"""
SELECT COUNT(*) FROM read_parquet('{wide_h3_url}')
WHERE otype = 'MaterialSampleRecord'
AND latitude BETWEEN 32 AND 49 AND longitude BETWEEN -125 AND -104
""").fetchone()
baseline_ms = (time.time() - start) * 1000
# H3: pre-filter by res4 cells, then refine
start = time.time()
h3_result = con.sql(f"""
WITH bbox_cells AS (
SELECT DISTINCT h3_res4
FROM read_parquet('{wide_h3_url}')
WHERE otype = 'MaterialSampleRecord'
AND latitude BETWEEN 32 AND 49 AND longitude BETWEEN -125 AND -104
)
SELECT COUNT(*) FROM read_parquet('{wide_h3_url}')
WHERE h3_res4 IN (SELECT h3_res4 FROM bbox_cells)
AND otype = 'MaterialSampleRecord'
""").fetchone()
h3_ms = (time.time() - start) * 1000
print(f"Baseline: {baseline_ms:.0f}ms ({baseline[0]:,} samples)")
print(f"H3 res4: {h3_ms:.0f}ms ({h3_result[0]:,} samples)")
print(f"Speedup: {baseline_ms/h3_ms:.1f}x")
2. Add H3 stats section
Show the H3 column distribution:
stats = con.sql(f"""
SELECT
COUNT(*) as total_rows,
COUNT(h3_res4) as with_h3,
COUNT(DISTINCT h3_res4) as unique_res4,
COUNT(DISTINCT h3_res6) as unique_res6,
COUNT(DISTINCT h3_res8) as unique_res8
FROM read_parquet('{wide_h3_url}')
WHERE otype = 'MaterialSampleRecord'
""").df()
3. Update Lonboard visualization
Use the H3 file URL instead of the original wide file, and demonstrate how H3 filtering can speed up the data loading for the map.
Acceptance Criteria
Priority: 2
Context
H3 geospatial indexing gives ~5x speedup for bounding box queries (170ms → 35ms). We published an H3-indexed wide parquet to R2. This notebook should demonstrate the speedup.
Data Files on R2
https://pub-a18234d962364c22a50c787b7ca09fa5.r2.dev/isamples_202601_wide_h3.parquethttps://pub-a18234d962364c22a50c787b7ca09fa5.r2.dev/isamples_202601_wide.parquetThe H3 file has 3 extra columns:
h3_res4(BIGINT),h3_res6(BIGINT),h3_res8(BIGINT). 11.96M of 20.7M rows have H3 values (rows with coordinates).File to Modify
examples/basic/geoparquet.ipynbCurrent Behavior
All spatial queries use raw lat/lon range scans:
Desired Changes
1. Add H3 bbox filtering section
After the existing spatial query section, add a new section demonstrating H3 acceleration:
2. Add H3 stats section
Show the H3 column distribution:
3. Update Lonboard visualization
Use the H3 file URL instead of the original wide file, and demonstrate how H3 filtering can speed up the data loading for the map.
Acceptance Criteria