Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
226 changes: 222 additions & 4 deletions examples/basic/geoparquet.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -534,7 +534,7 @@
"\n",
"* [iSamples Complete Export Dataset - April 2025](https://zenodo.org/records/15278211)\n",
"\n",
"* [Open Context Database SQL Dump and Parquet Exports](https://zenodo.org/records/15732000) -- [https://zenodo.org/records/15732000](https://zenodo.org/records/15732000) \n",
"* [Open Context Database SQL Dump and Parquet Exports](https://zenodo.org/records/15732000) -- [https://zenodo.org/records/15732000](https://zenodo.org/records/15732000)\u00a0\n",
"\n"
]
},
Expand Down Expand Up @@ -1717,7 +1717,7 @@
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>6680932 rows × 3 columns</p>\n",
"<p>6680932 rows \u00d7 3 columns</p>\n",
"</div>"
],
"text/plain": [
Expand Down Expand Up @@ -1957,7 +1957,7 @@
"version_minor": 1
},
"text/plain": [
"Map(custom_attribution='', layers=(BitmapTileLayer(data='https://tile.openstreetmap.org/{z}/{x}/{y}.png', max_"
"Map(custom_attribution='', layers=(BitmapTileLayer(data='https://tile.openstreetmap.org/{z}/{x}/{y}.png', max_\u2026"
]
},
"metadata": {},
Expand Down Expand Up @@ -2805,7 +2805,7 @@
"\n",
"The central idea is to create a \"control panel\" of widgets that are dynamically generated based on the schema of your Ibis table. This panel allows a user to build up a complex filter expression interactively, and then Ibis executes the final, filtered query.\n",
"\n",
"Here’s a step-by-step approach to implementing your vision:\n",
"Here\u2019s a step-by-step approach to implementing your vision:\n",
"\n",
"---\n",
"\n",
Expand Down Expand Up @@ -2946,6 +2946,224 @@
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "h3-acceleration-header",
"metadata": {},
"source": [
"## H3-Accelerated Spatial Filtering\n",
"\n",
"The [H3 geospatial indexing system](https://h3geo.org/) partitions the Earth into hexagonal cells at\n",
"multiple resolutions. By pre-computing H3 cell indices for each sample's coordinates, we can\n",
"replace expensive lat/lon range scans with fast integer lookups.\n",
"\n",
"The iSamples wide parquet file with H3 indices adds three BIGINT columns \u2014 `h3_res4`, `h3_res6`,\n",
"`h3_res8` \u2014 covering ~11.96M of 20.7M rows (those with valid coordinates).\n",
"\n",
"Below we benchmark **baseline lat/lon filtering** vs **H3 res4 pre-filtering** for a bounding-box\n",
"query, show H3 cell statistics, and render the H3-indexed data on a Lonboard map."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "h3-setup-cell",
"metadata": {},
"outputs": [],
"source": [
"import duckdb\n",
"import time\n",
"\n",
"# Data URLs\n",
"WIDE_H3_URL = \"https://pub-a18234d962364c22a50c787b7ca09fa5.r2.dev/isamples_202601_wide_h3.parquet\"\n",
"WIDE_URL = \"https://pub-a18234d962364c22a50c787b7ca09fa5.r2.dev/isamples_202601_wide.parquet\"\n",
"\n",
"con_h3 = duckdb.connect()\n",
"con_h3.execute(\"INSTALL h3 FROM community; LOAD h3;\")\n",
"print(\"DuckDB H3 extension loaded.\")"
]
},
{
"cell_type": "markdown",
"id": "h3-stats-header",
"metadata": {},
"source": [
"### H3 Cell Distribution Statistics\n",
"\n",
"How many distinct hexagonal cells exist at each resolution, and what fraction of rows carry H3 values?"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "h3-stats-cell",
"metadata": {},
"outputs": [],
"source": [
"h3_stats = con_h3.sql(f\"\"\"\n",
" SELECT\n",
" COUNT(*) AS total_rows,\n",
" COUNT(h3_res4) AS rows_with_h3,\n",
" ROUND(100.0 * COUNT(h3_res4) / COUNT(*), 1) AS pct_with_h3,\n",
" COUNT(DISTINCT h3_res4) AS distinct_res4,\n",
" COUNT(DISTINCT h3_res6) AS distinct_res6,\n",
" COUNT(DISTINCT h3_res8) AS distinct_res8\n",
" FROM read_parquet('{WIDE_H3_URL}')\n",
"\"\"\").df()\n",
"\n",
"print(\"H3 Cell Distribution\")\n",
"print(\"=\" * 50)\n",
"print(f\"Total rows: {h3_stats['total_rows'].iloc[0]:>12,}\")\n",
"print(f\"Rows with H3: {h3_stats['rows_with_h3'].iloc[0]:>12,} ({h3_stats['pct_with_h3'].iloc[0]}%)\")\n",
"print(f\"Distinct res4 cells: {h3_stats['distinct_res4'].iloc[0]:>12,} (~1,770 km hex)\")\n",
"print(f\"Distinct res6 cells: {h3_stats['distinct_res6'].iloc[0]:>12,} (~3.2 km hex)\")\n",
"print(f\"Distinct res8 cells: {h3_stats['distinct_res8'].iloc[0]:>12,} (~0.46 km hex)\")"
]
},
{
"cell_type": "markdown",
"id": "h3-benchmark-header",
"metadata": {},
"source": [
"### Bbox Benchmark: Lat/Lon Range Scan vs H3 Pre-Filter\n",
"\n",
"We query samples inside a bounding box (San Francisco Bay Area) two ways:\n",
"\n",
"1. **Baseline** \u2014 filter on `latitude BETWEEN ... AND longitude BETWEEN ...`\n",
"2. **H3 pre-filter** \u2014 find the set of res4 cells that overlap the bbox, then filter by those cells\n",
" before applying the exact lat/lon check.\n",
"\n",
"The H3 approach narrows the scan to a small number of hexagonal cells, reducing I/O on remote\n",
"parquet files."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "h3-benchmark-cell",
"metadata": {},
"outputs": [],
"source": [
"import h3\n",
"\n",
"# Bay Area bounding box\n",
"BBOX = dict(min_lat=37.2, max_lat=37.9, min_lon=-122.6, max_lon=-121.8)\n",
"\n",
"# --- Baseline: raw lat/lon range scan ---\n",
"t0 = time.time()\n",
"baseline = con_h3.sql(f\"\"\"\n",
" SELECT COUNT(*) AS n\n",
" FROM read_parquet('{WIDE_H3_URL}')\n",
" WHERE otype = 'MaterialSampleRecord'\n",
" AND latitude BETWEEN {BBOX['min_lat']} AND {BBOX['max_lat']}\n",
" AND longitude BETWEEN {BBOX['min_lon']} AND {BBOX['max_lon']}\n",
"\"\"\").df()\n",
"baseline_ms = (time.time() - t0) * 1000\n",
"\n",
"# --- H3 pre-filter: compute covering cells mathematically (no data scan) ---\n",
"t0 = time.time()\n",
"\n",
"# Use h3 Python library to compute all res4 cells covering the bbox.\n",
"# This is O(1) relative to dataset size \u2014 pure geometry, no I/O.\n",
"bbox_polygon = h3.LatLngPoly([\n",
" (BBOX['min_lat'], BBOX['min_lon']),\n",
" (BBOX['min_lat'], BBOX['max_lon']),\n",
" (BBOX['max_lat'], BBOX['max_lon']),\n",
" (BBOX['max_lat'], BBOX['min_lon']),\n",
"])\n",
"covering_cells = h3.geo_to_cells(bbox_polygon, res=4)\n",
"# Convert to signed int64 to match DuckDB BIGINT storage\n",
"def h3_to_signed(cell_hex):\n",
" val = h3.str_to_int(cell_hex)\n",
" return val if val < 2**63 else val - 2**64\n",
"\n",
"cell_list = [str(h3_to_signed(c)) for c in covering_cells]\n",
"print(f'Bbox covered by {len(cell_list)} res4 cells (computed mathematically)')\n",
"\n",
"if not cell_list:\n",
" print('No H3 cells cover this bbox.')\n",
" h3_ms = 0\n",
" h3_result = None\n",
"else:\n",
" cell_sql = ', '.join(cell_list)\n",
" h3_result = con_h3.sql(f\"\"\"\n",
" SELECT COUNT(*) AS n\n",
" FROM read_parquet('{WIDE_H3_URL}')\n",
" WHERE otype = 'MaterialSampleRecord'\n",
" AND h3_res4 IN ({cell_sql})\n",
" AND latitude BETWEEN {BBOX['min_lat']} AND {BBOX['max_lat']}\n",
" AND longitude BETWEEN {BBOX['min_lon']} AND {BBOX['max_lon']}\n",
" \"\"\").df()\n",
" h3_ms = (time.time() - t0) * 1000\n",
"\n",
"# --- Results ---\n",
"print('Bounding-Box Query Benchmark (Bay Area)')\n",
"print('=' * 50)\n",
"print(f'Baseline (lat/lon scan): {baseline_ms:>8.0f} ms | {baseline[\"n\"].iloc[0]:,} rows')\n",
"if h3_result is not None:\n",
" print(f'H3 res4 pre-filter: {h3_ms:>8.0f} ms | {h3_result[\"n\"].iloc[0]:,} rows')\n",
" speedup = baseline_ms / h3_ms if h3_ms > 0 else float('inf')\n",
" print(f'Speedup: {speedup:>7.1f}x')\n",
" print(f'\\nRow counts match: {baseline[\"n\"].iloc[0] == h3_result[\"n\"].iloc[0]}')\n",
"else:\n",
" print('H3 result: no covering cells')\n"
]
},
{
"cell_type": "markdown",
"id": "h3-lonboard-header",
"metadata": {},
"source": [
"### Lonboard Visualization with H3-Indexed Data\n",
"\n",
"Render a sample of the H3-indexed dataset to confirm the visualization pipeline works with\n",
"the enriched file."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "h3-lonboard-cell",
"metadata": {},
"outputs": [],
"source": [
"from lonboard import Map, ScatterplotLayer, BitmapTileLayer\n",
"\n",
"# Sample 100K rows with coordinates from the H3 file\n",
"sample_df = con_h3.sql(f\"\"\"\n",
" SELECT latitude, longitude, n AS source, h3_res4, h3_res6\n",
" FROM read_parquet('{WIDE_H3_URL}')\n",
" WHERE latitude IS NOT NULL AND longitude IS NOT NULL\n",
" USING SAMPLE 100000\n",
"\"\"\").df()\n",
"\n",
"gdf_h3 = gpd.GeoDataFrame(\n",
" sample_df,\n",
" geometry=gpd.points_from_xy(sample_df.longitude, sample_df.latitude),\n",
" crs=\"EPSG:4326\"\n",
")\n",
"\n",
"# Color by source\n",
"h3_color_map = {\n",
" 'SESAR': [51, 102, 204, 200],\n",
" 'OPENCONTEXT': [220, 57, 18, 200],\n",
" 'GEOME': [16, 150, 24, 200],\n",
" 'SMITHSONIAN': [255, 153, 0, 200],\n",
"}\n",
"default_c = [128, 128, 128, 200]\n",
"colors_arr = np.array([h3_color_map.get(s, default_c) for s in gdf_h3['source']], dtype=np.uint8)\n",
"\n",
"layer = ScatterplotLayer.from_geopandas(gdf_h3, get_fill_color=colors_arr, get_radius=500)\n",
"tile_layer = BitmapTileLayer(\n",
" data=\"https://tile.openstreetmap.org/{z}/{x}/{y}.png\",\n",
" min_zoom=0, max_zoom=19,\n",
")\n",
"\n",
"m = Map([tile_layer, layer], view_state={\"zoom\": 2, \"latitude\": 20, \"longitude\": 0})\n",
"print(f\"Rendering {len(gdf_h3):,} points from H3-indexed file\")\n",
"m"
]
}
],
"metadata": {
Expand Down
Loading