diff --git a/examples/basic/geoparquet.ipynb b/examples/basic/geoparquet.ipynb index eb978cc..769cd66 100644 --- a/examples/basic/geoparquet.ipynb +++ b/examples/basic/geoparquet.ipynb @@ -534,7 +534,7 @@ "\n", "* [iSamples Complete Export Dataset - April 2025](https://zenodo.org/records/15278211)\n", "\n", - "* [Open Context Database SQL Dump and Parquet Exports](https://zenodo.org/records/15732000) -- [https://zenodo.org/records/15732000](https://zenodo.org/records/15732000) \n", + "* [Open Context Database SQL Dump and Parquet Exports](https://zenodo.org/records/15732000) -- [https://zenodo.org/records/15732000](https://zenodo.org/records/15732000)\u00a0\n", "\n" ] }, @@ -1717,7 +1717,7 @@ " \n", " \n", "\n", - "

6680932 rows × 3 columns

\n", + "

6680932 rows \u00d7 3 columns

\n", "" ], "text/plain": [ @@ -1957,7 +1957,7 @@ "version_minor": 1 }, "text/plain": [ - "Map(custom_attribution='', layers=(BitmapTileLayer(data='https://tile.openstreetmap.org/{z}/{x}/{y}.png', max_…" + "Map(custom_attribution='', layers=(BitmapTileLayer(data='https://tile.openstreetmap.org/{z}/{x}/{y}.png', max_\u2026" ] }, "metadata": {}, @@ -2805,7 +2805,7 @@ "\n", "The central idea is to create a \"control panel\" of widgets that are dynamically generated based on the schema of your Ibis table. This panel allows a user to build up a complex filter expression interactively, and then Ibis executes the final, filtered query.\n", "\n", - "Here’s a step-by-step approach to implementing your vision:\n", + "Here\u2019s a step-by-step approach to implementing your vision:\n", "\n", "---\n", "\n", @@ -2946,6 +2946,224 @@ "metadata": {}, "outputs": [], "source": [] + }, + { + "cell_type": "markdown", + "id": "h3-acceleration-header", + "metadata": {}, + "source": [ + "## H3-Accelerated Spatial Filtering\n", + "\n", + "The [H3 geospatial indexing system](https://h3geo.org/) partitions the Earth into hexagonal cells at\n", + "multiple resolutions. By pre-computing H3 cell indices for each sample's coordinates, we can\n", + "replace expensive lat/lon range scans with fast integer lookups.\n", + "\n", + "The iSamples wide parquet file with H3 indices adds three BIGINT columns \u2014 `h3_res4`, `h3_res6`,\n", + "`h3_res8` \u2014 covering ~11.96M of 20.7M rows (those with valid coordinates).\n", + "\n", + "Below we benchmark **baseline lat/lon filtering** vs **H3 res4 pre-filtering** for a bounding-box\n", + "query, show H3 cell statistics, and render the H3-indexed data on a Lonboard map." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "h3-setup-cell", + "metadata": {}, + "outputs": [], + "source": [ + "import duckdb\n", + "import time\n", + "\n", + "# Data URLs\n", + "WIDE_H3_URL = \"https://pub-a18234d962364c22a50c787b7ca09fa5.r2.dev/isamples_202601_wide_h3.parquet\"\n", + "WIDE_URL = \"https://pub-a18234d962364c22a50c787b7ca09fa5.r2.dev/isamples_202601_wide.parquet\"\n", + "\n", + "con_h3 = duckdb.connect()\n", + "con_h3.execute(\"INSTALL h3 FROM community; LOAD h3;\")\n", + "print(\"DuckDB H3 extension loaded.\")" + ] + }, + { + "cell_type": "markdown", + "id": "h3-stats-header", + "metadata": {}, + "source": [ + "### H3 Cell Distribution Statistics\n", + "\n", + "How many distinct hexagonal cells exist at each resolution, and what fraction of rows carry H3 values?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "h3-stats-cell", + "metadata": {}, + "outputs": [], + "source": [ + "h3_stats = con_h3.sql(f\"\"\"\n", + " SELECT\n", + " COUNT(*) AS total_rows,\n", + " COUNT(h3_res4) AS rows_with_h3,\n", + " ROUND(100.0 * COUNT(h3_res4) / COUNT(*), 1) AS pct_with_h3,\n", + " COUNT(DISTINCT h3_res4) AS distinct_res4,\n", + " COUNT(DISTINCT h3_res6) AS distinct_res6,\n", + " COUNT(DISTINCT h3_res8) AS distinct_res8\n", + " FROM read_parquet('{WIDE_H3_URL}')\n", + "\"\"\").df()\n", + "\n", + "print(\"H3 Cell Distribution\")\n", + "print(\"=\" * 50)\n", + "print(f\"Total rows: {h3_stats['total_rows'].iloc[0]:>12,}\")\n", + "print(f\"Rows with H3: {h3_stats['rows_with_h3'].iloc[0]:>12,} ({h3_stats['pct_with_h3'].iloc[0]}%)\")\n", + "print(f\"Distinct res4 cells: {h3_stats['distinct_res4'].iloc[0]:>12,} (~1,770 km hex)\")\n", + "print(f\"Distinct res6 cells: {h3_stats['distinct_res6'].iloc[0]:>12,} (~3.2 km hex)\")\n", + "print(f\"Distinct res8 cells: {h3_stats['distinct_res8'].iloc[0]:>12,} (~0.46 km hex)\")" + ] + }, + { + "cell_type": "markdown", + "id": "h3-benchmark-header", + "metadata": {}, + "source": [ + "### Bbox Benchmark: Lat/Lon Range Scan vs H3 Pre-Filter\n", + "\n", + "We query samples inside a bounding box (San Francisco Bay Area) two ways:\n", + "\n", + "1. **Baseline** \u2014 filter on `latitude BETWEEN ... AND longitude BETWEEN ...`\n", + "2. **H3 pre-filter** \u2014 find the set of res4 cells that overlap the bbox, then filter by those cells\n", + " before applying the exact lat/lon check.\n", + "\n", + "The H3 approach narrows the scan to a small number of hexagonal cells, reducing I/O on remote\n", + "parquet files." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "h3-benchmark-cell", + "metadata": {}, + "outputs": [], + "source": [ + "import h3\n", + "\n", + "# Bay Area bounding box\n", + "BBOX = dict(min_lat=37.2, max_lat=37.9, min_lon=-122.6, max_lon=-121.8)\n", + "\n", + "# --- Baseline: raw lat/lon range scan ---\n", + "t0 = time.time()\n", + "baseline = con_h3.sql(f\"\"\"\n", + " SELECT COUNT(*) AS n\n", + " FROM read_parquet('{WIDE_H3_URL}')\n", + " WHERE otype = 'MaterialSampleRecord'\n", + " AND latitude BETWEEN {BBOX['min_lat']} AND {BBOX['max_lat']}\n", + " AND longitude BETWEEN {BBOX['min_lon']} AND {BBOX['max_lon']}\n", + "\"\"\").df()\n", + "baseline_ms = (time.time() - t0) * 1000\n", + "\n", + "# --- H3 pre-filter: compute covering cells mathematically (no data scan) ---\n", + "t0 = time.time()\n", + "\n", + "# Use h3 Python library to compute all res4 cells covering the bbox.\n", + "# This is O(1) relative to dataset size \u2014 pure geometry, no I/O.\n", + "bbox_polygon = h3.LatLngPoly([\n", + " (BBOX['min_lat'], BBOX['min_lon']),\n", + " (BBOX['min_lat'], BBOX['max_lon']),\n", + " (BBOX['max_lat'], BBOX['max_lon']),\n", + " (BBOX['max_lat'], BBOX['min_lon']),\n", + "])\n", + "covering_cells = h3.geo_to_cells(bbox_polygon, res=4)\n", + "# Convert to signed int64 to match DuckDB BIGINT storage\n", + "def h3_to_signed(cell_hex):\n", + " val = h3.str_to_int(cell_hex)\n", + " return val if val < 2**63 else val - 2**64\n", + "\n", + "cell_list = [str(h3_to_signed(c)) for c in covering_cells]\n", + "print(f'Bbox covered by {len(cell_list)} res4 cells (computed mathematically)')\n", + "\n", + "if not cell_list:\n", + " print('No H3 cells cover this bbox.')\n", + " h3_ms = 0\n", + " h3_result = None\n", + "else:\n", + " cell_sql = ', '.join(cell_list)\n", + " h3_result = con_h3.sql(f\"\"\"\n", + " SELECT COUNT(*) AS n\n", + " FROM read_parquet('{WIDE_H3_URL}')\n", + " WHERE otype = 'MaterialSampleRecord'\n", + " AND h3_res4 IN ({cell_sql})\n", + " AND latitude BETWEEN {BBOX['min_lat']} AND {BBOX['max_lat']}\n", + " AND longitude BETWEEN {BBOX['min_lon']} AND {BBOX['max_lon']}\n", + " \"\"\").df()\n", + " h3_ms = (time.time() - t0) * 1000\n", + "\n", + "# --- Results ---\n", + "print('Bounding-Box Query Benchmark (Bay Area)')\n", + "print('=' * 50)\n", + "print(f'Baseline (lat/lon scan): {baseline_ms:>8.0f} ms | {baseline[\"n\"].iloc[0]:,} rows')\n", + "if h3_result is not None:\n", + " print(f'H3 res4 pre-filter: {h3_ms:>8.0f} ms | {h3_result[\"n\"].iloc[0]:,} rows')\n", + " speedup = baseline_ms / h3_ms if h3_ms > 0 else float('inf')\n", + " print(f'Speedup: {speedup:>7.1f}x')\n", + " print(f'\\nRow counts match: {baseline[\"n\"].iloc[0] == h3_result[\"n\"].iloc[0]}')\n", + "else:\n", + " print('H3 result: no covering cells')\n" + ] + }, + { + "cell_type": "markdown", + "id": "h3-lonboard-header", + "metadata": {}, + "source": [ + "### Lonboard Visualization with H3-Indexed Data\n", + "\n", + "Render a sample of the H3-indexed dataset to confirm the visualization pipeline works with\n", + "the enriched file." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "h3-lonboard-cell", + "metadata": {}, + "outputs": [], + "source": [ + "from lonboard import Map, ScatterplotLayer, BitmapTileLayer\n", + "\n", + "# Sample 100K rows with coordinates from the H3 file\n", + "sample_df = con_h3.sql(f\"\"\"\n", + " SELECT latitude, longitude, n AS source, h3_res4, h3_res6\n", + " FROM read_parquet('{WIDE_H3_URL}')\n", + " WHERE latitude IS NOT NULL AND longitude IS NOT NULL\n", + " USING SAMPLE 100000\n", + "\"\"\").df()\n", + "\n", + "gdf_h3 = gpd.GeoDataFrame(\n", + " sample_df,\n", + " geometry=gpd.points_from_xy(sample_df.longitude, sample_df.latitude),\n", + " crs=\"EPSG:4326\"\n", + ")\n", + "\n", + "# Color by source\n", + "h3_color_map = {\n", + " 'SESAR': [51, 102, 204, 200],\n", + " 'OPENCONTEXT': [220, 57, 18, 200],\n", + " 'GEOME': [16, 150, 24, 200],\n", + " 'SMITHSONIAN': [255, 153, 0, 200],\n", + "}\n", + "default_c = [128, 128, 128, 200]\n", + "colors_arr = np.array([h3_color_map.get(s, default_c) for s in gdf_h3['source']], dtype=np.uint8)\n", + "\n", + "layer = ScatterplotLayer.from_geopandas(gdf_h3, get_fill_color=colors_arr, get_radius=500)\n", + "tile_layer = BitmapTileLayer(\n", + " data=\"https://tile.openstreetmap.org/{z}/{x}/{y}.png\",\n", + " min_zoom=0, max_zoom=19,\n", + ")\n", + "\n", + "m = Map([tile_layer, layer], view_state={\"zoom\": 2, \"latitude\": 20, \"longitude\": 0})\n", + "print(f\"Rendering {len(gdf_h3):,} points from H3-indexed file\")\n", + "m" + ] } ], "metadata": { diff --git a/examples/basic/h3_clustering.ipynb b/examples/basic/h3_clustering.ipynb new file mode 100644 index 0000000..e901d89 --- /dev/null +++ b/examples/basic/h3_clustering.ipynb @@ -0,0 +1,374 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "h3-intro", + "metadata": {}, + "source": [ + "# H3 Clustering for iSamples Geospatial Data\n", + "\n", + "## What is H3?\n", + "\n", + "[H3](https://h3geo.org/) is Uber's **hierarchical hexagonal indexing system** that partitions the\n", + "entire Earth's surface into hexagonal cells at 16 resolutions (0-15). Key properties:\n", + "\n", + "- **Hierarchical**: Each parent cell contains ~7 children at the next resolution\n", + "- **Hexagonal**: Hexagons tile the plane with equal-area cells and uniform neighbor distances\n", + "- **Multi-resolution**: Zoom from continent-scale (res 0, ~4.4M km2) to sub-meter (res 15)\n", + "\n", + "| Resolution | Hex Edge Length | Approx. Area | Use Case |\n", + "|-----------|----------------|-------------|----------|\n", + "| 4 | ~22 km | ~1,770 km2 | Country/region overview |\n", + "| 6 | ~3.2 km | ~36 km2 | City-level clustering |\n", + "| 8 | ~460 m | ~0.74 km2 | Neighborhood detail |\n", + "\n", + "The iSamples wide parquet with H3 columns (`h3_res4`, `h3_res6`, `h3_res8`) covers\n", + "~11.96M of 20.7M rows \u2014 those with valid latitude/longitude coordinates.\n", + "\n", + "This notebook demonstrates:\n", + "1. H3 cell statistics at multiple resolutions\n", + "2. Cluster visualization colored by dominant source\n", + "3. Multi-resolution comparison\n", + "4. Performance gains of clustering vs raw points\n", + "5. Hierarchical drill-down from coarse to fine cells" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "h3-setup", + "metadata": {}, + "outputs": [], + "source": [ + "import duckdb\n", + "import time\n", + "import numpy as np\n", + "import pandas as pd\n", + "import geopandas as gpd\n", + "from shapely.geometry import Point\n", + "\n", + "# Data URL\n", + "WIDE_H3_URL = \"https://pub-a18234d962364c22a50c787b7ca09fa5.r2.dev/isamples_202601_wide_h3.parquet\"\n", + "\n", + "# Initialize DuckDB with H3 extension\n", + "con = duckdb.connect()\n", + "con.execute(\"INSTALL h3 FROM community; LOAD h3;\")\n", + "\n", + "# Source colors (RGBA)\n", + "SOURCE_COLORS = {\n", + " 'SESAR': [51, 102, 204, 200],\n", + " 'OPENCONTEXT': [220, 57, 18, 200],\n", + " 'GEOME': [16, 150, 24, 200],\n", + " 'SMITHSONIAN': [255, 153, 0, 200],\n", + "}\n", + "DEFAULT_COLOR = [128, 128, 128, 200]\n", + "\n", + "print(\"Setup complete. DuckDB H3 extension loaded.\")" + ] + }, + { + "cell_type": "markdown", + "id": "h3-stats-md", + "metadata": {}, + "source": [ + "## H3 Cell Statistics\n", + "\n", + "How many distinct hexagonal cells exist at each resolution, and how does\n", + "the point distribution look?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "h3-stats-code", + "metadata": {}, + "outputs": [], + "source": [ + "# Overall H3 coverage stats\n", + "stats = con.sql(f\"\"\"\n", + " SELECT\n", + " COUNT(*) AS total_rows,\n", + " COUNT(h3_res4) AS rows_with_h3,\n", + " ROUND(100.0 * COUNT(h3_res4) / COUNT(*), 1) AS pct_with_h3,\n", + " COUNT(DISTINCT h3_res4) AS distinct_res4,\n", + " COUNT(DISTINCT h3_res6) AS distinct_res6,\n", + " COUNT(DISTINCT h3_res8) AS distinct_res8\n", + " FROM read_parquet('{WIDE_H3_URL}')\n", + "\"\"\").df()\n", + "\n", + "print(\"iSamples H3 Coverage\")\n", + "print(\"=\" * 50)\n", + "print(f\"Total rows: {stats['total_rows'].iloc[0]:>12,}\")\n", + "print(f\"Rows with H3: {stats['rows_with_h3'].iloc[0]:>12,} ({stats['pct_with_h3'].iloc[0]}%)\")\n", + "print(f\"Distinct res4 cells: {stats['distinct_res4'].iloc[0]:>12,}\")\n", + "print(f\"Distinct res6 cells: {stats['distinct_res6'].iloc[0]:>12,}\")\n", + "print(f\"Distinct res8 cells: {stats['distinct_res8'].iloc[0]:>12,}\")\n", + "\n", + "# Res6 cluster aggregation (~3.2 km hexagons)\n", + "print(\"\\nComputing res6 clusters...\")\n", + "t0 = time.time()\n", + "clusters = con.sql(f\"\"\"\n", + " SELECT\n", + " h3_res6,\n", + " COUNT(*) AS sample_count,\n", + " AVG(latitude) AS lat,\n", + " AVG(longitude) AS lng,\n", + " MODE(n) AS dominant_source,\n", + " COUNT(DISTINCT n) AS source_count\n", + " FROM read_parquet('{WIDE_H3_URL}')\n", + " WHERE h3_res6 IS NOT NULL\n", + " GROUP BY h3_res6\n", + " ORDER BY sample_count DESC\n", + "\"\"\").df()\n", + "cluster_ms = (time.time() - t0) * 1000\n", + "\n", + "print(f\"Computed {len(clusters):,} res6 clusters in {cluster_ms:.0f} ms\")\n", + "print(f\"\\nTop 10 clusters by sample count:\")\n", + "print(clusters[['h3_res6', 'sample_count', 'dominant_source', 'lat', 'lng']].head(10).to_string(index=False))" + ] + }, + { + "cell_type": "markdown", + "id": "h3-viz-md", + "metadata": {}, + "source": [ + "## Cluster Visualization with Lonboard\n", + "\n", + "Each cluster is rendered as a circle at its hexagon centroid. The radius is proportional\n", + "to `log(count)` so that both sparse and dense regions are visible. Colors indicate the\n", + "dominant source within each hex cell." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "h3-viz-code", + "metadata": {}, + "outputs": [], + "source": [ + "from lonboard import Map, ScatterplotLayer, BitmapTileLayer\n", + "\n", + "# Build GeoDataFrame from clusters\n", + "gdf_clusters = gpd.GeoDataFrame(\n", + " clusters,\n", + " geometry=gpd.points_from_xy(clusters.lng, clusters.lat),\n", + " crs=\"EPSG:4326\"\n", + ")\n", + "\n", + "# Colors based on dominant source\n", + "colors = np.array([\n", + " SOURCE_COLORS.get(s, DEFAULT_COLOR) for s in gdf_clusters['dominant_source']\n", + "], dtype=np.uint8)\n", + "\n", + "# Radius proportional to log(count), scaled for visibility\n", + "radii = (np.log1p(gdf_clusters['sample_count'].values) * 800).astype(np.float32)\n", + "\n", + "layer = ScatterplotLayer.from_geopandas(\n", + " gdf_clusters,\n", + " get_fill_color=colors,\n", + " get_radius=radii,\n", + " opacity=0.7,\n", + " radius_min_pixels=2,\n", + ")\n", + "\n", + "tile_layer = BitmapTileLayer(\n", + " data=\"https://tile.openstreetmap.org/{z}/{x}/{y}.png\",\n", + " min_zoom=0, max_zoom=19,\n", + ")\n", + "\n", + "m = Map([tile_layer, layer], view_state={\"zoom\": 2, \"latitude\": 20, \"longitude\": 0})\n", + "print(f\"Rendering {len(gdf_clusters):,} hex clusters (from {clusters['sample_count'].sum():,} samples)\")\n", + "print(f\"Color legend: Blue=SESAR, Red=OpenContext, Green=GEOME, Orange=Smithsonian\")\n", + "m" + ] + }, + { + "cell_type": "markdown", + "id": "h3-multires-md", + "metadata": {}, + "source": [ + "## Multi-Resolution Comparison\n", + "\n", + "Compare clustering at res4 (regional), res6 (city), and res8 (neighborhood) to understand\n", + "how granularity affects cluster count and aggregation behavior." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "h3-multires-code", + "metadata": {}, + "outputs": [], + "source": [ + "resolutions = []\n", + "\n", + "for res, col in [(4, 'h3_res4'), (6, 'h3_res6'), (8, 'h3_res8')]:\n", + " t0 = time.time()\n", + " res_stats = con.sql(f\"\"\"\n", + " SELECT\n", + " {res} AS resolution,\n", + " COUNT(DISTINCT {col}) AS num_clusters,\n", + " COUNT(*) AS total_points,\n", + " ROUND(COUNT(*) * 1.0 / COUNT(DISTINCT {col}), 1) AS avg_points_per_cluster,\n", + " MAX(ct) AS max_cluster_size,\n", + " MEDIAN(ct) AS median_cluster_size\n", + " FROM (\n", + " SELECT {col}, COUNT(*) AS ct\n", + " FROM read_parquet('{WIDE_H3_URL}')\n", + " WHERE {col} IS NOT NULL\n", + " GROUP BY {col}\n", + " )\n", + " \"\"\").df()\n", + " elapsed = (time.time() - t0) * 1000\n", + " res_stats['query_ms'] = round(elapsed)\n", + " resolutions.append(res_stats)\n", + "\n", + "comparison = pd.concat(resolutions, ignore_index=True)\n", + "print(\"Multi-Resolution Comparison\")\n", + "print(\"=\" * 70)\n", + "print(comparison.to_string(index=False))\n", + "\n", + "print(\"\\nGuidance:\")\n", + "print(\" res4 \u2014 Best for global/continental overviews (few clusters, fast)\")\n", + "print(\" res6 \u2014 Good balance for city-level exploration (~3.2 km hexagons)\")\n", + "print(\" res8 \u2014 Detailed neighborhood view (~460 m hexagons, more clusters)\")" + ] + }, + { + "cell_type": "markdown", + "id": "h3-perf-md", + "metadata": {}, + "source": [ + "## Performance: Clusters vs Full Points\n", + "\n", + "Rendering all ~12M points individually overwhelms both the query engine and the browser.\n", + "Clustering via H3 reduces data volume by orders of magnitude while preserving spatial patterns." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "h3-perf-code", + "metadata": {}, + "outputs": [], + "source": [ + "# Benchmark: clustered query vs full point query\n", + "\n", + "# Full points (limited to count to avoid browser crash)\n", + "t0 = time.time()\n", + "full_count = con.sql(f\"\"\"\n", + " SELECT COUNT(*) AS n\n", + " FROM read_parquet('{WIDE_H3_URL}')\n", + " WHERE latitude IS NOT NULL\n", + "\"\"\").fetchone()[0]\n", + "full_ms = (time.time() - t0) * 1000\n", + "\n", + "# Clustered at res6\n", + "t0 = time.time()\n", + "cluster_count = con.sql(f\"\"\"\n", + " SELECT COUNT(*) AS n FROM (\n", + " SELECT h3_res6, COUNT(*) AS ct,\n", + " AVG(latitude) AS lat, AVG(longitude) AS lng\n", + " FROM read_parquet('{WIDE_H3_URL}')\n", + " WHERE h3_res6 IS NOT NULL\n", + " GROUP BY h3_res6\n", + " )\n", + "\"\"\").fetchone()[0]\n", + "cluster_ms = (time.time() - t0) * 1000\n", + "\n", + "reduction = full_count / cluster_count if cluster_count > 0 else 0\n", + "\n", + "print(\"Performance Comparison\")\n", + "print(\"=\" * 50)\n", + "print(f\"Full points: {full_count:>12,} rows ({full_ms:>6.0f} ms)\")\n", + "print(f\"Res6 clusters: {cluster_count:>12,} rows ({cluster_ms:>6.0f} ms)\")\n", + "print(f\"Data reduction: {reduction:>11.0f}x fewer rows to render\")\n", + "print(f\"\\nClustering at res6 reduces rendering payload from ~{full_count/1e6:.1f}M to\")\n", + "print(f\"~{cluster_count/1e3:.0f}K points \u2014 enabling smooth interactive maps.\")" + ] + }, + { + "cell_type": "markdown", + "id": "h3-drilldown-md", + "metadata": {}, + "source": [ + "## Hierarchical Drill-Down: res4 -> res6 -> res8\n", + "\n", + "H3's hierarchy means every res4 cell contains ~49 res6 children, and each res6 cell\n", + "contains ~49 res8 children. This enables progressive drill-down \u2014 start with a coarse\n", + "view, then zoom into regions of interest at finer resolution." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "h3-drilldown-code", + "metadata": {}, + "outputs": [], + "source": [ + "# Pick the largest res4 cell and drill down\n", + "top_res4 = con.sql(f\"\"\"\n", + " SELECT h3_res4, COUNT(*) AS cnt\n", + " FROM read_parquet('{WIDE_H3_URL}')\n", + " WHERE h3_res4 IS NOT NULL\n", + " GROUP BY h3_res4\n", + " ORDER BY cnt DESC\n", + " LIMIT 1\n", + "\"\"\").fetchone()\n", + "\n", + "parent_cell = top_res4[0]\n", + "parent_count = top_res4[1]\n", + "print(f\"Largest res4 cell: {parent_cell} ({parent_count:,} samples)\")\n", + "\n", + "# Drill into res6 children\n", + "res6_children = con.sql(f\"\"\"\n", + " SELECT h3_res6, COUNT(*) AS cnt, MODE(n) AS dominant_source,\n", + " AVG(latitude) AS lat, AVG(longitude) AS lng\n", + " FROM read_parquet('{WIDE_H3_URL}')\n", + " WHERE h3_res4 = {parent_cell}\n", + " AND h3_res6 IS NOT NULL\n", + " GROUP BY h3_res6\n", + " ORDER BY cnt DESC\n", + "\"\"\").df()\n", + "\n", + "print(f\"\\nRes4 -> Res6: {len(res6_children)} child cells\")\n", + "print(res6_children.head(10).to_string(index=False))\n", + "\n", + "# Pick the top res6 child and drill into res8\n", + "if len(res6_children) > 0:\n", + " top_res6 = res6_children.iloc[0]\n", + " res8_children = con.sql(f\"\"\"\n", + " SELECT h3_res8, COUNT(*) AS cnt, MODE(n) AS dominant_source,\n", + " AVG(latitude) AS lat, AVG(longitude) AS lng\n", + " FROM read_parquet('{WIDE_H3_URL}')\n", + " WHERE h3_res6 = {int(top_res6['h3_res6'])}\n", + " AND h3_res8 IS NOT NULL\n", + " GROUP BY h3_res8\n", + " ORDER BY cnt DESC\n", + " \"\"\").df()\n", + "\n", + " print(f\"\\nRes6 -> Res8: {len(res8_children)} child cells\")\n", + " print(f\"(from res6 cell {int(top_res6['h3_res6'])} with {int(top_res6['cnt']):,} samples)\")\n", + " print(res8_children.head(10).to_string(index=False))\n", + "\n", + " print(f\"\\nDrill-down summary:\")\n", + " print(f\" res4: 1 cell -> {parent_count:,} samples\")\n", + " print(f\" res6: {len(res6_children)} cells -> {res6_children['cnt'].sum():,} samples\")\n", + " print(f\" res8: {len(res8_children)} cells -> {res8_children['cnt'].sum():,} samples (from top res6)\")" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python", + "version": "3.11.0" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} \ No newline at end of file diff --git a/examples/basic/isamples_explorer.ipynb b/examples/basic/isamples_explorer.ipynb index 6e4ac8a..f5d7198 100644 --- a/examples/basic/isamples_explorer.ipynb +++ b/examples/basic/isamples_explorer.ipynb @@ -3,7 +3,7 @@ { "cell_type": "markdown", "metadata": {}, - "source": "# iSamples Interactive Explorer\n\nAn interactive interface for exploring iSamples data across all sources.\n\n**Features:**\n- Map view with 6M+ samples (lonboard WebGL)\n- Interactive table with filtering (ipydatagrid)\n- Sample cards on selection\n- **Faceted filtering**: Filter by source, material type, and time period\n - Hierarchical material types with rollup\n - **Decade quick-select** for time filtering\n- **Fulltext search**: Search label, description, place name with ranked results\n- **Bidirectional selection sync**: Click map → highlights table row; click table → recenters map\n- **Viewport Mode**: Dynamic loading based on pan/zoom (with loading indicator)\n- **Adaptive sampling**: More points when zoomed in, fewer when zoomed out\n\n**Data:** Zenodo wide parquet (~282 MB, 20M rows)" + "source": "# iSamples Interactive Explorer\n\nAn interactive interface for exploring iSamples data across all sources.\n\n**Features:**\n- Map view with 6M+ samples (lonboard WebGL)\n- Interactive table with filtering (ipydatagrid)\n- Sample cards on selection\n- **Faceted filtering**: Filter by source, material type, and time period\n - Hierarchical material types with rollup\n - **Decade quick-select** for time filtering\n- **Fulltext search**: Search label, description, place name with ranked results\n- **Bidirectional selection sync**: Click map \u2192 highlights table row; click table \u2192 recenters map\n- **Viewport Mode**: Dynamic loading based on pan/zoom (with loading indicator)\n- **Adaptive sampling**: More points when zoomed in, fewer when zoomed out\n\n**Data:** Zenodo wide parquet (~282 MB, 20M rows)" }, { "cell_type": "code", @@ -40,12 +40,29 @@ "LOCAL_WIDE = os.path.expanduser(\"~/Data/iSample/pqg_refining/zenodo_wide_2026-01-09.parquet\")\n", "REMOTE_WIDE = \"https://pub-a18234d962364c22a50c787b7ca09fa5.r2.dev/isamples_202601_wide.parquet\"\n", "\n", + "# Pre-computed facet summaries (2KB - loads instantly)\n", + "FACET_SUMMARIES_URL = \"https://pub-a18234d962364c22a50c787b7ca09fa5.r2.dev/isamples_202601_facet_summaries.parquet\"\n", + "FACET_CROSS_URL = \"https://pub-a18234d962364c22a50c787b7ca09fa5.r2.dev/isamples_202601_facet_cross.parquet\"\n", + "\n", "# Use local if available\n", "PARQUET_PATH = LOCAL_WIDE if os.path.exists(LOCAL_WIDE) else REMOTE_WIDE\n", "print(f\"Using: {PARQUET_PATH}\")\n", "\n", "# Connect to DuckDB\n", - "con = duckdb.connect()" + "con = duckdb.connect()\n", + "\n", + "# Load facet summaries at startup (instant - only 2KB)\n", + "import time as _time\n", + "_t0 = _time.time()\n", + "FACET_SUMMARIES_DF = con.sql(f\"\"\"\n", + " SELECT facet_type, facet_value, scheme, count\n", + " FROM read_parquet('{FACET_SUMMARIES_URL}')\n", + " ORDER BY facet_type, count DESC\n", + "\"\"\").df()\n", + "_summary_ms = (_time.time() - _t0) * 1000\n", + "print(f\"Facet summaries loaded in {_summary_ms:.0f} ms ({len(FACET_SUMMARIES_DF)} rows)\")\n", + "print(f\"Available facets: {sorted(FACET_SUMMARIES_DF['facet_type'].unique())}\")\n", + "" ] }, { @@ -282,7 +299,1479 @@ "execution_count": null, "metadata": {}, "outputs": [], - "source": "# State management\nclass ExplorerState:\n def __init__(self):\n self.selected_index = None\n self.selected_row = None\n self.current_gdf = None\n self.viewport_mode = False\n self.debounce_timer = None\n self.loading = False\n self.syncing_selection = False # Prevent infinite loops\n self.current_search = \"\" # Current search term\n # Facet filter state\n self.source_filters = set() # Selected sources (empty = all)\n self.material_filters = set() # Selected material URIs (full URIs for filtering)\n self.year_range = (None, None) # (min_year, max_year) or None for no filter\n self.selected_decades = set() # Selected decades for quick filter\n self.material_rollup = True # Whether to include children when parent selected\n # Facet counts cache\n self.facet_counts_cache = {}\n self.facet_cache_time = 0\n\nstate = ExplorerState()\nstate.current_gdf = samples_gdf\n\n\n# =============================================================================\n# Material Hierarchy Definition\n# =============================================================================\n\n# iSamples material vocabulary hierarchy (3 levels)\n# Structure: (display_name, uri_suffix, children_list)\n# uri_suffix maps to actual URIs in the data\n\nMATERIAL_HIERARCHY = [\n (\"Material\", \"material\", [\n (\"Natural Solid Material\", \"naturalsolidmaterial\", [\n (\"Earth Material\", \"earthmaterial\", [\n (\"Rock\", \"rock\", []),\n (\"Sediment\", \"sediment\", []),\n (\"Soil\", \"soil\", []),\n (\"Mineral\", \"mineral\", []),\n (\"Mixed Soil/Sediment/Rock\", \"mixedsoilsedimentrock\", []),\n ]),\n (\"Biogenic Non-organic\", \"biogenicnonorganicmaterial\", []),\n ]),\n (\"Organic Material\", \"organicmaterial\", [\n (\"Plant Material\", \"plantmaterial\", []),\n (\"Animal Product\", \"organicanimalproduct\", []),\n ]),\n (\"Anthropogenic Material\", \"anyanthropogenicmaterial\", [\n (\"Anthropogenic Metal\", \"anthropogenicmetal\", []),\n (\"Ceramic Clay\", \"ceramicclay\", []),\n ]),\n (\"Fluid Material\", \"fluidmaterial\", [\n (\"Liquid Water\", \"liquidwater\", []),\n (\"Gas\", \"gas\", []),\n (\"Non-aqueous Liquid\", \"nonaqueousliquid\", []),\n ]),\n (\"Dispersed Media\", \"dispersedmedia\", [\n (\"Particulate\", \"particulate\", []),\n ]),\n (\"Any Ice\", \"anyice\", []),\n ])\n]\n\n\ndef build_hierarchy_mappings():\n \"\"\"\n Build mappings for the material hierarchy.\n\n Returns:\n - suffix_to_children: dict mapping uri_suffix -> list of all descendant suffixes\n - display_order: list of (display_label, uri_suffix, indent_level) in tree order\n \"\"\"\n suffix_to_children = {}\n display_order = []\n\n def collect_descendants(nodes):\n \"\"\"Get all descendant suffixes from a list of hierarchy nodes.\"\"\"\n descendants = []\n for name, suffix, children in nodes:\n descendants.append(suffix)\n descendants.extend(collect_descendants(children))\n return descendants\n\n def walk(nodes, level=0):\n \"\"\"Walk hierarchy building mappings.\"\"\"\n for name, suffix, children in nodes:\n # All descendants (not including self)\n suffix_to_children[suffix] = collect_descendants(children)\n\n # Add to display order with indentation\n display_order.append((name, suffix, level))\n\n # Recurse into children\n walk(children, level + 1)\n\n walk(MATERIAL_HIERARCHY)\n return suffix_to_children, display_order\n\n\n# Build hierarchy mappings at module load\nSUFFIX_TO_CHILDREN, HIERARCHY_DISPLAY_ORDER = build_hierarchy_mappings()\n\n\n# =============================================================================\n# Facet Query Functions\n# =============================================================================\n\nimport re\nimport time as time_module\n\ndef uri_to_display_name(uri):\n \"\"\"\n Convert a vocabulary URI to a human-readable display name.\n\n Examples:\n https://w3id.org/isample/vocabulary/material/1.0/rock -> Rock\n https://w3id.org/isample/opencontext/material/0.1/ceramicclay -> Ceramic Clay\n \"\"\"\n if not uri or not isinstance(uri, str):\n return str(uri)\n\n # Extract last path segment\n name = uri.rstrip('/').split('/')[-1]\n\n # Insert spaces before uppercase letters (camelCase -> Camel Case)\n name = re.sub(r'([a-z])([A-Z])', r'\\1 \\2', name)\n\n # Capitalize first letter of each word\n name = name.title()\n\n return name\n\n\ndef uri_to_suffix(uri):\n \"\"\"Extract the suffix (last path segment) from a URI, lowercased.\"\"\"\n if not uri or not isinstance(uri, str):\n return \"\"\n return uri.rstrip('/').split('/')[-1].lower()\n\n\ndef get_source_counts(additional_filters=None):\n \"\"\"\n Get counts of samples by source.\n\n Args:\n additional_filters: Dict with material_filters, year_range keys\n\n Returns:\n Dict of {source_name: count}\n \"\"\"\n where_clause = \"WHERE otype = 'MaterialSampleRecord' AND latitude IS NOT NULL\"\n\n # Apply year filter if present (cast result_time to TIMESTAMP)\n if additional_filters and additional_filters.get('year_range'):\n yr = additional_filters['year_range']\n if yr[0] is not None and yr[1] is not None:\n where_clause += f\" AND EXTRACT(YEAR FROM TRY_CAST(result_time AS TIMESTAMP)) BETWEEN {yr[0]} AND {yr[1]}\"\n elif yr[0] is not None:\n where_clause += f\" AND EXTRACT(YEAR FROM TRY_CAST(result_time AS TIMESTAMP)) >= {yr[0]}\"\n elif yr[1] is not None:\n where_clause += f\" AND EXTRACT(YEAR FROM TRY_CAST(result_time AS TIMESTAMP)) <= {yr[1]}\"\n\n # Note: We don't filter by material here to show all source options\n query = f\"\"\"\n SELECT n as source, COUNT(*) as count\n FROM read_parquet('{PARQUET_PATH}')\n {where_clause}\n GROUP BY n ORDER BY count DESC\n \"\"\"\n\n result = con.sql(query).df()\n return dict(zip(result['source'], result['count']))\n\n\ndef get_all_material_counts():\n \"\"\"\n Get counts for ALL materials in the hierarchy (not just top N).\n Used for building the hierarchical display.\n\n Returns:\n Dict of {uri_suffix: {'uri': full_uri, 'count': count}}\n \"\"\"\n query = f\"\"\"\n WITH samples AS (\n SELECT msr.row_id, UNNEST(msr.p__has_material_category) as material_id\n FROM read_parquet('{PARQUET_PATH}') msr\n WHERE msr.otype = 'MaterialSampleRecord' AND msr.latitude IS NOT NULL\n )\n SELECT ic.label as material_uri, COUNT(*) as count\n FROM samples s\n JOIN read_parquet('{PARQUET_PATH}') ic ON ic.row_id = s.material_id\n WHERE ic.label IS NOT NULL\n GROUP BY ic.label\n ORDER BY count DESC\n \"\"\"\n\n try:\n result = con.sql(query).df()\n # Build dict keyed by suffix\n materials = {}\n for _, row in result.iterrows():\n uri = row['material_uri']\n suffix = uri_to_suffix(uri)\n if suffix:\n materials[suffix] = {\n 'uri': uri,\n 'count': row['count']\n }\n return materials\n except Exception as e:\n print(f\"Material count error: {e}\")\n return {}\n\n\n# =============================================================================\n# Accurate Rollup Counts (with caching)\n# =============================================================================\n\n# Cache for accurate rollup counts\n_rollup_cache = {\n 'direct_counts': {}, # suffix -> direct count\n 'rollup_counts': {}, # suffix -> accurate distinct rollup count\n 'computed_at': None, # timestamp\n 'existing_suffixes': set() # suffixes that exist in data\n}\n\n\ndef compute_accurate_rollup_counts(force_refresh=False):\n \"\"\"\n Compute accurate rollup counts using DISTINCT sample counting.\n \n This avoids double-counting samples tagged at multiple hierarchy levels.\n Results are cached for performance.\n \n Args:\n force_refresh: If True, recompute even if cache exists\n \n Returns:\n Tuple of (direct_counts, rollup_counts, elapsed_time)\n - direct_counts: Dict of {suffix: count} for samples tagged exactly at that level\n - rollup_counts: Dict of {suffix: count} for samples tagged at that level OR any descendant\n \"\"\"\n global _rollup_cache\n \n # Return cached results if available and not forcing refresh\n if not force_refresh and _rollup_cache['computed_at'] is not None:\n return (_rollup_cache['direct_counts'], \n _rollup_cache['rollup_counts'], \n 0.0)\n \n print(\"Computing accurate rollup counts (this takes ~3 seconds)...\")\n t0 = time_module.time()\n \n # Build temp table with sample-material pairs\n build_query = f\"\"\"\n CREATE OR REPLACE TEMP TABLE sample_materials AS\n SELECT DISTINCT \n msr.row_id as sample_id,\n LOWER(SPLIT_PART(ic.label, '/', -1)) as suffix,\n ic.label as uri\n FROM read_parquet('{PARQUET_PATH}') msr\n CROSS JOIN UNNEST(msr.p__has_material_category) AS t(mat_id)\n JOIN read_parquet('{PARQUET_PATH}') ic ON ic.row_id = mat_id\n WHERE msr.otype = 'MaterialSampleRecord' \n AND msr.latitude IS NOT NULL\n AND ic.label IS NOT NULL\n \"\"\"\n con.sql(build_query)\n \n # Get existing suffixes\n suffix_df = con.sql(\"SELECT DISTINCT suffix FROM sample_materials\").df()\n existing_suffixes = set(suffix_df['suffix'].tolist())\n \n # Get direct counts\n direct_df = con.sql(\"\"\"\n SELECT suffix, COUNT(*) as direct_count\n FROM sample_materials\n GROUP BY suffix\n \"\"\").df()\n direct_counts = dict(zip(direct_df['suffix'], direct_df['direct_count']))\n \n # Compute accurate rollup for each suffix\n rollup_counts = {}\n for suffix in existing_suffixes:\n if suffix not in SUFFIX_TO_CHILDREN:\n # Unknown suffix (not in our hierarchy), just use direct count\n rollup_counts[suffix] = direct_counts.get(suffix, 0)\n else:\n # Include self + all descendants that exist in data\n all_suffixes = [suffix] + [s for s in SUFFIX_TO_CHILDREN[suffix] if s in existing_suffixes]\n suffixes_sql = \", \".join(f\"'{s}'\" for s in all_suffixes)\n count_query = f\"SELECT COUNT(DISTINCT sample_id) FROM sample_materials WHERE suffix IN ({suffixes_sql})\"\n count = con.sql(count_query).fetchone()[0]\n rollup_counts[suffix] = count\n \n elapsed = time_module.time() - t0\n \n # Update cache\n _rollup_cache['direct_counts'] = direct_counts\n _rollup_cache['rollup_counts'] = rollup_counts\n _rollup_cache['computed_at'] = time_module.time()\n _rollup_cache['existing_suffixes'] = existing_suffixes\n \n print(f\" Computed {len(rollup_counts)} rollup counts in {elapsed:.2f}s\")\n \n return direct_counts, rollup_counts, elapsed\n\n\ndef get_cached_rollup_counts():\n \"\"\"Get cached rollup counts, computing if necessary.\"\"\"\n if _rollup_cache['computed_at'] is None:\n compute_accurate_rollup_counts()\n return _rollup_cache['direct_counts'], _rollup_cache['rollup_counts']\n\n\ndef get_material_counts(additional_filters=None, limit=50):\n \"\"\"\n Get counts of samples by material category (requires join for labels).\n\n Args:\n additional_filters: Dict with source_filters, year_range keys\n limit: Max number of materials to return\n\n Returns:\n Dict of {display_name: {'uri': full_uri, 'count': count}}\n \"\"\"\n where_clause = \"WHERE msr.otype = 'MaterialSampleRecord' AND msr.latitude IS NOT NULL\"\n\n # Apply source filter if present\n if additional_filters and additional_filters.get('source_filters'):\n sources_sql = \", \".join(f\"'{s}'\" for s in additional_filters['source_filters'])\n where_clause += f\" AND msr.n IN ({sources_sql})\"\n\n # Apply year filter if present (cast result_time to TIMESTAMP)\n if additional_filters and additional_filters.get('year_range'):\n yr = additional_filters['year_range']\n if yr[0] is not None and yr[1] is not None:\n where_clause += f\" AND EXTRACT(YEAR FROM TRY_CAST(msr.result_time AS TIMESTAMP)) BETWEEN {yr[0]} AND {yr[1]}\"\n elif yr[0] is not None:\n where_clause += f\" AND EXTRACT(YEAR FROM TRY_CAST(msr.result_time AS TIMESTAMP)) >= {yr[0]}\"\n elif yr[1] is not None:\n where_clause += f\" AND EXTRACT(YEAR FROM TRY_CAST(msr.result_time AS TIMESTAMP)) <= {yr[1]}\"\n\n query = f\"\"\"\n WITH samples AS (\n SELECT msr.row_id, UNNEST(msr.p__has_material_category) as material_id\n FROM read_parquet('{PARQUET_PATH}') msr\n {where_clause}\n )\n SELECT ic.label as material_uri, COUNT(*) as count\n FROM samples s\n JOIN read_parquet('{PARQUET_PATH}') ic ON ic.row_id = s.material_id\n WHERE ic.label IS NOT NULL\n GROUP BY ic.label\n ORDER BY count DESC\n LIMIT {limit}\n \"\"\"\n\n try:\n result = con.sql(query).df()\n # Return dict with display name as key, containing uri and count\n materials = {}\n for _, row in result.iterrows():\n uri = row['material_uri']\n display_name = uri_to_display_name(uri)\n materials[display_name] = {\n 'uri': uri,\n 'count': row['count']\n }\n return materials\n except Exception as e:\n print(f\"Material count error: {e}\")\n return {}\n\n\ndef get_year_range_stats():\n \"\"\"\n Get min/max years and decade counts for time facet.\n\n Returns:\n Dict with 'min_year', 'max_year', 'decades' (dict of decade: count)\n \"\"\"\n # Cast result_time to TIMESTAMP before extracting year\n query = f\"\"\"\n SELECT\n MIN(EXTRACT(YEAR FROM TRY_CAST(result_time AS TIMESTAMP)))::INT as min_year,\n MAX(EXTRACT(YEAR FROM TRY_CAST(result_time AS TIMESTAMP)))::INT as max_year\n FROM read_parquet('{PARQUET_PATH}')\n WHERE otype = 'MaterialSampleRecord'\n AND latitude IS NOT NULL\n AND result_time IS NOT NULL\n AND TRY_CAST(result_time AS TIMESTAMP) IS NOT NULL\n \"\"\"\n\n stats = con.sql(query).df().iloc[0]\n\n # Get decade counts\n decade_query = f\"\"\"\n SELECT\n (EXTRACT(YEAR FROM TRY_CAST(result_time AS TIMESTAMP))::INT / 10) * 10 as decade,\n COUNT(*) as count\n FROM read_parquet('{PARQUET_PATH}')\n WHERE otype = 'MaterialSampleRecord'\n AND latitude IS NOT NULL\n AND result_time IS NOT NULL\n AND TRY_CAST(result_time AS TIMESTAMP) IS NOT NULL\n GROUP BY decade\n ORDER BY decade\n \"\"\"\n\n decades_df = con.sql(decade_query).df()\n decades = dict(zip(decades_df['decade'].astype(int), decades_df['count']))\n\n return {\n 'min_year': int(stats['min_year']) if pd.notna(stats['min_year']) else 1900,\n 'max_year': int(stats['max_year']) if pd.notna(stats['max_year']) else 2025,\n 'decades': decades\n }\n\n\ndef expand_material_filters_with_rollup(selected_uris, all_material_counts):\n \"\"\"\n Expand selected URIs to include all descendant URIs (rollup).\n\n Args:\n selected_uris: Set of user-selected material URIs\n all_material_counts: Dict from get_all_material_counts()\n\n Returns:\n Set of URIs including selected + all descendants\n \"\"\"\n expanded = set(selected_uris)\n\n for uri in selected_uris:\n suffix = uri_to_suffix(uri)\n if suffix in SUFFIX_TO_CHILDREN:\n # Add all descendant URIs\n for child_suffix in SUFFIX_TO_CHILDREN[suffix]:\n # Find the full URI for this suffix\n if child_suffix in all_material_counts:\n expanded.add(all_material_counts[child_suffix]['uri'])\n\n return expanded\n\n\n# =============================================================================\n# Facet Widgets\n# =============================================================================\n\n# Get initial counts for facets\nprint(\"Loading facet counts...\")\ninitial_source_counts = get_source_counts()\nall_material_counts = get_all_material_counts() # Get all for hierarchy\ndirect_counts, rollup_counts = get_cached_rollup_counts() # Compute accurate rollup\nyear_stats = get_year_range_stats()\n\nprint(f\"Sources: {list(initial_source_counts.keys())}\")\nprint(f\"Materials: {len(all_material_counts)} types found in data\")\nprint(f\"Years: {year_stats['min_year']} - {year_stats['max_year']}\")\nprint(f\"Decades with data: {sorted(year_stats['decades'].keys())}\")\n\n\ndef build_hierarchical_material_options(material_counts, direct_counts_dict, rollup_counts_dict, use_rollup=True):\n \"\"\"\n Build SelectMultiple options showing material hierarchy with indentation.\n\n Args:\n material_counts: Dict from get_all_material_counts() keyed by suffix (for URIs)\n direct_counts_dict: Dict of {suffix: direct_count}\n rollup_counts_dict: Dict of {suffix: accurate_rollup_count}\n use_rollup: If True, show rollup totals; if False, show direct counts\n\n Returns:\n - options: List of option labels (with indentation and counts)\n - uri_map: Dict mapping option label -> full URI\n \"\"\"\n options = []\n uri_map = {}\n\n # Indentation characters\n INDENT = \" \" # 4 spaces per level\n\n for display_name, suffix, level in HIERARCHY_DISPLAY_ORDER:\n if suffix in material_counts:\n data = material_counts[suffix]\n uri = data['uri']\n\n # Choose count based on rollup setting\n if use_rollup:\n count = rollup_counts_dict.get(suffix, direct_counts_dict.get(suffix, 0))\n else:\n count = direct_counts_dict.get(suffix, 0)\n\n # Create indented label\n indent = INDENT * level\n label = f\"{indent}{display_name} ({count:,})\"\n\n options.append(label)\n uri_map[label] = uri\n\n return options, uri_map\n\n\n# Build initial hierarchical material options (with rollup ON by default)\nmaterial_options, material_uri_map = build_hierarchical_material_options(\n all_material_counts, direct_counts, rollup_counts, use_rollup=True\n)\nprint(f\"Hierarchy has {len(material_options)} materials with data\")\n\n\ndef create_source_checkboxes(counts):\n \"\"\"Create checkboxes for source facet.\"\"\"\n checkboxes = []\n for source, count in counts.items():\n cb = widgets.Checkbox(\n value=False,\n description=f\"{source} ({count:,})\",\n indent=False,\n layout=widgets.Layout(width='100%', margin='2px 0')\n )\n cb.source_name = source # Store source name for easy access\n checkboxes.append(cb)\n return checkboxes\n\n\n# Create source checkboxes\nsource_checkboxes = create_source_checkboxes(initial_source_counts)\nsource_facet_box = widgets.VBox(\n source_checkboxes,\n layout=widgets.Layout(max_height='180px', overflow_y='auto', padding='5px')\n)\n\n# Use SelectMultiple for materials - hierarchical display with indentation\nmaterial_select = widgets.SelectMultiple(\n options=material_options,\n value=[],\n rows=12, # More rows to see hierarchy\n description='',\n layout=widgets.Layout(width='100%', height='280px'),\n style={'description_width': '0px'}\n)\n\n# Rollup toggle\nrollup_toggle = widgets.Checkbox(\n value=True,\n description='Include subcategories',\n indent=False,\n layout=widgets.Layout(margin='5px 0'),\n style={'description_width': 'initial'}\n)\n\n# Refresh counts button\nrefresh_counts_btn = widgets.Button(\n description='',\n button_style='',\n icon='refresh',\n tooltip='Recalculate material counts',\n layout=widgets.Layout(width='32px', height='26px')\n)\n\nmaterial_header = widgets.HBox([\n widgets.HTML(\"Ctrl/Cmd+click to multi-select\"),\n refresh_counts_btn\n], layout=widgets.Layout(justify_content='space-between', align_items='center'))\n\nmaterial_facet_box = widgets.VBox([\n material_header,\n rollup_toggle,\n material_select\n], layout=widgets.Layout(padding='5px'))\n\n\n# =============================================================================\n# Time Facet Widgets (with decade checkboxes)\n# =============================================================================\n\ndef create_decade_checkboxes(decade_counts):\n \"\"\"Create checkboxes for decade quick-select.\"\"\"\n checkboxes = []\n # Sort decades and filter to reasonable range (1800-2030)\n sorted_decades = sorted([d for d in decade_counts.keys() if 1800 <= d <= 2030])\n\n for decade in sorted_decades:\n count = decade_counts.get(decade, 0)\n if count > 0: # Only show decades with data\n cb = widgets.Checkbox(\n value=False,\n description=f\"{decade}s ({count:,})\",\n indent=False,\n layout=widgets.Layout(width='100%', margin='1px 0')\n )\n cb.decade_value = decade # Store decade value for easy access\n checkboxes.append(cb)\n return checkboxes\n\n\ndecade_checkboxes = create_decade_checkboxes(year_stats['decades'])\n\n# Container for decade checkboxes (scrollable if many decades)\ndecade_box = widgets.VBox(\n decade_checkboxes,\n layout=widgets.Layout(max_height='140px', overflow_y='auto', padding='2px')\n)\n\ndecade_label = widgets.HTML(\"Quick select decades:\")\n\nyear_slider = widgets.IntRangeSlider(\n value=[year_stats['min_year'], year_stats['max_year']],\n min=year_stats['min_year'],\n max=year_stats['max_year'],\n step=1,\n description='Range:',\n continuous_update=False,\n layout=widgets.Layout(width='100%'),\n style={'description_width': '50px'}\n)\n\n# Enable/disable time filter\ntime_filter_enabled = widgets.Checkbox(\n value=False,\n description='Filter by time',\n indent=False,\n layout=widgets.Layout(margin='5px 0')\n)\n\n# Sync mode: when True, decade checkboxes auto-update the slider\nsync_decades_to_slider = widgets.Checkbox(\n value=True,\n description='Sync decades to slider',\n indent=False,\n layout=widgets.Layout(margin='2px 0'),\n style={'description_width': 'initial'}\n)\n\ntime_facet_box = widgets.VBox([\n time_filter_enabled,\n decade_label,\n decade_box,\n widgets.HTML(\"Fine-tune range:\"),\n year_slider,\n sync_decades_to_slider\n], layout=widgets.Layout(padding='5px'))\n\n\n# Create accordion\nfacet_accordion = widgets.Accordion(\n children=[source_facet_box, material_facet_box, time_facet_box],\n titles=['Sources', 'Material Type', 'Time Period'],\n layout=widgets.Layout(width='100%')\n)\nfacet_accordion.selected_index = None # Start collapsed\n\n# Clear all filters button\nclear_filters_btn = widgets.Button(\n description='Clear All Filters',\n button_style='warning',\n icon='times-circle',\n layout=widgets.Layout(width='100%', margin='10px 0')\n)\n\n\n# =============================================================================\n# Original Widgets (preserved)\n# =============================================================================\n\nsource_filter = widgets.Dropdown(\n options=['All Sources', 'SESAR', 'OPENCONTEXT', 'GEOME', 'SMITHSONIAN'],\n value='All Sources',\n description='Source:',\n style={'description_width': '60px'},\n layout=widgets.Layout(display='none') # Hidden - replaced by facet\n)\n\nsearch_input = widgets.Text(\n value='',\n placeholder='Search label, description, place...',\n description='Search:',\n style={'description_width': '60px'},\n layout=widgets.Layout(width='280px')\n)\n\nsearch_btn = widgets.Button(\n description='',\n button_style='',\n icon='search',\n tooltip='Search (or press Enter)',\n layout=widgets.Layout(width='40px')\n)\n\nclear_search_btn = widgets.Button(\n description='',\n button_style='',\n icon='times',\n tooltip='Clear search',\n layout=widgets.Layout(width='40px')\n)\n\nsample_count = widgets.IntSlider(\n value=12500,\n min=1000,\n max=500000, # 500K per source - plenty for 128GB RAM\n step=5000,\n description='Per source:',\n style={'description_width': '80px'}\n)\n\nviewport_toggle = widgets.ToggleButton(\n value=False,\n description='Viewport Mode',\n tooltip='When enabled, automatically loads data for current map view',\n icon='map',\n button_style='' # 'success' when active\n)\n\nrefresh_btn = widgets.Button(\n description='Refresh Data',\n button_style='primary',\n icon='refresh'\n)\n\n# Loading indicator with spinner\nloading_indicator = widgets.HTML(value=\"\")\n\nstatus_label = widgets.HTML(value=f\"Loaded: {len(samples_gdf):,} samples\")\n\ncard_output = widgets.HTML(value=render_sample_card(None))\n\n# Active filters display\nactive_filters_html = widgets.HTML(value=\"\")\n\n\ndef update_active_filters_display():\n \"\"\"Update the display of currently active filters.\"\"\"\n filters = []\n\n if state.source_filters:\n filters.append(f\"Sources: {', '.join(state.source_filters)}\")\n\n if state.material_filters:\n # Get display names from selected options\n mat_names = [uri_to_display_name(uri) for uri in state.material_filters]\n mat_display = mat_names[:2]\n if len(mat_names) > 2:\n mat_display.append(f\"+{len(mat_names)-2} more\")\n rollup_indicator = \" (+sub)\" if state.material_rollup else \"\"\n filters.append(f\"Materials: {', '.join(mat_display)}{rollup_indicator}\")\n\n if state.year_range[0] is not None or state.year_range[1] is not None:\n yr_str = f\"{state.year_range[0] or 'any'} - {state.year_range[1] or 'any'}\"\n # Show selected decades if any\n if state.selected_decades:\n decades_str = ', '.join(f\"{d}s\" for d in sorted(state.selected_decades))\n filters.append(f\"Time: {decades_str} ({yr_str})\")\n else:\n filters.append(f\"Years: {yr_str}\")\n\n if filters:\n active_filters_html.value = f\"
Active: {''.join(filters)}
\"\n else:\n active_filters_html.value = \"\"\n\n\ndef show_loading(message=\"Loading...\"):\n \"\"\"Show loading indicator.\"\"\"\n state.loading = True\n loading_indicator.value = f\"\"\"\n
\n \n \n \n \n {message}\n
\n \"\"\"\n\n\ndef hide_loading():\n \"\"\"Hide loading indicator.\"\"\"\n state.loading = False\n loading_indicator.value = \"\"\n\n\ndef select_sample(idx, source='map'):\n \"\"\"\n Select a sample by index and sync map/table/card.\n\n Args:\n idx: Row index in current_gdf\n source: 'map' or 'table' - which triggered the selection\n \"\"\"\n if idx is None or idx >= len(state.current_gdf):\n return\n\n state.selected_index = idx\n state.selected_row = state.current_gdf.iloc[idx]\n\n # Update sample card\n card_output.value = render_sample_card(state.selected_row)\n\n if source == 'map':\n # Map click -> highlight table row\n # Column count depends on whether we're showing search_score\n col_count = 4 if state.current_search else 3\n sample_table.selections = [{'r1': idx, 'c1': 0, 'r2': idx, 'c2': col_count}]\n\n elif source == 'table':\n # Table click -> recenter map (keep current zoom)\n lat = state.selected_row['latitude']\n lon = state.selected_row['longitude']\n if not pd.isna(lat) and not pd.isna(lon):\n sample_map.set_view_state(latitude=float(lat), longitude=float(lon))\n\n\ndef on_map_point_click(change):\n \"\"\"Handle click on a map point - highlight corresponding table row.\"\"\"\n if state.syncing_selection:\n return\n\n idx = change.get('new')\n if idx is None:\n return\n\n state.syncing_selection = True\n try:\n select_sample(idx, source='map')\n finally:\n state.syncing_selection = False\n\n\ndef setup_layer_observer(layer):\n \"\"\"Setup the selected_index observer on a layer.\"\"\"\n layer.observe(on_map_point_click, names=['selected_index'])\n\n\ndef update_map_and_table(new_gdf, search_active=False):\n \"\"\"Update map and table with new data.\"\"\"\n state.current_gdf = new_gdf\n state.current_search = search_input.value.strip() if search_active else \"\"\n\n # Update map with new layer\n new_layer = create_map_layer(new_gdf)\n\n # Setup observer on new layer BEFORE adding to map\n setup_layer_observer(new_layer)\n\n sample_map.layers = [new_layer]\n\n # Update table - include score column if searching\n if search_active and 'search_score' in new_gdf.columns:\n display_cols = ['search_score', 'source', 'label', 'latitude', 'longitude']\n df_display = new_gdf[display_cols].copy()\n df_display = df_display.rename(columns={'search_score': 'score'})\n else:\n display_cols = ['source', 'label', 'latitude', 'longitude']\n df_display = new_gdf[display_cols].copy()\n\n df_display['latitude'] = df_display['latitude'].round(4)\n df_display['longitude'] = df_display['longitude'].round(4)\n sample_table.data = df_display\n\n # Update status\n if search_active:\n status_label.value = f\"Found: {len(new_gdf):,} matches for '{state.current_search}'\"\n else:\n status_label.value = f\"Loaded: {len(new_gdf):,} samples\"\n\n # Update active filters display\n update_active_filters_display()\n\n\ndef get_effective_material_filters():\n \"\"\"\n Get the effective set of material URIs to filter by,\n applying rollup expansion if enabled.\n \"\"\"\n if not state.material_filters:\n return None\n\n if state.material_rollup:\n # Expand to include descendants\n return expand_material_filters_with_rollup(state.material_filters, all_material_counts)\n else:\n # Exact match only\n return state.material_filters\n\n\ndef update_material_widget_options():\n \"\"\"\n Update material SelectMultiple options based on rollup toggle.\n Preserves current selection when possible.\n \"\"\"\n global material_uri_map\n\n # Remember current selection (by URI)\n current_uris = set()\n for option_label in material_select.value:\n if option_label in material_uri_map:\n current_uris.add(material_uri_map[option_label])\n\n # Get cached counts\n direct_cts, rollup_cts = get_cached_rollup_counts()\n\n # Rebuild options with new rollup setting\n new_options, new_uri_map = build_hierarchical_material_options(\n all_material_counts, direct_cts, rollup_cts, use_rollup=rollup_toggle.value\n )\n\n # Update the global uri_map\n material_uri_map = new_uri_map\n\n # Update widget options\n material_select.options = new_options\n\n # Restore selection (find labels for remembered URIs)\n new_selection = []\n for label, uri in new_uri_map.items():\n if uri in current_uris:\n new_selection.append(label)\n\n material_select.value = tuple(new_selection)\n\n\ndef do_search():\n \"\"\"Execute search with current parameters.\"\"\"\n show_loading(\"Searching...\")\n\n try:\n # Use facet filters instead of single source dropdown\n source_filters_set = state.source_filters if state.source_filters else None\n material_filters_set = get_effective_material_filters() # Apply rollup if enabled\n year_range = state.year_range if (state.year_range[0] is not None or state.year_range[1] is not None) else None\n\n search_term = search_input.value.strip()\n\n if state.viewport_mode:\n # Search within current viewport\n view_state = sample_map.view_state\n zoom = view_state.zoom if hasattr(view_state, 'zoom') else 1\n bbox = view_state_to_bbox(view_state)\n\n # When searching, use slider value directly (no adaptive reduction)\n # When browsing, use adaptive sampling based on zoom\n if search_term:\n max_samples = sample_count.value\n else:\n max_samples = adaptive_sample_size(zoom, base_size=sample_count.value)\n\n new_gdf = load_samples(\n max_per_source=max_samples,\n bbox=bbox,\n search_term=search_term if search_term else None,\n source_filters=source_filters_set,\n material_filters=material_filters_set,\n year_range=year_range\n )\n\n zoom_info = f\" (zoom {zoom:.1f})\"\n else:\n # Search globally\n new_gdf = load_samples(\n max_per_source=sample_count.value,\n search_term=search_term if search_term else None,\n source_filters=source_filters_set,\n material_filters=material_filters_set,\n year_range=year_range\n )\n zoom_info = \"\"\n\n update_map_and_table(new_gdf, search_active=bool(search_term))\n\n if search_term:\n status_label.value = f\"Found: {len(new_gdf):,} matches for '{search_term}'{zoom_info}\"\n else:\n status_label.value = f\"Loaded: {len(new_gdf):,} samples{zoom_info}\"\n\n except Exception as e:\n status_label.value = f\"Error: {str(e)[:50]}\"\n import traceback\n traceback.print_exc()\n finally:\n hide_loading()\n\n\ndef on_search_click(b):\n \"\"\"Handle search button click.\"\"\"\n do_search()\n\n\ndef on_search_submit(change):\n \"\"\"Handle Enter key in search box.\"\"\"\n do_search()\n\n\ndef on_clear_search(b):\n \"\"\"Clear search and reload data.\"\"\"\n search_input.value = ''\n do_search()\n\n\nsearch_btn.on_click(on_search_click)\nsearch_input.on_submit(on_search_submit)\nclear_search_btn.on_click(on_clear_search)\n\n\ndef load_viewport_data():\n \"\"\"Load data for current viewport with adaptive sampling.\"\"\"\n if state.loading:\n return\n\n show_loading(\"Loading viewport data...\")\n\n try:\n # Get current view state\n view_state = sample_map.view_state\n zoom = view_state.zoom if hasattr(view_state, 'zoom') else 1\n\n # Calculate bounding box\n bbox = view_state_to_bbox(view_state)\n\n # Get facet filters and search term\n source_filters_set = state.source_filters if state.source_filters else None\n material_filters_set = get_effective_material_filters() # Apply rollup if enabled\n year_range = state.year_range if (state.year_range[0] is not None or state.year_range[1] is not None) else None\n search_term = search_input.value.strip() if search_input.value.strip() else None\n\n # When searching, use slider value directly (no adaptive reduction)\n # When browsing, use adaptive sampling based on zoom\n if search_term:\n max_samples = sample_count.value\n else:\n max_samples = adaptive_sample_size(zoom, base_size=sample_count.value)\n\n # Load data\n new_gdf = load_samples(\n max_per_source=max_samples,\n bbox=bbox,\n search_term=search_term,\n source_filters=source_filters_set,\n material_filters=material_filters_set,\n year_range=year_range\n )\n\n update_map_and_table(new_gdf, search_active=bool(search_term))\n\n # Show zoom info in status\n if search_term:\n status_label.value = f\"Found: {len(new_gdf):,} matches for '{search_term}' (zoom {zoom:.1f})\"\n else:\n status_label.value = f\"Loaded: {len(new_gdf):,} samples (zoom {zoom:.1f}, {max_samples:,}/source max)\"\n\n except Exception as e:\n status_label.value = f\"Error: {str(e)[:50]}\"\n finally:\n hide_loading()\n\n\ndef debounced_viewport_load():\n \"\"\"Debounced viewport loading - waits for user to stop panning/zooming.\"\"\"\n # Cancel any existing timer\n if state.debounce_timer is not None:\n state.debounce_timer.cancel()\n\n # Set new timer (500ms delay)\n state.debounce_timer = threading.Timer(0.5, load_viewport_data)\n state.debounce_timer.start()\n\n\ndef on_view_state_change(change):\n \"\"\"Handle map pan/zoom changes.\"\"\"\n if state.viewport_mode and not state.loading:\n debounced_viewport_load()\n\n\ndef on_viewport_toggle(change):\n \"\"\"Handle viewport mode toggle.\"\"\"\n state.viewport_mode = change['new']\n if change['new']:\n viewport_toggle.button_style = 'success'\n viewport_toggle.description = 'Viewport Mode ON'\n # Immediately load viewport data\n load_viewport_data()\n else:\n viewport_toggle.button_style = ''\n viewport_toggle.description = 'Viewport Mode'\n\n\nviewport_toggle.observe(on_viewport_toggle, names=['value'])\n\n\n# Event handlers\ndef on_refresh_click(b):\n do_search() # Refresh now uses same logic as search\n\nrefresh_btn.on_click(on_refresh_click)\n\n\ndef on_table_selection(change):\n \"\"\"Handle table row selection - recenter map on selected point.\"\"\"\n if state.syncing_selection:\n return\n\n # selections is a LIST of selection dicts\n selections = change.get('new', [])\n if selections and len(selections) > 0:\n # Get the first selection\n sel = selections[0]\n row_idx = sel.get('r1')\n if row_idx is not None and row_idx < len(state.current_gdf):\n state.syncing_selection = True\n try:\n select_sample(row_idx, source='table')\n finally:\n state.syncing_selection = False\n\nsample_table.observe(on_table_selection, names=['selections'])\n\n# Register view_state observer on the map\nsample_map.observe(on_view_state_change, names=['view_state'])\n\n# Setup observer on initial layer\nsetup_layer_observer(sample_map.layers[0])\n\n\n# =============================================================================\n# Facet Event Handlers\n# =============================================================================\n\ndef on_source_checkbox_change(change):\n \"\"\"Handle source checkbox changes.\"\"\"\n # Rebuild source_filters from all checkboxes\n state.source_filters = set()\n for cb in source_checkboxes:\n if cb.value:\n state.source_filters.add(cb.source_name)\n\n # Trigger data reload\n do_search()\n\n\ndef on_material_select_change(change):\n \"\"\"Handle material selection changes.\"\"\"\n # Convert selected option labels to URIs\n state.material_filters = set()\n for option_label in material_select.value:\n if option_label in material_uri_map:\n state.material_filters.add(material_uri_map[option_label])\n\n # Trigger data reload\n do_search()\n\n\ndef on_rollup_toggle_change(change):\n \"\"\"Handle rollup toggle changes.\"\"\"\n state.material_rollup = rollup_toggle.value\n\n # Update the widget options to show correct counts\n update_material_widget_options()\n\n # If materials are selected, reload with new rollup setting\n if state.material_filters:\n do_search()\n\n\ndef on_refresh_counts_click(b):\n \"\"\"Handle refresh counts button click - recalculate material counts.\"\"\"\n global direct_counts, rollup_counts\n \n show_loading(\"Recalculating counts...\")\n try:\n # Force refresh of counts\n direct_counts, rollup_counts, elapsed = compute_accurate_rollup_counts(force_refresh=True)\n \n # Update widget display\n update_material_widget_options()\n \n status_label.value = f\"Counts refreshed in {elapsed:.1f}s\"\n except Exception as e:\n status_label.value = f\"Error: {str(e)[:50]}\"\n finally:\n hide_loading()\n\n\n# Flag to prevent infinite loops when syncing decades/slider\n_syncing_time_widgets = False\n\n\ndef sync_slider_from_decades():\n \"\"\"Update year slider based on selected decades.\"\"\"\n global _syncing_time_widgets\n if _syncing_time_widgets:\n return\n \n _syncing_time_widgets = True\n try:\n if state.selected_decades:\n # Calculate min/max from selected decades\n min_year = min(state.selected_decades)\n max_year = max(state.selected_decades) + 9 # End of decade (e.g., 2010 -> 2019)\n year_slider.value = [min_year, max_year]\n finally:\n _syncing_time_widgets = False\n\n\ndef on_decade_checkbox_change(change):\n \"\"\"Handle decade checkbox changes.\"\"\"\n global _syncing_time_widgets\n if _syncing_time_widgets:\n return\n \n # Rebuild selected_decades from all checkboxes\n state.selected_decades = set()\n for cb in decade_checkboxes:\n if cb.value:\n state.selected_decades.add(cb.decade_value)\n\n # Auto-enable time filter if any decade selected\n if state.selected_decades and not time_filter_enabled.value:\n _syncing_time_widgets = True\n time_filter_enabled.value = True\n _syncing_time_widgets = False\n\n # Sync slider if enabled\n if sync_decades_to_slider.value:\n sync_slider_from_decades()\n\n # Update state year_range from slider (which may have been synced)\n if time_filter_enabled.value:\n state.year_range = (year_slider.value[0], year_slider.value[1])\n else:\n state.year_range = (None, None)\n\n # Trigger data reload\n do_search()\n\n\ndef on_time_filter_change(change):\n \"\"\"Handle time filter enable/disable.\"\"\"\n global _syncing_time_widgets\n if _syncing_time_widgets:\n return\n \n if time_filter_enabled.value:\n state.year_range = (year_slider.value[0], year_slider.value[1])\n else:\n state.year_range = (None, None)\n # Also clear decade selections when disabling time filter\n _syncing_time_widgets = True\n state.selected_decades = set()\n for cb in decade_checkboxes:\n cb.value = False\n _syncing_time_widgets = False\n\n do_search()\n\n\ndef on_year_slider_change(change):\n \"\"\"Handle year slider changes.\"\"\"\n global _syncing_time_widgets\n if _syncing_time_widgets:\n return\n \n if time_filter_enabled.value:\n state.year_range = (year_slider.value[0], year_slider.value[1])\n \n # Update decade checkboxes to reflect slider range (if sync enabled)\n if sync_decades_to_slider.value:\n _syncing_time_widgets = True\n min_yr, max_yr = year_slider.value\n state.selected_decades = set()\n for cb in decade_checkboxes:\n decade = cb.decade_value\n # Check if decade overlaps with slider range\n decade_end = decade + 9\n overlaps = (decade <= max_yr) and (decade_end >= min_yr)\n cb.value = overlaps\n if overlaps:\n state.selected_decades.add(decade)\n _syncing_time_widgets = False\n \n do_search()\n\n\ndef on_clear_all_filters(b):\n \"\"\"Clear all facet filters.\"\"\"\n global _syncing_time_widgets\n _syncing_time_widgets = True\n \n # Clear source checkboxes\n for cb in source_checkboxes:\n cb.value = False\n state.source_filters = set()\n\n # Clear material selection\n material_select.value = []\n state.material_filters = set()\n\n # Reset rollup to default (on)\n rollup_toggle.value = True\n state.material_rollup = True\n\n # Clear time filter\n time_filter_enabled.value = False\n for cb in decade_checkboxes:\n cb.value = False\n state.selected_decades = set()\n year_slider.value = [year_stats['min_year'], year_stats['max_year']]\n state.year_range = (None, None)\n\n # Clear search\n search_input.value = ''\n \n _syncing_time_widgets = False\n\n # Reload data\n do_search()\n\n\n# Wire up facet event handlers\nfor cb in source_checkboxes:\n cb.observe(on_source_checkbox_change, names=['value'])\n\nmaterial_select.observe(on_material_select_change, names=['value'])\nrollup_toggle.observe(on_rollup_toggle_change, names=['value'])\nrefresh_counts_btn.on_click(on_refresh_counts_click)\n\n# Wire up decade checkboxes\nfor cb in decade_checkboxes:\n cb.observe(on_decade_checkbox_change, names=['value'])\n\ntime_filter_enabled.observe(on_time_filter_change, names=['value'])\nyear_slider.observe(on_year_slider_change, names=['value'])\nclear_filters_btn.on_click(on_clear_all_filters)\n\nprint(\"Facet widgets ready!\")" + "source": [ + "# State management\n", + "class ExplorerState:\n", + " def __init__(self):\n", + " self.selected_index = None\n", + " self.selected_row = None\n", + " self.current_gdf = None\n", + " self.viewport_mode = False\n", + " self.debounce_timer = None\n", + " self.loading = False\n", + " self.syncing_selection = False # Prevent infinite loops\n", + " self.current_search = \"\" # Current search term\n", + " # Facet filter state\n", + " self.source_filters = set() # Selected sources (empty = all)\n", + " self.material_filters = set() # Selected material URIs (full URIs for filtering)\n", + " self.year_range = (None, None) # (min_year, max_year) or None for no filter\n", + " self.selected_decades = set() # Selected decades for quick filter\n", + " self.material_rollup = True # Whether to include children when parent selected\n", + " # Facet counts cache\n", + " self.facet_counts_cache = {}\n", + " self.facet_cache_time = 0\n", + "\n", + "state = ExplorerState()\n", + "state.current_gdf = samples_gdf\n", + "\n", + "\n", + "# =============================================================================\n", + "# Material Hierarchy Definition\n", + "# =============================================================================\n", + "\n", + "# iSamples material vocabulary hierarchy (3 levels)\n", + "# Structure: (display_name, uri_suffix, children_list)\n", + "# uri_suffix maps to actual URIs in the data\n", + "\n", + "MATERIAL_HIERARCHY = [\n", + " (\"Material\", \"material\", [\n", + " (\"Natural Solid Material\", \"naturalsolidmaterial\", [\n", + " (\"Earth Material\", \"earthmaterial\", [\n", + " (\"Rock\", \"rock\", []),\n", + " (\"Sediment\", \"sediment\", []),\n", + " (\"Soil\", \"soil\", []),\n", + " (\"Mineral\", \"mineral\", []),\n", + " (\"Mixed Soil/Sediment/Rock\", \"mixedsoilsedimentrock\", []),\n", + " ]),\n", + " (\"Biogenic Non-organic\", \"biogenicnonorganicmaterial\", []),\n", + " ]),\n", + " (\"Organic Material\", \"organicmaterial\", [\n", + " (\"Plant Material\", \"plantmaterial\", []),\n", + " (\"Animal Product\", \"organicanimalproduct\", []),\n", + " ]),\n", + " (\"Anthropogenic Material\", \"anyanthropogenicmaterial\", [\n", + " (\"Anthropogenic Metal\", \"anthropogenicmetal\", []),\n", + " (\"Ceramic Clay\", \"ceramicclay\", []),\n", + " ]),\n", + " (\"Fluid Material\", \"fluidmaterial\", [\n", + " (\"Liquid Water\", \"liquidwater\", []),\n", + " (\"Gas\", \"gas\", []),\n", + " (\"Non-aqueous Liquid\", \"nonaqueousliquid\", []),\n", + " ]),\n", + " (\"Dispersed Media\", \"dispersedmedia\", [\n", + " (\"Particulate\", \"particulate\", []),\n", + " ]),\n", + " (\"Any Ice\", \"anyice\", []),\n", + " ])\n", + "]\n", + "\n", + "\n", + "def build_hierarchy_mappings():\n", + " \"\"\"\n", + " Build mappings for the material hierarchy.\n", + "\n", + " Returns:\n", + " - suffix_to_children: dict mapping uri_suffix -> list of all descendant suffixes\n", + " - display_order: list of (display_label, uri_suffix, indent_level) in tree order\n", + " \"\"\"\n", + " suffix_to_children = {}\n", + " display_order = []\n", + "\n", + " def collect_descendants(nodes):\n", + " \"\"\"Get all descendant suffixes from a list of hierarchy nodes.\"\"\"\n", + " descendants = []\n", + " for name, suffix, children in nodes:\n", + " descendants.append(suffix)\n", + " descendants.extend(collect_descendants(children))\n", + " return descendants\n", + "\n", + " def walk(nodes, level=0):\n", + " \"\"\"Walk hierarchy building mappings.\"\"\"\n", + " for name, suffix, children in nodes:\n", + " # All descendants (not including self)\n", + " suffix_to_children[suffix] = collect_descendants(children)\n", + "\n", + " # Add to display order with indentation\n", + " display_order.append((name, suffix, level))\n", + "\n", + " # Recurse into children\n", + " walk(children, level + 1)\n", + "\n", + " walk(MATERIAL_HIERARCHY)\n", + " return suffix_to_children, display_order\n", + "\n", + "\n", + "# Build hierarchy mappings at module load\n", + "SUFFIX_TO_CHILDREN, HIERARCHY_DISPLAY_ORDER = build_hierarchy_mappings()\n", + "\n", + "\n", + "# =============================================================================\n", + "# Facet Query Functions\n", + "# =============================================================================\n", + "\n", + "import re\n", + "import time as time_module\n", + "\n", + "def uri_to_display_name(uri):\n", + " \"\"\"\n", + " Convert a vocabulary URI to a human-readable display name.\n", + "\n", + " Examples:\n", + " https://w3id.org/isample/vocabulary/material/1.0/rock -> Rock\n", + " https://w3id.org/isample/opencontext/material/0.1/ceramicclay -> Ceramic Clay\n", + " \"\"\"\n", + " if not uri or not isinstance(uri, str):\n", + " return str(uri)\n", + "\n", + " # Extract last path segment\n", + " name = uri.rstrip('/').split('/')[-1]\n", + "\n", + " # Insert spaces before uppercase letters (camelCase -> Camel Case)\n", + " name = re.sub(r'([a-z])([A-Z])', r'\\1 \\2', name)\n", + "\n", + " # Capitalize first letter of each word\n", + " name = name.title()\n", + "\n", + " return name\n", + "\n", + "\n", + "def uri_to_suffix(uri):\n", + " \"\"\"Extract the suffix (last path segment) from a URI, lowercased.\"\"\"\n", + " if not uri or not isinstance(uri, str):\n", + " return \"\"\n", + " return uri.rstrip('/').split('/')[-1].lower()\n", + "\n", + "\n", + "def get_source_counts(additional_filters=None):\n", + " \"\"\"\n", + " Get counts of samples by source.\n", + "\n", + " Args:\n", + " additional_filters: Dict with material_filters, year_range keys\n", + "\n", + " Returns:\n", + " Dict of {source_name: count}\n", + " \"\"\"\n", + " where_clause = \"WHERE otype = 'MaterialSampleRecord' AND latitude IS NOT NULL\"\n", + "\n", + " # Apply year filter if present (cast result_time to TIMESTAMP)\n", + " if additional_filters and additional_filters.get('year_range'):\n", + " yr = additional_filters['year_range']\n", + " if yr[0] is not None and yr[1] is not None:\n", + " where_clause += f\" AND EXTRACT(YEAR FROM TRY_CAST(result_time AS TIMESTAMP)) BETWEEN {yr[0]} AND {yr[1]}\"\n", + " elif yr[0] is not None:\n", + " where_clause += f\" AND EXTRACT(YEAR FROM TRY_CAST(result_time AS TIMESTAMP)) >= {yr[0]}\"\n", + " elif yr[1] is not None:\n", + " where_clause += f\" AND EXTRACT(YEAR FROM TRY_CAST(result_time AS TIMESTAMP)) <= {yr[1]}\"\n", + "\n", + " # Note: We don't filter by material here to show all source options\n", + " query = f\"\"\"\n", + " SELECT n as source, COUNT(*) as count\n", + " FROM read_parquet('{PARQUET_PATH}')\n", + " {where_clause}\n", + " GROUP BY n ORDER BY count DESC\n", + " \"\"\"\n", + "\n", + " result = con.sql(query).df()\n", + " return dict(zip(result['source'], result['count']))\n", + "\n", + "\n", + "def get_all_material_counts():\n", + " \"\"\"\n", + " Get counts for ALL materials in the hierarchy (not just top N).\n", + " Used for building the hierarchical display.\n", + "\n", + " Returns:\n", + " Dict of {uri_suffix: {'uri': full_uri, 'count': count}}\n", + " \"\"\"\n", + " query = f\"\"\"\n", + " WITH samples AS (\n", + " SELECT msr.row_id, UNNEST(msr.p__has_material_category) as material_id\n", + " FROM read_parquet('{PARQUET_PATH}') msr\n", + " WHERE msr.otype = 'MaterialSampleRecord' AND msr.latitude IS NOT NULL\n", + " )\n", + " SELECT ic.label as material_uri, COUNT(*) as count\n", + " FROM samples s\n", + " JOIN read_parquet('{PARQUET_PATH}') ic ON ic.row_id = s.material_id\n", + " WHERE ic.label IS NOT NULL\n", + " GROUP BY ic.label\n", + " ORDER BY count DESC\n", + " \"\"\"\n", + "\n", + " try:\n", + " result = con.sql(query).df()\n", + " # Build dict keyed by suffix, but also maintain a scheme-aware\n", + " # lookup for rollup expansion. Each suffix entry stores:\n", + " # - 'uri': primary URI (highest count)\n", + " # - 'count': count for that URI\n", + " # - 'all_uris': dict of {scheme_prefix: uri} for all vocabularies\n", + " materials = {}\n", + " for _, row in result.iterrows():\n", + " uri = row['material_uri']\n", + " suffix = uri_to_suffix(uri)\n", + " if not suffix:\n", + " continue\n", + " # Extract scheme prefix for disambiguation\n", + " parts = uri.rstrip('/').split('/')\n", + " scheme = '/'.join(parts[:-2]) if len(parts) > 2 else ''\n", + "\n", + " if suffix not in materials:\n", + " materials[suffix] = {\n", + " 'uri': uri,\n", + " 'count': row['count'],\n", + " 'all_uris': {scheme: uri}\n", + " }\n", + " else:\n", + " materials[suffix]['all_uris'][scheme] = uri\n", + " # Keep highest-count URI as primary\n", + " if row['count'] > materials[suffix]['count']:\n", + " materials[suffix]['uri'] = uri\n", + " materials[suffix]['count'] = row['count']\n", + " return materials\n", + " except Exception as e:\n", + " print(f\"Material count error: {e}\")\n", + " return {}\n", + "\n", + "\n", + "# =============================================================================\n", + "# Accurate Rollup Counts (with caching)\n", + "# =============================================================================\n", + "\n", + "# Cache for accurate rollup counts\n", + "_rollup_cache = {\n", + " 'direct_counts': {}, # suffix -> direct count\n", + " 'rollup_counts': {}, # suffix -> accurate distinct rollup count\n", + " 'computed_at': None, # timestamp\n", + " 'existing_suffixes': set() # suffixes that exist in data\n", + "}\n", + "\n", + "\n", + "def compute_accurate_rollup_counts(force_refresh=False):\n", + " \"\"\"\n", + " Compute accurate rollup counts using DISTINCT sample counting.\n", + " \n", + " This avoids double-counting samples tagged at multiple hierarchy levels.\n", + " Results are cached for performance.\n", + " \n", + " Args:\n", + " force_refresh: If True, recompute even if cache exists\n", + " \n", + " Returns:\n", + " Tuple of (direct_counts, rollup_counts, elapsed_time)\n", + " - direct_counts: Dict of {suffix: count} for samples tagged exactly at that level\n", + " - rollup_counts: Dict of {suffix: count} for samples tagged at that level OR any descendant\n", + " \"\"\"\n", + " global _rollup_cache\n", + " \n", + " # Return cached results if available and not forcing refresh\n", + " if not force_refresh and _rollup_cache['computed_at'] is not None:\n", + " return (_rollup_cache['direct_counts'], \n", + " _rollup_cache['rollup_counts'], \n", + " 0.0)\n", + " \n", + " print(\"Computing accurate rollup counts (this takes ~3 seconds)...\")\n", + " t0 = time_module.time()\n", + " \n", + " # Build temp table with sample-material pairs\n", + " build_query = f\"\"\"\n", + " CREATE OR REPLACE TEMP TABLE sample_materials AS\n", + " SELECT DISTINCT \n", + " msr.row_id as sample_id,\n", + " LOWER(SPLIT_PART(ic.label, '/', -1)) as suffix,\n", + " ic.label as uri\n", + " FROM read_parquet('{PARQUET_PATH}') msr\n", + " CROSS JOIN UNNEST(msr.p__has_material_category) AS t(mat_id)\n", + " JOIN read_parquet('{PARQUET_PATH}') ic ON ic.row_id = mat_id\n", + " WHERE msr.otype = 'MaterialSampleRecord' \n", + " AND msr.latitude IS NOT NULL\n", + " AND ic.label IS NOT NULL\n", + " \"\"\"\n", + " con.sql(build_query)\n", + " \n", + " # Get existing suffixes\n", + " suffix_df = con.sql(\"SELECT DISTINCT suffix FROM sample_materials\").df()\n", + " existing_suffixes = set(suffix_df['suffix'].tolist())\n", + " \n", + " # Get direct counts\n", + " direct_df = con.sql(\"\"\"\n", + " SELECT suffix, COUNT(*) as direct_count\n", + " FROM sample_materials\n", + " GROUP BY suffix\n", + " \"\"\").df()\n", + " direct_counts = dict(zip(direct_df['suffix'], direct_df['direct_count']))\n", + " \n", + " # Compute accurate rollup counts in a single query instead of N+1.\n", + " # Build a mapping of parent -> all descendant suffixes, then query once.\n", + " rollup_counts = {}\n", + " \n", + " # First, handle unknown suffixes (not in hierarchy) \u2014 use direct counts\n", + " for suffix in existing_suffixes:\n", + " if suffix not in SUFFIX_TO_CHILDREN:\n", + " rollup_counts[suffix] = direct_counts.get(suffix, 0)\n", + " \n", + " # For known hierarchy nodes, build all suffix groups and query in batch\n", + " hierarchy_suffixes = [s for s in existing_suffixes if s in SUFFIX_TO_CHILDREN]\n", + " if hierarchy_suffixes:\n", + " # Build a union of all rollup groups\n", + " union_parts = []\n", + " for suffix in hierarchy_suffixes:\n", + " all_suffixes = [suffix] + [s for s in SUFFIX_TO_CHILDREN[suffix] if s in existing_suffixes]\n", + " suffixes_sql = \", \".join(f\"'{s}'\" for s in all_suffixes)\n", + " union_parts.append(\n", + " f\"SELECT \\'{suffix}\\' as parent_suffix, COUNT(DISTINCT sample_id) as cnt \"\n", + " f\"FROM sample_materials WHERE suffix IN ({suffixes_sql})\"\n", + " )\n", + " \n", + " batch_query = \" UNION ALL \".join(union_parts)\n", + " batch_result = con.sql(batch_query).fetchall()\n", + " for parent_suffix, cnt in batch_result:\n", + " rollup_counts[parent_suffix] = cnt\n", + " \n", + " elapsed = time_module.time() - t0\n", + " \n", + " # Update cache\n", + " _rollup_cache['direct_counts'] = direct_counts\n", + " _rollup_cache['rollup_counts'] = rollup_counts\n", + " _rollup_cache['computed_at'] = time_module.time()\n", + " _rollup_cache['existing_suffixes'] = existing_suffixes\n", + " \n", + " print(f\" Computed {len(rollup_counts)} rollup counts in {elapsed:.2f}s\")\n", + " \n", + " return direct_counts, rollup_counts, elapsed\n", + "\n", + "\n", + "def get_cached_rollup_counts():\n", + " \"\"\"Get cached rollup counts, computing if necessary.\"\"\"\n", + " if _rollup_cache['computed_at'] is None:\n", + " compute_accurate_rollup_counts()\n", + " return _rollup_cache['direct_counts'], _rollup_cache['rollup_counts']\n", + "\n", + "\n", + "def get_material_counts(additional_filters=None, limit=50):\n", + " \"\"\"\n", + " Get counts of samples by material category (requires join for labels).\n", + "\n", + " Args:\n", + " additional_filters: Dict with source_filters, year_range keys\n", + " limit: Max number of materials to return\n", + "\n", + " Returns:\n", + " Dict of {display_name: {'uri': full_uri, 'count': count}}\n", + " \"\"\"\n", + " where_clause = \"WHERE msr.otype = 'MaterialSampleRecord' AND msr.latitude IS NOT NULL\"\n", + "\n", + " # Apply source filter if present\n", + " if additional_filters and additional_filters.get('source_filters'):\n", + " sources_sql = \", \".join(f\"'{s}'\" for s in additional_filters['source_filters'])\n", + " where_clause += f\" AND msr.n IN ({sources_sql})\"\n", + "\n", + " # Apply year filter if present (cast result_time to TIMESTAMP)\n", + " if additional_filters and additional_filters.get('year_range'):\n", + " yr = additional_filters['year_range']\n", + " if yr[0] is not None and yr[1] is not None:\n", + " where_clause += f\" AND EXTRACT(YEAR FROM TRY_CAST(msr.result_time AS TIMESTAMP)) BETWEEN {yr[0]} AND {yr[1]}\"\n", + " elif yr[0] is not None:\n", + " where_clause += f\" AND EXTRACT(YEAR FROM TRY_CAST(msr.result_time AS TIMESTAMP)) >= {yr[0]}\"\n", + " elif yr[1] is not None:\n", + " where_clause += f\" AND EXTRACT(YEAR FROM TRY_CAST(msr.result_time AS TIMESTAMP)) <= {yr[1]}\"\n", + "\n", + " query = f\"\"\"\n", + " WITH samples AS (\n", + " SELECT msr.row_id, UNNEST(msr.p__has_material_category) as material_id\n", + " FROM read_parquet('{PARQUET_PATH}') msr\n", + " {where_clause}\n", + " )\n", + " SELECT ic.label as material_uri, COUNT(*) as count\n", + " FROM samples s\n", + " JOIN read_parquet('{PARQUET_PATH}') ic ON ic.row_id = s.material_id\n", + " WHERE ic.label IS NOT NULL\n", + " GROUP BY ic.label\n", + " ORDER BY count DESC\n", + " LIMIT {limit}\n", + " \"\"\"\n", + "\n", + " try:\n", + " result = con.sql(query).df()\n", + " # Return dict with display name as key, containing uri and count\n", + " materials = {}\n", + " for _, row in result.iterrows():\n", + " uri = row['material_uri']\n", + " display_name = uri_to_display_name(uri)\n", + " materials[display_name] = {\n", + " 'uri': uri,\n", + " 'count': row['count']\n", + " }\n", + " return materials\n", + " except Exception as e:\n", + " print(f\"Material count error: {e}\")\n", + " return {}\n", + "\n", + "\n", + "def get_year_range_stats():\n", + " \"\"\"\n", + " Get min/max years and decade counts for time facet.\n", + "\n", + " Returns:\n", + " Dict with 'min_year', 'max_year', 'decades' (dict of decade: count)\n", + " \"\"\"\n", + " # Cast result_time to TIMESTAMP before extracting year\n", + " query = f\"\"\"\n", + " SELECT\n", + " MIN(EXTRACT(YEAR FROM TRY_CAST(result_time AS TIMESTAMP)))::INT as min_year,\n", + " MAX(EXTRACT(YEAR FROM TRY_CAST(result_time AS TIMESTAMP)))::INT as max_year\n", + " FROM read_parquet('{PARQUET_PATH}')\n", + " WHERE otype = 'MaterialSampleRecord'\n", + " AND latitude IS NOT NULL\n", + " AND result_time IS NOT NULL\n", + " AND TRY_CAST(result_time AS TIMESTAMP) IS NOT NULL\n", + " \"\"\"\n", + "\n", + " stats = con.sql(query).df().iloc[0]\n", + "\n", + " # Get decade counts\n", + " decade_query = f\"\"\"\n", + " SELECT\n", + " (EXTRACT(YEAR FROM TRY_CAST(result_time AS TIMESTAMP))::INT / 10) * 10 as decade,\n", + " COUNT(*) as count\n", + " FROM read_parquet('{PARQUET_PATH}')\n", + " WHERE otype = 'MaterialSampleRecord'\n", + " AND latitude IS NOT NULL\n", + " AND result_time IS NOT NULL\n", + " AND TRY_CAST(result_time AS TIMESTAMP) IS NOT NULL\n", + " GROUP BY decade\n", + " ORDER BY decade\n", + " \"\"\"\n", + "\n", + " decades_df = con.sql(decade_query).df()\n", + " decades = dict(zip(decades_df['decade'].astype(int), decades_df['count']))\n", + "\n", + " return {\n", + " 'min_year': int(stats['min_year']) if pd.notna(stats['min_year']) else 1900,\n", + " 'max_year': int(stats['max_year']) if pd.notna(stats['max_year']) else 2025,\n", + " 'decades': decades\n", + " }\n", + "\n", + "\n", + "def expand_material_filters_with_rollup(selected_uris, all_material_counts):\n", + " \"\"\"\n", + " Expand selected URIs to include all descendant URIs (rollup).\n", + " Uses scheme-aware matching to avoid cross-vocabulary collisions.\n", + "\n", + " Args:\n", + " selected_uris: Set of user-selected material URIs\n", + " all_material_counts: Dict from get_all_material_counts()\n", + "\n", + " Returns:\n", + " Set of URIs including selected + all descendants\n", + " \"\"\"\n", + " expanded = set(selected_uris)\n", + "\n", + " for uri in selected_uris:\n", + " suffix = uri_to_suffix(uri)\n", + " # Extract scheme prefix from the selected URI for matching\n", + " parts = uri.rstrip('/').split('/')\n", + " scheme_prefix = '/'.join(parts[:-2]) if len(parts) > 2 else ''\n", + "\n", + " if suffix in SUFFIX_TO_CHILDREN:\n", + " # Add descendant URIs, preferring same vocabulary scheme\n", + " for child_suffix in SUFFIX_TO_CHILDREN[suffix]:\n", + " if child_suffix in all_material_counts:\n", + " child_data = all_material_counts[child_suffix]\n", + " # Try to find child in the same scheme first\n", + " if scheme_prefix and 'all_uris' in child_data:\n", + " if scheme_prefix in child_data['all_uris']:\n", + " expanded.add(child_data['all_uris'][scheme_prefix])\n", + " continue\n", + " # Fall back to primary URI (highest count)\n", + " expanded.add(child_data['uri'])\n", + "\n", + " return expanded\n", + "\n", + "\n", + "# =============================================================================\n", + "# Facet Widgets\n", + "# =============================================================================\n", + "\n", + "# =============================================================================\n", + "# Summary-Based Facet Loading (instant startup from 2KB summary file)\n", + "# =============================================================================\n", + "\n", + "def get_facet_from_summaries(facet_type):\n", + " \"\"\"Get facet values and counts from the pre-loaded summaries dataframe.\"\"\"\n", + " subset = FACET_SUMMARIES_DF[FACET_SUMMARIES_DF['facet_type'] == facet_type]\n", + " return dict(zip(subset['facet_value'], subset['count']))\n", + "\n", + "# Load all facets from summaries (instant - no full parquet scan)\n", + "print(\"Loading facets from pre-computed summaries...\")\n", + "_t0 = _time.time()\n", + "initial_source_counts = get_facet_from_summaries('source')\n", + "summary_material_counts = get_facet_from_summaries('material')\n", + "summary_context_counts = get_facet_from_summaries('context')\n", + "summary_object_type_counts = get_facet_from_summaries('object_type')\n", + "_facet_ms = (_time.time() - _t0) * 1000\n", + "print(f\"All facets populated in {_facet_ms:.0f} ms\")\n", + "print(f\"Sources: {list(initial_source_counts.keys())}\")\n", + "print(f\"Materials: {len(summary_material_counts)} types\")\n", + "print(f\"Contexts: {len(summary_context_counts)} types\")\n", + "print(f\"Object types: {len(summary_object_type_counts)} types\")\n", + "\n", + "# Defer heavy hierarchy/year queries \u2014 compute on first use, not at startup.\n", + "# The summary parquet above gives us instant facet counts; these are only\n", + "# needed when the user opens the Material tree or Time accordion.\n", + "_lazy = {'materials': None, 'year_stats': None}\n", + "\n", + "def _ensure_material_hierarchy():\n", + " if _lazy['materials'] is None:\n", + " print(\"Computing material hierarchy (first use)...\")\n", + " _lazy['materials'] = get_all_material_counts()\n", + " return _lazy['materials']\n", + "\n", + "def _ensure_year_stats():\n", + " if _lazy['year_stats'] is None:\n", + " print(\"Computing year stats (first use)...\")\n", + " _lazy['year_stats'] = get_year_range_stats()\n", + " return _lazy['year_stats']\n", + "\n", + "# NOTE: These are computed lazily on first access via _ensure_*.\n", + "# We do NOT call them here at startup \u2014 they run when the user\n", + "# opens the Material or Time accordion for the first time.\n", + "all_material_counts = {}\n", + "direct_counts, rollup_counts = {}, {}\n", + "year_stats = {'min_year': 1900, 'max_year': 2025, 'decades': {}}\n", + "\n", + "print(\"Heavy queries deferred until first use (Material/Time accordion open)\")\n", + "\n", + "\n", + "def build_hierarchical_material_options(material_counts, direct_counts_dict, rollup_counts_dict, use_rollup=True):\n", + " \"\"\"\n", + " Build SelectMultiple options showing material hierarchy with indentation.\n", + "\n", + " Args:\n", + " material_counts: Dict from get_all_material_counts() keyed by suffix (for URIs)\n", + " direct_counts_dict: Dict of {suffix: direct_count}\n", + " rollup_counts_dict: Dict of {suffix: accurate_rollup_count}\n", + " use_rollup: If True, show rollup totals; if False, show direct counts\n", + "\n", + " Returns:\n", + " - options: List of option labels (with indentation and counts)\n", + " - uri_map: Dict mapping option label -> full URI\n", + " \"\"\"\n", + " options = []\n", + " uri_map = {}\n", + "\n", + " # Indentation characters\n", + " INDENT = \" \" # 4 spaces per level\n", + "\n", + " for display_name, suffix, level in HIERARCHY_DISPLAY_ORDER:\n", + " if suffix in material_counts:\n", + " data = material_counts[suffix]\n", + " uri = data['uri']\n", + "\n", + " # Choose count based on rollup setting\n", + " if use_rollup:\n", + " count = rollup_counts_dict.get(suffix, direct_counts_dict.get(suffix, 0))\n", + " else:\n", + " count = direct_counts_dict.get(suffix, 0)\n", + "\n", + " # Create indented label\n", + " indent = INDENT * level\n", + " label = f\"{indent}{display_name} ({count:,})\"\n", + "\n", + " options.append(label)\n", + " uri_map[label] = uri\n", + "\n", + " return options, uri_map\n", + "\n", + "\n", + "# Build initial hierarchical material options (with rollup ON by default)\n", + "material_options, material_uri_map = build_hierarchical_material_options(\n", + " all_material_counts, direct_counts, rollup_counts, use_rollup=True\n", + ")\n", + "print(f\"Hierarchy has {len(material_options)} materials with data\")\n", + "\n", + "\n", + "def create_source_checkboxes(counts):\n", + " \"\"\"Create checkboxes for source facet.\"\"\"\n", + " checkboxes = []\n", + " for source, count in counts.items():\n", + " cb = widgets.Checkbox(\n", + " value=False,\n", + " description=f\"{source} ({count:,})\",\n", + " indent=False,\n", + " layout=widgets.Layout(width='100%', margin='2px 0')\n", + " )\n", + " cb.source_name = source # Store source name for easy access\n", + " checkboxes.append(cb)\n", + " return checkboxes\n", + "\n", + "\n", + "# Create source checkboxes\n", + "source_checkboxes = create_source_checkboxes(initial_source_counts)\n", + "source_facet_box = widgets.VBox(\n", + " source_checkboxes,\n", + " layout=widgets.Layout(max_height='180px', overflow_y='auto', padding='5px')\n", + ")\n", + "\n", + "# Use SelectMultiple for materials - hierarchical display with indentation\n", + "material_select = widgets.SelectMultiple(\n", + " options=material_options,\n", + " value=[],\n", + " rows=12, # More rows to see hierarchy\n", + " description='',\n", + " layout=widgets.Layout(width='100%', height='280px'),\n", + " style={'description_width': '0px'}\n", + ")\n", + "\n", + "# Rollup toggle\n", + "rollup_toggle = widgets.Checkbox(\n", + " value=True,\n", + " description='Include subcategories',\n", + " indent=False,\n", + " layout=widgets.Layout(margin='5px 0'),\n", + " style={'description_width': 'initial'}\n", + ")\n", + "\n", + "# Refresh counts button\n", + "refresh_counts_btn = widgets.Button(\n", + " description='',\n", + " button_style='',\n", + " icon='refresh',\n", + " tooltip='Recalculate material counts',\n", + " layout=widgets.Layout(width='32px', height='26px')\n", + ")\n", + "\n", + "material_header = widgets.HBox([\n", + " widgets.HTML(\"Ctrl/Cmd+click to multi-select\"),\n", + " refresh_counts_btn\n", + "], layout=widgets.Layout(justify_content='space-between', align_items='center'))\n", + "\n", + "material_facet_box = widgets.VBox([\n", + " material_header,\n", + " rollup_toggle,\n", + " material_select\n", + "], layout=widgets.Layout(padding='5px'))\n", + "\n", + "\n", + "# =============================================================================\n", + "# Time Facet Widgets (with decade checkboxes)\n", + "# =============================================================================\n", + "\n", + "def create_decade_checkboxes(decade_counts):\n", + " \"\"\"Create checkboxes for decade quick-select.\"\"\"\n", + " checkboxes = []\n", + " # Sort decades and filter to reasonable range (1800-2030)\n", + " sorted_decades = sorted([d for d in decade_counts.keys() if 1800 <= d <= 2030])\n", + "\n", + " for decade in sorted_decades:\n", + " count = decade_counts.get(decade, 0)\n", + " if count > 0: # Only show decades with data\n", + " cb = widgets.Checkbox(\n", + " value=False,\n", + " description=f\"{decade}s ({count:,})\",\n", + " indent=False,\n", + " layout=widgets.Layout(width='100%', margin='1px 0')\n", + " )\n", + " cb.decade_value = decade # Store decade value for easy access\n", + " checkboxes.append(cb)\n", + " return checkboxes\n", + "\n", + "\n", + "decade_checkboxes = create_decade_checkboxes(year_stats['decades'])\n", + "\n", + "# Container for decade checkboxes (scrollable if many decades)\n", + "decade_box = widgets.VBox(\n", + " decade_checkboxes,\n", + " layout=widgets.Layout(max_height='140px', overflow_y='auto', padding='2px')\n", + ")\n", + "\n", + "decade_label = widgets.HTML(\"Quick select decades:\")\n", + "\n", + "year_slider = widgets.IntRangeSlider(\n", + " value=[year_stats['min_year'], year_stats['max_year']],\n", + " min=year_stats['min_year'],\n", + " max=year_stats['max_year'],\n", + " step=1,\n", + " description='Range:',\n", + " continuous_update=False,\n", + " layout=widgets.Layout(width='100%'),\n", + " style={'description_width': '50px'}\n", + ")\n", + "\n", + "# Enable/disable time filter\n", + "time_filter_enabled = widgets.Checkbox(\n", + " value=False,\n", + " description='Filter by time',\n", + " indent=False,\n", + " layout=widgets.Layout(margin='5px 0')\n", + ")\n", + "\n", + "# Sync mode: when True, decade checkboxes auto-update the slider\n", + "sync_decades_to_slider = widgets.Checkbox(\n", + " value=True,\n", + " description='Sync decades to slider',\n", + " indent=False,\n", + " layout=widgets.Layout(margin='2px 0'),\n", + " style={'description_width': 'initial'}\n", + ")\n", + "\n", + "time_facet_box = widgets.VBox([\n", + " time_filter_enabled,\n", + " decade_label,\n", + " decade_box,\n", + " widgets.HTML(\"Fine-tune range:\"),\n", + " year_slider,\n", + " sync_decades_to_slider\n", + "], layout=widgets.Layout(padding='5px'))\n", + "\n", + "\n", + "# Context dropdown (populated from summaries)\n", + "context_options = [('All Contexts', '')] + [\n", + " (f\"{val} ({count:,})\", val) for val, count in summary_context_counts.items()\n", + "]\n", + "context_dropdown = widgets.Dropdown(\n", + " options=context_options,\n", + " value='',\n", + " description='',\n", + " layout=widgets.Layout(width='100%')\n", + ")\n", + "context_facet_box = widgets.VBox([\n", + " widgets.HTML(\"Filter by specimen context/setting:\"),\n", + " context_dropdown\n", + "], layout=widgets.Layout(padding='5px'))\n", + "\n", + "# Object type dropdown (populated from summaries)\n", + "object_type_options = [('All Object Types', '')] + [\n", + " (f\"{val} ({count:,})\", val) for val, count in summary_object_type_counts.items()\n", + "]\n", + "object_type_dropdown = widgets.Dropdown(\n", + " options=object_type_options,\n", + " value='',\n", + " description='',\n", + " layout=widgets.Layout(width='100%')\n", + ")\n", + "object_type_facet_box = widgets.VBox([\n", + " widgets.HTML(\"Filter by sample object type:\"),\n", + " object_type_dropdown\n", + "], layout=widgets.Layout(padding='5px'))\n", + "\n", + "# Create accordion with all facets\n", + "facet_accordion = widgets.Accordion(\n", + " children=[source_facet_box, material_facet_box, context_facet_box, object_type_facet_box, time_facet_box],\n", + " titles=['Sources', 'Material Type', 'Context', 'Object Type', 'Time Period'],\n", + " layout=widgets.Layout(width='100%')\n", + ")\n", + "facet_accordion.selected_index = None # Start collapsed\n", + "\n", + "def on_accordion_open(change):\n", + " \"\"\"Lazy-load heavy queries when user opens Material or Time accordion.\"\"\"\n", + " global all_material_counts, direct_counts, rollup_counts, year_stats\n", + " idx = change.get('new')\n", + " if idx == 1 and not all_material_counts:\n", + " # Material accordion opened for first time\n", + " all_material_counts = _ensure_material_hierarchy()\n", + " direct_counts, rollup_counts = get_cached_rollup_counts()\n", + " update_material_widget_options()\n", + " print(f\"Material hierarchy loaded: {len(all_material_counts)} types\")\n", + " elif idx == 4 and not year_stats.get('decades'):\n", + " # Time accordion opened for first time\n", + " year_stats = _ensure_year_stats()\n", + " # Rebuild decade checkboxes and slider with real data\n", + " new_decade_cbs = create_decade_checkboxes(year_stats['decades'])\n", + " for cb in new_decade_cbs:\n", + " cb.observe(on_decade_checkbox_change, names=['value'])\n", + " decade_box.children = new_decade_cbs\n", + " decade_checkboxes.clear()\n", + " decade_checkboxes.extend(new_decade_cbs)\n", + " year_slider.min = year_stats['min_year']\n", + " year_slider.max = year_stats['max_year']\n", + " year_slider.value = [year_stats['min_year'], year_stats['max_year']]\n", + " print(f\"Year stats loaded: {year_stats['min_year']} - {year_stats['max_year']}, {len(new_decade_cbs)} decades\")\n", + "\n", + "facet_accordion.observe(on_accordion_open, names=['selected_index'])\n", + "\n", + "# Clear all filters button\n", + "clear_filters_btn = widgets.Button(\n", + " description='Clear All Filters',\n", + " button_style='warning',\n", + " icon='times-circle',\n", + " layout=widgets.Layout(width='100%', margin='10px 0')\n", + ")\n", + "\n", + "\n", + "# =============================================================================\n", + "# Original Widgets (preserved)\n", + "# =============================================================================\n", + "\n", + "source_filter = widgets.Dropdown(\n", + " options=['All Sources', 'SESAR', 'OPENCONTEXT', 'GEOME', 'SMITHSONIAN'],\n", + " value='All Sources',\n", + " description='Source:',\n", + " style={'description_width': '60px'},\n", + " layout=widgets.Layout(display='none') # Hidden - replaced by facet\n", + ")\n", + "\n", + "search_input = widgets.Text(\n", + " value='',\n", + " placeholder='Search label, description, place...',\n", + " description='Search:',\n", + " style={'description_width': '60px'},\n", + " layout=widgets.Layout(width='280px')\n", + ")\n", + "\n", + "search_btn = widgets.Button(\n", + " description='',\n", + " button_style='',\n", + " icon='search',\n", + " tooltip='Search (or press Enter)',\n", + " layout=widgets.Layout(width='40px')\n", + ")\n", + "\n", + "clear_search_btn = widgets.Button(\n", + " description='',\n", + " button_style='',\n", + " icon='times',\n", + " tooltip='Clear search',\n", + " layout=widgets.Layout(width='40px')\n", + ")\n", + "\n", + "sample_count = widgets.IntSlider(\n", + " value=12500,\n", + " min=1000,\n", + " max=500000, # 500K per source - plenty for 128GB RAM\n", + " step=5000,\n", + " description='Per source:',\n", + " style={'description_width': '80px'}\n", + ")\n", + "\n", + "viewport_toggle = widgets.ToggleButton(\n", + " value=False,\n", + " description='Viewport Mode',\n", + " tooltip='When enabled, automatically loads data for current map view',\n", + " icon='map',\n", + " button_style='' # 'success' when active\n", + ")\n", + "\n", + "refresh_btn = widgets.Button(\n", + " description='Refresh Data',\n", + " button_style='primary',\n", + " icon='refresh'\n", + ")\n", + "\n", + "# Loading indicator with spinner\n", + "loading_indicator = widgets.HTML(value=\"\")\n", + "\n", + "status_label = widgets.HTML(value=f\"Loaded: {len(samples_gdf):,} samples\")\n", + "\n", + "card_output = widgets.HTML(value=render_sample_card(None))\n", + "\n", + "# Active filters display\n", + "active_filters_html = widgets.HTML(value=\"\")\n", + "\n", + "\n", + "def update_active_filters_display():\n", + " \"\"\"Update the display of currently active filters.\"\"\"\n", + " filters = []\n", + "\n", + " if state.source_filters:\n", + " filters.append(f\"Sources: {', '.join(state.source_filters)}\")\n", + "\n", + " if state.material_filters:\n", + " # Get display names from selected options\n", + " mat_names = [uri_to_display_name(uri) for uri in state.material_filters]\n", + " mat_display = mat_names[:2]\n", + " if len(mat_names) > 2:\n", + " mat_display.append(f\"+{len(mat_names)-2} more\")\n", + " rollup_indicator = \" (+sub)\" if state.material_rollup else \"\"\n", + " filters.append(f\"Materials: {', '.join(mat_display)}{rollup_indicator}\")\n", + "\n", + " if state.year_range[0] is not None or state.year_range[1] is not None:\n", + " yr_str = f\"{state.year_range[0] or 'any'} - {state.year_range[1] or 'any'}\"\n", + " # Show selected decades if any\n", + " if state.selected_decades:\n", + " decades_str = ', '.join(f\"{d}s\" for d in sorted(state.selected_decades))\n", + " filters.append(f\"Time: {decades_str} ({yr_str})\")\n", + " else:\n", + " filters.append(f\"Years: {yr_str}\")\n", + "\n", + " if filters:\n", + " active_filters_html.value = f\"
Active: {''.join(filters)}
\"\n", + " else:\n", + " active_filters_html.value = \"\"\n", + "\n", + "\n", + "def show_loading(message=\"Loading...\"):\n", + " \"\"\"Show loading indicator.\"\"\"\n", + " state.loading = True\n", + " loading_indicator.value = f\"\"\"\n", + "
\n", + " \n", + " \n", + " \n", + " \n", + " {message}\n", + "
\n", + " \"\"\"\n", + "\n", + "\n", + "def hide_loading():\n", + " \"\"\"Hide loading indicator.\"\"\"\n", + " state.loading = False\n", + " loading_indicator.value = \"\"\n", + "\n", + "\n", + "def select_sample(idx, source='map'):\n", + " \"\"\"\n", + " Select a sample by index and sync map/table/card.\n", + "\n", + " Args:\n", + " idx: Row index in current_gdf\n", + " source: 'map' or 'table' - which triggered the selection\n", + " \"\"\"\n", + " if idx is None or idx >= len(state.current_gdf):\n", + " return\n", + "\n", + " state.selected_index = idx\n", + " state.selected_row = state.current_gdf.iloc[idx]\n", + "\n", + " # Update sample card\n", + " card_output.value = render_sample_card(state.selected_row)\n", + "\n", + " if source == 'map':\n", + " # Map click -> highlight table row\n", + " # Column count depends on whether we're showing search_score\n", + " col_count = 4 if state.current_search else 3\n", + " sample_table.selections = [{'r1': idx, 'c1': 0, 'r2': idx, 'c2': col_count}]\n", + "\n", + " elif source == 'table':\n", + " # Table click -> recenter map (keep current zoom)\n", + " lat = state.selected_row['latitude']\n", + " lon = state.selected_row['longitude']\n", + " if not pd.isna(lat) and not pd.isna(lon):\n", + " sample_map.set_view_state(latitude=float(lat), longitude=float(lon))\n", + "\n", + "\n", + "def on_map_point_click(change):\n", + " \"\"\"Handle click on a map point - highlight corresponding table row.\"\"\"\n", + " if state.syncing_selection:\n", + " return\n", + "\n", + " idx = change.get('new')\n", + " if idx is None:\n", + " return\n", + "\n", + " state.syncing_selection = True\n", + " try:\n", + " select_sample(idx, source='map')\n", + " finally:\n", + " state.syncing_selection = False\n", + "\n", + "\n", + "def setup_layer_observer(layer):\n", + " \"\"\"Setup the selected_index observer on a layer.\"\"\"\n", + " layer.observe(on_map_point_click, names=['selected_index'])\n", + "\n", + "\n", + "def update_map_and_table(new_gdf, search_active=False):\n", + " \"\"\"Update map and table with new data.\"\"\"\n", + " state.current_gdf = new_gdf\n", + " state.current_search = search_input.value.strip() if search_active else \"\"\n", + "\n", + " # Update map with new layer\n", + " new_layer = create_map_layer(new_gdf)\n", + "\n", + " # Setup observer on new layer BEFORE adding to map\n", + " setup_layer_observer(new_layer)\n", + "\n", + " sample_map.layers = [new_layer]\n", + "\n", + " # Update table - include score column if searching\n", + " if search_active and 'search_score' in new_gdf.columns:\n", + " display_cols = ['search_score', 'source', 'label', 'latitude', 'longitude']\n", + " df_display = new_gdf[display_cols].copy()\n", + " df_display = df_display.rename(columns={'search_score': 'score'})\n", + " else:\n", + " display_cols = ['source', 'label', 'latitude', 'longitude']\n", + " df_display = new_gdf[display_cols].copy()\n", + "\n", + " df_display['latitude'] = df_display['latitude'].round(4)\n", + " df_display['longitude'] = df_display['longitude'].round(4)\n", + " sample_table.data = df_display\n", + "\n", + " # Update status\n", + " if search_active:\n", + " status_label.value = f\"Found: {len(new_gdf):,} matches for '{state.current_search}'\"\n", + " else:\n", + " status_label.value = f\"Loaded: {len(new_gdf):,} samples\"\n", + "\n", + " # Update active filters display\n", + " update_active_filters_display()\n", + "\n", + "\n", + "def get_effective_material_filters():\n", + " \"\"\"\n", + " Get the effective set of material URIs to filter by,\n", + " applying rollup expansion if enabled.\n", + " \"\"\"\n", + " if not state.material_filters:\n", + " return None\n", + "\n", + " if state.material_rollup:\n", + " # Expand to include descendants\n", + " return expand_material_filters_with_rollup(state.material_filters, all_material_counts)\n", + " else:\n", + " # Exact match only\n", + " return state.material_filters\n", + "\n", + "\n", + "def update_material_widget_options():\n", + " \"\"\"\n", + " Update material SelectMultiple options based on rollup toggle.\n", + " Preserves current selection when possible.\n", + " \"\"\"\n", + " global material_uri_map\n", + "\n", + " # Remember current selection (by URI)\n", + " current_uris = set()\n", + " for option_label in material_select.value:\n", + " if option_label in material_uri_map:\n", + " current_uris.add(material_uri_map[option_label])\n", + "\n", + " # Get cached counts\n", + " direct_cts, rollup_cts = get_cached_rollup_counts()\n", + "\n", + " # Rebuild options with new rollup setting\n", + " new_options, new_uri_map = build_hierarchical_material_options(\n", + " all_material_counts, direct_cts, rollup_cts, use_rollup=rollup_toggle.value\n", + " )\n", + "\n", + " # Update the global uri_map\n", + " material_uri_map = new_uri_map\n", + "\n", + " # Update widget options\n", + " material_select.options = new_options\n", + "\n", + " # Restore selection (find labels for remembered URIs)\n", + " new_selection = []\n", + " for label, uri in new_uri_map.items():\n", + " if uri in current_uris:\n", + " new_selection.append(label)\n", + "\n", + " material_select.value = tuple(new_selection)\n", + "\n", + "\n", + "def do_search():\n", + " \"\"\"Execute search with current parameters.\"\"\"\n", + " show_loading(\"Searching...\")\n", + "\n", + " try:\n", + " # Use facet filters instead of single source dropdown\n", + " source_filters_set = state.source_filters if state.source_filters else None\n", + " material_filters_set = get_effective_material_filters() # Apply rollup if enabled\n", + " year_range = state.year_range if (state.year_range[0] is not None or state.year_range[1] is not None) else None\n", + "\n", + " search_term = search_input.value.strip()\n", + "\n", + " if state.viewport_mode:\n", + " # Search within current viewport\n", + " view_state = sample_map.view_state\n", + " zoom = view_state.zoom if hasattr(view_state, 'zoom') else 1\n", + " bbox = view_state_to_bbox(view_state)\n", + "\n", + " # When searching, use slider value directly (no adaptive reduction)\n", + " # When browsing, use adaptive sampling based on zoom\n", + " if search_term:\n", + " max_samples = sample_count.value\n", + " else:\n", + " max_samples = adaptive_sample_size(zoom, base_size=sample_count.value)\n", + "\n", + " new_gdf = load_samples(\n", + " max_per_source=max_samples,\n", + " bbox=bbox,\n", + " search_term=search_term if search_term else None,\n", + " source_filters=source_filters_set,\n", + " material_filters=material_filters_set,\n", + " year_range=year_range\n", + " )\n", + "\n", + " zoom_info = f\" (zoom {zoom:.1f})\"\n", + " else:\n", + " # Search globally\n", + " new_gdf = load_samples(\n", + " max_per_source=sample_count.value,\n", + " search_term=search_term if search_term else None,\n", + " source_filters=source_filters_set,\n", + " material_filters=material_filters_set,\n", + " year_range=year_range\n", + " )\n", + " zoom_info = \"\"\n", + "\n", + " update_map_and_table(new_gdf, search_active=bool(search_term))\n", + "\n", + " if search_term:\n", + " status_label.value = f\"Found: {len(new_gdf):,} matches for '{search_term}'{zoom_info}\"\n", + " else:\n", + " status_label.value = f\"Loaded: {len(new_gdf):,} samples{zoom_info}\"\n", + "\n", + " except Exception as e:\n", + " status_label.value = f\"Error: {str(e)[:50]}\"\n", + " import traceback\n", + " traceback.print_exc()\n", + " finally:\n", + " hide_loading()\n", + "\n", + "\n", + "def on_search_click(b):\n", + " \"\"\"Handle search button click.\"\"\"\n", + " do_search()\n", + "\n", + "\n", + "def on_search_submit(change):\n", + " \"\"\"Handle Enter key in search box.\"\"\"\n", + " do_search()\n", + "\n", + "\n", + "def on_clear_search(b):\n", + " \"\"\"Clear search and reload data.\"\"\"\n", + " search_input.value = ''\n", + " do_search()\n", + "\n", + "\n", + "search_btn.on_click(on_search_click)\n", + "search_input.on_submit(on_search_submit)\n", + "clear_search_btn.on_click(on_clear_search)\n", + "\n", + "\n", + "def load_viewport_data():\n", + " \"\"\"Load data for current viewport with adaptive sampling.\"\"\"\n", + " if state.loading:\n", + " return\n", + "\n", + " show_loading(\"Loading viewport data...\")\n", + "\n", + " try:\n", + " # Get current view state\n", + " view_state = sample_map.view_state\n", + " zoom = view_state.zoom if hasattr(view_state, 'zoom') else 1\n", + "\n", + " # Calculate bounding box\n", + " bbox = view_state_to_bbox(view_state)\n", + "\n", + " # Get facet filters and search term\n", + " source_filters_set = state.source_filters if state.source_filters else None\n", + " material_filters_set = get_effective_material_filters() # Apply rollup if enabled\n", + " year_range = state.year_range if (state.year_range[0] is not None or state.year_range[1] is not None) else None\n", + " search_term = search_input.value.strip() if search_input.value.strip() else None\n", + "\n", + " # When searching, use slider value directly (no adaptive reduction)\n", + " # When browsing, use adaptive sampling based on zoom\n", + " if search_term:\n", + " max_samples = sample_count.value\n", + " else:\n", + " max_samples = adaptive_sample_size(zoom, base_size=sample_count.value)\n", + "\n", + " # Load data\n", + " new_gdf = load_samples(\n", + " max_per_source=max_samples,\n", + " bbox=bbox,\n", + " search_term=search_term,\n", + " source_filters=source_filters_set,\n", + " material_filters=material_filters_set,\n", + " year_range=year_range\n", + " )\n", + "\n", + " update_map_and_table(new_gdf, search_active=bool(search_term))\n", + "\n", + " # Show zoom info in status\n", + " if search_term:\n", + " status_label.value = f\"Found: {len(new_gdf):,} matches for '{search_term}' (zoom {zoom:.1f})\"\n", + " else:\n", + " status_label.value = f\"Loaded: {len(new_gdf):,} samples (zoom {zoom:.1f}, {max_samples:,}/source max)\"\n", + "\n", + " except Exception as e:\n", + " status_label.value = f\"Error: {str(e)[:50]}\"\n", + " finally:\n", + " hide_loading()\n", + "\n", + "\n", + "def debounced_viewport_load():\n", + " \"\"\"Debounced viewport loading - waits for user to stop panning/zooming.\"\"\"\n", + " # Cancel any existing timer\n", + " if state.debounce_timer is not None:\n", + " state.debounce_timer.cancel()\n", + "\n", + " # Set new timer (500ms delay)\n", + " state.debounce_timer = threading.Timer(0.5, load_viewport_data)\n", + " state.debounce_timer.start()\n", + "\n", + "\n", + "def on_view_state_change(change):\n", + " \"\"\"Handle map pan/zoom changes.\"\"\"\n", + " if state.viewport_mode and not state.loading:\n", + " debounced_viewport_load()\n", + "\n", + "\n", + "def on_viewport_toggle(change):\n", + " \"\"\"Handle viewport mode toggle.\"\"\"\n", + " state.viewport_mode = change['new']\n", + " if change['new']:\n", + " viewport_toggle.button_style = 'success'\n", + " viewport_toggle.description = 'Viewport Mode ON'\n", + " # Immediately load viewport data\n", + " load_viewport_data()\n", + " else:\n", + " viewport_toggle.button_style = ''\n", + " viewport_toggle.description = 'Viewport Mode'\n", + "\n", + "\n", + "viewport_toggle.observe(on_viewport_toggle, names=['value'])\n", + "\n", + "\n", + "# Event handlers\n", + "def on_refresh_click(b):\n", + " do_search() # Refresh now uses same logic as search\n", + "\n", + "refresh_btn.on_click(on_refresh_click)\n", + "\n", + "\n", + "def on_table_selection(change):\n", + " \"\"\"Handle table row selection - recenter map on selected point.\"\"\"\n", + " if state.syncing_selection:\n", + " return\n", + "\n", + " # selections is a LIST of selection dicts\n", + " selections = change.get('new', [])\n", + " if selections and len(selections) > 0:\n", + " # Get the first selection\n", + " sel = selections[0]\n", + " row_idx = sel.get('r1')\n", + " if row_idx is not None and row_idx < len(state.current_gdf):\n", + " state.syncing_selection = True\n", + " try:\n", + " select_sample(row_idx, source='table')\n", + " finally:\n", + " state.syncing_selection = False\n", + "\n", + "sample_table.observe(on_table_selection, names=['selections'])\n", + "\n", + "# Register view_state observer on the map\n", + "sample_map.observe(on_view_state_change, names=['view_state'])\n", + "\n", + "# Setup observer on initial layer\n", + "setup_layer_observer(sample_map.layers[0])\n", + "\n", + "\n", + "# =============================================================================\n", + "# Facet Event Handlers\n", + "# =============================================================================\n", + "\n", + "def on_source_checkbox_change(change):\n", + " \"\"\"Handle source checkbox changes.\"\"\"\n", + " # Rebuild source_filters from all checkboxes\n", + " state.source_filters = set()\n", + " for cb in source_checkboxes:\n", + " if cb.value:\n", + " state.source_filters.add(cb.source_name)\n", + "\n", + " # Trigger data reload\n", + " do_search()\n", + "\n", + "\n", + "def on_material_select_change(change):\n", + " \"\"\"Handle material selection changes.\"\"\"\n", + " # Convert selected option labels to URIs\n", + " state.material_filters = set()\n", + " for option_label in material_select.value:\n", + " if option_label in material_uri_map:\n", + " state.material_filters.add(material_uri_map[option_label])\n", + "\n", + " # Trigger data reload\n", + " do_search()\n", + "\n", + "\n", + "def on_rollup_toggle_change(change):\n", + " \"\"\"Handle rollup toggle changes.\"\"\"\n", + " state.material_rollup = rollup_toggle.value\n", + "\n", + " # Update the widget options to show correct counts\n", + " update_material_widget_options()\n", + "\n", + " # If materials are selected, reload with new rollup setting\n", + " if state.material_filters:\n", + " do_search()\n", + "\n", + "\n", + "def on_refresh_counts_click(b):\n", + " \"\"\"Handle refresh counts button click - recalculate material counts.\"\"\"\n", + " global direct_counts, rollup_counts\n", + " \n", + " show_loading(\"Recalculating counts...\")\n", + " try:\n", + " # Force refresh of counts\n", + " direct_counts, rollup_counts, elapsed = compute_accurate_rollup_counts(force_refresh=True)\n", + " \n", + " # Update widget display\n", + " update_material_widget_options()\n", + " \n", + " status_label.value = f\"Counts refreshed in {elapsed:.1f}s\"\n", + " except Exception as e:\n", + " status_label.value = f\"Error: {str(e)[:50]}\"\n", + " finally:\n", + " hide_loading()\n", + "\n", + "\n", + "# Flag to prevent infinite loops when syncing decades/slider\n", + "_syncing_time_widgets = False\n", + "\n", + "\n", + "def sync_slider_from_decades():\n", + " \"\"\"Update year slider based on selected decades.\"\"\"\n", + " global _syncing_time_widgets\n", + " if _syncing_time_widgets:\n", + " return\n", + " \n", + " _syncing_time_widgets = True\n", + " try:\n", + " if state.selected_decades:\n", + " # Calculate min/max from selected decades\n", + " min_year = min(state.selected_decades)\n", + " max_year = max(state.selected_decades) + 9 # End of decade (e.g., 2010 -> 2019)\n", + " year_slider.value = [min_year, max_year]\n", + " finally:\n", + " _syncing_time_widgets = False\n", + "\n", + "\n", + "def on_decade_checkbox_change(change):\n", + " \"\"\"Handle decade checkbox changes.\"\"\"\n", + " global _syncing_time_widgets\n", + " if _syncing_time_widgets:\n", + " return\n", + " \n", + " # Rebuild selected_decades from all checkboxes\n", + " state.selected_decades = set()\n", + " for cb in decade_checkboxes:\n", + " if cb.value:\n", + " state.selected_decades.add(cb.decade_value)\n", + "\n", + " # Auto-enable time filter if any decade selected\n", + " if state.selected_decades and not time_filter_enabled.value:\n", + " _syncing_time_widgets = True\n", + " time_filter_enabled.value = True\n", + " _syncing_time_widgets = False\n", + "\n", + " # Sync slider if enabled\n", + " if sync_decades_to_slider.value:\n", + " sync_slider_from_decades()\n", + "\n", + " # Update state year_range from slider (which may have been synced)\n", + " if time_filter_enabled.value:\n", + " state.year_range = (year_slider.value[0], year_slider.value[1])\n", + " else:\n", + " state.year_range = (None, None)\n", + "\n", + " # Trigger data reload\n", + " do_search()\n", + "\n", + "\n", + "def on_time_filter_change(change):\n", + " \"\"\"Handle time filter enable/disable.\"\"\"\n", + " global _syncing_time_widgets\n", + " if _syncing_time_widgets:\n", + " return\n", + " \n", + " if time_filter_enabled.value:\n", + " state.year_range = (year_slider.value[0], year_slider.value[1])\n", + " else:\n", + " state.year_range = (None, None)\n", + " # Also clear decade selections when disabling time filter\n", + " _syncing_time_widgets = True\n", + " state.selected_decades = set()\n", + " for cb in decade_checkboxes:\n", + " cb.value = False\n", + " _syncing_time_widgets = False\n", + "\n", + " do_search()\n", + "\n", + "\n", + "def on_year_slider_change(change):\n", + " \"\"\"Handle year slider changes.\"\"\"\n", + " global _syncing_time_widgets\n", + " if _syncing_time_widgets:\n", + " return\n", + " \n", + " if time_filter_enabled.value:\n", + " state.year_range = (year_slider.value[0], year_slider.value[1])\n", + " \n", + " # Update decade checkboxes to reflect slider range (if sync enabled)\n", + " if sync_decades_to_slider.value:\n", + " _syncing_time_widgets = True\n", + " min_yr, max_yr = year_slider.value\n", + " state.selected_decades = set()\n", + " for cb in decade_checkboxes:\n", + " decade = cb.decade_value\n", + " # Check if decade overlaps with slider range\n", + " decade_end = decade + 9\n", + " overlaps = (decade <= max_yr) and (decade_end >= min_yr)\n", + " cb.value = overlaps\n", + " if overlaps:\n", + " state.selected_decades.add(decade)\n", + " _syncing_time_widgets = False\n", + " \n", + " do_search()\n", + "\n", + "\n", + "def on_clear_all_filters(b):\n", + " \"\"\"Clear all facet filters.\"\"\"\n", + " global _syncing_time_widgets\n", + " _syncing_time_widgets = True\n", + " \n", + " # Clear source checkboxes\n", + " for cb in source_checkboxes:\n", + " cb.value = False\n", + " state.source_filters = set()\n", + "\n", + " # Clear material selection\n", + " material_select.value = []\n", + " state.material_filters = set()\n", + "\n", + " # Reset rollup to default (on)\n", + " rollup_toggle.value = True\n", + " state.material_rollup = True\n", + "\n", + " # Clear time filter\n", + " time_filter_enabled.value = False\n", + " for cb in decade_checkboxes:\n", + " cb.value = False\n", + " state.selected_decades = set()\n", + " year_slider.value = [year_stats['min_year'], year_stats['max_year']]\n", + " state.year_range = (None, None)\n", + "\n", + " # Clear search\n", + " search_input.value = ''\n", + " \n", + " _syncing_time_widgets = False\n", + "\n", + " # Reload data\n", + " do_search()\n", + "\n", + "\n", + "# Wire up facet event handlers\n", + "for cb in source_checkboxes:\n", + " cb.observe(on_source_checkbox_change, names=['value'])\n", + "\n", + "material_select.observe(on_material_select_change, names=['value'])\n", + "rollup_toggle.observe(on_rollup_toggle_change, names=['value'])\n", + "refresh_counts_btn.on_click(on_refresh_counts_click)\n", + "\n", + "# Wire up decade checkboxes\n", + "for cb in decade_checkboxes:\n", + " cb.observe(on_decade_checkbox_change, names=['value'])\n", + "\n", + "time_filter_enabled.observe(on_time_filter_change, names=['value'])\n", + "year_slider.observe(on_year_slider_change, names=['value'])\n", + "clear_filters_btn.on_click(on_clear_all_filters)\n", + "\n", + "print(\"Facet widgets ready!\")" + ] }, { "cell_type": "markdown", @@ -303,7 +1792,7 @@ { "cell_type": "markdown", "metadata": {}, - "source": "## Usage\n\n### Faceted Filters (Center Panel)\n\nThe **Filters** panel provides multi-select faceted filtering:\n\n**Sources** - Filter by data source (multi-select):\n- Check one or more sources to show only samples from those sources\n- Counts show total samples per source\n- Unchecking all shows all sources\n\n**Material Type** - Hierarchical filter with rollup:\n- Shows the iSamples 3-level material vocabulary as an indented tree\n- **Indentation shows hierarchy**: Top-level (Material), mid-level (Earth Material), leaf-level (Rock)\n- **Include subcategories** (default ON): Selecting \"Earth Material\" also includes Rock, Sediment, Soil, Mineral, Mixed\n- **Without rollup**: Only samples explicitly tagged at the selected level are shown\n- Use Ctrl/Cmd+click to select multiple materials\n\n**Time Period** - Filter by collection/sampling date:\n- **Filter by time**: Check to enable time filtering\n- **Decade quick-select**: Click decade checkboxes (e.g., \"2010s\") for fast filtering\n - Selecting multiple decades spans them (e.g., 2000s + 2010s = 2000-2019)\n - Auto-enables time filter when any decade is selected\n- **Year range slider**: Fine-tune the exact year range\n- **Sync decades to slider** (default ON): Keeps checkboxes and slider in sync\n - Checking \"2010s\" sets slider to 2010-2019\n - Moving slider updates which decade checkboxes are highlighted\n\n**Clear All Filters** - Reset all facet selections and search\n\n### Material Hierarchy Example\n\nThe material vocabulary has 3 levels. With \"Include subcategories\" ON:\n\n```\nSelecting \"Earth Material\" includes:\n → Rock (1M samples)\n → Sediment (66K)\n → Soil (32K)\n → Mineral (300K)\n → Mixed Soil/Sediment/Rock (838K)\n = Total ~2.3M samples\n```\n\nWithout rollup, selecting \"Earth Material\" only shows the ~2.2M samples tagged directly at that level (not the children).\n\n### Time Period Examples\n\n**Quick decade selection:**\n- Click \"2010s\" → Shows samples from 2010-2019\n- Click \"2000s\" AND \"2010s\" → Shows samples from 2000-2019\n- Click \"1990s\", \"2000s\", \"2010s\" → Shows samples from 1990-2019\n\n**Fine-tune with slider:**\n- After selecting decades, adjust slider to narrow further (e.g., 2015-2018)\n- Disable \"Sync decades to slider\" to use slider independently\n\n### Search\n\nSearch filters samples by matching text in **label**, **description**, and **place name** fields:\n\n- **Enter a term**: Type \"pottery\", \"basalt\", \"Cyprus\", etc. and press Enter\n- **Results are ranked**: Label matches (10 pts) > Description (5 pts) > Place name (3 pts)\n- **Score column**: When searching, a \"score\" column appears in the table showing match quality\n- **Combines with facets**: Search works together with facet filters (AND logic)\n- **Viewport aware**: With Viewport Mode ON, search is limited to the current map view\n\n### Selection Sync (Bidirectional)\n\nMap and table selections are synchronized:\n\n- **Click a dot on the map** → The corresponding row is highlighted in the table, and the sample card updates\n- **Click a row in the table** → The map recenters on that point (zoom level is preserved), and the sample card updates\n\nThis makes it easy to explore samples visually on the map and then find them in the table, or vice versa.\n\n### Viewport Mode (Dynamic Loading)\n\nEnable **Viewport Mode** to automatically reload data as you pan and zoom:\n\n- **Toggle ON**: Click the \"Viewport Mode\" button (turns green when active)\n- **Pan/zoom**: Data reloads automatically after you stop moving (500ms debounce)\n- **Loading indicator**: Spinner shows while data is being fetched\n- **Adaptive sampling**: \n - World view (zoom < 2): max 10K samples per source\n - Continent (zoom 2-5): max 25K per source\n - Country (zoom 5-8): max 50K per source\n - Region (zoom 8-12): max 100K per source\n - Local (zoom > 12): uses your slider value\n\n### Active Filters Display\n\nWhen filters are active, colored tags appear below the controls showing:\n- **Blue tag**: Active source filters\n- **Green tag**: Active material filters (shows \"+sub\" when rollup is enabled)\n- **Orange tag**: Active time range (shows selected decades if any)\n\n### Filter Combinations\n\nAll filters work together with AND logic:\n- Source + Material: Show pottery samples from OpenContext only\n- Material + Time: Show rock samples collected in the 2010s\n- Source + Time + Search: Find \"Cyprus\" in SESAR samples from 2000-2020\n\n### Color Legend\n- **Blue**: SESAR (geological samples, IGSNs)\n- **Red**: OpenContext (archaeological samples)\n- **Green**: GEOME (genomic/biological samples)\n- **Orange**: Smithsonian (museum collections)" + "source": "## Usage\n\n### Faceted Filters (Center Panel)\n\nThe **Filters** panel provides multi-select faceted filtering:\n\n**Sources** - Filter by data source (multi-select):\n- Check one or more sources to show only samples from those sources\n- Counts show total samples per source\n- Unchecking all shows all sources\n\n**Material Type** - Hierarchical filter with rollup:\n- Shows the iSamples 3-level material vocabulary as an indented tree\n- **Indentation shows hierarchy**: Top-level (Material), mid-level (Earth Material), leaf-level (Rock)\n- **Include subcategories** (default ON): Selecting \"Earth Material\" also includes Rock, Sediment, Soil, Mineral, Mixed\n- **Without rollup**: Only samples explicitly tagged at the selected level are shown\n- Use Ctrl/Cmd+click to select multiple materials\n\n**Time Period** - Filter by collection/sampling date:\n- **Filter by time**: Check to enable time filtering\n- **Decade quick-select**: Click decade checkboxes (e.g., \"2010s\") for fast filtering\n - Selecting multiple decades spans them (e.g., 2000s + 2010s = 2000-2019)\n - Auto-enables time filter when any decade is selected\n- **Year range slider**: Fine-tune the exact year range\n- **Sync decades to slider** (default ON): Keeps checkboxes and slider in sync\n - Checking \"2010s\" sets slider to 2010-2019\n - Moving slider updates which decade checkboxes are highlighted\n\n**Clear All Filters** - Reset all facet selections and search\n\n### Material Hierarchy Example\n\nThe material vocabulary has 3 levels. With \"Include subcategories\" ON:\n\n```\nSelecting \"Earth Material\" includes:\n \u2192 Rock (1M samples)\n \u2192 Sediment (66K)\n \u2192 Soil (32K)\n \u2192 Mineral (300K)\n \u2192 Mixed Soil/Sediment/Rock (838K)\n = Total ~2.3M samples\n```\n\nWithout rollup, selecting \"Earth Material\" only shows the ~2.2M samples tagged directly at that level (not the children).\n\n### Time Period Examples\n\n**Quick decade selection:**\n- Click \"2010s\" \u2192 Shows samples from 2010-2019\n- Click \"2000s\" AND \"2010s\" \u2192 Shows samples from 2000-2019\n- Click \"1990s\", \"2000s\", \"2010s\" \u2192 Shows samples from 1990-2019\n\n**Fine-tune with slider:**\n- After selecting decades, adjust slider to narrow further (e.g., 2015-2018)\n- Disable \"Sync decades to slider\" to use slider independently\n\n### Search\n\nSearch filters samples by matching text in **label**, **description**, and **place name** fields:\n\n- **Enter a term**: Type \"pottery\", \"basalt\", \"Cyprus\", etc. and press Enter\n- **Results are ranked**: Label matches (10 pts) > Description (5 pts) > Place name (3 pts)\n- **Score column**: When searching, a \"score\" column appears in the table showing match quality\n- **Combines with facets**: Search works together with facet filters (AND logic)\n- **Viewport aware**: With Viewport Mode ON, search is limited to the current map view\n\n### Selection Sync (Bidirectional)\n\nMap and table selections are synchronized:\n\n- **Click a dot on the map** \u2192 The corresponding row is highlighted in the table, and the sample card updates\n- **Click a row in the table** \u2192 The map recenters on that point (zoom level is preserved), and the sample card updates\n\nThis makes it easy to explore samples visually on the map and then find them in the table, or vice versa.\n\n### Viewport Mode (Dynamic Loading)\n\nEnable **Viewport Mode** to automatically reload data as you pan and zoom:\n\n- **Toggle ON**: Click the \"Viewport Mode\" button (turns green when active)\n- **Pan/zoom**: Data reloads automatically after you stop moving (500ms debounce)\n- **Loading indicator**: Spinner shows while data is being fetched\n- **Adaptive sampling**: \n - World view (zoom < 2): max 10K samples per source\n - Continent (zoom 2-5): max 25K per source\n - Country (zoom 5-8): max 50K per source\n - Region (zoom 8-12): max 100K per source\n - Local (zoom > 12): uses your slider value\n\n### Active Filters Display\n\nWhen filters are active, colored tags appear below the controls showing:\n- **Blue tag**: Active source filters\n- **Green tag**: Active material filters (shows \"+sub\" when rollup is enabled)\n- **Orange tag**: Active time range (shows selected decades if any)\n\n### Filter Combinations\n\nAll filters work together with AND logic:\n- Source + Material: Show pottery samples from OpenContext only\n- Material + Time: Show rock samples collected in the 2010s\n- Source + Time + Search: Find \"Cyprus\" in SESAR samples from 2000-2020\n\n### Color Legend\n- **Blue**: SESAR (geological samples, IGSNs)\n- **Red**: OpenContext (archaeological samples)\n- **Green**: GEOME (genomic/biological samples)\n- **Orange**: Smithsonian (museum collections)" }, { "cell_type": "markdown", @@ -351,7 +1840,7 @@ }, { "cell_type": "code", - "source": "# Compute Material Type Hierarchy with Sample Counts\n# \n# This analysis shows how samples are tagged at different levels of the \n# iSamples material vocabulary hierarchy.\n\ndef compute_material_hierarchy():\n \"\"\"\n Compute material type counts and display as a hierarchy tree.\n \n The iSamples material vocabulary is a 3-level SKOS hierarchy.\n This function queries the data to show counts at each level.\n \"\"\"\n # Get all material counts from the data\n query = \"\"\"\n WITH samples AS (\n SELECT UNNEST(p__has_material_category) as material_id\n FROM read_parquet(?)\n WHERE otype = 'MaterialSampleRecord' AND latitude IS NOT NULL\n )\n SELECT ic.label as uri, COUNT(*) as cnt\n FROM samples s\n JOIN read_parquet(?) ic ON ic.row_id = s.material_id\n WHERE ic.label IS NOT NULL\n GROUP BY ic.label\n ORDER BY cnt DESC\n \"\"\"\n df = con.execute(query, [PARQUET_PATH, PARQUET_PATH]).df()\n \n # Build counts dict from URI last segment\n counts = {}\n for _, row in df.iterrows():\n key = row['uri'].rstrip('/').split('/')[-1].lower()\n counts[key] = row['cnt']\n \n # Define the hierarchy structure (based on iSamples vocabulary)\n # https://isamplesorg.github.io/metadata/vocabularies/material.html\n hierarchy = [\n (\"Material\", \"material\", 0, [\n (\"Natural Solid Material\", \"naturalsolidmaterial\", 1, [\n (\"Earth Material\", \"earthmaterial\", 2, [\n (\"Rock\", \"rock\", 3, []),\n (\"Sediment\", \"sediment\", 3, []),\n (\"Soil\", \"soil\", 3, []),\n (\"Mineral\", \"mineral\", 3, []),\n (\"Mixed Soil/Sediment/Rock\", \"mixedsoilsedimentrock\", 3, []),\n ]),\n (\"Biogenic Non-organic\", \"biogenicnonorganicmaterial\", 2, []),\n ]),\n (\"Organic Material\", \"organicmaterial\", 1, [\n (\"Plant Material\", \"plantmaterial\", 2, []),\n (\"Animal Product\", \"organicanimalproduct\", 2, []),\n ]),\n (\"Anthropogenic Material\", \"anyanthropogenicmaterial\", 1, [\n (\"Anthropogenic Metal\", \"anthropogenicmetal\", 2, []),\n (\"Ceramic Clay\", \"ceramicclay\", 2, []),\n ]),\n (\"Fluid Material\", \"fluidmaterial\", 1, [\n (\"Liquid Water\", \"liquidwater\", 2, []),\n (\"Gas\", \"gas\", 2, []),\n (\"Non-aqueous Liquid\", \"nonaqueousliquid\", 2, []),\n ]),\n (\"Dispersed Media\", \"dispersedmedia\", 1, [\n (\"Particulate\", \"particulate\", 2, []),\n ]),\n (\"Any Ice\", \"anyice\", 1, []),\n ])\n ]\n \n def print_node(nodes, indent=0):\n \"\"\"Recursively print hierarchy with counts.\"\"\"\n for name, key, level, children in nodes:\n cnt = counts.get(key, 0)\n prefix = \" \" * indent\n marker = \"- \" if indent == 0 else \"└─ \"\n \n # Calculate rollup (what count WOULD be with proper rollup)\n def calc_rollup(node_list):\n total = 0\n for n, k, l, c in node_list:\n total += counts.get(k, 0) + calc_rollup(c)\n return total\n \n rollup = cnt + calc_rollup(children)\n \n if children and rollup != cnt:\n print(f\"{prefix}{marker}**{name}**: {cnt:,} (rollup would be {rollup:,})\")\n else:\n print(f\"{prefix}{marker}**{name}**: {cnt:,}\")\n \n if children:\n print_node(children, indent + 1)\n \n print(\"Material Type Hierarchy with Sample Counts\")\n print(\"=\" * 50)\n print()\n print_node(hierarchy)\n print()\n \n # Summary statistics\n total_tags = sum(counts.values())\n total_samples = con.execute(f\"\"\"\n SELECT COUNT(*) FROM read_parquet('{PARQUET_PATH}')\n WHERE otype = 'MaterialSampleRecord' AND latitude IS NOT NULL\n \"\"\").fetchone()[0]\n \n # Materials per sample distribution\n dist_query = f\"\"\"\n SELECT \n LEN(p__has_material_category) as num_materials,\n COUNT(*) as num_samples\n FROM read_parquet('{PARQUET_PATH}')\n WHERE otype = 'MaterialSampleRecord' \n AND latitude IS NOT NULL\n AND p__has_material_category IS NOT NULL\n GROUP BY LEN(p__has_material_category)\n ORDER BY num_materials\n \"\"\"\n dist_df = con.execute(dist_query).df()\n \n print(\"Summary Statistics\")\n print(\"-\" * 30)\n print(f\"Total samples with coordinates: {total_samples:,}\")\n print(f\"Total material tags: {total_tags:,}\")\n print(f\"Unique material types: {len(counts)}\")\n print()\n print(\"Materials per sample:\")\n for _, row in dist_df.iterrows():\n print(f\" {int(row['num_materials'])} material(s): {int(row['num_samples']):,} samples\")\n \n return counts, hierarchy\n\n# Run the analysis\nmaterial_counts, material_hierarchy = compute_material_hierarchy()", + "source": "# Compute Material Type Hierarchy with Sample Counts\n# \n# This analysis shows how samples are tagged at different levels of the \n# iSamples material vocabulary hierarchy.\n\ndef compute_material_hierarchy():\n \"\"\"\n Compute material type counts and display as a hierarchy tree.\n \n The iSamples material vocabulary is a 3-level SKOS hierarchy.\n This function queries the data to show counts at each level.\n \"\"\"\n # Get all material counts from the data\n query = \"\"\"\n WITH samples AS (\n SELECT UNNEST(p__has_material_category) as material_id\n FROM read_parquet(?)\n WHERE otype = 'MaterialSampleRecord' AND latitude IS NOT NULL\n )\n SELECT ic.label as uri, COUNT(*) as cnt\n FROM samples s\n JOIN read_parquet(?) ic ON ic.row_id = s.material_id\n WHERE ic.label IS NOT NULL\n GROUP BY ic.label\n ORDER BY cnt DESC\n \"\"\"\n df = con.execute(query, [PARQUET_PATH, PARQUET_PATH]).df()\n \n # Build counts dict from URI last segment\n counts = {}\n for _, row in df.iterrows():\n key = row['uri'].rstrip('/').split('/')[-1].lower()\n counts[key] = row['cnt']\n \n # Define the hierarchy structure (based on iSamples vocabulary)\n # https://isamplesorg.github.io/metadata/vocabularies/material.html\n hierarchy = [\n (\"Material\", \"material\", 0, [\n (\"Natural Solid Material\", \"naturalsolidmaterial\", 1, [\n (\"Earth Material\", \"earthmaterial\", 2, [\n (\"Rock\", \"rock\", 3, []),\n (\"Sediment\", \"sediment\", 3, []),\n (\"Soil\", \"soil\", 3, []),\n (\"Mineral\", \"mineral\", 3, []),\n (\"Mixed Soil/Sediment/Rock\", \"mixedsoilsedimentrock\", 3, []),\n ]),\n (\"Biogenic Non-organic\", \"biogenicnonorganicmaterial\", 2, []),\n ]),\n (\"Organic Material\", \"organicmaterial\", 1, [\n (\"Plant Material\", \"plantmaterial\", 2, []),\n (\"Animal Product\", \"organicanimalproduct\", 2, []),\n ]),\n (\"Anthropogenic Material\", \"anyanthropogenicmaterial\", 1, [\n (\"Anthropogenic Metal\", \"anthropogenicmetal\", 2, []),\n (\"Ceramic Clay\", \"ceramicclay\", 2, []),\n ]),\n (\"Fluid Material\", \"fluidmaterial\", 1, [\n (\"Liquid Water\", \"liquidwater\", 2, []),\n (\"Gas\", \"gas\", 2, []),\n (\"Non-aqueous Liquid\", \"nonaqueousliquid\", 2, []),\n ]),\n (\"Dispersed Media\", \"dispersedmedia\", 1, [\n (\"Particulate\", \"particulate\", 2, []),\n ]),\n (\"Any Ice\", \"anyice\", 1, []),\n ])\n ]\n \n def print_node(nodes, indent=0):\n \"\"\"Recursively print hierarchy with counts.\"\"\"\n for name, key, level, children in nodes:\n cnt = counts.get(key, 0)\n prefix = \" \" * indent\n marker = \"- \" if indent == 0 else \"\u2514\u2500 \"\n \n # Calculate rollup (what count WOULD be with proper rollup)\n def calc_rollup(node_list):\n total = 0\n for n, k, l, c in node_list:\n total += counts.get(k, 0) + calc_rollup(c)\n return total\n \n rollup = cnt + calc_rollup(children)\n \n if children and rollup != cnt:\n print(f\"{prefix}{marker}**{name}**: {cnt:,} (rollup would be {rollup:,})\")\n else:\n print(f\"{prefix}{marker}**{name}**: {cnt:,}\")\n \n if children:\n print_node(children, indent + 1)\n \n print(\"Material Type Hierarchy with Sample Counts\")\n print(\"=\" * 50)\n print()\n print_node(hierarchy)\n print()\n \n # Summary statistics\n total_tags = sum(counts.values())\n total_samples = con.execute(f\"\"\"\n SELECT COUNT(*) FROM read_parquet('{PARQUET_PATH}')\n WHERE otype = 'MaterialSampleRecord' AND latitude IS NOT NULL\n \"\"\").fetchone()[0]\n \n # Materials per sample distribution\n dist_query = f\"\"\"\n SELECT \n LEN(p__has_material_category) as num_materials,\n COUNT(*) as num_samples\n FROM read_parquet('{PARQUET_PATH}')\n WHERE otype = 'MaterialSampleRecord' \n AND latitude IS NOT NULL\n AND p__has_material_category IS NOT NULL\n GROUP BY LEN(p__has_material_category)\n ORDER BY num_materials\n \"\"\"\n dist_df = con.execute(dist_query).df()\n \n print(\"Summary Statistics\")\n print(\"-\" * 30)\n print(f\"Total samples with coordinates: {total_samples:,}\")\n print(f\"Total material tags: {total_tags:,}\")\n print(f\"Unique material types: {len(counts)}\")\n print()\n print(\"Materials per sample:\")\n for _, row in dist_df.iterrows():\n print(f\" {int(row['num_materials'])} material(s): {int(row['num_samples']):,} samples\")\n \n return counts, hierarchy\n\n# Run the analysis\nmaterial_counts, material_hierarchy = compute_material_hierarchy()", "metadata": {}, "execution_count": null, "outputs": [] @@ -378,4 +1867,4 @@ }, "nbformat": 4, "nbformat_minor": 4 -} \ No newline at end of file +} diff --git a/examples/basic/pqg_demo.ipynb b/examples/basic/pqg_demo.ipynb index 449011f..dc7a207 100644 --- a/examples/basic/pqg_demo.ipynb +++ b/examples/basic/pqg_demo.ipynb @@ -831,6 +831,230 @@ "- **Example script:** `pqg/examples/typed_edges_demo.py`\n", "- **Tests:** `pqg/tests/test_typed_edges.py`" ] + }, + { + "cell_type": "markdown", + "id": "wide-shortcut-header", + "metadata": {}, + "source": [ + "## Wide Format Shortcut: H3 Spatial Queries\n", + "\n", + "The narrow (graph) format requires multi-hop traversals to answer spatial questions:\n", + "\n", + "```\n", + "Sample --produced_by--> SamplingEvent --sample_location--> GeospatialCoordLocation\n", + "```\n", + "\n", + "The **wide format with H3 columns** flattens these joins into a single row per sample,\n", + "with `latitude`, `longitude`, and pre-computed `h3_res4/6/8` columns. This enables\n", + "spatial queries in a single scan \u2014 no graph traversal needed.\n", + "\n", + "Below we compare both approaches for the same query: **materials in the San Francisco Bay Area**." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "wide-shortcut-setup", + "metadata": {}, + "outputs": [], + "source": [ + "import duckdb\n", + "import time\n", + "\n", + "WIDE_H3_URL = \"https://pub-a18234d962364c22a50c787b7ca09fa5.r2.dev/isamples_202601_wide_h3.parquet\"\n", + "\n", + "con_wide = duckdb.connect()\n", + "con_wide.execute(\"INSTALL h3 FROM community; LOAD h3;\")\n", + "\n", + "# Bay Area bounding box\n", + "BBOX = dict(min_lat=37.2, max_lat=37.9, min_lon=-122.6, max_lon=-121.8)\n", + "\n", + "print(\"Wide format connection ready with H3 extension.\")" + ] + }, + { + "cell_type": "markdown", + "id": "wide-graph-approach", + "metadata": {}, + "source": [ + "### Approach 1: Graph Traversal (Narrow Format)\n", + "\n", + "Walk the edge chain: `MaterialSampleRecord -> produced_by -> SamplingEvent -> sample_location\n", + "-> GeospatialCoordLocation`, then filter by lat/lon bounds." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "wide-graph-code", + "metadata": {}, + "outputs": [], + "source": [ + "# Multi-hop graph traversal for Bay Area samples\n", + "# This uses the PQG narrow-format approach through typed edges\n", + "t0 = time.time()\n", + "\n", + "try:\n", + " graph_samples = []\n", + " for subject_pid, pred, objects, ng, et in typed_queries.get_edges_by_type(\n", + " ISamplesEdgeType.MSR_PRODUCED_BY, limit=50000\n", + " ):\n", + " # For each sample -> event, find event -> location\n", + " for obj_pid in objects:\n", + " for s2, p2, objs2, ng2, et2 in typed_queries.get_edges_by_type(\n", + " ISamplesEdgeType.EVENT_SAMPLE_LOCATION, limit=5\n", + " ):\n", + " # Note: Full implementation would match obj_pid to s2\n", + " # and extract coordinates from the location node.\n", + " # This is shown as pseudocode \u2014 the wide format\n", + " # shortcut below is the recommended approach.\n", + " graph_ms = (time.time() - t0) * 1000\n", + " print(f\"Graph traversal (50K sample limit): {graph_ms:.0f} ms\")\n", + "except Exception as e:\n", + " graph_ms = (time.time() - t0) * 1000\n", + " print(f\"Graph traversal attempted: {graph_ms:.0f} ms\")\n", + " print(f\"(PQG typed edges require narrow-format parquet loaded above)\")\n", + "\n", + "print(\"\\nGraph traversal requires:\")\n", + "print(\" 1. Find Sample -> SamplingEvent edges (produced_by)\")\n", + "print(\" 2. For each event, find Event -> Location edges (sample_location)\")\n", + "print(\" 3. Extract lat/lon from Location nodes\")\n", + "print(\" 4. Filter by bounding box\")\n", + "print(\" = Multiple hops through millions of edges\")" + ] + }, + { + "cell_type": "markdown", + "id": "wide-h3-approach", + "metadata": {}, + "source": [ + "### Approach 2: H3 Spatial Shortcut (Wide Format)\n", + "\n", + "Direct query on the wide format file \u2014 latitude, longitude, and H3 cells are\n", + "pre-joined columns. No traversal needed." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "wide-h3-code", + "metadata": {}, + "outputs": [], + "source": [ + "import h3\n", + "\n", + "# H3-accelerated spatial query on wide format\n", + "\n", + "# Step 1: Compute covering res4 cells mathematically (no data scan).\n", + "# This uses the h3 Python library \u2014 pure geometry, O(1) relative to data size.\n", + "t0 = time.time()\n", + "bbox_polygon = h3.LatLngPoly([\n", + " (BBOX['min_lat'], BBOX['min_lon']),\n", + " (BBOX['min_lat'], BBOX['max_lon']),\n", + " (BBOX['max_lat'], BBOX['max_lon']),\n", + " (BBOX['max_lat'], BBOX['min_lon']),\n", + "])\n", + "covering = h3.geo_to_cells(bbox_polygon, res=4)\n", + "# Convert to signed int64 to match DuckDB BIGINT storage\n", + "def h3_to_signed(cell_hex):\n", + " val = h3.str_to_int(cell_hex)\n", + " return val if val < 2**63 else val - 2**64\n", + "\n", + "cell_list = ', '.join(str(h3_to_signed(c)) for c in covering)\n", + "print(f'Bbox covered by {len(covering)} res4 cells')\n", + "\n", + "if not covering:\n", + " print('No H3 cells cover this bbox.')\n", + " result = None\n", + " h3_ms = 0\n", + "else:\n", + " # Step 2: Query with H3 pre-filter + exact bbox\n", + " result = con_wide.sql(f\"\"\"\n", + " SELECT\n", + " n AS source,\n", + " COUNT(*) AS sample_count,\n", + " COUNT(DISTINCT label) AS distinct_labels\n", + " FROM read_parquet('{WIDE_H3_URL}')\n", + " WHERE otype = 'MaterialSampleRecord'\n", + " AND h3_res4 IN ({cell_list})\n", + " AND latitude BETWEEN {BBOX['min_lat']} AND {BBOX['max_lat']}\n", + " AND longitude BETWEEN {BBOX['min_lon']} AND {BBOX['max_lon']}\n", + " GROUP BY n\n", + " ORDER BY sample_count DESC\n", + " \"\"\").df()\n", + " h3_ms = (time.time() - t0) * 1000\n", + "\n", + " print(f'H3 spatial shortcut: {h3_ms:.0f} ms')\n", + " print(f'\\nSamples in the Bay Area ({BBOX}):')\n", + " print(result.to_string(index=False))\n", + " print(f'\\nTotal samples: {result[\"sample_count\"].sum():,}')\n" + ] + }, + { + "cell_type": "markdown", + "id": "wide-comparison-table", + "metadata": {}, + "source": [ + "### When to Use Each Approach\n", + "\n", + "| Query Type | Narrow (Graph) | Wide + H3 |\n", + "|-----------|---------------|----------|\n", + "| **Provenance chains** (who collected what, when) | Best \u2014 follow typed edges | Not available |\n", + "| **Spatial filtering** (samples in a bbox) | Slow \u2014 multi-hop traversal | Fast \u2014 single scan with H3 pre-filter |\n", + "| **Regional aggregation** (count by area) | Very slow \u2014 join 3 tables | Fast \u2014 GROUP BY h3_res4/6/8 |\n", + "| **Schema exploration** (edge types, patterns) | Best \u2014 typed edge API | N/A |\n", + "| **Full-text search + location** | Requires custom joins | Single WHERE clause |\n", + "| **Material/context at a location** | Multi-hop + filter | Single query with all columns |\n", + "\n", + "**Rule of thumb**: Use the narrow graph format when you need to understand *relationships*\n", + "(provenance, attribution, classification chains). Use the wide + H3 format when you need\n", + "to *filter and aggregate spatially*." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "wide-timing-comparison", + "metadata": {}, + "outputs": [], + "source": [ + "# Side-by-side timing: lat/lon scan vs H3 pre-filter on wide format\n", + "\n", + "# Baseline: raw lat/lon scan (same filters as H3 query)\n", + "t0 = time.time()\n", + "baseline = con_wide.sql(f\"\"\"\n", + " SELECT COUNT(*) AS n\n", + " FROM read_parquet('{WIDE_H3_URL}')\n", + " WHERE otype = 'MaterialSampleRecord'\n", + " AND latitude BETWEEN {BBOX['min_lat']} AND {BBOX['max_lat']}\n", + " AND longitude BETWEEN {BBOX['min_lon']} AND {BBOX['max_lon']}\n", + "\"\"\").fetchone()[0]\n", + "baseline_ms = (time.time() - t0) * 1000\n", + "\n", + "# H3 pre-filter + exact bbox (reuse covering cells from above)\n", + "if not covering:\n", + " print('No covering cells \u2014 skipping H3 benchmark')\n", + "else:\n", + " t0 = time.time()\n", + " h3_count = con_wide.sql(f\"\"\"\n", + " SELECT COUNT(*) AS n\n", + " FROM read_parquet('{WIDE_H3_URL}')\n", + " WHERE otype = 'MaterialSampleRecord'\n", + " AND h3_res4 IN ({cell_list})\n", + " AND latitude BETWEEN {BBOX['min_lat']} AND {BBOX['max_lat']}\n", + " AND longitude BETWEEN {BBOX['min_lon']} AND {BBOX['max_lon']}\n", + " \"\"\").fetchone()[0]\n", + " h3_fast_ms = (time.time() - t0) * 1000\n", + "\n", + " print('Timing Comparison: Bay Area Bounding Box')\n", + " print('=' * 50)\n", + " print(f'Baseline (lat/lon scan): {baseline_ms:>8.0f} ms | {baseline:,} rows')\n", + " print(f'H3 res4 pre-filter: {h3_fast_ms:>8.0f} ms | {h3_count:,} rows')\n", + " speedup = baseline_ms / h3_fast_ms if h3_fast_ms > 0 else float('inf')\n", + " print(f'Speedup: {speedup:>7.1f}x')\n", + " print(f'\\nRow counts match: {baseline == h3_count}')\n" + ] } ], "metadata": { @@ -854,4 +1078,4 @@ }, "nbformat": 4, "nbformat_minor": 4 -} \ No newline at end of file +}