Skip to content

[Optimization 4/4] H3 spatial joins in pqg_demo.ipynb #5

@rdhyee

Description

@rdhyee

Priority: 8 (lowest — tutorial enhancement)

Context

The PQG demo notebook shows multi-hop graph traversals for spatial queries. With H3 columns pre-computed in the wide format, we can demonstrate spatial aggregation without traversal — a significant simplification.

Data Files on R2

File URL Size
Wide + H3 https://pub-a18234d962364c22a50c787b7ca09fa5.r2.dev/isamples_202601_wide_h3.parquet 292 MB
Facet summaries https://pub-a18234d962364c22a50c787b7ca09fa5.r2.dev/isamples_202601_facet_summaries.parquet 2 KB

File to Modify

examples/basic/pqg_demo.ipynb

Current Behavior

Demonstrates multi-hop traversals for spatial queries:

Sample → produced_by → SamplingEvent → sample_location → GeospatialCoordLocation

This is correct for the narrow (graph) format but verbose.

Desired Changes

1. Add "Wide format shortcut" section

After the existing graph traversal section, add a new section showing the equivalent query using the wide format with H3:

import duckdb

con = duckdb.connect()
wide_h3_url = "https://pub-a18234d962364c22a50c787b7ca09fa5.r2.dev/isamples_202601_wide_h3.parquet"

# "What materials are found in the Bay Area?"
# With H3: single query, no traversal needed
bay_area_samples = con.sql(f"""
    WITH bay_area AS (
        SELECT DISTINCT h3_res6
        FROM read_parquet('{wide_h3_url}')
        WHERE otype = 'MaterialSampleRecord'
        AND latitude BETWEEN 37.0 AND 38.5 AND longitude BETWEEN -123.0 AND -121.5
    )
    SELECT
        c.label as material,
        COUNT(*) as count
    FROM read_parquet('{wide_h3_url}') w
    JOIN read_parquet('{wide_h3_url}') c
        ON c.row_id = ANY(w.p__has_material_category)
    WHERE w.h3_res6 IN (SELECT h3_res6 FROM bay_area)
    AND w.otype = 'MaterialSampleRecord'
    AND c.otype = 'IdentifiedConcept'
    GROUP BY c.label
    ORDER BY count DESC
""").df()

2. Add comparison table

Show when to use graph traversal (narrow) vs H3 shortcut (wide):

Use Case Best Format Why
"Show me the full provenance chain" Narrow (graph) Need explicit edge traversal
"What materials are in this area?" Wide + H3 Single query, no joins
"How many samples per source per region?" Wide + H3 Aggregation with spatial filter

3. Performance comparison

Time the graph traversal vs the H3 shortcut for the same spatial query.

Acceptance Criteria

  • New section demonstrating H3 spatial shortcut
  • Side-by-side with existing graph traversal
  • Comparison table: when to use which approach
  • Performance timing for both approaches
  • Notebook runs end-to-end

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions