Skip to content

[Optimization 2/4] Summary-first facets in isamples_explorer.ipynb #3

@rdhyee

Description

@rdhyee

Priority: 5

Context

The interactive Jupyter explorer re-scans 6.7M rows on every widget interaction (3-8s). Pre-computed summaries (2KB) can provide instant facet counts.

Data Files on R2

File URL Size
Facet summaries https://pub-a18234d962364c22a50c787b7ca09fa5.r2.dev/isamples_202601_facet_summaries.parquet 2 KB
Facet cross-tab https://pub-a18234d962364c22a50c787b7ca09fa5.r2.dev/isamples_202601_facet_cross.parquet 1 KB
Wide + H3 https://pub-a18234d962364c22a50c787b7ca09fa5.r2.dev/isamples_202601_wide_h3.parquet 292 MB

Facet summaries schema: facet_type (source/material/context/object_type), facet_value, scheme, count

File to Modify

examples/basic/isamples_explorer.ipynb

Current Behavior

Uses ipywidgets + Lonboard. Every facet/filter interaction re-queries the full parquet:

# Slow: scans full file every time
results = con.sql(f"SELECT ... FROM read_parquet('{url}') WHERE ...").df()

Desired Changes

1. Load summaries at startup

import duckdb

con = duckdb.connect()
summaries_url = "https://pub-a18234d962364c22a50c787b7ca09fa5.r2.dev/isamples_202601_facet_summaries.parquet"

# Instant: 2KB download
facets = con.sql(f"SELECT * FROM read_parquet('{summaries_url}')").df()

source_options = facets[facets.facet_type == 'source'][['facet_value', 'count']].to_dict('records')
material_options = facets[facets.facet_type == 'material'][['facet_value', 'count']].to_dict('records')
# etc.

2. Populate widgets from summaries

import ipywidgets as widgets

source_dropdown = widgets.Dropdown(
    options=[(f"{r['facet_value']} ({r['count']:,})", r['facet_value']) for r in source_options],
    description='Source:'
)

material_dropdown = widgets.Dropdown(
    options=[('All', None)] + [(f"{r['facet_value']} ({r['count']:,})", r['facet_value']) for r in material_options],
    description='Material:'
)

3. Only query full parquet on explicit "Search" action

Keep full parquet queries for the actual sample results table and map, but facet counts should come from summaries.

Acceptance Criteria

  • Widgets populated from 2KB summary file (< 1s)
  • Material, context, and object_type facets now available (new!)
  • Full parquet only queried when user searches/filters for records
  • Lonboard map still works
  • Notebook runs end-to-end

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions