Skip to content

mcp: add data_url field to _table_catalog to enable mechanical table refresh #60

@StefanSteiner

Description

@StefanSteiner

Summary

The _table_catalog source_url field currently stores a human-readable
page URL (e.g. https://ourworldindata.org/grapher/population). There is
no structured field for the actual data file URL, which means an automated
"refresh this table from its source" operation cannot be driven purely from
catalog metadata — it requires prose parsing from notes or manual
re-entry of the download URL.

Proposed change

Add a new data_url field to _table_catalog and expose it as a
parameter in set_table_metadata. This is additive and non-breaking:
source_url continues to hold the human-readable page/reference URL,
source_description is unchanged, and data_url carries the machine-
actionable download URL.

Example for an OWID table:

Field Value
source_url https://ourworldindata.org/grapher/population (chart page — unchanged)
source_description "OWID historical population, 10 000 BCE – 2023…" (unchanged)
data_url https://ourworldindata.org/grapher/population.csv?v=1&csvType=full&useColumnShortNames=true ← new

Why this matters — mechanical refresh

With data_url and the existing load_params (which already records
mode, schema, merge_key, database, etc.), a refresh becomes fully
mechanical with no prose parsing:

1. SELECT data_url, load_params
   FROM _table_catalog
   WHERE table_name = 'owid_population'

2. Infer format from data_url extension (.csv / .parquet / .json / etc.)

3. Download data_url → temp file

4. load_file(path=<temp_file>, **load_params)

This would enable:

  • A refresh_table tool that re-ingests any table from its source in one
    call.
  • Bulk refresh of all tables that have a data_url set:
    SELECT table_name FROM _table_catalog WHERE data_url IS NOT NULL.
  • Scheduled refresh without any human-maintained refresh scripts.

Format inference rule

Infer from the last path segment's extension before any query string
— the same rule load_file already uses for local paths:

URL ending Format
.csv CSV
.parquet Parquet
.json / .jsonl JSON / NDJSON
.arrow / .ipc Arrow IPC

For URLs where the extension is absent or ambiguous (e.g. a presigned S3
URL with a UUID key), a source_format hint in load_params or an
additional data_format field could carry the override.

Related issue

This pairs with the rename-drops-metadata issue: both are about making
_table_catalog fully machine-actionable rather than just human-readable.
A refresh_table tool (future) would depend on both fixes — metadata
surviving renames, and a structured data_url to fetch from.

Environment

  • hyper-rust-api version: 0.2.3.re93d08d2
  • Observed while loading 17 OWID datasets into the persistent database
    and noting that refresh requires prose parsing of notes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions