Skip to content

docs: surface the declarative ingestor flow from top-level docs#144

Merged
saadqbal merged 2 commits into
developfrom
docs/143-ingest-discovery
May 20, 2026
Merged

docs: surface the declarative ingestor flow from top-level docs#144
saadqbal merged 2 commits into
developfrom
docs/143-ingest-discovery

Conversation

@saadqbal
Copy link
Copy Markdown
Contributor

@saadqbal saadqbal commented May 20, 2026

Summary

The declarative ingestor went live with v1.3.5 + v0.3.0, but the top-level docs didn't mention it. A customer landing on this repo had no discovery path to ingestor/README.mdREADME.md and docs/INSTALL.md had zero mentions of "ingest".

This PR fixes the discovery problem and rolls in two stale-docs fixes I tripped over while in the files.

Changes

  1. README.md — new "Ingest a dataset" section between Deploy and Links. Shows the canonical command + a sample ingest.yaml. Two new rows in the topic table:

    • Ingest a dataset (declarative YAML) → ingestor/README.md
    • Available ingestion categories + example YAMLs → tracebloc/data-ingestors templates
  2. docs/INSTALL.md — new "Next: ingest your first dataset" continuation section at the end, so operators finishing the parent chart install land on the obvious next step.

  3. Drive-by fixes (caught while in the files; mentioned in the ticket docs: add 'Ingest a dataset' discovery path from top-level README + INSTALL #143):

    • README.md: "currently v1.3.1" → "currently v1.3.5"
    • README.md + INSTALL.md: helm install ... tracebloc/tracebloctracebloc/client (the actual chart name in the published helm repo)

Test plan

  • Both files render correctly in markdown preview
  • All helm install ... tracebloc/client ... commands match the actual chart name (helm search repo tracebloc confirms tracebloc/client and tracebloc/ingestor)
  • Sample ingest.yaml snippet matches the schema (apiVersion, kind, category, table, intent, csv, images, label)
  • Internal markdown links (ingestor/README.md, docs/INSTALL.md) resolve

Out of scope (separate PR)

tracebloc/data-ingestors/Readme.md and the 9 template READMEs still describe the legacy Python-script-based pattern as primary. That's a separate PR against tracebloc/data-ingestors — a customer landing on PyPI / the data-ingestors repo would still get confused. Filing the ticket + PR there next.

Closes

#143

🤖 Generated with Claude Code


Note

Low Risk
Low risk documentation-only changes; main impact is correcting Helm chart names/versions in install commands, which could affect copy/paste installs if incorrect.

Overview
Surfaces the declarative dataset ingestion workflow by adding an "Ingest a dataset" section to README.md and a "Next: ingest your first dataset" follow-up section to docs/INSTALL.md, including an example IngestConfig YAML and helm install tracebloc/ingestor command.

Also fixes stale install guidance by updating Helm commands from tracebloc/tracebloc to tracebloc/client and bumping the referenced chart version to v1.3.5, plus expanding ingestor/README.md with explicit shared-PVC data staging instructions (including a kubectl cp/pvc-shell pattern).

Reviewed by Cursor Bugbot for commit 57d773e. Bugbot is set up for automated code reviews on this repo. Configure here.

The new declarative ingestor (tracebloc/ingestor 0.2.0, the
data-ingestors v0.3.0 image) is live and the customer-facing install
command works end-to-end, but a customer landing on this repo has no
discovery path — README.md and docs/INSTALL.md don't mention
"ingest" at all.

This commit:

1. Adds an "Ingest a dataset" section to README.md between Deploy and
   Links, with the canonical helm install command + a sample
   ingest.yaml + a link to ingestor/README.md.
2. Adds a "Next: ingest your first dataset" continuation section at
   the end of docs/INSTALL.md, so operators finishing the parent
   chart install land on the obvious next step.
3. Adds two new rows to the topic table pointing at ingestor/README.md
   and the data-ingestors per-category templates.
4. Fixes stale "currently v1.3.1" → v1.3.5.
5. Fixes stale `helm install ... tracebloc/tracebloc` → tracebloc/client
   (the chart's actual name in the helm repo; see
   `helm search repo tracebloc`).

Closes #143.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@LukasWodka
Copy link
Copy Markdown
Contributor

👋 Heads-up — Code review queue is at 19 / 8

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

…raming

A reviewer flagged that "ingest a cats-vs-dogs dataset already staged
on the shared PVC at /data/shared/cats-dogs/" is factually wrong — the
chart doesn't transport data into the cluster, customers stage it
themselves before running helm install. The docs glossed over that
precondition entirely.

Three changes:

1. ingestor/README.md (the canonical chart doc) gets a new section
   "Stage your data on the shared PVC" between Prerequisites and
   "What this chart owns". Explains the PVC name (client-pvc), the
   in-pod mount path (/data/shared/), and walks through the
   kubectl cp + pvc-shell pod pattern with a copy-pasteable manifest.
   Mentions the init-container-with-cloud-sync pattern for production.

2. README.md "Ingest a dataset" section now opens with the staging
   prerequisite as the first of two steps. The cats-dogs example is
   still there but no longer pretends the data magically exists.

3. docs/INSTALL.md "Next: ingest" section adds the same staging
   prerequisite paragraph + reworks the cats-dogs line from
   "already staged on the shared PVC at /data/shared/cats-dogs/" to
   "once you've staged a ... dataset under /data/shared/cats-dogs/".

Same PR (#144) since the discovery sections in README.md and
INSTALL.md are pointing at ingestor/README.md — coupling them keeps
the whole story self-consistent in a single review.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
saadqbal added a commit to tracebloc/data-ingestors that referenced this pull request May 20, 2026
A reviewer flagged that phrases like "Stage the data on the PVC at
/data/shared/<prefix>/" implied the chart could put data there. It
can't — customers stage data themselves, the chart only reads from
wherever the cluster's shared PVC is already mounted.

Adds an explicit "**Prerequisite:** the chart doesn't transport data
into the cluster" callout at the top of each template's Quickstart,
pointing at the staging recipe in tracebloc/client/ingestor/README.md
(which a parallel PR — tracebloc/client#144 — adds with a working
kubectl cp / pvc-shell Pod manifest and a production-pattern note).

Also re-frames the top-level Readme.md's Quickstart to introduce
staging as step 2 of the customer flow (between "add the chart repo"
and "write your ingest.yaml") rather than glossing past it.

Touches:
- Readme.md (top-level — staging step added explicitly)
- All 10 template READMEs (Prerequisite callout in each Quickstart)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@saadqbal saadqbal self-assigned this May 20, 2026
@saadqbal saadqbal merged commit 28e9009 into develop May 20, 2026
1 check passed
saadqbal added a commit to tracebloc/data-ingestors that referenced this pull request May 20, 2026
* docs(#110): lead with declarative YAML; templates point at helm chart

The README and 10 template READMEs described the legacy Python+Dockerfile
pattern as primary, but the declarative YAML path (via
`helm install tracebloc/ingestor`) is now the recommended customer
flow — chart published to tracebloc.github.io/client as of v1.3.5,
validated end-to-end on EKS, 8 lines of YAML per dataset.

Changes:

1. `Readme.md` — re-frame Quickstart to lead with the declarative path:
   `helm repo add tracebloc ...` → write 8 lines of YAML →
   `helm install tracebloc/ingestor`. Move the Python+Docker pattern to
   an "Advanced: custom processors (legacy Python pattern)" section.
   Refresh the Supported data types table — was missing keypoint,
   semantic_segmentation, masked_language_modeling.

2. Per-category template READMEs (10 of them: image_classification,
   object_detection, keypoint_detection, semantic_segmentation,
   text_classification, masked_language_modeling, tabular_classification,
   tabular_regression, time_series_forecasting, time_to_event_prediction)
   — add a "Quickstart — declarative (recommended)" section at the top
   referencing examples/yaml/<category>.yaml with a copy-pasteable
   ingest.yaml + helm install command. The existing Python-script Usage
   stays as-is for the custom-processor case.

3. In image_classification and object_detection, also rename "Usage" →
   "Advanced: custom processor script" so the heading itself signals
   primary vs advanced. The remaining 8 templates keep "Usage" as-is for
   now — the Quickstart section at the top is the strong discovery
   signal; relabeling the Usage headers in those 8 is hygiene we can do
   as a follow-up.

A customer landing on PyPI's `tracebloc-ingestor` page, the repo
root, or any individual template's directory now sees the declarative
path as the recommended option, with a copy-pasteable YAML + helm
install command, before any Python-script content.

Closes #110.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: clarify the chart doesn't stage data; link to staging recipe

A reviewer flagged that phrases like "Stage the data on the PVC at
/data/shared/<prefix>/" implied the chart could put data there. It
can't — customers stage data themselves, the chart only reads from
wherever the cluster's shared PVC is already mounted.

Adds an explicit "**Prerequisite:** the chart doesn't transport data
into the cluster" callout at the top of each template's Quickstart,
pointing at the staging recipe in tracebloc/client/ingestor/README.md
(which a parallel PR — tracebloc/client#144 — adds with a working
kubectl cp / pvc-shell Pod manifest and a production-pattern note).

Also re-frames the top-level Readme.md's Quickstart to introduce
staging as step 2 of the customer flow (between "add the chart repo"
and "write your ingest.yaml") rather than glossing past it.

Touches:
- Readme.md (top-level — staging step added explicitly)
- All 10 template READMEs (Prerequisite callout in each Quickstart)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
saadqbal added a commit that referenced this pull request May 20, 2026
docs: surface the declarative ingestor flow from top-level docs
saadqbal added a commit to tracebloc/data-ingestors that referenced this pull request May 20, 2026
* Add release-time smoke test for ingestor image (#96)

* ci(#95): smoke-test ingestor image and sdist before publishing

v0.3.0-rc1 of ghcr.io/tracebloc/ingestor shipped without the bundled
schema/ingest.v1.json because tracebloc_ingestor/schema/ had no
__init__.py and find_packages() silently dropped it. docker build,
cosign, and SBOM all passed; the bug only surfaced at customer-side
first use. Packaging fix lands separately; this is the CI guardrail
so a green release can never again ship without the schema.

release-image.yml: after `docker buildx build --push`, before cosign
signs anything, run two probes against the built digest inside the
image: `_load_schema()` (the exact path that crashed in v0.3.0-rc1)
and an invocation of the tracebloc-ingest console script. The CLI
has no --help; main() exits with a friendly INGEST_CONFIG error
when called bare, which proves the console_scripts metadata was
installed and the wrapper can import main() — broken entrypoint
wiring or an import-time failure produces a different output and
fails the grep. Both probes run before cosign/release-notes so a
regression aborts cleanly.

publish-{master,dev}.yml: sdists go through MANIFEST.in, a separate
packaging code path from the wheel's package_data, and the schema
bug surfaced there too. Replace the bare `import tracebloc_ingestor`
post-install check with the same `_load_schema()` probe.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci(#95): bypass docker-entrypoint.sh in smoke step with --entrypoint

The image's ENTRYPOINT is docker-entrypoint.sh, which exits 64 when
MYSQL_HOST is unset — before the shell ever reaches `exec tracebloc-
ingest`. Without --entrypoint, both `docker run` calls in the smoke
step hit the MYSQL_HOST guard, fail under `set -euo pipefail`, and
the workflow aborts every release.

Override with --entrypoint python / --entrypoint tracebloc-ingest so
the smoke probes hit the Python package directly. We're testing
packaging, not the MySQL-wait wrapper, so standing up a MySQL sidecar
for a 5-second sanity check is not worth it.

Caught by Cursor Bugbot on PR #96.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: lead with declarative YAML; templates point at helm chart (#111)

* docs(#110): lead with declarative YAML; templates point at helm chart

The README and 10 template READMEs described the legacy Python+Dockerfile
pattern as primary, but the declarative YAML path (via
`helm install tracebloc/ingestor`) is now the recommended customer
flow — chart published to tracebloc.github.io/client as of v1.3.5,
validated end-to-end on EKS, 8 lines of YAML per dataset.

Changes:

1. `Readme.md` — re-frame Quickstart to lead with the declarative path:
   `helm repo add tracebloc ...` → write 8 lines of YAML →
   `helm install tracebloc/ingestor`. Move the Python+Docker pattern to
   an "Advanced: custom processors (legacy Python pattern)" section.
   Refresh the Supported data types table — was missing keypoint,
   semantic_segmentation, masked_language_modeling.

2. Per-category template READMEs (10 of them: image_classification,
   object_detection, keypoint_detection, semantic_segmentation,
   text_classification, masked_language_modeling, tabular_classification,
   tabular_regression, time_series_forecasting, time_to_event_prediction)
   — add a "Quickstart — declarative (recommended)" section at the top
   referencing examples/yaml/<category>.yaml with a copy-pasteable
   ingest.yaml + helm install command. The existing Python-script Usage
   stays as-is for the custom-processor case.

3. In image_classification and object_detection, also rename "Usage" →
   "Advanced: custom processor script" so the heading itself signals
   primary vs advanced. The remaining 8 templates keep "Usage" as-is for
   now — the Quickstart section at the top is the strong discovery
   signal; relabeling the Usage headers in those 8 is hygiene we can do
   as a follow-up.

A customer landing on PyPI's `tracebloc-ingestor` page, the repo
root, or any individual template's directory now sees the declarative
path as the recommended option, with a copy-pasteable YAML + helm
install command, before any Python-script content.

Closes #110.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: clarify the chart doesn't stage data; link to staging recipe

A reviewer flagged that phrases like "Stage the data on the PVC at
/data/shared/<prefix>/" implied the chart could put data there. It
can't — customers stage data themselves, the chart only reads from
wherever the cluster's shared PVC is already mounted.

Adds an explicit "**Prerequisite:** the chart doesn't transport data
into the cluster" callout at the top of each template's Quickstart,
pointing at the staging recipe in tracebloc/client/ingestor/README.md
(which a parallel PR — tracebloc/client#144 — adds with a working
kubectl cp / pvc-shell Pod manifest and a production-pattern note).

Also re-frames the top-level Readme.md's Quickstart to introduce
staging as step 2 of the customer flow (between "add the chart repo"
and "write your ingest.yaml") rather than glossing past it.

Touches:
- Readme.md (top-level — staging step added explicitly)
- All 10 template READMEs (Prerequisite callout in each Quickstart)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: point staging-recipe links at develop branch (not main) (#112)

The new "Stage your data on the shared PVC" section in
tracebloc/client/ingestor/README.md lives on PR #144's branch; it
will be on develop once #144 merges, and on main only after the
next develop→main sync. The links in PR #111 pointed at /blob/main/
which won't resolve to the right section until that sync runs.

Switch all 22 references from /blob/main/ to /blob/develop/ so the
anchor resolves as soon as PR #144 lands on develop (single click,
imminent). Develop is a stable enough ref long-term that the URLs
remain durable after the eventual sync.

If we later want to move back to /blob/main/ for the "stable URL"
aesthetic, that's a one-character bulk-replace; not blocking.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@saadqbal saadqbal deleted the docs/143-ingest-discovery branch May 25, 2026 07:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants