docs: surface the declarative ingestor flow from top-level docs#144
Merged
Conversation
The new declarative ingestor (tracebloc/ingestor 0.2.0, the data-ingestors v0.3.0 image) is live and the customer-facing install command works end-to-end, but a customer landing on this repo has no discovery path — README.md and docs/INSTALL.md don't mention "ingest" at all. This commit: 1. Adds an "Ingest a dataset" section to README.md between Deploy and Links, with the canonical helm install command + a sample ingest.yaml + a link to ingestor/README.md. 2. Adds a "Next: ingest your first dataset" continuation section at the end of docs/INSTALL.md, so operators finishing the parent chart install land on the obvious next step. 3. Adds two new rows to the topic table pointing at ingestor/README.md and the data-ingestors per-category templates. 4. Fixes stale "currently v1.3.1" → v1.3.5. 5. Fixes stale `helm install ... tracebloc/tracebloc` → tracebloc/client (the chart's actual name in the helm repo; see `helm search repo tracebloc`). Closes #143. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
|
👋 Heads-up — Code review queue is at 19 / 8 Above the WIP limit. The team convention is to review existing PRs before opening new work. Open PRs currently in Code review (oldest first):
Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.) |
5 tasks
…raming A reviewer flagged that "ingest a cats-vs-dogs dataset already staged on the shared PVC at /data/shared/cats-dogs/" is factually wrong — the chart doesn't transport data into the cluster, customers stage it themselves before running helm install. The docs glossed over that precondition entirely. Three changes: 1. ingestor/README.md (the canonical chart doc) gets a new section "Stage your data on the shared PVC" between Prerequisites and "What this chart owns". Explains the PVC name (client-pvc), the in-pod mount path (/data/shared/), and walks through the kubectl cp + pvc-shell pod pattern with a copy-pasteable manifest. Mentions the init-container-with-cloud-sync pattern for production. 2. README.md "Ingest a dataset" section now opens with the staging prerequisite as the first of two steps. The cats-dogs example is still there but no longer pretends the data magically exists. 3. docs/INSTALL.md "Next: ingest" section adds the same staging prerequisite paragraph + reworks the cats-dogs line from "already staged on the shared PVC at /data/shared/cats-dogs/" to "once you've staged a ... dataset under /data/shared/cats-dogs/". Same PR (#144) since the discovery sections in README.md and INSTALL.md are pointing at ingestor/README.md — coupling them keeps the whole story self-consistent in a single review. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
saadqbal
added a commit
to tracebloc/data-ingestors
that referenced
this pull request
May 20, 2026
A reviewer flagged that phrases like "Stage the data on the PVC at /data/shared/<prefix>/" implied the chart could put data there. It can't — customers stage data themselves, the chart only reads from wherever the cluster's shared PVC is already mounted. Adds an explicit "**Prerequisite:** the chart doesn't transport data into the cluster" callout at the top of each template's Quickstart, pointing at the staging recipe in tracebloc/client/ingestor/README.md (which a parallel PR — tracebloc/client#144 — adds with a working kubectl cp / pvc-shell Pod manifest and a production-pattern note). Also re-frames the top-level Readme.md's Quickstart to introduce staging as step 2 of the customer flow (between "add the chart repo" and "write your ingest.yaml") rather than glossing past it. Touches: - Readme.md (top-level — staging step added explicitly) - All 10 template READMEs (Prerequisite callout in each Quickstart) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
shujaatTracebloc
approved these changes
May 20, 2026
saadqbal
added a commit
to tracebloc/data-ingestors
that referenced
this pull request
May 20, 2026
* docs(#110): lead with declarative YAML; templates point at helm chart The README and 10 template READMEs described the legacy Python+Dockerfile pattern as primary, but the declarative YAML path (via `helm install tracebloc/ingestor`) is now the recommended customer flow — chart published to tracebloc.github.io/client as of v1.3.5, validated end-to-end on EKS, 8 lines of YAML per dataset. Changes: 1. `Readme.md` — re-frame Quickstart to lead with the declarative path: `helm repo add tracebloc ...` → write 8 lines of YAML → `helm install tracebloc/ingestor`. Move the Python+Docker pattern to an "Advanced: custom processors (legacy Python pattern)" section. Refresh the Supported data types table — was missing keypoint, semantic_segmentation, masked_language_modeling. 2. Per-category template READMEs (10 of them: image_classification, object_detection, keypoint_detection, semantic_segmentation, text_classification, masked_language_modeling, tabular_classification, tabular_regression, time_series_forecasting, time_to_event_prediction) — add a "Quickstart — declarative (recommended)" section at the top referencing examples/yaml/<category>.yaml with a copy-pasteable ingest.yaml + helm install command. The existing Python-script Usage stays as-is for the custom-processor case. 3. In image_classification and object_detection, also rename "Usage" → "Advanced: custom processor script" so the heading itself signals primary vs advanced. The remaining 8 templates keep "Usage" as-is for now — the Quickstart section at the top is the strong discovery signal; relabeling the Usage headers in those 8 is hygiene we can do as a follow-up. A customer landing on PyPI's `tracebloc-ingestor` page, the repo root, or any individual template's directory now sees the declarative path as the recommended option, with a copy-pasteable YAML + helm install command, before any Python-script content. Closes #110. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: clarify the chart doesn't stage data; link to staging recipe A reviewer flagged that phrases like "Stage the data on the PVC at /data/shared/<prefix>/" implied the chart could put data there. It can't — customers stage data themselves, the chart only reads from wherever the cluster's shared PVC is already mounted. Adds an explicit "**Prerequisite:** the chart doesn't transport data into the cluster" callout at the top of each template's Quickstart, pointing at the staging recipe in tracebloc/client/ingestor/README.md (which a parallel PR — tracebloc/client#144 — adds with a working kubectl cp / pvc-shell Pod manifest and a production-pattern note). Also re-frames the top-level Readme.md's Quickstart to introduce staging as step 2 of the customer flow (between "add the chart repo" and "write your ingest.yaml") rather than glossing past it. Touches: - Readme.md (top-level — staging step added explicitly) - All 10 template READMEs (Prerequisite callout in each Quickstart) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
saadqbal
added a commit
that referenced
this pull request
May 20, 2026
docs: surface the declarative ingestor flow from top-level docs
saadqbal
added a commit
to tracebloc/data-ingestors
that referenced
this pull request
May 20, 2026
* Add release-time smoke test for ingestor image (#96) * ci(#95): smoke-test ingestor image and sdist before publishing v0.3.0-rc1 of ghcr.io/tracebloc/ingestor shipped without the bundled schema/ingest.v1.json because tracebloc_ingestor/schema/ had no __init__.py and find_packages() silently dropped it. docker build, cosign, and SBOM all passed; the bug only surfaced at customer-side first use. Packaging fix lands separately; this is the CI guardrail so a green release can never again ship without the schema. release-image.yml: after `docker buildx build --push`, before cosign signs anything, run two probes against the built digest inside the image: `_load_schema()` (the exact path that crashed in v0.3.0-rc1) and an invocation of the tracebloc-ingest console script. The CLI has no --help; main() exits with a friendly INGEST_CONFIG error when called bare, which proves the console_scripts metadata was installed and the wrapper can import main() — broken entrypoint wiring or an import-time failure produces a different output and fails the grep. Both probes run before cosign/release-notes so a regression aborts cleanly. publish-{master,dev}.yml: sdists go through MANIFEST.in, a separate packaging code path from the wheel's package_data, and the schema bug surfaced there too. Replace the bare `import tracebloc_ingestor` post-install check with the same `_load_schema()` probe. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci(#95): bypass docker-entrypoint.sh in smoke step with --entrypoint The image's ENTRYPOINT is docker-entrypoint.sh, which exits 64 when MYSQL_HOST is unset — before the shell ever reaches `exec tracebloc- ingest`. Without --entrypoint, both `docker run` calls in the smoke step hit the MYSQL_HOST guard, fail under `set -euo pipefail`, and the workflow aborts every release. Override with --entrypoint python / --entrypoint tracebloc-ingest so the smoke probes hit the Python package directly. We're testing packaging, not the MySQL-wait wrapper, so standing up a MySQL sidecar for a 5-second sanity check is not worth it. Caught by Cursor Bugbot on PR #96. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: lead with declarative YAML; templates point at helm chart (#111) * docs(#110): lead with declarative YAML; templates point at helm chart The README and 10 template READMEs described the legacy Python+Dockerfile pattern as primary, but the declarative YAML path (via `helm install tracebloc/ingestor`) is now the recommended customer flow — chart published to tracebloc.github.io/client as of v1.3.5, validated end-to-end on EKS, 8 lines of YAML per dataset. Changes: 1. `Readme.md` — re-frame Quickstart to lead with the declarative path: `helm repo add tracebloc ...` → write 8 lines of YAML → `helm install tracebloc/ingestor`. Move the Python+Docker pattern to an "Advanced: custom processors (legacy Python pattern)" section. Refresh the Supported data types table — was missing keypoint, semantic_segmentation, masked_language_modeling. 2. Per-category template READMEs (10 of them: image_classification, object_detection, keypoint_detection, semantic_segmentation, text_classification, masked_language_modeling, tabular_classification, tabular_regression, time_series_forecasting, time_to_event_prediction) — add a "Quickstart — declarative (recommended)" section at the top referencing examples/yaml/<category>.yaml with a copy-pasteable ingest.yaml + helm install command. The existing Python-script Usage stays as-is for the custom-processor case. 3. In image_classification and object_detection, also rename "Usage" → "Advanced: custom processor script" so the heading itself signals primary vs advanced. The remaining 8 templates keep "Usage" as-is for now — the Quickstart section at the top is the strong discovery signal; relabeling the Usage headers in those 8 is hygiene we can do as a follow-up. A customer landing on PyPI's `tracebloc-ingestor` page, the repo root, or any individual template's directory now sees the declarative path as the recommended option, with a copy-pasteable YAML + helm install command, before any Python-script content. Closes #110. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: clarify the chart doesn't stage data; link to staging recipe A reviewer flagged that phrases like "Stage the data on the PVC at /data/shared/<prefix>/" implied the chart could put data there. It can't — customers stage data themselves, the chart only reads from wherever the cluster's shared PVC is already mounted. Adds an explicit "**Prerequisite:** the chart doesn't transport data into the cluster" callout at the top of each template's Quickstart, pointing at the staging recipe in tracebloc/client/ingestor/README.md (which a parallel PR — tracebloc/client#144 — adds with a working kubectl cp / pvc-shell Pod manifest and a production-pattern note). Also re-frames the top-level Readme.md's Quickstart to introduce staging as step 2 of the customer flow (between "add the chart repo" and "write your ingest.yaml") rather than glossing past it. Touches: - Readme.md (top-level — staging step added explicitly) - All 10 template READMEs (Prerequisite callout in each Quickstart) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: point staging-recipe links at develop branch (not main) (#112) The new "Stage your data on the shared PVC" section in tracebloc/client/ingestor/README.md lives on PR #144's branch; it will be on develop once #144 merges, and on main only after the next develop→main sync. The links in PR #111 pointed at /blob/main/ which won't resolve to the right section until that sync runs. Switch all 22 references from /blob/main/ to /blob/develop/ so the anchor resolves as soon as PR #144 lands on develop (single click, imminent). Develop is a stable enough ref long-term that the URLs remain durable after the eventual sync. If we later want to move back to /blob/main/ for the "stable URL" aesthetic, that's a one-character bulk-replace; not blocking. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The declarative ingestor went live with v1.3.5 + v0.3.0, but the top-level docs didn't mention it. A customer landing on this repo had no discovery path to
ingestor/README.md—README.mdanddocs/INSTALL.mdhad zero mentions of "ingest".This PR fixes the discovery problem and rolls in two stale-docs fixes I tripped over while in the files.
Changes
README.md — new "Ingest a dataset" section between Deploy and Links. Shows the canonical command + a sample ingest.yaml. Two new rows in the topic table:
ingestor/README.mdtracebloc/data-ingestorstemplatesdocs/INSTALL.md — new "Next: ingest your first dataset" continuation section at the end, so operators finishing the parent chart install land on the obvious next step.
Drive-by fixes (caught while in the files; mentioned in the ticket docs: add 'Ingest a dataset' discovery path from top-level README + INSTALL #143):
v1.3.1" → "currentlyv1.3.5"helm install ... tracebloc/tracebloc→tracebloc/client(the actual chart name in the published helm repo)Test plan
helm install ... tracebloc/client ...commands match the actual chart name (helm search repo traceblocconfirmstracebloc/clientandtracebloc/ingestor)ingest.yamlsnippet matches the schema (apiVersion, kind, category, table, intent, csv, images, label)ingestor/README.md,docs/INSTALL.md) resolveOut of scope (separate PR)
tracebloc/data-ingestors/Readme.mdand the 9 template READMEs still describe the legacy Python-script-based pattern as primary. That's a separate PR against tracebloc/data-ingestors — a customer landing on PyPI / the data-ingestors repo would still get confused. Filing the ticket + PR there next.Closes
#143
🤖 Generated with Claude Code
Note
Low Risk
Low risk documentation-only changes; main impact is correcting Helm chart names/versions in install commands, which could affect copy/paste installs if incorrect.
Overview
Surfaces the declarative dataset ingestion workflow by adding an "Ingest a dataset" section to
README.mdand a "Next: ingest your first dataset" follow-up section todocs/INSTALL.md, including an exampleIngestConfigYAML andhelm install tracebloc/ingestorcommand.Also fixes stale install guidance by updating Helm commands from
tracebloc/tracebloctotracebloc/clientand bumping the referenced chart version tov1.3.5, plus expandingingestor/README.mdwith explicit shared-PVC data staging instructions (including akubectl cp/pvc-shell pattern).Reviewed by Cursor Bugbot for commit 57d773e. Bugbot is set up for automated code reviews on this repo. Configure here.