From c7d111847d6cec984a5f948fa5025b41407c7052 Mon Sep 17 00:00:00 2001 From: Divya Date: Mon, 1 Jun 2026 15:10:50 +0530 Subject: [PATCH] docs: make declarative-ingest staging self-contained (issue #131 B/C) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Fixes the docs side of data-ingestors#131: - B2 / B1: section 2 of the declarative path linked out for the staging recipe and described `kubectl cp` while the Detailed Setup section (further down on the same page) prescribed a host-path `cp -R`. Replaced section 2 with an inline host-path recipe that matches the Detailed Setup section, and demoted `kubectl cp` to a Note for multi-node / EKS deployments. The recipe now uses a `` subdirectory so the path lines up with the `/data/shared//...` style used in ingest.yaml examples. - C2: section 4 was silent on where CLIENT_ID / CLIENT_PASSWORD come from in the declarative path. Added a sentence noting the ingestor Pod inherits them from the Kubernetes Secret the parent tracebloc/client chart creates in at install time — no creds are passed on the `helm install` line. - C5: section 4 mentioned the run-twice rule only as a trailing parenthetical. Promoted it to bolded prose and added a worked train + test pair (two `helm install` invocations, distinct release names + `table:` + `intent:`) so the rule is concrete. Co-Authored-By: Claude Opus 4.7 (1M context) --- create-use-case/prepare-dataset.mdx | 32 +++++++++++++++++++++++++---- 1 file changed, 28 insertions(+), 4 deletions(-) diff --git a/create-use-case/prepare-dataset.mdx b/create-use-case/prepare-dataset.mdx index 05d77a3..3133093 100644 --- a/create-use-case/prepare-dataset.mdx +++ b/create-use-case/prepare-dataset.mdx @@ -50,7 +50,23 @@ Append `--version ` to pin a specific chart version. ### 2. Stage your data on the cluster's shared PVC -The chart **doesn't transport data into the cluster** — it points at data already accessible to the cluster's shared PVC (`client-pvc` by default, mounted at `/data/shared/` inside the ingestor Pod). Before installing, get your raw files there. The simplest pattern for a small dataset is a throwaway `kubectl cp` Pod that mounts the PVC; for production you'd typically use an init container with cloud-storage sync. Full staging recipe and manifests live in the [client ingestor README](https://github.com/tracebloc/client/blob/develop/ingestor/README.md#stage-your-data-on-the-shared-pvc). +The chart **doesn't transport data into the cluster** — it points at data already accessible to the cluster's shared PVC (`client-pvc` by default, mounted at `/data/shared/` inside the ingestor Pod). Before installing, get your raw files there. + +For a single-node workspace (the default install), the PVC is backed by a host directory the installer created at `~/.tracebloc//data/`. Drop your files into a per-dataset subdirectory: + +```bash +# Host path on the machine where the tracebloc client is installed. +# Pick a per dataset — it becomes the path you reference in ingest.yaml. +mkdir -p ~/.tracebloc//data/ +cp -R LOCAL_PATH/images ~/.tracebloc//data// +cp LOCAL_PATH/labels.csv ~/.tracebloc//data// +``` + +Inside the ingestor Pod those files appear at `/data/shared//...` — that's what you'll put in `ingest.yaml` below. + + +For multi-node or EKS deployments where the PVC isn't backed by a local host path, use a throwaway `kubectl cp` Pod or a cloud-storage init container instead. See the [client ingestor README](https://github.com/tracebloc/client/blob/develop/ingestor/README.md#stage-your-data-on-the-shared-pvc) for those recipes. + ### 3. Write your `ingest.yaml` @@ -71,13 +87,21 @@ The top-level shape (`apiVersion`, `kind`, `category`, `table`, `intent`, `label ### 4. Install once per dataset +The ingestor runs once: validates your data, copies files into the destination directory on the PVC, inserts rows into MySQL, sends metadata to the tracebloc backend, then exits. **Run it twice per dataset** — once with `intent: train`, once with `intent: test` — using distinct `table:` names. The example below shows both releases: + ```bash -helm install my-cats-dogs tracebloc/ingestor \ +# Train release — points at the ingest.yaml from step 3 (table: cats_dogs_train, intent: train) +helm install cats-dogs-train tracebloc/ingestor \ + --namespace \ + --set-file ingestConfig=./ingest-train.yaml + +# Test release — same shape, with table: cats_dogs_test and intent: test +helm install cats-dogs-test tracebloc/ingestor \ --namespace \ - --set-file ingestConfig=./ingest.yaml + --set-file ingestConfig=./ingest-test.yaml ``` -The ingestor runs once: validates your data, copies files into the destination directory on the PVC, inserts rows into MySQL, sends metadata to the tracebloc backend, then exits. Repeat per dataset (one helm release per dataset, with different `table:` and `intent:` for train and test). +Each `helm install` is a separate release (the first argument is the release name), so the two runs don't collide. The ingestor Pod picks up `CLIENT_ID` / `CLIENT_PASSWORD` automatically from the Kubernetes Secret the parent `tracebloc/client` chart created in `` at install time — you don't pass credentials on the `helm install` command. Full chart docs (data-staging recipe, schema, every category, update model, verification, override knobs) → [client ingestor README](https://github.com/tracebloc/client/blob/develop/ingestor/README.md).