tracebloc · divyasinghds · Jun 1, 2026
diff --git a/create-use-case/prepare-dataset.mdx b/create-use-case/prepare-dataset.mdx
@@ -1,18 +1,18 @@
 ---
 title: "Prepare Data"
 description: "Learn how to prepare and ingest your datasets into tracebloc using containerized data ingestors. Complete guide for CSV, image, and text data with Kubernetes deployment steps."
 ---

 ## Overview

 Make your data available to the Kubernetes cluster so it can be used for training and evaluation. Regardless of where your client runs on Azure, AWS, Google Cloud, or a local Minikube setup, the process of ingesting datasets works the same way.

 The data ingestor is a lightweight service that bridges your raw data and the cluster's persistent storage. It comes with ready-made templates (CSV, images, text) that you can use as starting points and customize for your own dataset. By containerizing the ingestion step, the ingestor validates data format and schema, enforces consistency, and transfers the dataset securely into cluster's SQL storage where it becomes accessible to all training and evaluation jobs.

 This guide covers:
 - Customizing ingestor templates for different data types (CSV, images, text)
 - Deploying the data ingestor for training and test data using Kubernetes
 - Managing datasets through the tracebloc interface

 **IMPORTANT** Make sure that the data format and ML task is supported and that data standards are met by reviewing the [docs](/create-use-case/prerequisites). You must run the process twice, once to ingest training and once to ingest testing data.

@@ -20,14 +20,14 @@

 You can ingest data into your client in two ways:

 - **Declarative YAML (recommended, simpler)** — describe your dataset in ~8 lines of `ingest.yaml`, then `helm install`. No Dockerfile, no custom Python script. The official ingestor image runs it for you. Use this for any dataset that fits a supported category.
 - **Custom Python template + Kubernetes Job (advanced)** — clone the [data-ingestors repo](https://github.com/tracebloc/data-ingestors), pick a per-category template script, edit it, build and push a Docker image, then `kubectl apply` an `ingestor-job.yaml`. Use this when the declarative schema can't express what your data needs — e.g. non-trivial preprocessing, a custom validator, or a `BaseProcessor` subclass.

 Start with the declarative method below. Drop down to the custom-template flow only if you need it.

 ## Declarative YAML (recommended)

 Describe your dataset in ~8 lines of YAML, then `helm install`. The official ingestor image (published as `ghcr.io/tracebloc/ingestor`) runs it. No Dockerfile, no Python script.

 ### 1. Add the chart repo (one-time)

@@ -36,7 +36,7 @@
 helm repo update
 ```

 The `tracebloc/client` parent chart bootstraps the cluster (jobs-manager, MySQL, RBAC). The `tracebloc/ingestor` subchart submits per-dataset ingestion runs against it.

 <Note>
 If you installed the client via the one-liner (`bash <(curl -fsSL https://tracebloc.io/i.sh)`), use `--reset-then-reuse-values` so the helm upgrade doesn't drop the values the installer applied:
@@ -50,11 +50,27 @@
 
 ### 2. Stage your data on the cluster's shared PVC
 
-The chart **doesn't transport data into the cluster** — it points at data already accessible to the cluster's shared PVC (`client-pvc` by default, mounted at `/data/shared/` inside the ingestor Pod). Before installing, get your raw files there. The simplest pattern for a small dataset is a throwaway `kubectl cp` Pod that mounts the PVC; for production you'd typically use an init container with cloud-storage sync. Full staging recipe and manifests live in the [client ingestor README](https://github.com/tracebloc/client/blob/develop/ingestor/README.md#stage-your-data-on-the-shared-pvc).
+The chart **doesn't transport data into the cluster** — it points at data already accessible to the cluster's shared PVC (`client-pvc` by default, mounted at `/data/shared/` inside the ingestor Pod). Before installing, get your raw files there.
+
+For a single-node workspace (the default install), the PVC is backed by a host directory the installer created at `~/.tracebloc/<workspace>/data/`. Drop your files into a per-dataset subdirectory:
+
+```bash
+# Host path on the machine where the tracebloc client is installed.
+# Pick a <prefix> per dataset — it becomes the path you reference in ingest.yaml.
+mkdir -p ~/.tracebloc/<workspace>/data/<prefix>
+cp -R LOCAL_PATH/images   ~/.tracebloc/<workspace>/data/<prefix>/
+cp    LOCAL_PATH/labels.csv ~/.tracebloc/<workspace>/data/<prefix>/
+```
+
+Inside the ingestor Pod those files appear at `/data/shared/<prefix>/...` — that's what you'll put in `ingest.yaml` below.
+
+<Note>
+For multi-node or EKS deployments where the PVC isn't backed by a local host path, use a throwaway `kubectl cp` Pod or a cloud-storage init container instead. See the [client ingestor README](https://github.com/tracebloc/client/blob/develop/ingestor/README.md#stage-your-data-on-the-shared-pvc) for those recipes.
+</Note>
 
 ### 3. Write your `ingest.yaml`
 
 The example below is for `image_classification`. **Other categories require different fields** — e.g. `tabular_classification` has no `images:` and instead needs a typed `schema:` block. Don't copy this one blindly; grab the matching file from [`examples/yaml/`](https://github.com/tracebloc/data-ingestors/tree/master/examples/yaml) (one per category) and edit from there. Per-category sample data and READMEs live under [`templates/`](https://github.com/tracebloc/data-ingestors/tree/master/templates).

 ```yaml
 apiVersion: tracebloc.io/v1
@@ -67,34 +83,42 @@
 label: label
 ```

 The top-level shape (`apiVersion`, `kind`, `category`, `table`, `intent`, `label`) is the same for every category; the `category` field picks the validator set, file-extension defaults, and column conventions. The data-source fields (`csv:`, `images:`, `schema:`, …) vary per category. The paths are *paths inside the ingestor Pod*, which is the PVC mount you populated in step 2.
 
 ### 4. Install once per dataset
 
+The ingestor runs once: validates your data, copies files into the destination directory on the PVC, inserts rows into MySQL, sends metadata to the tracebloc backend, then exits. **Run it twice per dataset** — once with `intent: train`, once with `intent: test` — using distinct `table:` names. The example below shows both releases:
+
 ```bash
-helm install my-cats-dogs tracebloc/ingestor \
+# Train release — points at the ingest.yaml from step 3 (table: cats_dogs_train, intent: train)
+helm install cats-dogs-train tracebloc/ingestor \
+  --namespace <workspace> \
+  --set-file ingestConfig=./ingest-train.yaml
+
+# Test release — same shape, with table: cats_dogs_test and intent: test
+helm install cats-dogs-test tracebloc/ingestor \
   --namespace <workspace> \
-  --set-file ingestConfig=./ingest.yaml
+  --set-file ingestConfig=./ingest-test.yaml
 ```
 
-The ingestor runs once: validates your data, copies files into the destination directory on the PVC, inserts rows into MySQL, sends metadata to the tracebloc backend, then exits. Repeat per dataset (one helm release per dataset, with different `table:` and `intent:` for train and test).
+Each `helm install` is a separate release (the first argument is the release name), so the two runs don't collide. The ingestor Pod picks up `CLIENT_ID` / `CLIENT_PASSWORD` automatically from the Kubernetes Secret the parent `tracebloc/client` chart created in `<workspace>` at install time — you don't pass credentials on the `helm install` command.
 
 Full chart docs (data-staging recipe, schema, every category, update model, verification, override knobs) → [client ingestor README](https://github.com/tracebloc/client/blob/develop/ingestor/README.md).
 
 ## Custom Python template (advanced)

 Use this flow when the declarative schema can't express what your data needs — typically when you have non-trivial preprocessing logic, a custom validator, or a `BaseProcessor` subclass. The sections below — Quick Setup and Detailed Setup — both describe this advanced path.

 ## Quick Setup

 Use this quick setup if you already have an ingestor configured and just want to switch datasets or toggle between training and testing. If you are setting up for the first time, go to the next section for the detailed walkthrough.

 ### Steps

 1. Pick a template script and edit it. E.g. `/templates/tabular_classification/tabular_classification.py`
 - Update csv options and data_path
 - Only for tabular data: Update schema
 - Set `schema` and `CSVIngestor()`parameters like category, intent, label_column, etc. to match data type, task and train/test purpose

 ```python
 ingestor = CSVIngestor(
@@ -128,9 +152,9 @@

 ### 1. Configure a Template

 This section walks you through the step-by-step setup of a data ingestor. You will clone the repository, select the right template for your data type, and customize it to match your task. Follow this guide if you are setting up an ingestor for the first time or need full control beyond the quick setup.

 ### Clone the Data Ingestor Repository

 Clone the public [Data Ingestor GitHub repository](https://github.com/tracebloc/data-ingestors):

@@ -195,14 +219,14 @@
        ...
 ```

 Both Database, APIClient and other values are configured automatically from the environment variables defined in `ingestor_job.yaml`.

 - `config.LABEL_FILE`: Path to local csv label file
 - `config.BATCH_SIZE`: Batch size used during ingestion

 ### Customize a Template

 Templates provide a starting point, but every dataset has its own format and labels. In this step you adapt the template to your data by tuning CSV ingestion options and setting the ingestor parameters (category, label column, intent, data path and schema). The following example in `templates/tabular_classification/tabular_classification.py` shows how to ingest a tabular dataset, but the setup works the same way for image or text data.

 #### Needed for Tabular Data: Define Schema

@@ -255,7 +279,7 @@
 ```

 #### Set CSV ingestion options
 Customize parsing, memory handling, and data cleaning with the csv_options dictionary:

 ```python
 csv_options = {
@@ -270,9 +294,9 @@
 }
 ```

 #### Set Up the Ingestor

 Define the Ingestor instance with the required configuration. See the tabular data example below:

 ```python
 ingestor = CSVIngestor(
@@ -304,7 +328,7 @@

 ### Docker Hub Setup (first-time users)

 The cluster pulls your ingestor image from a public Docker registry, so you need an account before you can push. If you already have one, skip to [Edit Dockerfile](#edit-dockerfile).

 1. **Create a Docker Hub account** at [hub.docker.com/signup](https://hub.docker.com/signup) and verify your email.
 2. **Log in from your terminal** so the `docker push` command can authenticate:
@@ -313,18 +337,18 @@
   docker login
   ```

 3. **Push the data ingestor image** to your account using the build/push commands in the next section. The image name takes the form `<your-docker-username>/<image-name>:<tag>` — the username segment must match the account you just created.
 4. **Make the image public** so the cluster can pull it without credentials:
   - Go to [hub.docker.com/repositories](https://hub.docker.com/repositories), open the repository you just pushed.
   - Click **Settings → Visibility settings → Make public**.

   Keeping the image private is also fine, but then you must create a Kubernetes `imagePullSecret` named `regcred` in the client namespace (the `ingestor-job.yaml` already references it).

 ### Place data files on the client host

 Datasets are **not** baked into the Docker image. They live on the client host in the per-workspace data directory and are mounted into the ingestor pod through the shared PVC (`client-pvc` → `/data/shared`).

 Copy your dataset into the client's data directory, where `<workspace>` is the workspace name you chose during client install (which is also the Helm release name and the Kubernetes namespace — the chart uses the same value for all three). The directory `~/.tracebloc/<workspace>/data/` is created automatically by the installer; just drop your files into it:

 ```bash
 # Host path on the machine where the tracebloc client is installed.
@@ -333,20 +357,20 @@
 cp    LOCAL_PATH/labels.csv ~/.tracebloc/<workspace>/data/
 ```

 Inside the ingestor pod this directory is mounted at `/data/shared`, so the same files appear as `/data/shared/images/...` and `/data/shared/labels.csv`. Set `SRC_PATH` and `LABEL_FILE` in `ingestor-job.yaml` to point at those in-pod paths (see [Configure Kubernetes](#3-configure-kubernetes) below).

 For tabular data the same rule applies — drop the single `labels.csv` (with features and labels) into `~/.tracebloc/<workspace>/data/`.

 ### Edit Dockerfile

 The Dockerfile only needs to package the ingestion script — the dataset is mounted at runtime, so do **not** `COPY` data into the image:

 ```dockerfile
 # Copy the ingestion script into /app
 COPY templates/tabular_classification/tabular_classification.py /app/ingestor.py
 ```

 If the cluster enforces the `restricted` Pod Security Standard (see [Run as non-root](#run-as-non-root) below), also add a non-root user to the Dockerfile, **before** the `# Set the entrypoint` line:

 ```dockerfile
 RUN groupadd -g 1000 app && \
@@ -445,14 +469,14 @@
 - `image`, your Docker image (imagePullPolicy: Always for DockerHub, IfNotPresent for local)
 - `CLIENT_ID`, `CLIENT_PASSWORD` from the [tracebloc client view](https://ai.tracebloc.io/clients)
 - `TABLE_NAME`, unique per dataset, train and test use different names, no spaces. Different names for train and test data is mandatory
 - `LABEL_FILE`, path inside the ingestor pod (under `/data/shared`) to the CSV with file paths and labels — must match the location of the file you placed in `~/.tracebloc/<workspace>/data/`
 - `SRC_PATH`, root inside the pod where the dataset directory is mounted (`/data/shared`)
 - `BATCH_SIZE` is the number of entries sent to the server per request. Optional — defaults to 4000. Keep it consistent across data types. It depends on available CPU memory, not for example image size. Too large can exhaust memory. It was tested up to 10,000, but 5,000 is a safe default for most systems.
 - `LOG_LEVEL`, "WARNING" for all warnings and errors, "INFO" for all logs, "ERROR" for errors only

 ### 4. Deploy

 Run the ingestor as a Kubernetes Job:

 ```bash
 kubectl apply -f ingestor-job.yaml -n <workspace>
@@ -468,7 +492,7 @@

 ### Run as non-root

 If the namespace enforces the `restricted` [Pod Security Standard](https://kubernetes.io/docs/concepts/security/pod-security-standards/), `kubectl apply` will be admitted but the pod will be rejected with a warning like:

 ```text
 Warning: would violate PodSecurity "restricted:latest":
@@ -494,7 +518,7 @@
    type: RuntimeDefault
 ```

 **2. Run the container as a non-root user.** Add the following to the Dockerfile **before** the `# Set the entrypoint` line so the image ships with a UID that satisfies `runAsNonRoot: true`:

 ```dockerfile
 RUN groupadd -g 1000 app && \
@@ -506,7 +530,7 @@

 Rebuild and push the image, then re-apply the job.

 The data ingestor always runs a validation step before ingestion and moving files.


 #### Verify Deployment
@@ -528,7 +552,7 @@
 **Interface displays:**
 - Dataset name, ID, and record count
 - Data type (Tabular, Image, Text) and purpose (Training/Testing)
 - Namespace and GPU requirements

 ## Best Practices
 - Deploy jobs for training and testing simultaneously using different job names