Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 28 additions & 4 deletions create-use-case/prepare-dataset.mdx
Original file line number Diff line number Diff line change
@@ -1,18 +1,18 @@
---
title: "Prepare Data"
description: "Learn how to prepare and ingest your datasets into tracebloc using containerized data ingestors. Complete guide for CSV, image, and text data with Kubernetes deployment steps."

Check warning on line 3 in create-use-case/prepare-dataset.mdx

View check run for this annotation

Mintlify / Mintlify Validation (tracebloc) - vale-spellcheck

create-use-case/prepare-dataset.mdx#L3

Did you really mean 'tracebloc'?

Check warning on line 3 in create-use-case/prepare-dataset.mdx

View check run for this annotation

Mintlify / Mintlify Validation (tracebloc) - vale-spellcheck

create-use-case/prepare-dataset.mdx#L3

Did you really mean 'ingestors'?
---

## Overview

Make your data available to the Kubernetes cluster so it can be used for training and evaluation. Regardless of where your client runs on Azure, AWS, Google Cloud, or a local Minikube setup, the process of ingesting datasets works the same way.

Check warning on line 8 in create-use-case/prepare-dataset.mdx

View check run for this annotation

Mintlify / Mintlify Validation (tracebloc) - vale-spellcheck

create-use-case/prepare-dataset.mdx#L8

Did you really mean 'Minikube'?

The data ingestor is a lightweight service that bridges your raw data and the cluster's persistent storage. It comes with ready-made templates (CSV, images, text) that you can use as starting points and customize for your own dataset. By containerizing the ingestion step, the ingestor validates data format and schema, enforces consistency, and transfers the dataset securely into cluster's SQL storage where it becomes accessible to all training and evaluation jobs.

Check warning on line 10 in create-use-case/prepare-dataset.mdx

View check run for this annotation

Mintlify / Mintlify Validation (tracebloc) - vale-spellcheck

create-use-case/prepare-dataset.mdx#L10

Did you really mean 'ingestor'?

Check warning on line 10 in create-use-case/prepare-dataset.mdx

View check run for this annotation

Mintlify / Mintlify Validation (tracebloc) - vale-spellcheck

create-use-case/prepare-dataset.mdx#L10

Did you really mean 'ingestor'?

This guide covers:
- Customizing ingestor templates for different data types (CSV, images, text)

Check warning on line 13 in create-use-case/prepare-dataset.mdx

View check run for this annotation

Mintlify / Mintlify Validation (tracebloc) - vale-spellcheck

create-use-case/prepare-dataset.mdx#L13

Did you really mean 'ingestor'?
- Deploying the data ingestor for training and test data using Kubernetes

Check warning on line 14 in create-use-case/prepare-dataset.mdx

View check run for this annotation

Mintlify / Mintlify Validation (tracebloc) - vale-spellcheck

create-use-case/prepare-dataset.mdx#L14

Did you really mean 'ingestor'?
- Managing datasets through the tracebloc interface

Check warning on line 15 in create-use-case/prepare-dataset.mdx

View check run for this annotation

Mintlify / Mintlify Validation (tracebloc) - vale-spellcheck

create-use-case/prepare-dataset.mdx#L15

Did you really mean 'tracebloc'?

**IMPORTANT** Make sure that the data format and ML task is supported and that data standards are met by reviewing the [docs](/create-use-case/prerequisites). You must run the process twice, once to ingest training and once to ingest testing data.

Expand All @@ -20,14 +20,14 @@

You can ingest data into your client in two ways:

- **Declarative YAML (recommended, simpler)** — describe your dataset in ~8 lines of `ingest.yaml`, then `helm install`. No Dockerfile, no custom Python script. The official ingestor image runs it for you. Use this for any dataset that fits a supported category.

Check warning on line 23 in create-use-case/prepare-dataset.mdx

View check run for this annotation

Mintlify / Mintlify Validation (tracebloc) - vale-spellcheck

create-use-case/prepare-dataset.mdx#L23

Did you really mean 'Dockerfile'?

Check warning on line 23 in create-use-case/prepare-dataset.mdx

View check run for this annotation

Mintlify / Mintlify Validation (tracebloc) - vale-spellcheck

create-use-case/prepare-dataset.mdx#L23

Did you really mean 'ingestor'?
- **Custom Python template + Kubernetes Job (advanced)** — clone the [data-ingestors repo](https://github.com/tracebloc/data-ingestors), pick a per-category template script, edit it, build and push a Docker image, then `kubectl apply` an `ingestor-job.yaml`. Use this when the declarative schema can't express what your data needs — e.g. non-trivial preprocessing, a custom validator, or a `BaseProcessor` subclass.

Check warning on line 24 in create-use-case/prepare-dataset.mdx

View check run for this annotation

Mintlify / Mintlify Validation (tracebloc) - vale-spellcheck

create-use-case/prepare-dataset.mdx#L24

Did you really mean 'validator'?

Start with the declarative method below. Drop down to the custom-template flow only if you need it.

## Declarative YAML (recommended)

Describe your dataset in ~8 lines of YAML, then `helm install`. The official ingestor image (published as `ghcr.io/tracebloc/ingestor`) runs it. No Dockerfile, no Python script.

Check warning on line 30 in create-use-case/prepare-dataset.mdx

View check run for this annotation

Mintlify / Mintlify Validation (tracebloc) - vale-spellcheck

create-use-case/prepare-dataset.mdx#L30

Did you really mean 'ingestor'?

Check warning on line 30 in create-use-case/prepare-dataset.mdx

View check run for this annotation

Mintlify / Mintlify Validation (tracebloc) - vale-spellcheck

create-use-case/prepare-dataset.mdx#L30

Did you really mean 'Dockerfile'?

### 1. Add the chart repo (one-time)

Expand All @@ -36,7 +36,7 @@
helm repo update
```

The `tracebloc/client` parent chart bootstraps the cluster (jobs-manager, MySQL, RBAC). The `tracebloc/ingestor` subchart submits per-dataset ingestion runs against it.

Check warning on line 39 in create-use-case/prepare-dataset.mdx

View check run for this annotation

Mintlify / Mintlify Validation (tracebloc) - vale-spellcheck

create-use-case/prepare-dataset.mdx#L39

Did you really mean 'subchart'?

<Note>
If you installed the client via the one-liner (`bash <(curl -fsSL https://tracebloc.io/i.sh)`), use `--reset-then-reuse-values` so the helm upgrade doesn't drop the values the installer applied:
Expand All @@ -50,11 +50,27 @@

### 2. Stage your data on the cluster's shared PVC

The chart **doesn't transport data into the cluster** — it points at data already accessible to the cluster's shared PVC (`client-pvc` by default, mounted at `/data/shared/` inside the ingestor Pod). Before installing, get your raw files there. The simplest pattern for a small dataset is a throwaway `kubectl cp` Pod that mounts the PVC; for production you'd typically use an init container with cloud-storage sync. Full staging recipe and manifests live in the [client ingestor README](https://github.com/tracebloc/client/blob/develop/ingestor/README.md#stage-your-data-on-the-shared-pvc).
The chart **doesn't transport data into the cluster** — it points at data already accessible to the cluster's shared PVC (`client-pvc` by default, mounted at `/data/shared/` inside the ingestor Pod). Before installing, get your raw files there.

Check warning on line 53 in create-use-case/prepare-dataset.mdx

View check run for this annotation

Mintlify / Mintlify Validation (tracebloc) - vale-spellcheck

create-use-case/prepare-dataset.mdx#L53

Did you really mean 'ingestor'?

For a single-node workspace (the default install), the PVC is backed by a host directory the installer created at `~/.tracebloc/<workspace>/data/`. Drop your files into a per-dataset subdirectory:

```bash
# Host path on the machine where the tracebloc client is installed.
# Pick a <prefix> per dataset — it becomes the path you reference in ingest.yaml.
mkdir -p ~/.tracebloc/<workspace>/data/<prefix>
cp -R LOCAL_PATH/images ~/.tracebloc/<workspace>/data/<prefix>/
cp LOCAL_PATH/labels.csv ~/.tracebloc/<workspace>/data/<prefix>/
```

Inside the ingestor Pod those files appear at `/data/shared/<prefix>/...` — that's what you'll put in `ingest.yaml` below.

Check warning on line 65 in create-use-case/prepare-dataset.mdx

View check run for this annotation

Mintlify / Mintlify Validation (tracebloc) - vale-spellcheck

create-use-case/prepare-dataset.mdx#L65

Did you really mean 'ingestor'?

<Note>
For multi-node or EKS deployments where the PVC isn't backed by a local host path, use a throwaway `kubectl cp` Pod or a cloud-storage init container instead. See the [client ingestor README](https://github.com/tracebloc/client/blob/develop/ingestor/README.md#stage-your-data-on-the-shared-pvc) for those recipes.

Check warning on line 68 in create-use-case/prepare-dataset.mdx

View check run for this annotation

Mintlify / Mintlify Validation (tracebloc) - vale-spellcheck

create-use-case/prepare-dataset.mdx#L68

Did you really mean 'ingestor'?
</Note>

### 3. Write your `ingest.yaml`

The example below is for `image_classification`. **Other categories require different fields** — e.g. `tabular_classification` has no `images:` and instead needs a typed `schema:` block. Don't copy this one blindly; grab the matching file from [`examples/yaml/`](https://github.com/tracebloc/data-ingestors/tree/master/examples/yaml) (one per category) and edit from there. Per-category sample data and READMEs live under [`templates/`](https://github.com/tracebloc/data-ingestors/tree/master/templates).

Check warning on line 73 in create-use-case/prepare-dataset.mdx

View check run for this annotation

Mintlify / Mintlify Validation (tracebloc) - vale-spellcheck

create-use-case/prepare-dataset.mdx#L73

Did you really mean 'READMEs'?

```yaml
apiVersion: tracebloc.io/v1
Expand All @@ -67,34 +83,42 @@
label: label
```

The top-level shape (`apiVersion`, `kind`, `category`, `table`, `intent`, `label`) is the same for every category; the `category` field picks the validator set, file-extension defaults, and column conventions. The data-source fields (`csv:`, `images:`, `schema:`, …) vary per category. The paths are *paths inside the ingestor Pod*, which is the PVC mount you populated in step 2.

Check warning on line 86 in create-use-case/prepare-dataset.mdx

View check run for this annotation

Mintlify / Mintlify Validation (tracebloc) - vale-spellcheck

create-use-case/prepare-dataset.mdx#L86

Did you really mean 'validator'?

### 4. Install once per dataset

The ingestor runs once: validates your data, copies files into the destination directory on the PVC, inserts rows into MySQL, sends metadata to the tracebloc backend, then exits. **Run it twice per dataset** — once with `intent: train`, once with `intent: test` — using distinct `table:` names. The example below shows both releases:

Check warning on line 90 in create-use-case/prepare-dataset.mdx

View check run for this annotation

Mintlify / Mintlify Validation (tracebloc) - vale-spellcheck

create-use-case/prepare-dataset.mdx#L90

Did you really mean 'ingestor'?

Check warning on line 90 in create-use-case/prepare-dataset.mdx

View check run for this annotation

Mintlify / Mintlify Validation (tracebloc) - vale-spellcheck

create-use-case/prepare-dataset.mdx#L90

Did you really mean 'tracebloc'?

```bash
helm install my-cats-dogs tracebloc/ingestor \
# Train release — points at the ingest.yaml from step 3 (table: cats_dogs_train, intent: train)
helm install cats-dogs-train tracebloc/ingestor \
--namespace <workspace> \
--set-file ingestConfig=./ingest-train.yaml

# Test release — same shape, with table: cats_dogs_test and intent: test
helm install cats-dogs-test tracebloc/ingestor \
--namespace <workspace> \
--set-file ingestConfig=./ingest.yaml
--set-file ingestConfig=./ingest-test.yaml
```

The ingestor runs once: validates your data, copies files into the destination directory on the PVC, inserts rows into MySQL, sends metadata to the tracebloc backend, then exits. Repeat per dataset (one helm release per dataset, with different `table:` and `intent:` for train and test).
Each `helm install` is a separate release (the first argument is the release name), so the two runs don't collide. The ingestor Pod picks up `CLIENT_ID` / `CLIENT_PASSWORD` automatically from the Kubernetes Secret the parent `tracebloc/client` chart created in `<workspace>` at install time — you don't pass credentials on the `helm install` command.

Check warning on line 104 in create-use-case/prepare-dataset.mdx

View check run for this annotation

Mintlify / Mintlify Validation (tracebloc) - vale-spellcheck

create-use-case/prepare-dataset.mdx#L104

Did you really mean 'ingestor'?

Full chart docs (data-staging recipe, schema, every category, update model, verification, override knobs) → [client ingestor README](https://github.com/tracebloc/client/blob/develop/ingestor/README.md).

## Custom Python template (advanced)

Use this flow when the declarative schema can't express what your data needs — typically when you have non-trivial preprocessing logic, a custom validator, or a `BaseProcessor` subclass. The sections below — Quick Setup and Detailed Setup — both describe this advanced path.

Check warning on line 110 in create-use-case/prepare-dataset.mdx

View check run for this annotation

Mintlify / Mintlify Validation (tracebloc) - vale-spellcheck

create-use-case/prepare-dataset.mdx#L110

Did you really mean 'validator'?

## Quick Setup

Use this quick setup if you already have an ingestor configured and just want to switch datasets or toggle between training and testing. If you are setting up for the first time, go to the next section for the detailed walkthrough.

Check warning on line 114 in create-use-case/prepare-dataset.mdx

View check run for this annotation

Mintlify / Mintlify Validation (tracebloc) - vale-spellcheck

create-use-case/prepare-dataset.mdx#L114

Did you really mean 'ingestor'?

Check warning on line 114 in create-use-case/prepare-dataset.mdx

View check run for this annotation

Mintlify / Mintlify Validation (tracebloc) - vale-spellcheck

create-use-case/prepare-dataset.mdx#L114

Did you really mean 'walkthrough'?

### Steps

1. Pick a template script and edit it. E.g. `/templates/tabular_classification/tabular_classification.py`
- Update csv options and data_path

Check warning on line 119 in create-use-case/prepare-dataset.mdx

View check run for this annotation

Mintlify / Mintlify Validation (tracebloc) - vale-spellcheck

create-use-case/prepare-dataset.mdx#L119

Did you really mean 'csv'?

Check warning on line 119 in create-use-case/prepare-dataset.mdx

View check run for this annotation

Mintlify / Mintlify Validation (tracebloc) - vale-spellcheck

create-use-case/prepare-dataset.mdx#L119

Did you really mean 'data_path'?
- Only for tabular data: Update schema
- Set `schema` and `CSVIngestor()`parameters like category, intent, label_column, etc. to match data type, task and train/test purpose

Check warning on line 121 in create-use-case/prepare-dataset.mdx

View check run for this annotation

Mintlify / Mintlify Validation (tracebloc) - vale-spellcheck

create-use-case/prepare-dataset.mdx#L121

Did you really mean 'label_column'?

```python
ingestor = CSVIngestor(
Expand Down Expand Up @@ -128,9 +152,9 @@

### 1. Configure a Template

This section walks you through the step-by-step setup of a data ingestor. You will clone the repository, select the right template for your data type, and customize it to match your task. Follow this guide if you are setting up an ingestor for the first time or need full control beyond the quick setup.

Check warning on line 155 in create-use-case/prepare-dataset.mdx

View check run for this annotation

Mintlify / Mintlify Validation (tracebloc) - vale-spellcheck

create-use-case/prepare-dataset.mdx#L155

Did you really mean 'ingestor'?

Check warning on line 155 in create-use-case/prepare-dataset.mdx

View check run for this annotation

Mintlify / Mintlify Validation (tracebloc) - vale-spellcheck

create-use-case/prepare-dataset.mdx#L155

Did you really mean 'ingestor'?

### Clone the Data Ingestor Repository

Check warning on line 157 in create-use-case/prepare-dataset.mdx

View check run for this annotation

Mintlify / Mintlify Validation (tracebloc) - vale-spellcheck

create-use-case/prepare-dataset.mdx#L157

Did you really mean 'Ingestor'?

Clone the public [Data Ingestor GitHub repository](https://github.com/tracebloc/data-ingestors):

Expand Down Expand Up @@ -195,14 +219,14 @@
...
```

Both Database, APIClient and other values are configured automatically from the environment variables defined in `ingestor_job.yaml`.

Check warning on line 222 in create-use-case/prepare-dataset.mdx

View check run for this annotation

Mintlify / Mintlify Validation (tracebloc) - vale-spellcheck

create-use-case/prepare-dataset.mdx#L222

Did you really mean 'APIClient'?

- `config.LABEL_FILE`: Path to local csv label file

Check warning on line 224 in create-use-case/prepare-dataset.mdx

View check run for this annotation

Mintlify / Mintlify Validation (tracebloc) - vale-spellcheck

create-use-case/prepare-dataset.mdx#L224

Did you really mean 'csv'?
- `config.BATCH_SIZE`: Batch size used during ingestion

### Customize a Template

Templates provide a starting point, but every dataset has its own format and labels. In this step you adapt the template to your data by tuning CSV ingestion options and setting the ingestor parameters (category, label column, intent, data path and schema). The following example in `templates/tabular_classification/tabular_classification.py` shows how to ingest a tabular dataset, but the setup works the same way for image or text data.

Check warning on line 229 in create-use-case/prepare-dataset.mdx

View check run for this annotation

Mintlify / Mintlify Validation (tracebloc) - vale-spellcheck

create-use-case/prepare-dataset.mdx#L229

Did you really mean 'ingestor'?

#### Needed for Tabular Data: Define Schema

Expand Down Expand Up @@ -255,7 +279,7 @@
```

#### Set CSV ingestion options
Customize parsing, memory handling, and data cleaning with the csv_options dictionary:

Check warning on line 282 in create-use-case/prepare-dataset.mdx

View check run for this annotation

Mintlify / Mintlify Validation (tracebloc) - vale-spellcheck

create-use-case/prepare-dataset.mdx#L282

Did you really mean 'csv_options'?

```python
csv_options = {
Expand All @@ -270,9 +294,9 @@
}
```

#### Set Up the Ingestor

Check warning on line 297 in create-use-case/prepare-dataset.mdx

View check run for this annotation

Mintlify / Mintlify Validation (tracebloc) - vale-spellcheck

create-use-case/prepare-dataset.mdx#L297

Did you really mean 'Ingestor'?

Define the Ingestor instance with the required configuration. See the tabular data example below:

Check warning on line 299 in create-use-case/prepare-dataset.mdx

View check run for this annotation

Mintlify / Mintlify Validation (tracebloc) - vale-spellcheck

create-use-case/prepare-dataset.mdx#L299

Did you really mean 'Ingestor'?

```python
ingestor = CSVIngestor(
Expand Down Expand Up @@ -304,7 +328,7 @@

### Docker Hub Setup (first-time users)

The cluster pulls your ingestor image from a public Docker registry, so you need an account before you can push. If you already have one, skip to [Edit Dockerfile](#edit-dockerfile).

Check warning on line 331 in create-use-case/prepare-dataset.mdx

View check run for this annotation

Mintlify / Mintlify Validation (tracebloc) - vale-spellcheck

create-use-case/prepare-dataset.mdx#L331

Did you really mean 'ingestor'?

1. **Create a Docker Hub account** at [hub.docker.com/signup](https://hub.docker.com/signup) and verify your email.
2. **Log in from your terminal** so the `docker push` command can authenticate:
Expand All @@ -313,18 +337,18 @@
docker login
```

3. **Push the data ingestor image** to your account using the build/push commands in the next section. The image name takes the form `<your-docker-username>/<image-name>:<tag>` — the username segment must match the account you just created.

Check warning on line 340 in create-use-case/prepare-dataset.mdx

View check run for this annotation

Mintlify / Mintlify Validation (tracebloc) - vale-spellcheck

create-use-case/prepare-dataset.mdx#L340

Did you really mean 'ingestor'?
4. **Make the image public** so the cluster can pull it without credentials:
- Go to [hub.docker.com/repositories](https://hub.docker.com/repositories), open the repository you just pushed.
- Click **Settings → Visibility settings → Make public**.

Keeping the image private is also fine, but then you must create a Kubernetes `imagePullSecret` named `regcred` in the client namespace (the `ingestor-job.yaml` already references it).

Check warning on line 345 in create-use-case/prepare-dataset.mdx

View check run for this annotation

Mintlify / Mintlify Validation (tracebloc) - vale-spellcheck

create-use-case/prepare-dataset.mdx#L345

Did you really mean 'namespace'?

### Place data files on the client host

Datasets are **not** baked into the Docker image. They live on the client host in the per-workspace data directory and are mounted into the ingestor pod through the shared PVC (`client-pvc` → `/data/shared`).

Check warning on line 349 in create-use-case/prepare-dataset.mdx

View check run for this annotation

Mintlify / Mintlify Validation (tracebloc) - vale-spellcheck

create-use-case/prepare-dataset.mdx#L349

Did you really mean 'ingestor'?

Copy your dataset into the client's data directory, where `<workspace>` is the workspace name you chose during client install (which is also the Helm release name and the Kubernetes namespace — the chart uses the same value for all three). The directory `~/.tracebloc/<workspace>/data/` is created automatically by the installer; just drop your files into it:

Check warning on line 351 in create-use-case/prepare-dataset.mdx

View check run for this annotation

Mintlify / Mintlify Validation (tracebloc) - vale-spellcheck

create-use-case/prepare-dataset.mdx#L351

Did you really mean 'namespace'?

```bash
# Host path on the machine where the tracebloc client is installed.
Expand All @@ -333,20 +357,20 @@
cp LOCAL_PATH/labels.csv ~/.tracebloc/<workspace>/data/
```

Inside the ingestor pod this directory is mounted at `/data/shared`, so the same files appear as `/data/shared/images/...` and `/data/shared/labels.csv`. Set `SRC_PATH` and `LABEL_FILE` in `ingestor-job.yaml` to point at those in-pod paths (see [Configure Kubernetes](#3-configure-kubernetes) below).

Check warning on line 360 in create-use-case/prepare-dataset.mdx

View check run for this annotation

Mintlify / Mintlify Validation (tracebloc) - vale-spellcheck

create-use-case/prepare-dataset.mdx#L360

Did you really mean 'ingestor'?

For tabular data the same rule applies — drop the single `labels.csv` (with features and labels) into `~/.tracebloc/<workspace>/data/`.

### Edit Dockerfile

Check warning on line 364 in create-use-case/prepare-dataset.mdx

View check run for this annotation

Mintlify / Mintlify Validation (tracebloc) - vale-spellcheck

create-use-case/prepare-dataset.mdx#L364

Did you really mean 'Dockerfile'?

The Dockerfile only needs to package the ingestion script — the dataset is mounted at runtime, so do **not** `COPY` data into the image:

Check warning on line 366 in create-use-case/prepare-dataset.mdx

View check run for this annotation

Mintlify / Mintlify Validation (tracebloc) - vale-spellcheck

create-use-case/prepare-dataset.mdx#L366

Did you really mean 'Dockerfile'?

```dockerfile
# Copy the ingestion script into /app
COPY templates/tabular_classification/tabular_classification.py /app/ingestor.py
```

If the cluster enforces the `restricted` Pod Security Standard (see [Run as non-root](#run-as-non-root) below), also add a non-root user to the Dockerfile, **before** the `# Set the entrypoint` line:

Check warning on line 373 in create-use-case/prepare-dataset.mdx

View check run for this annotation

Mintlify / Mintlify Validation (tracebloc) - vale-spellcheck

create-use-case/prepare-dataset.mdx#L373

Did you really mean 'Dockerfile'?

```dockerfile
RUN groupadd -g 1000 app && \
Expand Down Expand Up @@ -445,14 +469,14 @@
- `image`, your Docker image (imagePullPolicy: Always for DockerHub, IfNotPresent for local)
- `CLIENT_ID`, `CLIENT_PASSWORD` from the [tracebloc client view](https://ai.tracebloc.io/clients)
- `TABLE_NAME`, unique per dataset, train and test use different names, no spaces. Different names for train and test data is mandatory
- `LABEL_FILE`, path inside the ingestor pod (under `/data/shared`) to the CSV with file paths and labels — must match the location of the file you placed in `~/.tracebloc/<workspace>/data/`

Check warning on line 472 in create-use-case/prepare-dataset.mdx

View check run for this annotation

Mintlify / Mintlify Validation (tracebloc) - vale-spellcheck

create-use-case/prepare-dataset.mdx#L472

Did you really mean 'ingestor'?
- `SRC_PATH`, root inside the pod where the dataset directory is mounted (`/data/shared`)
- `BATCH_SIZE` is the number of entries sent to the server per request. Optional — defaults to 4000. Keep it consistent across data types. It depends on available CPU memory, not for example image size. Too large can exhaust memory. It was tested up to 10,000, but 5,000 is a safe default for most systems.
- `LOG_LEVEL`, "WARNING" for all warnings and errors, "INFO" for all logs, "ERROR" for errors only

### 4. Deploy

Run the ingestor as a Kubernetes Job:

Check warning on line 479 in create-use-case/prepare-dataset.mdx

View check run for this annotation

Mintlify / Mintlify Validation (tracebloc) - vale-spellcheck

create-use-case/prepare-dataset.mdx#L479

Did you really mean 'ingestor'?

```bash
kubectl apply -f ingestor-job.yaml -n <workspace>
Expand All @@ -468,7 +492,7 @@

### Run as non-root

If the namespace enforces the `restricted` [Pod Security Standard](https://kubernetes.io/docs/concepts/security/pod-security-standards/), `kubectl apply` will be admitted but the pod will be rejected with a warning like:

Check warning on line 495 in create-use-case/prepare-dataset.mdx

View check run for this annotation

Mintlify / Mintlify Validation (tracebloc) - vale-spellcheck

create-use-case/prepare-dataset.mdx#L495

Did you really mean 'namespace'?

```text
Warning: would violate PodSecurity "restricted:latest":
Expand All @@ -494,7 +518,7 @@
type: RuntimeDefault
```

**2. Run the container as a non-root user.** Add the following to the Dockerfile **before** the `# Set the entrypoint` line so the image ships with a UID that satisfies `runAsNonRoot: true`:

Check warning on line 521 in create-use-case/prepare-dataset.mdx

View check run for this annotation

Mintlify / Mintlify Validation (tracebloc) - vale-spellcheck

create-use-case/prepare-dataset.mdx#L521

Did you really mean 'Dockerfile'?

```dockerfile
RUN groupadd -g 1000 app && \
Expand All @@ -506,7 +530,7 @@

Rebuild and push the image, then re-apply the job.

The data ingestor always runs a validation step before ingestion and moving files.

Check warning on line 533 in create-use-case/prepare-dataset.mdx

View check run for this annotation

Mintlify / Mintlify Validation (tracebloc) - vale-spellcheck

create-use-case/prepare-dataset.mdx#L533

Did you really mean 'ingestor'?


#### Verify Deployment
Expand All @@ -528,7 +552,7 @@
**Interface displays:**
- Dataset name, ID, and record count
- Data type (Tabular, Image, Text) and purpose (Training/Testing)
- Namespace and GPU requirements

Check warning on line 555 in create-use-case/prepare-dataset.mdx

View check run for this annotation

Mintlify / Mintlify Validation (tracebloc) - vale-spellcheck

create-use-case/prepare-dataset.mdx#L555

Did you really mean 'Namespace'?

## Best Practices
- Deploy jobs for training and testing simultaneously using different job names
Expand Down
Loading