diff --git a/README.md b/README.md index 787c113..014ba0e 100644 --- a/README.md +++ b/README.md @@ -50,7 +50,7 @@ For the threat model, defense layers, per-platform caveats, operator responsibil ## Deploy -This repo ships the **tracebloc** unified Helm chart (currently `v1.3.1`) — one chart for AKS, EKS, bare-metal, and OpenShift. +This repo ships the **tracebloc** unified Helm chart (currently `v1.3.5`) — one chart for AKS, EKS, bare-metal, and OpenShift. ### Quick install @@ -77,16 +77,48 @@ For existing Kubernetes clusters: ```bash helm repo add tracebloc https://tracebloc.github.io/client helm repo update -helm install my-tracebloc tracebloc/tracebloc \ +helm install my-tracebloc tracebloc/client \ --namespace tracebloc --create-namespace \ -f my-values.yaml ``` Full deployment guide → **[docs/INSTALL.md](docs/INSTALL.md)** (prerequisites, required values, upgrade & rollback, air-gapped install). +## Ingest a dataset + +Once the client is running, get a dataset into your cluster's local MySQL with ~8 lines of YAML and a single `helm install`. No Dockerfile, no Python script — the platform owns the official image, you describe what you want ingested. + +The flow is two steps. **First**, stage your raw files on the cluster's shared PVC (`client-pvc` by default, mounted at `/data/shared/` inside the ingestor Pod). The chart doesn't transport data into the cluster — it points at data the cluster can already see. The simplest pattern is a throwaway `kubectl cp` Pod that mounts the PVC; the chart README links the manifest. + +**Second**, describe the dataset and install: + +```yaml +# my-cats-dogs.yaml +apiVersion: tracebloc.io/v1 +kind: IngestConfig +category: image_classification +table: cats_dogs_train +intent: train +csv: /data/shared/cats-dogs/labels.csv +images: /data/shared/cats-dogs/images/ +label: label +``` + +```bash +helm install my-cats-dogs tracebloc/ingestor \ + --namespace tracebloc \ + --set-file ingestConfig=./my-cats-dogs.yaml +``` + +The ingestor runs once, validates the data, copies files into the destination directory on the PVC, inserts rows into the cluster's MySQL, sends metadata to the tracebloc backend — then exits. The chart artifacts (ConfigMap + post-install hook Job) become inert; nothing keeps running. Repeat per dataset. + +Full ingestor docs → **[ingestor/README.md](ingestor/README.md)** (data staging patterns, every supported category, the schema, the update model, verification, override knobs). + | Topic | Where to look | |---|---| | Production install + required values | [docs/INSTALL.md](docs/INSTALL.md) | +| Ingest a dataset (declarative YAML) | [ingestor/README.md](ingestor/README.md) | +| Available ingestion categories + example YAMLs | [tracebloc/data-ingestors templates](https://github.com/tracebloc/data-ingestors/tree/master/templates) | | Threat model & operator responsibilities | [docs/SECURITY.md](docs/SECURITY.md) | | Migrating from `eks-1.0.x` / `aks-*` charts to `client-1.x` | [docs/MIGRATIONS.md](docs/MIGRATIONS.md) | | Per-tenant migration runbook | [docs/migration-tools/README.md](docs/migration-tools/README.md) | diff --git a/docs/INSTALL.md b/docs/INSTALL.md index 81690c1..56363cf 100644 --- a/docs/INSTALL.md +++ b/docs/INSTALL.md @@ -33,7 +33,7 @@ helm repo add tracebloc https://tracebloc.github.io/client helm repo update # Install with a release name and namespace -helm install my-tracebloc tracebloc/tracebloc \ +helm install my-tracebloc tracebloc/client \ --namespace tracebloc \ --create-namespace \ -f my-values.yaml @@ -104,7 +104,7 @@ For platform-specific settings (AKS, EKS, bare-metal, OpenShift), see `client/ci ```bash # Upgrade to a new chart version (repo install) helm repo update -helm upgrade my-tracebloc tracebloc/tracebloc -n tracebloc -f my-values.yaml +helm upgrade my-tracebloc tracebloc/client -n tracebloc -f my-values.yaml # Upgrade when using a tgz helm upgrade my-tracebloc ./tracebloc-2.0.1.tgz -n tracebloc -f my-values.yaml @@ -228,7 +228,7 @@ After that, users can run: ```bash helm repo add tracebloc https://tracebloc.github.io/client -helm install my-tracebloc tracebloc/tracebloc -n tracebloc -f my-values.yaml +helm install my-tracebloc tracebloc/client -n tracebloc -f my-values.yaml ``` --- @@ -241,3 +241,37 @@ helm install my-tracebloc tracebloc/tracebloc -n tracebloc -f my-values.yaml - [ ] Namespace created or `--create-namespace` used. - [ ] Resource requests/limits and storage sizes reviewed in `values.yaml` (e.g. `pvc.mysql`, `pvc.logs`, `pvc.data`). - [ ] Lint and template checked: `helm lint ./client -f my-values.yaml` and `helm template my-tracebloc ./client -f my-values.yaml`. + +--- + +## Next: ingest your first dataset + +With the client running, the typical follow-up is to land a dataset in the cluster's local MySQL so training jobs can read it. The `tracebloc/ingestor` subchart wraps that flow — customers describe the dataset in ~8 lines of YAML and run a single `helm install`. No Dockerfile, no Python script. + +The chart **does not transport data into the cluster** — it points at data already accessible on the cluster's shared PVC (`client-pvc` by default, mounted at `/data/shared/` inside the ingestor Pod). Stage your CSV + image / text / annotation files there first; the ingestor chart README documents the `kubectl cp` pattern and production sync alternatives. + +Example: once you've staged a cats-vs-dogs image classification dataset under `/data/shared/cats-dogs/` on the PVC, the `ingest.yaml` describes what's there: + +```yaml +# my-cats-dogs.yaml +apiVersion: tracebloc.io/v1 +kind: IngestConfig +category: image_classification +table: cats_dogs_train +intent: train +csv: /data/shared/cats-dogs/labels.csv +images: /data/shared/cats-dogs/images/ +label: label +``` + +```bash +helm install my-cats-dogs tracebloc/ingestor \ + --namespace tracebloc \ + --set-file ingestConfig=./my-cats-dogs.yaml +``` + +The ingestor runs once: validates the data, copies files into the destination directory on the PVC, inserts rows into MySQL, sends metadata to the tracebloc backend, then exits. Repeat per dataset. + +Full ingestor documentation, including the schema for every supported category, the auto-update model that keeps the ingestor image current without per-install overrides, and verification commands → **[ingestor/README.md](../ingestor/README.md)**. + +Category-specific YAML examples (image classification, object detection, tabular regression, semantic segmentation, text classification, masked language modeling, etc.) → **[tracebloc/data-ingestors templates](https://github.com/tracebloc/data-ingestors/tree/master/templates)**. diff --git a/ingestor/README.md b/ingestor/README.md index 10e375d..736f73f 100644 --- a/ingestor/README.md +++ b/ingestor/README.md @@ -24,6 +24,75 @@ The SA is shared by every `tracebloc/ingestor` release in the namespace which broke as soon as a second ingestor release tried to install ([tracebloc/client#129](https://github.com/tracebloc/client/issues/129)). +## Stage your data on the shared PVC + +This chart **does not transport data into the cluster.** It points at data already accessible to the cluster's shared PVC (`client-pvc` by default, mounted at `/data/shared/` inside every pod that uses it, including the ingestor Pod that jobs-manager spawns). + +Before running `helm install tracebloc/ingestor`, you need your raw files (the CSV plus any images / texts / annotations / masks / sequences the category requires) under `/data/shared//` on that PVC. The `csv:`, `images:` (etc.) paths in your `ingest.yaml` are paths *inside the ingestor Pod's filesystem*, which is the PVC mount. + +How to stage depends on dataset size and your environment. Two common patterns: + +### Pattern 1: `kubectl cp` via a pvc-shell pod (small datasets, one-off) + +Spin up a throwaway pod that mounts the PVC, copy files in, tear it down: + +```yaml +# /tmp/pvc-shell.yaml +apiVersion: v1 +kind: Pod +metadata: + name: pvc-shell + namespace: tracebloc +spec: + restartPolicy: Never + securityContext: + runAsNonRoot: true + runAsUser: 65534 + seccompProfile: + type: RuntimeDefault + containers: + - name: shell + image: alpine:3.19 + command: ["sleep", "3600"] + securityContext: + allowPrivilegeEscalation: false + capabilities: + drop: ["ALL"] + volumeMounts: + - name: shared + mountPath: /data/shared + volumes: + - name: shared + persistentVolumeClaim: + claimName: client-pvc +``` + +```bash +kubectl apply -f /tmp/pvc-shell.yaml +kubectl -n tracebloc wait --for=condition=Ready pod/pvc-shell --timeout=60s + +kubectl -n tracebloc exec pvc-shell -- \ + mkdir -p /data/shared/my-dataset/images + +kubectl -n tracebloc cp ./local-images/ pvc-shell:/data/shared/my-dataset/ +kubectl -n tracebloc cp ./local-labels.csv pvc-shell:/data/shared/my-dataset/labels.csv + +# Verify what landed +kubectl -n tracebloc exec pvc-shell -- ls /data/shared/my-dataset/ + +kubectl -n tracebloc delete pod pvc-shell +``` + +Now `csv: /data/shared/my-dataset/labels.csv` + `images: /data/shared/my-dataset/images/` in your `ingest.yaml` will resolve. + +### Pattern 2: Init container with cloud-storage sync (production / large datasets) + +For datasets too large to `kubectl cp` (and any production workflow with versioned data), run a one-shot Pod whose init or main container pulls from S3 / GCS / Azure Blob into the PVC. Customers typically wire this into their CI / GitOps tool so the data syncs before the ingestion `helm install` runs. The chart itself stays out of this — it's a precondition, not a chart responsibility. + +### Where the PVC name comes from + +The default `client-pvc` is set by the parent client chart's PVC block (see `values.yaml#pvc`). If your install renamed it, the ingestor Pod will mount whatever the parent chart configured via `CLIENT_PVC` on jobs-manager. In the rare case of a custom name, `kubectl -n tracebloc get pvc` shows what's actually bound, and that's the value to use as `claimName:` in the pvc-shell manifest above. + ## What this chart owns | Resource | Owner | Lifecycle |