Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 34 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ For the threat model, defense layers, per-platform caveats, operator responsibil

## Deploy

This repo ships the **tracebloc** unified Helm chart (currently `v1.3.1`) — one chart for AKS, EKS, bare-metal, and OpenShift.
This repo ships the **tracebloc** unified Helm chart (currently `v1.3.5`) — one chart for AKS, EKS, bare-metal, and OpenShift.

### Quick install

Expand All @@ -77,16 +77,48 @@ For existing Kubernetes clusters:
```bash
helm repo add tracebloc https://tracebloc.github.io/client
helm repo update
helm install my-tracebloc tracebloc/tracebloc \
helm install my-tracebloc tracebloc/client \
--namespace tracebloc --create-namespace \
-f my-values.yaml
```

Full deployment guide → **[docs/INSTALL.md](docs/INSTALL.md)** (prerequisites, required values, upgrade & rollback, air-gapped install).

## Ingest a dataset

Once the client is running, get a dataset into your cluster's local MySQL with ~8 lines of YAML and a single `helm install`. No Dockerfile, no Python script — the platform owns the official image, you describe what you want ingested.

The flow is two steps. **First**, stage your raw files on the cluster's shared PVC (`client-pvc` by default, mounted at `/data/shared/` inside the ingestor Pod). The chart doesn't transport data into the cluster — it points at data the cluster can already see. The simplest pattern is a throwaway `kubectl cp` Pod that mounts the PVC; the chart README links the manifest.

**Second**, describe the dataset and install:

```yaml
# my-cats-dogs.yaml
apiVersion: tracebloc.io/v1
kind: IngestConfig
category: image_classification
table: cats_dogs_train
intent: train
csv: /data/shared/cats-dogs/labels.csv
images: /data/shared/cats-dogs/images/
label: label
```

```bash
helm install my-cats-dogs tracebloc/ingestor \
--namespace tracebloc \
--set-file ingestConfig=./my-cats-dogs.yaml
```

The ingestor runs once, validates the data, copies files into the destination directory on the PVC, inserts rows into the cluster's MySQL, sends metadata to the tracebloc backend — then exits. The chart artifacts (ConfigMap + post-install hook Job) become inert; nothing keeps running. Repeat per dataset.

Full ingestor docs → **[ingestor/README.md](ingestor/README.md)** (data staging patterns, every supported category, the schema, the update model, verification, override knobs).

| Topic | Where to look |
|---|---|
| Production install + required values | [docs/INSTALL.md](docs/INSTALL.md) |
| Ingest a dataset (declarative YAML) | [ingestor/README.md](ingestor/README.md) |
| Available ingestion categories + example YAMLs | [tracebloc/data-ingestors templates](https://github.com/tracebloc/data-ingestors/tree/master/templates) |
| Threat model & operator responsibilities | [docs/SECURITY.md](docs/SECURITY.md) |
| Migrating from `eks-1.0.x` / `aks-*` charts to `client-1.x` | [docs/MIGRATIONS.md](docs/MIGRATIONS.md) |
| Per-tenant migration runbook | [docs/migration-tools/README.md](docs/migration-tools/README.md) |
Expand Down
40 changes: 37 additions & 3 deletions docs/INSTALL.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ helm repo add tracebloc https://tracebloc.github.io/client
helm repo update

# Install with a release name and namespace
helm install my-tracebloc tracebloc/tracebloc \
helm install my-tracebloc tracebloc/client \
--namespace tracebloc \
--create-namespace \
-f my-values.yaml
Expand Down Expand Up @@ -104,7 +104,7 @@ For platform-specific settings (AKS, EKS, bare-metal, OpenShift), see `client/ci
```bash
# Upgrade to a new chart version (repo install)
helm repo update
helm upgrade my-tracebloc tracebloc/tracebloc -n tracebloc -f my-values.yaml
helm upgrade my-tracebloc tracebloc/client -n tracebloc -f my-values.yaml

# Upgrade when using a tgz
helm upgrade my-tracebloc ./tracebloc-2.0.1.tgz -n tracebloc -f my-values.yaml
Expand Down Expand Up @@ -228,7 +228,7 @@ After that, users can run:

```bash
helm repo add tracebloc https://tracebloc.github.io/client
helm install my-tracebloc tracebloc/tracebloc -n tracebloc -f my-values.yaml
helm install my-tracebloc tracebloc/client -n tracebloc -f my-values.yaml
```

---
Expand All @@ -241,3 +241,37 @@ helm install my-tracebloc tracebloc/tracebloc -n tracebloc -f my-values.yaml
- [ ] Namespace created or `--create-namespace` used.
- [ ] Resource requests/limits and storage sizes reviewed in `values.yaml` (e.g. `pvc.mysql`, `pvc.logs`, `pvc.data`).
- [ ] Lint and template checked: `helm lint ./client -f my-values.yaml` and `helm template my-tracebloc ./client -f my-values.yaml`.

---

## Next: ingest your first dataset

With the client running, the typical follow-up is to land a dataset in the cluster's local MySQL so training jobs can read it. The `tracebloc/ingestor` subchart wraps that flow — customers describe the dataset in ~8 lines of YAML and run a single `helm install`. No Dockerfile, no Python script.

The chart **does not transport data into the cluster** — it points at data already accessible on the cluster's shared PVC (`client-pvc` by default, mounted at `/data/shared/` inside the ingestor Pod). Stage your CSV + image / text / annotation files there first; the ingestor chart README documents the `kubectl cp` pattern and production sync alternatives.

Example: once you've staged a cats-vs-dogs image classification dataset under `/data/shared/cats-dogs/` on the PVC, the `ingest.yaml` describes what's there:

```yaml
# my-cats-dogs.yaml
apiVersion: tracebloc.io/v1
kind: IngestConfig
category: image_classification
table: cats_dogs_train
intent: train
csv: /data/shared/cats-dogs/labels.csv
images: /data/shared/cats-dogs/images/
label: label
```

```bash
helm install my-cats-dogs tracebloc/ingestor \
--namespace tracebloc \
--set-file ingestConfig=./my-cats-dogs.yaml
```

The ingestor runs once: validates the data, copies files into the destination directory on the PVC, inserts rows into MySQL, sends metadata to the tracebloc backend, then exits. Repeat per dataset.

Full ingestor documentation, including the schema for every supported category, the auto-update model that keeps the ingestor image current without per-install overrides, and verification commands → **[ingestor/README.md](../ingestor/README.md)**.

Category-specific YAML examples (image classification, object detection, tabular regression, semantic segmentation, text classification, masked language modeling, etc.) → **[tracebloc/data-ingestors templates](https://github.com/tracebloc/data-ingestors/tree/master/templates)**.
69 changes: 69 additions & 0 deletions ingestor/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,75 @@ The SA is shared by every `tracebloc/ingestor` release in the namespace
which broke as soon as a second ingestor release tried to install
([tracebloc/client#129](https://github.com/tracebloc/client/issues/129)).

## Stage your data on the shared PVC

This chart **does not transport data into the cluster.** It points at data already accessible to the cluster's shared PVC (`client-pvc` by default, mounted at `/data/shared/` inside every pod that uses it, including the ingestor Pod that jobs-manager spawns).

Before running `helm install tracebloc/ingestor`, you need your raw files (the CSV plus any images / texts / annotations / masks / sequences the category requires) under `/data/shared/<your-prefix>/` on that PVC. The `csv:`, `images:` (etc.) paths in your `ingest.yaml` are paths *inside the ingestor Pod's filesystem*, which is the PVC mount.

How to stage depends on dataset size and your environment. Two common patterns:

### Pattern 1: `kubectl cp` via a pvc-shell pod (small datasets, one-off)

Spin up a throwaway pod that mounts the PVC, copy files in, tear it down:

```yaml
# /tmp/pvc-shell.yaml
apiVersion: v1
kind: Pod
metadata:
name: pvc-shell
namespace: tracebloc
spec:
restartPolicy: Never
securityContext:
runAsNonRoot: true
runAsUser: 65534
seccompProfile:
type: RuntimeDefault
containers:
- name: shell
image: alpine:3.19
command: ["sleep", "3600"]
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
volumeMounts:
- name: shared
mountPath: /data/shared
volumes:
- name: shared
persistentVolumeClaim:
claimName: client-pvc
```

```bash
kubectl apply -f /tmp/pvc-shell.yaml
kubectl -n tracebloc wait --for=condition=Ready pod/pvc-shell --timeout=60s

kubectl -n tracebloc exec pvc-shell -- \
mkdir -p /data/shared/my-dataset/images

kubectl -n tracebloc cp ./local-images/ pvc-shell:/data/shared/my-dataset/
kubectl -n tracebloc cp ./local-labels.csv pvc-shell:/data/shared/my-dataset/labels.csv

# Verify what landed
kubectl -n tracebloc exec pvc-shell -- ls /data/shared/my-dataset/

kubectl -n tracebloc delete pod pvc-shell
```

Now `csv: /data/shared/my-dataset/labels.csv` + `images: /data/shared/my-dataset/images/` in your `ingest.yaml` will resolve.

### Pattern 2: Init container with cloud-storage sync (production / large datasets)

For datasets too large to `kubectl cp` (and any production workflow with versioned data), run a one-shot Pod whose init or main container pulls from S3 / GCS / Azure Blob into the PVC. Customers typically wire this into their CI / GitOps tool so the data syncs before the ingestion `helm install` runs. The chart itself stays out of this — it's a precondition, not a chart responsibility.

### Where the PVC name comes from

The default `client-pvc` is set by the parent client chart's PVC block (see `values.yaml#pvc`). If your install renamed it, the ingestor Pod will mount whatever the parent chart configured via `CLIENT_PVC` on jobs-manager. In the rare case of a custom name, `kubectl -n tracebloc get pvc` shows what's actually bound, and that's the value to use as `claimName:` in the pvc-shell manifest above.

## What this chart owns

| Resource | Owner | Lifecycle |
Expand Down