Problem
Today's declarative-YAML ingestor (helm install tracebloc/ingestor --set-file ingestConfig=./ingest.yaml) is a real improvement over the per-customer-Dockerfile pattern, but it serves exactly one persona well: a Kubernetes-fluent ML engineer at a customer who already has data staged on a shared cluster filesystem. For everyone else — data scientists with local data, ML engineers with cloud sources, platform engineers running GitOps, air-gapped operators, and anyone iterating fast on many datasets — the chart hands them at least three foreign mental models (Helm, Kubernetes PVCs, YAML schema) plus a "stage data on the PVC yourself" prerequisite that doesn't scale beyond the cats/dogs sample.
This epic introduces a CLI (tracebloc) that collapses the customer's mental load into:
tracebloc dataset push ./my-data \
--table cats_dogs_train \
--category image_classification \
--intent train \
--label-column label
The CLI handles authentication, cluster discovery, data staging, schema validation, submission, watching, and summary reporting. The customer never touches Helm, never edits YAML, never runs kubectl cp.
The chart stays — it's still the right surface for some workflows (GitOps-managed installs, batch deploys via Helm wrappers). The CLI is a sibling, not a replacement.
Goals (v0.1)
- One command, zero prerequisites beyond kubectl access. A customer who can
kubectl get pods against the cluster should be able to tracebloc dataset push ./my-data against it.
- Local-first validation. Schema errors surface on the customer's laptop in milliseconds, not after a multi-second cluster round trip.
- Real-time feedback. Live progress during staging + ingestion; final summary in the same terminal session.
- Multi-platform single binary. Linux/macOS/Windows, x86_64/arm64. Brew, scoop, direct curl install, GitHub Releases.
- No new credentials. Reuses the customer's existing kubeconfig + the parent client release's
ingestor ServiceAccount token (via TokenRequest API or static token from the namespace).
Non-goals (v0.1)
- Cloud-storage source providers (S3/GCS/HTTPS). Deferred to v0.2.
- Custom processors / Python script ingestion. Deferred indefinitely — that's the legacy escape hatch.
- GitOps CRD (
kind: IngestionRun). Separate epic.
- Web UI. Separate epic.
- Auto-update mechanism (
tracebloc upgrade). v0.2.
- Multi-tenancy / multi-context support. v0.2.
- All categories other than image_classification. Image first because it's the dominant case + we validated it end-to-end this week. Other categories come as one-PR additions in v0.2.
Decisions
| Decision |
Choice |
Why |
| Repo |
New: tracebloc/cli |
Different language + release cadence from chart/data-ingestors; matches kubectl/helm/gh organizational pattern |
| Language |
Go |
Matches k8s ecosystem, single static binary, trivial cross-compile, excellent client-go |
| Auth |
Kubeconfig + ingestor SA token via TokenRequest API |
Zero new credentials; reuses today's TokenReview validation in jobs-manager |
| MVP scope |
tracebloc login + tracebloc dataset push <path> + watch |
Smallest shippable that solves the dominant case |
| Distribution |
Homebrew tap + GitHub Releases (cross-compiled binaries) + install.sh |
Matches gh/kubectl distribution model |
Architecture
┌──────────────────────────────────────────────────────┐
│ tracebloc CLI (Go single binary, ~10MB) │
│ │
│ Commands: │
│ tracebloc login │
│ tracebloc dataset push <path> [flags] │
│ tracebloc ingest validate <path> (local-only) │
│ tracebloc version │
│ tracebloc completion {bash|zsh|fish} │
│ │
│ Embedded: │
│ - ingest.v1.json schema (validated at build) │
│ - Default values per category │
│ │
│ Talks to: │
│ - Customer's kubeconfig (k8s API for discovery, │
│ TokenRequest, pod exec, log streaming) │
│ - jobs-manager directly (POST submit-ingestion-run) │
└──────────────────────────────────────────────────────┘
│
│ Same protocol as the chart
│ (POST /internal/submit-ingestion-run,
│ validated by jobs-manager's existing
│ TokenReview + ingest.v1.json check)
▼
┌─────────────────────────────┐
│ jobs-manager (unchanged) │
└─────────────────────────────┘
The CLI and the chart are sibling interfaces. Both POST the same body. Both are validated by the same jobs-manager logic. The protocol is the stable contract.
Implementation phases
Each phase = 1-3 PRs. Estimate ~2.5-3.5 weeks of focused work for v0.1.
Phase 0 — Repo bootstrap (~2 days)
- Create
tracebloc/cli repo
- Go module,
cmd/tracebloc/main.go, cobra for command tree
- CI:
go vet, go test, golangci-lint, cross-compile to 5 platforms (linux amd64/arm64, darwin amd64/arm64, windows amd64)
- License (Apache-2.0), README skeleton, CONTRIBUTING, CODEOWNERS, kanban routing workflows mirroring the other repos
- Skeleton
tracebloc version command (proves binary builds + runs)
- Smoke test in CI: build, run
tracebloc version, assert it prints
Phase 1 — Local schema validator (~3 days)
- Vendor
ingest.v1.json from tracebloc/data-ingestors at build time (go:embed)
- Implement validator using
github.com/santhosh-tekuri/jsonschema/v6 (the standard Go choice)
tracebloc ingest validate <path> reads the YAML, validates, prints errors with JSON-pointer paths
- Error formatting mirrors data-ingestors'
_format_errors output for consistency
- Unit tests against every example YAML in data-ingestors/examples/yaml/
Phase 2 — Cluster discovery + auth (~4 days)
- Read kubeconfig (
~/.kube/config or $KUBECONFIG)
- Discover the tracebloc parent release in the current namespace: list Helm releases, find the one whose chart name is
client
- Mint a token for the
ingestor ServiceAccount via the k8s TokenRequest API (or read a pre-existing token if the cluster's RBAC doesn't allow TokenRequest)
tracebloc cluster info diagnostic: prints discovered cluster, namespace, release, jobs-manager endpoint, SA name, token expiration
Phase 3 — Data staging (~5 days)
- Create an ephemeral Pod in the cluster's namespace mounting the shared PVC (
client-pvc by default, discoverable from the parent chart's values)
kubectl cp semantics: stream local files to the Pod's /data/shared/<table>/ path
- Progress bar (use
github.com/schollz/progressbar/v3)
- Cleanup of the ephemeral Pod on exit (including SIGINT)
- Test against the cats/dogs sample we used this week
- Document the size limit (kubectl cp + go-client memory profile) — anything above ~1GB needs Phase 2's cloud-source story (v0.2)
Phase 4 — Submit + watch (~4 days)
- Build the body.json the chart builds today (ingest_config + idempotency_key)
- POST to jobs-manager with the SA token; show response
- Watch the resulting ingestor Job (named in the response) via k8s API
- Stream the Pod's logs to the customer's terminal
- Parse the ingestion summary; print rows-ingested + files-transferred + any failures
- Exit code maps to outcome (0 = success, non-zero = category)
Phase 5 — Distribution (~2 days)
- GitHub Releases workflow: triggered by
v*.*.* tags, builds all 5 platform binaries, signs with cosign keyless OIDC (matching the data-ingestors image release flow), attaches to release
- Homebrew tap (separate repo
tracebloc/homebrew-tap): brew install tracebloc/tap/tracebloc
install.sh for curl-bash install: hosted at install.tracebloc.io (S3 + Cloudfront)
- Verify signatures in install.sh before unpacking
Phase 6 — First customer test + iteration (~3 days)
- End-to-end on the same EKS cluster we validated this week
- One internal customer dogfoods it
- Fix whatever shakes out
- Tag v0.1.0
Open questions
-
Schema-version negotiation. Today jobs-manager validates against its embedded schema. If the CLI ships with a newer schema than the cluster's jobs-manager, what happens? Need either (a) jobs-manager exposes its supported schema version via a /v1/version endpoint the CLI queries, or (b) CLI ships pinned to a specific schema version and we accept slight churn. Recommend (a) but it's a small jobs-manager addition.
-
TokenRequest vs static SA secret. TokenRequest requires the customer's kubeconfig user to have create permission on serviceaccounts/token. Most clusters' default ClusterRole-binding doesn't grant this to humans. Fallback: read the existing Secret of type kubernetes.io/service-account-token if one exists (older k8s); or have the parent client chart create a long-lived token secret for ingestion (modern k8s — opt-in). Decide before Phase 2.
-
Backend-side auth. v0.1 reuses the SA-token-to-jobs-manager flow which doesn't need backend auth at all (the ingestor Job mints its own backend token via clientId/password env). So tracebloc login is empty in v0.1 — purely a placeholder for future OAuth. Document this so users aren't surprised.
-
Concurrency model. What if two tracebloc dataset push invocations race against the same table? Today the chart handles this via idempotency-key uniqueness; the CLI should generate a fresh key per invocation (matching the chart's <release>-<unix-epoch> pattern, but stamped at the moment of POST).
-
tracebloc dataset push vs tracebloc ingest. Command naming. "Dataset push" feels more product (matches docker push, git push). "Ingest" matches the API surface (/internal/submit-ingestion-run). Recommend "dataset push" for the customer surface, "ingest" as an alias for advanced users.
Acceptance criteria for v0.1
Future epics (post-v0.1)
- Cloud-storage source providers — S3/GCS/HTTPS in the schema + CLI plumbing. Solves the staging-crisis pain point.
tracebloc dataset list / show / delete — observability for existing datasets.
- GitOps CRD —
kind: IngestionRun controller; ArgoCD/Flux managed datasets.
- Web UI for datasets — for users who shouldn't touch CLI/YAML.
- Custom processors without Docker — Python uploaded as ConfigMap, mounted into the standard ingestor image.
Reasoning notes
- The protocol (POST + ingest.v1.json) is the stable contract. Multiple interfaces (chart, CLI, future CRD, future Web UI) translate to the same call. This means adding the CLI doesn't deprecate anything, doesn't break anything, and sets up the layered-interface architecture we'd want for everything else.
- Go was chosen over Python for distribution simplicity (single static binary, no runtime dependency). Data scientists who already have Python can still install via
brew or curl install.sh | sh.
- v0.1 deliberately scopes out cloud sources, multi-category, GitOps, and Web UI. Smallest shippable thing that exercises the layered-interface idea + serves the dominant case. Real customer usage informs v0.2.
Problem
Today's declarative-YAML ingestor (
helm install tracebloc/ingestor --set-file ingestConfig=./ingest.yaml) is a real improvement over the per-customer-Dockerfile pattern, but it serves exactly one persona well: a Kubernetes-fluent ML engineer at a customer who already has data staged on a shared cluster filesystem. For everyone else — data scientists with local data, ML engineers with cloud sources, platform engineers running GitOps, air-gapped operators, and anyone iterating fast on many datasets — the chart hands them at least three foreign mental models (Helm, Kubernetes PVCs, YAML schema) plus a "stage data on the PVC yourself" prerequisite that doesn't scale beyond the cats/dogs sample.This epic introduces a CLI (
tracebloc) that collapses the customer's mental load into:The CLI handles authentication, cluster discovery, data staging, schema validation, submission, watching, and summary reporting. The customer never touches Helm, never edits YAML, never runs
kubectl cp.The chart stays — it's still the right surface for some workflows (GitOps-managed installs, batch deploys via Helm wrappers). The CLI is a sibling, not a replacement.
Goals (v0.1)
kubectl get podsagainst the cluster should be able totracebloc dataset push ./my-dataagainst it.ingestorServiceAccount token (via TokenRequest API or static token from the namespace).Non-goals (v0.1)
kind: IngestionRun). Separate epic.tracebloc upgrade). v0.2.Decisions
tracebloc/clitracebloc login+tracebloc dataset push <path>+ watchArchitecture
The CLI and the chart are sibling interfaces. Both POST the same body. Both are validated by the same jobs-manager logic. The protocol is the stable contract.
Implementation phases
Each phase = 1-3 PRs. Estimate ~2.5-3.5 weeks of focused work for v0.1.
Phase 0 — Repo bootstrap (~2 days)
tracebloc/clirepocmd/tracebloc/main.go,cobrafor command treego vet,go test,golangci-lint, cross-compile to 5 platforms (linux amd64/arm64, darwin amd64/arm64, windows amd64)tracebloc versioncommand (proves binary builds + runs)tracebloc version, assert it printsPhase 1 — Local schema validator (~3 days)
ingest.v1.jsonfrom tracebloc/data-ingestors at build time (go:embed)github.com/santhosh-tekuri/jsonschema/v6(the standard Go choice)tracebloc ingest validate <path>reads the YAML, validates, prints errors with JSON-pointer paths_format_errorsoutput for consistencyPhase 2 — Cluster discovery + auth (~4 days)
~/.kube/configor$KUBECONFIG)clientingestorServiceAccount via the k8s TokenRequest API (or read a pre-existing token if the cluster's RBAC doesn't allow TokenRequest)tracebloc cluster infodiagnostic: prints discovered cluster, namespace, release, jobs-manager endpoint, SA name, token expirationPhase 3 — Data staging (~5 days)
client-pvcby default, discoverable from the parent chart's values)kubectl cpsemantics: stream local files to the Pod's/data/shared/<table>/pathgithub.com/schollz/progressbar/v3)Phase 4 — Submit + watch (~4 days)
Phase 5 — Distribution (~2 days)
v*.*.*tags, builds all 5 platform binaries, signs with cosign keyless OIDC (matching the data-ingestors image release flow), attaches to releasetracebloc/homebrew-tap):brew install tracebloc/tap/traceblocinstall.shfor curl-bash install: hosted atinstall.tracebloc.io(S3 + Cloudfront)Phase 6 — First customer test + iteration (~3 days)
Open questions
Schema-version negotiation. Today jobs-manager validates against its embedded schema. If the CLI ships with a newer schema than the cluster's jobs-manager, what happens? Need either (a) jobs-manager exposes its supported schema version via a
/v1/versionendpoint the CLI queries, or (b) CLI ships pinned to a specific schema version and we accept slight churn. Recommend (a) but it's a small jobs-manager addition.TokenRequest vs static SA secret. TokenRequest requires the customer's kubeconfig user to have
createpermission onserviceaccounts/token. Most clusters' default ClusterRole-binding doesn't grant this to humans. Fallback: read the existingSecretof typekubernetes.io/service-account-tokenif one exists (older k8s); or have the parent client chart create a long-lived token secret for ingestion (modern k8s — opt-in). Decide before Phase 2.Backend-side auth. v0.1 reuses the SA-token-to-jobs-manager flow which doesn't need backend auth at all (the ingestor Job mints its own backend token via clientId/password env). So
tracebloc loginis empty in v0.1 — purely a placeholder for future OAuth. Document this so users aren't surprised.Concurrency model. What if two
tracebloc dataset pushinvocations race against the same table? Today the chart handles this via idempotency-key uniqueness; the CLI should generate a fresh key per invocation (matching the chart's<release>-<unix-epoch>pattern, but stamped at the moment of POST).tracebloc dataset pushvstracebloc ingest. Command naming. "Dataset push" feels more product (matchesdocker push,git push). "Ingest" matches the API surface (/internal/submit-ingestion-run). Recommend "dataset push" for the customer surface, "ingest" as an alias for advanced users.Acceptance criteria for v0.1
brew install tracebloc/tap/tracebloctracebloc dataset push ./cats-dogs --table cats_dogs_train --category image_classification --intent train --label-column labelagainst an EKS cluster where they have kubectl accesstracebloc/ingestor) keeps working unchangedFuture epics (post-v0.1)
tracebloc dataset list / show / delete— observability for existing datasets.kind: IngestionRuncontroller; ArgoCD/Flux managed datasets.Reasoning notes
breworcurl install.sh | sh.