Skip to content

Epic: tracebloc CLI — tracebloc dataset push for one-command ingestion #147

@saadqbal

Description

@saadqbal

Problem

Today's declarative-YAML ingestor (helm install tracebloc/ingestor --set-file ingestConfig=./ingest.yaml) is a real improvement over the per-customer-Dockerfile pattern, but it serves exactly one persona well: a Kubernetes-fluent ML engineer at a customer who already has data staged on a shared cluster filesystem. For everyone else — data scientists with local data, ML engineers with cloud sources, platform engineers running GitOps, air-gapped operators, and anyone iterating fast on many datasets — the chart hands them at least three foreign mental models (Helm, Kubernetes PVCs, YAML schema) plus a "stage data on the PVC yourself" prerequisite that doesn't scale beyond the cats/dogs sample.

This epic introduces a CLI (tracebloc) that collapses the customer's mental load into:

tracebloc dataset push ./my-data \
  --table cats_dogs_train \
  --category image_classification \
  --intent train \
  --label-column label

The CLI handles authentication, cluster discovery, data staging, schema validation, submission, watching, and summary reporting. The customer never touches Helm, never edits YAML, never runs kubectl cp.

The chart stays — it's still the right surface for some workflows (GitOps-managed installs, batch deploys via Helm wrappers). The CLI is a sibling, not a replacement.

Goals (v0.1)

  1. One command, zero prerequisites beyond kubectl access. A customer who can kubectl get pods against the cluster should be able to tracebloc dataset push ./my-data against it.
  2. Local-first validation. Schema errors surface on the customer's laptop in milliseconds, not after a multi-second cluster round trip.
  3. Real-time feedback. Live progress during staging + ingestion; final summary in the same terminal session.
  4. Multi-platform single binary. Linux/macOS/Windows, x86_64/arm64. Brew, scoop, direct curl install, GitHub Releases.
  5. No new credentials. Reuses the customer's existing kubeconfig + the parent client release's ingestor ServiceAccount token (via TokenRequest API or static token from the namespace).

Non-goals (v0.1)

  • Cloud-storage source providers (S3/GCS/HTTPS). Deferred to v0.2.
  • Custom processors / Python script ingestion. Deferred indefinitely — that's the legacy escape hatch.
  • GitOps CRD (kind: IngestionRun). Separate epic.
  • Web UI. Separate epic.
  • Auto-update mechanism (tracebloc upgrade). v0.2.
  • Multi-tenancy / multi-context support. v0.2.
  • All categories other than image_classification. Image first because it's the dominant case + we validated it end-to-end this week. Other categories come as one-PR additions in v0.2.

Decisions

Decision Choice Why
Repo New: tracebloc/cli Different language + release cadence from chart/data-ingestors; matches kubectl/helm/gh organizational pattern
Language Go Matches k8s ecosystem, single static binary, trivial cross-compile, excellent client-go
Auth Kubeconfig + ingestor SA token via TokenRequest API Zero new credentials; reuses today's TokenReview validation in jobs-manager
MVP scope tracebloc login + tracebloc dataset push <path> + watch Smallest shippable that solves the dominant case
Distribution Homebrew tap + GitHub Releases (cross-compiled binaries) + install.sh Matches gh/kubectl distribution model

Architecture

┌──────────────────────────────────────────────────────┐
│ tracebloc CLI (Go single binary, ~10MB)              │
│                                                      │
│ Commands:                                            │
│   tracebloc login                                    │
│   tracebloc dataset push <path> [flags]              │
│   tracebloc ingest validate <path>  (local-only)     │
│   tracebloc version                                  │
│   tracebloc completion {bash|zsh|fish}               │
│                                                      │
│ Embedded:                                            │
│   - ingest.v1.json schema (validated at build)       │
│   - Default values per category                      │
│                                                      │
│ Talks to:                                            │
│   - Customer's kubeconfig (k8s API for discovery,    │
│     TokenRequest, pod exec, log streaming)            │
│   - jobs-manager directly (POST submit-ingestion-run) │
└──────────────────────────────────────────────────────┘
                          │
                          │ Same protocol as the chart
                          │ (POST /internal/submit-ingestion-run,
                          │  validated by jobs-manager's existing
                          │  TokenReview + ingest.v1.json check)
                          ▼
              ┌─────────────────────────────┐
              │ jobs-manager (unchanged)    │
              └─────────────────────────────┘

The CLI and the chart are sibling interfaces. Both POST the same body. Both are validated by the same jobs-manager logic. The protocol is the stable contract.

Implementation phases

Each phase = 1-3 PRs. Estimate ~2.5-3.5 weeks of focused work for v0.1.

Phase 0 — Repo bootstrap (~2 days)

  • Create tracebloc/cli repo
  • Go module, cmd/tracebloc/main.go, cobra for command tree
  • CI: go vet, go test, golangci-lint, cross-compile to 5 platforms (linux amd64/arm64, darwin amd64/arm64, windows amd64)
  • License (Apache-2.0), README skeleton, CONTRIBUTING, CODEOWNERS, kanban routing workflows mirroring the other repos
  • Skeleton tracebloc version command (proves binary builds + runs)
  • Smoke test in CI: build, run tracebloc version, assert it prints

Phase 1 — Local schema validator (~3 days)

  • Vendor ingest.v1.json from tracebloc/data-ingestors at build time (go:embed)
  • Implement validator using github.com/santhosh-tekuri/jsonschema/v6 (the standard Go choice)
  • tracebloc ingest validate <path> reads the YAML, validates, prints errors with JSON-pointer paths
  • Error formatting mirrors data-ingestors' _format_errors output for consistency
  • Unit tests against every example YAML in data-ingestors/examples/yaml/

Phase 2 — Cluster discovery + auth (~4 days)

  • Read kubeconfig (~/.kube/config or $KUBECONFIG)
  • Discover the tracebloc parent release in the current namespace: list Helm releases, find the one whose chart name is client
  • Mint a token for the ingestor ServiceAccount via the k8s TokenRequest API (or read a pre-existing token if the cluster's RBAC doesn't allow TokenRequest)
  • tracebloc cluster info diagnostic: prints discovered cluster, namespace, release, jobs-manager endpoint, SA name, token expiration

Phase 3 — Data staging (~5 days)

  • Create an ephemeral Pod in the cluster's namespace mounting the shared PVC (client-pvc by default, discoverable from the parent chart's values)
  • kubectl cp semantics: stream local files to the Pod's /data/shared/<table>/ path
  • Progress bar (use github.com/schollz/progressbar/v3)
  • Cleanup of the ephemeral Pod on exit (including SIGINT)
  • Test against the cats/dogs sample we used this week
  • Document the size limit (kubectl cp + go-client memory profile) — anything above ~1GB needs Phase 2's cloud-source story (v0.2)

Phase 4 — Submit + watch (~4 days)

  • Build the body.json the chart builds today (ingest_config + idempotency_key)
  • POST to jobs-manager with the SA token; show response
  • Watch the resulting ingestor Job (named in the response) via k8s API
  • Stream the Pod's logs to the customer's terminal
  • Parse the ingestion summary; print rows-ingested + files-transferred + any failures
  • Exit code maps to outcome (0 = success, non-zero = category)

Phase 5 — Distribution (~2 days)

  • GitHub Releases workflow: triggered by v*.*.* tags, builds all 5 platform binaries, signs with cosign keyless OIDC (matching the data-ingestors image release flow), attaches to release
  • Homebrew tap (separate repo tracebloc/homebrew-tap): brew install tracebloc/tap/tracebloc
  • install.sh for curl-bash install: hosted at install.tracebloc.io (S3 + Cloudfront)
  • Verify signatures in install.sh before unpacking

Phase 6 — First customer test + iteration (~3 days)

  • End-to-end on the same EKS cluster we validated this week
  • One internal customer dogfoods it
  • Fix whatever shakes out
  • Tag v0.1.0

Open questions

  1. Schema-version negotiation. Today jobs-manager validates against its embedded schema. If the CLI ships with a newer schema than the cluster's jobs-manager, what happens? Need either (a) jobs-manager exposes its supported schema version via a /v1/version endpoint the CLI queries, or (b) CLI ships pinned to a specific schema version and we accept slight churn. Recommend (a) but it's a small jobs-manager addition.

  2. TokenRequest vs static SA secret. TokenRequest requires the customer's kubeconfig user to have create permission on serviceaccounts/token. Most clusters' default ClusterRole-binding doesn't grant this to humans. Fallback: read the existing Secret of type kubernetes.io/service-account-token if one exists (older k8s); or have the parent client chart create a long-lived token secret for ingestion (modern k8s — opt-in). Decide before Phase 2.

  3. Backend-side auth. v0.1 reuses the SA-token-to-jobs-manager flow which doesn't need backend auth at all (the ingestor Job mints its own backend token via clientId/password env). So tracebloc login is empty in v0.1 — purely a placeholder for future OAuth. Document this so users aren't surprised.

  4. Concurrency model. What if two tracebloc dataset push invocations race against the same table? Today the chart handles this via idempotency-key uniqueness; the CLI should generate a fresh key per invocation (matching the chart's <release>-<unix-epoch> pattern, but stamped at the moment of POST).

  5. tracebloc dataset push vs tracebloc ingest. Command naming. "Dataset push" feels more product (matches docker push, git push). "Ingest" matches the API surface (/internal/submit-ingestion-run). Recommend "dataset push" for the customer surface, "ingest" as an alias for advanced users.

Acceptance criteria for v0.1

  • A new user installs the CLI via brew install tracebloc/tap/tracebloc
  • They can run tracebloc dataset push ./cats-dogs --table cats_dogs_train --category image_classification --intent train --label-column label against an EKS cluster where they have kubectl access
  • The command stages the data, submits, watches, and reports "576 rows ingested" within ~60 seconds (for a small sample)
  • Schema errors in the customer's input are caught locally before any cluster interaction
  • Non-image categories give a clear "not supported in v0.1" error
  • Repository tracebloc/cli exists with CI green
  • The chart (tracebloc/ingestor) keeps working unchanged

Future epics (post-v0.1)

  • Cloud-storage source providers — S3/GCS/HTTPS in the schema + CLI plumbing. Solves the staging-crisis pain point.
  • tracebloc dataset list / show / delete — observability for existing datasets.
  • GitOps CRDkind: IngestionRun controller; ArgoCD/Flux managed datasets.
  • Web UI for datasets — for users who shouldn't touch CLI/YAML.
  • Custom processors without Docker — Python uploaded as ConfigMap, mounted into the standard ingestor image.

Reasoning notes

  • The protocol (POST + ingest.v1.json) is the stable contract. Multiple interfaces (chart, CLI, future CRD, future Web UI) translate to the same call. This means adding the CLI doesn't deprecate anything, doesn't break anything, and sets up the layered-interface architecture we'd want for everything else.
  • Go was chosen over Python for distribution simplicity (single static binary, no runtime dependency). Data scientists who already have Python can still install via brew or curl install.sh | sh.
  • v0.1 deliberately scopes out cloud sources, multi-category, GitOps, and Web UI. Smallest shippable thing that exercises the layered-interface idea + serves the dominant case. Real customer usage informs v0.2.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions