Epic: tracebloc CLI — `tracebloc dataset push` for one-command ingestion

## Problem

Today's declarative-YAML ingestor (`helm install tracebloc/ingestor --set-file ingestConfig=./ingest.yaml`) is a real improvement over the per-customer-Dockerfile pattern, but it serves exactly one persona well: a Kubernetes-fluent ML engineer at a customer who already has data staged on a shared cluster filesystem. For everyone else — data scientists with local data, ML engineers with cloud sources, platform engineers running GitOps, air-gapped operators, and anyone iterating fast on many datasets — the chart hands them at least three foreign mental models (Helm, Kubernetes PVCs, YAML schema) plus a "stage data on the PVC yourself" prerequisite that doesn't scale beyond the cats/dogs sample.

This epic introduces a CLI (`tracebloc`) that collapses the customer's mental load into:

```bash
tracebloc dataset push ./my-data \
  --table cats_dogs_train \
  --category image_classification \
  --intent train \
  --label-column label
```

The CLI handles authentication, cluster discovery, data staging, schema validation, submission, watching, and summary reporting. The customer never touches Helm, never edits YAML, never runs `kubectl cp`.

The chart stays — it's still the right surface for some workflows (GitOps-managed installs, batch deploys via Helm wrappers). The CLI is a sibling, not a replacement.

## Goals (v0.1)

1. **One command, zero prerequisites beyond kubectl access.** A customer who can `kubectl get pods` against the cluster should be able to `tracebloc dataset push ./my-data` against it.
2. **Local-first validation.** Schema errors surface on the customer's laptop in milliseconds, not after a multi-second cluster round trip.
3. **Real-time feedback.** Live progress during staging + ingestion; final summary in the same terminal session.
4. **Multi-platform single binary.** Linux/macOS/Windows, x86_64/arm64. Brew, scoop, direct curl install, GitHub Releases.
5. **No new credentials.** Reuses the customer's existing kubeconfig + the parent client release's `ingestor` ServiceAccount token (via TokenRequest API or static token from the namespace).

## Non-goals (v0.1)

- Cloud-storage source providers (S3/GCS/HTTPS). Deferred to v0.2.
- Custom processors / Python script ingestion. Deferred indefinitely — that's the legacy escape hatch.
- GitOps CRD (`kind: IngestionRun`). Separate epic.
- Web UI. Separate epic.
- Auto-update mechanism (`tracebloc upgrade`). v0.2.
- Multi-tenancy / multi-context support. v0.2.
- All categories other than image_classification. Image first because it's the dominant case + we validated it end-to-end this week. Other categories come as one-PR additions in v0.2.

## Decisions

| Decision | Choice | Why |
|---|---|---|
| Repo | New: `tracebloc/cli` | Different language + release cadence from chart/data-ingestors; matches kubectl/helm/gh organizational pattern |
| Language | Go | Matches k8s ecosystem, single static binary, trivial cross-compile, excellent client-go |
| Auth | Kubeconfig + ingestor SA token via TokenRequest API | Zero new credentials; reuses today's TokenReview validation in jobs-manager |
| MVP scope | `tracebloc login` + `tracebloc dataset push <path>` + watch | Smallest shippable that solves the dominant case |
| Distribution | Homebrew tap + GitHub Releases (cross-compiled binaries) + install.sh | Matches gh/kubectl distribution model |

## Architecture

```
┌──────────────────────────────────────────────────────┐
│ tracebloc CLI (Go single binary, ~10MB)              │
│                                                      │
│ Commands:                                            │
│   tracebloc login                                    │
│   tracebloc dataset push <path> [flags]              │
│   tracebloc ingest validate <path>  (local-only)     │
│   tracebloc version                                  │
│   tracebloc completion {bash|zsh|fish}               │
│                                                      │
│ Embedded:                                            │
│   - ingest.v1.json schema (validated at build)       │
│   - Default values per category                      │
│                                                      │
│ Talks to:                                            │
│   - Customer's kubeconfig (k8s API for discovery,    │
│     TokenRequest, pod exec, log streaming)            │
│   - jobs-manager directly (POST submit-ingestion-run) │
└──────────────────────────────────────────────────────┘
                          │
                          │ Same protocol as the chart
                          │ (POST /internal/submit-ingestion-run,
                          │  validated by jobs-manager's existing
                          │  TokenReview + ingest.v1.json check)
                          ▼
              ┌─────────────────────────────┐
              │ jobs-manager (unchanged)    │
              └─────────────────────────────┘
```

The CLI and the chart are sibling interfaces. Both POST the same body. Both are validated by the same jobs-manager logic. The protocol is the stable contract.

## Implementation phases

Each phase = 1-3 PRs. Estimate ~2.5-3.5 weeks of focused work for v0.1.

### Phase 0 — Repo bootstrap (~2 days)

- Create `tracebloc/cli` repo
- Go module, `cmd/tracebloc/main.go`, `cobra` for command tree
- CI: `go vet`, `go test`, `golangci-lint`, cross-compile to 5 platforms (linux amd64/arm64, darwin amd64/arm64, windows amd64)
- License (Apache-2.0), README skeleton, CONTRIBUTING, CODEOWNERS, kanban routing workflows mirroring the other repos
- Skeleton `tracebloc version` command (proves binary builds + runs)
- Smoke test in CI: build, run `tracebloc version`, assert it prints

### Phase 1 — Local schema validator (~3 days)

- Vendor `ingest.v1.json` from tracebloc/data-ingestors at build time (go:embed)
- Implement validator using `github.com/santhosh-tekuri/jsonschema/v6` (the standard Go choice)
- `tracebloc ingest validate <path>` reads the YAML, validates, prints errors with JSON-pointer paths
- Error formatting mirrors data-ingestors' `_format_errors` output for consistency
- Unit tests against every example YAML in data-ingestors/examples/yaml/

### Phase 2 — Cluster discovery + auth (~4 days)

- Read kubeconfig (`~/.kube/config` or `$KUBECONFIG`)
- Discover the tracebloc parent release in the current namespace: list Helm releases, find the one whose chart name is `client`
- Mint a token for the `ingestor` ServiceAccount via the k8s TokenRequest API (or read a pre-existing token if the cluster's RBAC doesn't allow TokenRequest)
- `tracebloc cluster info` diagnostic: prints discovered cluster, namespace, release, jobs-manager endpoint, SA name, token expiration

### Phase 3 — Data staging (~5 days)

- Create an ephemeral Pod in the cluster's namespace mounting the shared PVC (`client-pvc` by default, discoverable from the parent chart's values)
- `kubectl cp` semantics: stream local files to the Pod's `/data/shared/<table>/` path
- Progress bar (use `github.com/schollz/progressbar/v3`)
- Cleanup of the ephemeral Pod on exit (including SIGINT)
- Test against the cats/dogs sample we used this week
- Document the size limit (kubectl cp + go-client memory profile) — anything above ~1GB needs Phase 2's cloud-source story (v0.2)

### Phase 4 — Submit + watch (~4 days)

- Build the body.json the chart builds today (ingest_config + idempotency_key)
- POST to jobs-manager with the SA token; show response
- Watch the resulting ingestor Job (named in the response) via k8s API
- Stream the Pod's logs to the customer's terminal
- Parse the ingestion summary; print rows-ingested + files-transferred + any failures
- Exit code maps to outcome (0 = success, non-zero = category)

### Phase 5 — Distribution (~2 days)

- GitHub Releases workflow: triggered by `v*.*.*` tags, builds all 5 platform binaries, signs with cosign keyless OIDC (matching the data-ingestors image release flow), attaches to release
- Homebrew tap (separate repo `tracebloc/homebrew-tap`): `brew install tracebloc/tap/tracebloc`
- `install.sh` for curl-bash install: hosted at `install.tracebloc.io` (S3 + Cloudfront)
- Verify signatures in install.sh before unpacking

### Phase 6 — First customer test + iteration (~3 days)

- End-to-end on the same EKS cluster we validated this week
- One internal customer dogfoods it
- Fix whatever shakes out
- Tag v0.1.0

## Open questions

1. **Schema-version negotiation.** Today jobs-manager validates against its embedded schema. If the CLI ships with a newer schema than the cluster's jobs-manager, what happens? Need either (a) jobs-manager exposes its supported schema version via a `/v1/version` endpoint the CLI queries, or (b) CLI ships pinned to a specific schema version and we accept slight churn. Recommend (a) but it's a small jobs-manager addition.

2. **TokenRequest vs static SA secret.** TokenRequest requires the customer's kubeconfig user to have `create` permission on `serviceaccounts/token`. Most clusters' default ClusterRole-binding doesn't grant this to humans. Fallback: read the existing `Secret` of type `kubernetes.io/service-account-token` if one exists (older k8s); or have the parent client chart create a long-lived token secret for ingestion (modern k8s — opt-in). Decide before Phase 2.

3. **Backend-side auth.** v0.1 reuses the SA-token-to-jobs-manager flow which doesn't need backend auth at all (the ingestor Job mints its own backend token via clientId/password env). So `tracebloc login` is empty in v0.1 — purely a placeholder for future OAuth. Document this so users aren't surprised.

4. **Concurrency model.** What if two `tracebloc dataset push` invocations race against the same table? Today the chart handles this via idempotency-key uniqueness; the CLI should generate a fresh key per invocation (matching the chart's `<release>-<unix-epoch>` pattern, but stamped at the moment of POST).

5. **`tracebloc dataset push` vs `tracebloc ingest`.** Command naming. "Dataset push" feels more product (matches `docker push`, `git push`). "Ingest" matches the API surface (`/internal/submit-ingestion-run`). Recommend "dataset push" for the customer surface, "ingest" as an alias for advanced users.

## Acceptance criteria for v0.1

- [ ] A new user installs the CLI via `brew install tracebloc/tap/tracebloc`
- [ ] They can run `tracebloc dataset push ./cats-dogs --table cats_dogs_train --category image_classification --intent train --label-column label` against an EKS cluster where they have kubectl access
- [ ] The command stages the data, submits, watches, and reports "576 rows ingested" within ~60 seconds (for a small sample)
- [ ] Schema errors in the customer's input are caught locally before any cluster interaction
- [ ] Non-image categories give a clear "not supported in v0.1" error
- [ ] Repository tracebloc/cli exists with CI green
- [ ] The chart (`tracebloc/ingestor`) keeps working unchanged

## Future epics (post-v0.1)

- **Cloud-storage source providers** — S3/GCS/HTTPS in the schema + CLI plumbing. Solves the staging-crisis pain point.
- **`tracebloc dataset list / show / delete`** — observability for existing datasets.
- **GitOps CRD** — `kind: IngestionRun` controller; ArgoCD/Flux managed datasets.
- **Web UI for datasets** — for users who shouldn't touch CLI/YAML.
- **Custom processors without Docker** — Python uploaded as ConfigMap, mounted into the standard ingestor image.

## Reasoning notes

- The protocol (POST + ingest.v1.json) is the stable contract. Multiple interfaces (chart, CLI, future CRD, future Web UI) translate to the same call. This means adding the CLI doesn't deprecate anything, doesn't break anything, and sets up the layered-interface architecture we'd want for everything else.
- Go was chosen over Python for distribution simplicity (single static binary, no runtime dependency). Data scientists who already have Python can still install via `brew` or `curl install.sh | sh`.
- v0.1 deliberately scopes out cloud sources, multi-category, GitOps, and Web UI. Smallest shippable thing that exercises the layered-interface idea + serves the dominant case. Real customer usage informs v0.2.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Epic: tracebloc CLI — `tracebloc dataset push` for one-command ingestion #147

Problem

Goals (v0.1)

Non-goals (v0.1)

Decisions

Architecture

Implementation phases

Phase 0 — Repo bootstrap (~2 days)

Phase 1 — Local schema validator (~3 days)

Phase 2 — Cluster discovery + auth (~4 days)

Phase 3 — Data staging (~5 days)

Phase 4 — Submit + watch (~4 days)

Phase 5 — Distribution (~2 days)

Phase 6 — First customer test + iteration (~3 days)

Open questions

Acceptance criteria for v0.1

Future epics (post-v0.1)

Reasoning notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Decision	Choice	Why
Repo	New: `tracebloc/cli`	Different language + release cadence from chart/data-ingestors; matches kubectl/helm/gh organizational pattern
Language	Go	Matches k8s ecosystem, single static binary, trivial cross-compile, excellent client-go
Auth	Kubeconfig + ingestor SA token via TokenRequest API	Zero new credentials; reuses today's TokenReview validation in jobs-manager
MVP scope	`tracebloc login` + `tracebloc dataset push <path>` + watch	Smallest shippable that solves the dominant case
Distribution	Homebrew tap + GitHub Releases (cross-compiled binaries) + install.sh	Matches gh/kubectl distribution model

Epic: tracebloc CLI — tracebloc dataset push for one-command ingestion #147

Description

Problem

Goals (v0.1)

Non-goals (v0.1)

Decisions

Architecture

Implementation phases

Phase 0 — Repo bootstrap (~2 days)

Phase 1 — Local schema validator (~3 days)

Phase 2 — Cluster discovery + auth (~4 days)

Phase 3 — Data staging (~5 days)

Phase 4 — Submit + watch (~4 days)

Phase 5 — Distribution (~2 days)

Phase 6 — First customer test + iteration (~3 days)

Open questions

Acceptance criteria for v0.1

Future epics (post-v0.1)

Reasoning notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Epic: tracebloc CLI — `tracebloc dataset push` for one-command ingestion #147