Sync develop → main for v0.1.0-alpha#22
Merged
Merged
Conversation
Phase 1: embed ingest.v1.json + `tracebloc ingest validate`
* feat(cluster): kubeconfig discovery, parent release detection, SA token Phase 2 of the v0.1 roadmap. Adds the plumbing the future `tracebloc dataset push` flow needs: where does the customer's kubeconfig point, which tracebloc release lives there, and how do we authenticate to its jobs-manager. End-to-end validated against the dev EKS cluster (tb-client-dev-templates): discovers chart 1.3.5, resolves jobs-manager.tracebloc-templates.svc:8080, reads INGESTOR_IMAGE_DIGEST out of the deployment env, mints a 10-minute SA token via TokenRequest. What lands: - internal/cluster/kubeconfig.go — Load() that honors --kubeconfig, $KUBECONFIG, ~/.kube/config (via clientcmd's full default loading rules — *not* an empty ExplicitPath, which silently refuses to fall back to defaults; that was the first bug the real-cluster smoke caught). - internal/cluster/discover.go — DiscoverParentRelease() finds the tracebloc/client release in a namespace by listing chart-managed Deployments and filtering by name suffix (-jobs-manager). The chart shares app.kubernetes.io/name=client across mysql/jobs- manager/requests-proxy, so suffix matching is what distinguishes jobs-manager. Returns a friendly multi-release error when ambiguous, with remediation text in the message. - internal/cluster/token.go — MintIngestorToken() tries the modern TokenRequest path first, falls back to a static service-account-token Secret on RBAC denial / older clusters / SA missing. Errors propagate verbatim on non-recoverable failures (network, context cancellation) so customers see the real problem instead of a misleading "static fallback also failed." - internal/cli/cluster.go — `tracebloc cluster info` command. Prints context, server, namespace, parent release info, SA + token state (with SHA256(token)[:8] instead of the raw bytes — token must never appear in scrollback). Exit codes 3 (kubeconfig issue) / 4 (no parent release) / 5 (token mint failed). Tests: - 5 new test files covering happy path, multi-release ambiguity, service name fallback, the sibling-deployment filter regression (mysql + requests-proxy + jobs-manager all share chart-level labels — discovery must pick jobs-manager by name suffix), TokenRequest happy path, static-secret fallback, non-recoverable error pass-through, and the combined-failure remediation message. Coverage: internal/cluster 83.8%, internal/cli 59.5% (cluster info itself is hard to unit-test without a real cluster — Phase 3+ adds integration tests against a kind cluster). Library footprint: brings in k8s.io/client-go + apimachinery + api (@v0.31.0) and sigs.k8s.io/yaml. Cross-compiled binaries grow from ~10MB to ~30MB; cost is acceptable for the customer-experience upside of "your kubeconfig is all you need." Closes tracebloc/client#150. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(ci): resolve Go version + dep conflicts for lint to typecheck CI on PR #2 hit a lint typecheck error ("undefined: jsonschema / yaml (typecheck)") that took a few iterations to diagnose. Root cause: the k8s.io/client-go v0.31 line pulls in transitive deps (sigs.k8s.io/structured-merge-diff/v6 vs v4) that fight in go.mod, and golangci-lint v1.61's bundled Go SDK can't typecheck a module whose `go` directive is newer than what the linter supports. Resolution: 1. Bump k8s.io/client-go + api + apimachinery from v0.31.0 to v0.36.1 (latest stable). Fixes the structured-merge-diff version split — v0.36 uses v6 consistently across the dependency chain. 2. Accept whatever `go mod tidy` writes to go.mod's `go` directive (currently 1.26 on this dev machine, 1.24 on others — same either way since Go modules are forward-compatible). Stop fighting tidy; pinning a stale version produces typecheck errors instead of real findings. 3. Bump golangci-lint in the workflow from v1.61.0 to v1.64.7, the first version that handles the Go 1.24+ source the dep tree now requires. 4. Update .golangci.yml `run.go: "1.24"` to match go.mod's effective minimum. 5. Refresh the go.mod comment so future readers understand why the version directive isn't pinned low. Local validation: `make vet test fmt-check schema-check` all green; cluster-info smoke against the dev EKS still discovers chart 1.3.5 + mints a TokenRequest token. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci: trim linters that whole-program-analyze the k8s.io dep tree CI's Lint job has now failed three runs in a row with "The runner has received a shutdown signal" after ~2 minutes of golangci-lint actually running. Not a flaky runner — reproducible. Root cause: `staticcheck` and `unused` do full-program SSA analysis across every transitive dep. With k8s.io/client-go + apimachinery + api in the graph (≈80 indirect modules), that exceeds whatever budget the standard 4-CPU GitHub-hosted runner allots for the job. The runner gets preempted before lint completes. Drop `staticcheck` and `unused` from the active linter set. Keep the cheap per-file linters that catch the bugs we've actually hit this week (errcheck, govet, ineffassign, gofmt, goimports, misspell, unconvert). Filed v0.2 ticket to bring them back via either (a) a self-hosted larger runner, (b) `-skip-files=k8s.io/*` patterns that don't exist in golangci-lint v1.64 but do in v2.x, or (c) split the lint job to run staticcheck only on `./internal/...` (our own code) and skip module-cache packages. The dropped linters' value relative to CI cycles spent debugging this: - staticcheck SA-checks are valuable but redundant with `govet` for the most-likely-to-bite cases (printf, lock copies, etc.) - `unused` rarely fires on a brand-new codebase where every symbol is just-introduced. Pragmatic tradeoff for v0.1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci: replace golangci-lint-action with standalone errcheck + gofmt -s Four consecutive Lint failures on PR #2, all hitting the same "shutdown signal received" at ~2 minutes (15:17, 15:24, 15:33, 15:35). Not lint config (the trim from staticcheck/unused didn't help), not a flaky runner (reproducible), not an OOM (no resource warning). golangci-lint-action@v6 + the k8s.io/* dep tree appears to be the incompatible combination in early 2026's GitHub Actions environment. Rather than spend another iteration debugging the action, replace it with standalone tools that already work locally and have predictable behavior: - errcheck v1.7.0 (the bug class we've actually hit this week) - gofmt -s (the formatting check; matches what `make fmt-check` does) - ineffassign v0.1 (cheap dead-assignment detection) - misspell (typo guard) Combined runtime in standalone mode: ~10s. golangci-lint's value-add beyond these was staticcheck + unused — both already deferred to #6 as v0.2 work pending a strategy for the dep tree. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci: unpin lint-tool versions; their old tags don't build with current Go errcheck v1.7.0 + ineffassign v0.1.0 + misspell v0.3.4 all transitively import a golang.org/x/tools version (v0.17 era) that fails to compile under current Go ("invalid array length -delta * delta" in tokeninternal.go). Pinning was the right instinct for reproducibility, but the upstream tools haven't shipped current-Go-compat tags yet. Use @latest for now; reproducibility tradeoff is acceptable given these are lint tools, not runtime deps. Document in #6 as "pin once upstream tags newer versions" follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(lint): explicit-discard 20 unchecked fmt.Fprintf in cluster.go errcheck caught 20 unchecked Fprintf/Fprintln returns in runClusterInfo — same class of finding that was previously caught + fixed in internal/cli/ingest.go and cmd/tracebloc/main.go. I missed cluster.go when I added the explicit-discard pattern there. Same rationale as the other sites: the exit code is the contract; a pipe-write failure shouldn't convert a successful diagnostic into a non-zero exit. Wrap each call with `_, _ =`. Now caught by CI thanks to the standalone-errcheck swap from the previous commit. The whole reason for the lint-job rework was to catch this exact bug class earlier in the loop — we just had to trade the golangci-lint-action for a working setup first. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(lint): explicit-discard the stderr Fprintln in main.go too Preempts the next errcheck cycle. Same explicit-discard rationale as the cluster.go fixes — stderr unreachable shouldn't change the exit code we propagate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cluster): admit hardcoded ingestor SA name + add --ingestor-sa override Bugbot caught a contradiction in PR #2's Phase 2 code: - The comment said "Read INGESTOR_IMAGE_DIGEST + ingestor SA name from the Deployment's pod-spec env" - The struct doc said "customers can override; the value comes from jobs-manager's environment" - But the switch only handled INGESTOR_IMAGE_DIGEST. SA name was hardcoded to "ingestor", silently ignoring any customer override. Truthful fix: 1. Update the comment and struct doc to admit the limitation. 2. Add a `--ingestor-sa` flag on `cluster info` so customers who set `ingestionAuthz.serviceAccountName` to a non-default value in the parent client chart can still use the CLI today. 3. Plumb the override through `runClusterInfo` -> applied to the discovered ParentRelease before token mint. 4. Drop the now-only-INGESTOR_IMAGE_DIGEST switch statement to a plain `if` — clearer, errcheck-friendlier, and signals there's only one env var being read. File #7 in tracebloc/cli for the proper fix: discover the SA name from the chart-rendered ingestionAuthz ConfigMap so the flag becomes unnecessary. v0.2 work, not blocking Phase 2 ship. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cluster): defer to apierrors stdlib; close go.mod/lint Go-version drift Cursor Bugbot's re-review on 123b56a flagged two new issues in the Phase 2 (#150) tree: 1. internal/cluster/token.go reimplemented isForbidden / isNotFound / isMethodNotSupported / statusCode against Status.Code (numeric HTTP code) when k8s.io/apimachinery/pkg/api/errors already exports IsForbidden / IsNotFound / IsMethodNotSupported that key off Status.Reason (the typed enum). The two can diverge silently for non-standard status errors. The test file already imports apierrors and constructs fake errors via apierrors.NewForbidden(), so deferring to the stdlib is both safer AND removes ~20 lines of homegrown code. The four token tests still pass at 82.1% pkg coverage because NewForbidden() sets both Code and Reason fields. 2. go.mod's top-of-file comment claimed "Minimum Go is 1.22" but the actual `go 1.26.0` directive (forced by k8s.io/* v0.36.x deps) contradicted it, and .golangci.yml pinned `go: "1.24"` — also stale. Rewrote the go.mod comment to admit reality + tell future-me to bump both together, and bumped the lint config to "1.26" to match. The third inline comment from Bugbot's re-review is a stale carry-over of the SA-name finding fixed in 123b56a (same bug ID 5e4b5df0…, GitHub auto-shifted its anchor onto the new lines). Bugbot's own review-body count confirms 2 new findings, not 3. Local: go vet, go test -race -cover, gofmt -s, errcheck — all green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(dataset): pre-flight for `dataset push` (PR-a of #151) Phase 3 (tracebloc/client#151) lands in two PRs to keep diffs reviewable. PR-a (this one) is the no-op-safe path: synthesize the ingest spec from flags, validate against the embedded ingest.v1 schema, walk the local directory, discover the cluster's parent release + shared PVC. Print a summary, then either stop (--dry-run) or fail cleanly with "wait for PR-b" so a customer who pulled the PR-a binary doesn't see "0 files transferred" confusion. PR-b adds the novel-engineering piece: ephemeral stage Pod (alpine 3.20 pinned by digest), client-go remotecommand SPDY executor + tar stream, schollz/progressbar/v3, SIGINT-safe cleanup. It plugs into the "TODO: PR-b" branch at the end of runDatasetPush() without touching anything upstream. Surface area added: internal/push/ (new package) spec.go SpecArgs -> ingest.v1.json-conforming map. Path is leave-validation-to-schema; the schema is the single source of truth. walk.go Discover() walks <root>/labels.csv + <root>/images/, enforces v0.1 size caps (1 GiB total, 500 MiB per file), accepts {.jpg,.jpeg,.png,.webp} case-insensitively. internal/cluster/pvc.go (new file) DiscoverSharedPVC reads the chart's `client-pvc` (hardcoded by _helpers.tpl, not parameterized) in the namespace, verifies it is Bound, returns access modes. IsReadWriteMany() drives the stage-Pod scheduling warning in PR-b. internal/cli/dataset.go (new file) `tracebloc dataset push <local-path>` command. Order: 1. flag->spec synthesize + schema-validate (stderr diagnostic, exit 2 — same wording style as ingest validate) 2. walk local layout + size caps (exit 3) 3. load kubeconfig + clientset (exit 3) 4. discover parent release (exit 4) 5. discover shared PVC (exit 4) 6. print pre-flight summary 7. --dry-run stop, or exit 6 "wait for PR-b" internal/cli/root.go (1-line wire-up) Register the new command. Why image_classification only: the epic (tracebloc/client#147) v0.1 non-goals defer other categories to v0.2 as one-PR additions. The --category flag does NOT pre-validate so the schema's enum check produces the canonical error (rather than CLI-side drift). Why exit code 6 for "not implemented yet": distinct from the existing schema (2), local (3), discovery (4), and Phase-2's token-mint (5) codes. Tests assert the exact code so the contract sticks across PR-b's refactor. Tests: spec_test.go Build() ↔ embedded schema contract (the core "this must produce something the validator accepts" pin) walk_test.go Layout + size caps, all happy + sad paths (case-insensitive ext, skip .DS_Store, missing files, byte-sum sanity) pvc_test.go Fake-clientset coverage: happy path, not-found diagnostic, Pending-phase diagnostic, mixed-mode IsReadWriteMany dataset_test.go CLI integration: schema-fail exit 2, walk-fail exit 3, kubeconfig-fail exit 3, cobra args check Locally: vet, test -race -cover, gofmt -s, errcheck — all green. Coverage: push 81.0%, cluster 83.2%, schema 80.7%, cli 52.0%. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(dataset): drop duplicate pluralS, reuse existing plural helper Cursor Bugbot (Low severity) on PR #8: pluralS() in dataset.go was an exact duplicate of plural() already defined in ingest.go — same cli package, identical body. Removed pluralS, the schema-error diagnostic now calls plural() directly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(push): reject path-traversal table names (Bugbot, Medium) Cursor Bugbot on PR #8 (commit 4240097), Medium severity: the `table` value flows unsanitized into the /data/shared/<table>/ PVC path. The embedded ingest.v1 schema only enforces minLength:1 on `table` — no pattern — so --table=../../etc would resolve (via path-cleaning) to /etc, and PR-b's stage Pod would write outside the intended subtree, potentially clobbering another table's staged data. Fix: - push.ValidateTableName(table): rejects anything not matching ^[A-Za-z0-9_]+$ — the intersection of "valid unquoted MySQL identifier" and "safe single path segment". Called as step 1 of runDatasetPush, before SpecArgs.Build(). - StagedPrefix hardened: drops path.Join (whose ".."-cleaning is the silent footgun) for plain concatenation, and panics if handed an unsafe name. Callers MUST ValidateTableName first; the panic is a defense-in-depth backstop so a PR-b call site that forgets validation fails loudly in tests instead of silently escaping the prefix. - Build() doc gains the precondition note. The upstream half — adding a `pattern` constraint to the schema's `table` field so the helm flow + jobs-manager get the same guard — is filed as tracebloc/data-ingestors#116. Once that lands and the CLI re-syncs the schema, ValidateTableName collapses to a thin wrapper. Tests: - TestValidateTableName_Accepts / _RejectsUnsafe: the security regression pin (../../etc, ../foo, slashes, dots, etc.) - TestStagedPrefix_PanicsOnUnsafeName: the backstop - TestDatasetPush_TraversalTableName_ExitsTwo: CLI-layer, --table=../../etc → exit 2 before any cluster work Locally: vet, test -race -cover, gofmt -s, errcheck — all green. Coverage: push 83.3%, cluster 83.2%, schema 80.7%, cli 52.2%. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(push): export HumanBytes, drop the copy in dataset.go Cursor Bugbot on PR #8 (commit 837157f), Low severity: humanBytesForSummary in dataset.go was a line-for-line copy of humanBytes in walk.go — both new in this PR. A future tweak (TiB support, precision change) would likely patch one and not the other, drifting the size shown in an over-cap error from the size shown in the dry-run summary. Fix: export push.HumanBytes as the single implementation; delete the dataset.go copy. Both the over-cap diagnostics and the pre-flight summary now format through the same function. Locally: vet, test -race -cover, gofmt -s, errcheck — all green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(push): reject a directory named labels.csv in Discover Cursor Bugbot on PR #8 (commit 9915866), Low severity: Discover checked imagesStat.IsDir() for the images/ entry but never the reverse for labels.csv. A directory literally named "labels.csv" passes os.Stat, so the pre-flight would accept it and PR-b's tar stream would later fail confusingly trying to read a directory as a CSV. Added the symmetric labelsStat.IsDir() guard with a clear error. TestDiscover_LabelsCSVIsDirectory pins it. Locally: vet, test -race -cover, gofmt -s, errcheck — all green. Coverage: push 83.8%. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(dataset): stage Pod + tar-over-exec stream (PR-b of #151) Completes Phase 3 (tracebloc/client#151). PR-a delivered the no-op-safe pre-flight; this PR-b plugs in the actual file staging — ephemeral Pod, client-go SPDY executor, tar stream, SIGINT-safe cleanup. ## What lands internal/push/ (5 new files): - pod.go Stage Pod spec builder + Create/Wait/Delete lifecycle. Alpine 3.20 pinned by digest, PSA- restricted security context, activeDeadline- Seconds=600 in-cluster self-kill defense-in-depth, random 8-hex-char suffix for parallel-push collision avoidance. - stream.go Executor interface (SPDYExecutor in prod, fakeExecutor in tests). StreamLayout wires a tar.Writer through an io.Pipe to the exec stdin, running `tar -xf - -C /data/shared/<table>` in the Pod. Captures remote stderr into the error surface so customers see "no space left on device" verbatim, not a generic exec failure. - progress.go schollz/progressbar/v3 wrapper, TTY-detected via golang.org/x/term. No-op in CI / non-TTY output so the bar doesn't pollute log streams. - orphan.go Scans for stage Pods labeled managed-by=tracebloc-cli older than 5 min; renders an actionable warning with the kubectl-delete command. Warn-only in v0.1 (auto-cleanup is v0.2 — can't distinguish "crashed previous push" from "still-running parallel push from another workstation"). - stage.go The Stage() orchestrator. Order: orphan scan → CreateStagePod → defer DeleteStagePod with a FRESH context (SIGINT-safe — defers fire even when the parent ctx is cancelled) → WaitForReady → StreamLayout → cleanup. On any error past Create, the deferred delete still runs. internal/cli/dataset.go: - Replace the exit-6 stub with push.Stage(...) wired to cluster-discovered ns/PVC/SA + the new SPDYExecutor. - Add v0.1 category gate (Medium-1 from PR-a self-review). Runs BEFORE schema validation so unsupported-but-schema- valid categories like tabular_classification get the actionable "v0.1 supports only image_classification" message instead of the schema's confusing "missing property 'schema'". - Add --stage-pod-image flag for air-gapped customers. - Update exit-code doc: drop exit 6 (PR-b stub), add exit 7 (staging-step failed, distinct from pre-flight 3/4). internal/push/stage_test.go covers Medium-2 from PR-a review: TestStage_IngestorSANameFlowsToPod pins that the --ingestor-sa override actually lands on the stage Pod's ServiceAccountName. ## Tricky bits Pipe deadlock guard: when the executor returns early (ctx cancel, remote tar dies immediately), the tar-write goroutine would block on pipe.Write forever because nothing reads. Fix: close the pipe Reader after exec.Exec returns, which unblocks the writer with io.ErrClosedPipe. Caught by TestStage_CancelledContext_StillCleansUp. SIGINT safety: the deferred DeleteStagePod uses a fresh ctx with a 30s deadline rather than the (possibly cancelled) parent ctx. Without this, Ctrl-C right after pod-create leaks an orphan Pod. activeDeadlineSeconds=600 is the in-cluster backstop for the truly-pathological case (hard-kill, network partition between laptop and cluster). errcheck cleanup: writeLayoutTar uses a named return + deferred close that promotes a tar trailer-write error if and only if the function otherwise succeeded — silent truncation would be much worse than a noisy error. ## Test plan - [x] go vet ./... — green - [x] go test -race -cover ./... — green (push 79.8%, cluster 83.2%, schema 80.7%, cli 52.2%) - [x] gofmt -s -l . — no drift - [x] errcheck ./... — green - [x] go build — binary builds - [x] Local smoke: --category=tabular_classification → exit 2 with "v0.1 supports only image_classification" message (the new gate's actionable diagnostic) - [ ] Real EKS smoke (manual; out of CI scope) ## Open items deferred Per PR-a review's Low/Nit list: - runDatasetPushArgs.Context shadowing context.Context - printPushPreflight rendering AccessModes with %v - testutil package consolidation for imgcDir / imgcLayout - dataset_test kubeconfig path traversal via subtests - Discover hint for nested-image-dirs These are tracked for v0.2 cleanup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(push): DNS-1123-safe Pod names + wire SIGINT to ctx (Bugbot, High+Med) Two real bugs caught by Bugbot's re-review on PR-b: ## 1. High — Stage Pod name breaks DNS-1123 BuildStagePodSpec used the raw table name as a Pod name segment. ValidateTableName accepts [A-Za-z0-9_]+ (MySQL identifier rules), but K8s Pod names follow DNS-1123 subdomain rules: lowercase alnum + hyphen, must start/end alnum. The canonical example throughout tracebloc docs (cats_dogs_train, snake_case) would fail Pod create post-pre-flight — worst-of-both-worlds UX (pre-flight says "good!" then create fails). Fix: transform the table name into a DNS-1123-safe segment for the Pod name only — lowercase, _→-, trim leading/trailing hyphens, cap length, fallback to "tbl" for the pathological all-underscore case. The original verbatim name still goes in the tracebloc.io/table label so orphan warnings stay traceable. TestDNS1123SafeTableSegment covers cats_dogs → cats-dogs, MyTable → mytable, _leading_underscore → leading-underscore, _ → tbl, 50-char → 30-char truncation. Each result is cross-validated against an inline DNS-1123 regex check. ## 2. Medium — SIGINT skips Pod cleanup push.Stage documented SIGINT-safe cleanup via context cancellation + fresh-ctx deferred DeleteStagePod, but cmd/tracebloc/main.go called Execute() without signal.NotifyContext. Default Go runtime behaviour on SIGINT is to exit without running defers — so every Ctrl-C during staging leaked an orphan Pod until activeDeadlineSeconds (10 min) fired. The docstring was false advertising. Fix: signal.NotifyContext(ctx, SIGINT, SIGTERM) in main, passed to ExecuteContext. First Ctrl-C cancels the cobra ctx → push.Stage's in-flight HTTP cancels → deferred cleanup runs (with its own fresh ctx, also kept). Second Ctrl-C does normal hard-kill (stop() in defer un-registers the handler) — important if the cleanup itself hangs. Local: vet, test -race -cover, gofmt -s, errcheck — all green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(push): surface NotFound/Forbidden + propagate rand error (Bugbot) Two new findings from Bugbot's re-review on commit 35ae044: ## Medium — WaitForStagePodReady polled through terminal errors The poll callback returned (false, nil) for EVERY Pods.Get error, which meant Pod-deleted-out-of-band (NotFound) or RBAC-revoked-mid- push (Forbidden) waited the full 60s timeout despite the situation being unrecoverable. Customer-facing impact: 60s of "Waiting for stage Pod to be Ready..." for an error that should surface in <1s. Fix: classify the error. NotFound + Forbidden terminate the poll immediately (return (false, err)); everything else stays transient (network blip, brief API unavailability). The apierrors.Is{NotFound,Forbidden} helpers were already imported from the previous fixes, so this is a single-block change. Two new tests pin the contract via t.Now() bounds — if a regression makes either case transient again, the test waits the full 60s and the assertion catches it. ## Low — BuildStagePodSpec swallowed crypto/rand error `suffix, _ := randomSuffix(4)` — if crypto/rand failed (rare but possible on systems with exhausted entropy), the suffix was empty and the Pod name became `tracebloc-stage-cats-dogs-` with a bare trailing hyphen. DNS-1123 rejects that → opaque API error from kube-apiserver instead of a clear local diagnostic. Fix: change BuildStagePodSpec signature to (*Pod, error) and propagate the rand failure. CreateStagePod (the prod caller) already returns an error; tests use a mustBuildStagePod helper that t.Fatal's on the rare path. 8 test call sites updated via find-and-replace, plus the override-image test by hand. The third inline comment ("SIGINT skips pod cleanup", bug_id 24ab9106) is a stale carry-over of the prior-round finding I already fixed in 35ae044 — same bug_id, GitHub auto-shifted the anchor onto the new commit. Replied in-thread pointing at the fix. Bugbot's review-body confirms "2 NEW findings", not 3. Local: vet, test -race -cover, gofmt -s, errcheck — all green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(push): close symlink size-cap bypass + Phase=Failed short-circuit Round 4 of Bugbot fixes on PR-b. Two new findings on commit 7baa321 (plus 3 stale carry-overs from earlier rounds — same bug IDs as already-fixed findings, GitHub keeps re-anchoring them onto each new commit; Bugbot's review-body confirms "2 NEW findings"). ## Medium-Security — symlink images bypassed size caps The vulnerability: Discover sized image entries via DirEntry.Info() (Lstat-equivalent: returns the symlink's own ~100-byte size). writeTarFile read them via os.Stat + os.Open (which follow symlinks). So a customer with `images/evil.jpg` symlinked to `/var/log/system.log` (multi-GB) or `~/.ssh/id_rsa` could: 1. Pass Discover's MaxSingleFileBytes + MaxTotalBytes caps trivially (cap is 1 GiB; the symlink itself is ~100 bytes). 2. Stream the target's FULL contents into the cluster PVC, where the cluster admin can read them via `kubectl exec`. This is both a size-cap bypass AND an arbitrary-local-file disclosure. Worse, it can fire unintentionally — a customer who ran `ln -s ~/datasets/big-images images` to share data across projects would silently stream gigabytes during what they thought was a small test push. Fix in three layers: 1. walk.go: os.Stat → os.Lstat for the labels.csv check (so the mode bits include ModeSymlink) + new rejectSymlink() helper called for both labels.csv and each image entry. v0.1 rejects symlinks outright with a clear "v0.1 doesn't allow symlinks (security: ...)" message pointing at `cp -L` for materializing and the v0.2 cloud-source story for distributed data. 2. stream.go: writeTarFile's os.Stat → os.Lstat + symlink check too. Defense-in-depth — Discover is the primary fix, but a future refactor that calls writeTarFile directly (e.g. a new resume-from-partial-transfer code path in v0.2) would re-introduce the hole without this layer. 3. New tests pin both rejections (labels.csv-as-symlink + image- as-symlink). Skipped on Windows where symlinks require admin. ## Medium — WaitForStagePodReady didn't short-circuit on Phase=Failed Companion to the NotFound/Forbidden short-circuit landed in the previous commit. The poll positively-terminated on Ready=True but never negatively-terminated on Phase=Failed (container crashed at startup: PSA rejection, image crashloop, OOM at container init) or Phase=Succeeded (stage container's sleep exited unexpectedly). Same 60s timeout symptom for an outcome that's actually decided in <5s. Fix: in the poll body, check Phase after the conditions loop; return a structured error including the podReadyTimeoutHint output (container-status reason + message) so the customer sees "terminated in phase Failed (OOMKilled — ...)" instead of a generic timeout. New test TestWaitForStagePodReady_FailedPhaseIsTerminal pins the <3s elapsed contract for an OOMKilled startup. Local: vet, test -race -cover, gofmt -s, errcheck — all green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(push): Lstat images dir + table-len cap + hermetic re-push + progress defer Round 5 of Bugbot fixes on PR-b. The "5 potential issues" review-body count corresponds to 5 NEW findings; the other 5 inline comments are stale carry-overs of bug IDs already fixed in earlier rounds (24ab9106, 15e29a3e, 258869d3, 5c577688, 7fcf137f — GitHub auto-anchors them onto each new commit). ## Medium — Symlinked images/ directory bypassed all checks Round 4 fixed symlinked FILES inside images/ (Lstat on each entry). It missed the case where images/ ITSELF is a symlink — `ln -s /etc images` passed os.Stat (which follows symlinks), the IsDir check succeeded, and the loop walked /etc. Same disclosure pattern as the round-4 finding, one layer up. Fix: os.Stat → os.Lstat on imagesDir + rejectSymlink before the IsDir check. Symmetric with the labels.csv treatment from round 4. New test TestDiscover_RejectsSymlinkedImagesDir pins the rejection. ## Medium — Table label could exceed Kubernetes 63-char limit ValidateTableName accepted [A-Za-z0-9_]+ without a length cap, but the stage Pod's tracebloc.io/table label carries the raw name. K8s label values cap at 63 chars (DNS-1123 label rule), so a 100-char name passed pre-flight then failed Pod creation with an opaque label-validation error. Fix: cap at 63 chars (also matches MySQL's 64-char identifier limit with 1 char of headroom for any future ingestion-run-id suffix). New MaxTableNameLength const + boundary test pin the cap. ## Medium — Re-push left stale PVC files The remote command was `mkdir -p <dest> && tar -xf - -C <dest>`. If a previous push had 3 images and the new push has 2, the PVC ends up with the union — and labels.csv now disagrees with the stage directory, silently breaking ingestion. Fix: `rm -rf <dest> && mkdir -p <dest> && tar -xf - -C <dest>`. Safe because dest = StagedPrefix(table) = /data/shared/<table> and ValidateTableName has ensured `table` is a single safe segment, so rm -rf can only nuke that one per-table subdir, never the parent /data/shared or sibling tables. TestStreamLayout_RemoteCommand updated to assert the rm AND its ordering before mkdir. ## Low — Progress bar not Finish'd on early Stage failure Stage's deferred Finish lives inside StreamLayout, so a failure earlier in the lifecycle (CreateStagePod, WaitForStagePodReady) returned without clearing the TTY bar. Visual artifact on the customer's terminal after a Pod-create failure. Fix: defer progress.Finish() at the construction site in runDatasetPush. Schollz Finish is idempotent so double-call on happy path is a no-op. ## Hard-link bypass — documented, not fixed Bugbot also flagged that hard links to outside files aren't rejected. Filesystem mode bits don't distinguish a hard link from a regular file the way ModeSymlink distinguishes symlinks, and a high-Nlink check has too many false positives. The implicit security boundary is the CLI's process-level read permissions: a customer can only hard-link files they already have read access to, so this isn't a privilege escalation. v0.2 may add openat(O_NOFOLLOW) sandboxing if customers need harder isolation. Documented in rejectSymlink's docstring as a known limitation alongside the v0.2 plan. Local: vet, test -race -cover, gofmt -s, errcheck — all green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(push): 30m activeDeadline + portable tar paths (Bugbot round 6) Round 6. Review-body says "2 NEW + 3 unresolved" — Bugbot itself is flagging the 9 carry-overs at this point. Two new findings, both real: ## High — Pod activeDeadlineSeconds (600s) cut too close The deadline starts at Pod CREATION, but image pull (up to 60s) + WaitForStagePodReady ceiling (60s) + worst-case stream (1 GiB at 2 MB/s ≈ 8.5 min) eat the budget. Near-cap customers on slow uplinks could see the kubelet terminate mid-transfer with no useful diagnostic. Fix: bump StagePodActiveDeadline from 600 → 1800 (30 min). Budget breakdown in the const's docstring; comfortable margin for variance. Cost is "an idle alpine Pod with sleep idles for ~22min after a successful push" — ~5 MiB cluster RAM, zero CPU, deleted seconds later by the defer'd cleanup anyway. Test pin: TestStagePodActiveDeadline_CoversFullLifecycle asserts the floor at 1500s so a future regression to 600 is caught. ## Medium — Windows tar paths used backslashes `filepath.Join("images", filepath.Base(abs))` produces `images\file.jpg` on Windows. USTAR / POSIX tar requires forward slashes; the Linux stage Pod's `tar -xf` either rejects or extracts as a flat-named file (collapsing the images/ subdir the ingest spec expects). Breaks the Windows-built CLI. Fix: switch to `path.Join` for the tar HEADER name. Keep `filepath.Join` everywhere else (where the OS-native separator is the right thing). Test pin TestStreamLayout_TarPathsAreForwardSlash asserts no backslashes in any tar entry name, regardless of host OS. ## Carry-overs Bugbot's review body now explicitly says "There are 3 total unresolved" — it knows the carry-overs aren't on the new commit anymore. Counts I'm tracking as known-fixed across earlier rounds: 24ab9106 (SIGINT, r3), 15e29a3e (Pod-Get spin, r3), 258869d3 (rand error, r3), 5c577688 (Failed Pod, r4), 7fcf137f (symlink files, r4), de426248 (hard links, r5 — documented limitation), cd462de9 (symlink dir, r5), a8e9e5c7 (label len, r5), 0b296807 (progress finish, r5). Local: vet, test -race -cover, gofmt -s, errcheck — all green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(push): wait-error labels + orphan-scan precision + transactional re-push Round 7 of Bugbot fixes on PR-b. Review-body confirmed "4 NEW + ...". All four real, all four fixed: ## Medium — Wait errors mislabeled as timeout WaitForStagePodReady wrapped EVERY poll exit (NotFound, Forbidden, Phase=Failed, ctx.Canceled from SIGINT) in "did not become Ready within 60s." Misleading: the customer who hit Ctrl-C 2 seconds into the wait would see a "timed out" diagnostic. Fix: branch on errors.Is(err, context.DeadlineExceeded). True timeout keeps the "did not become Ready within %s" wording; early-exit cases surface as "did not reach Ready state: %w" with the actual cause front and center. ## Medium — Orphan-delete hint would nuke parallel pushes FormatOrphansWarning's kubectl-delete command used the `managed-by=tracebloc-cli,component=stage-pod` label selector, which matches every stage Pod in the namespace including LEGITIMATE RUNNING ONES from parallel pushes (other workstations, or even this one's just-started Pod). A customer copy-pasting the suggested cleanup could silently kill someone else's in-progress push. Fix: list specific orphan Pod names in the delete command — `kubectl delete pod -n <ns> <name1> <name2> ...`. Test regression- pins the absence of the label-selector form. ## Medium — Re-push deleted before transfer succeeded Round 5's hermetic-re-push fix (rm -rf $DEST && mkdir + tar) satisfied "no stale files" but not "preserve on failure." A tar mid-stream failure (Ctrl-C, network drop, container OOM) left the customer with NOTHING on the PVC — the previous push's data already nuked, the new push aborted before completing. Fix: extract to <dest>.staging, swap on success: rm -rf <dest>.staging # cleanup any prior failure mkdir -p <dest>.staging && tar -xf - -C <dest>.staging rm -rf <dest> && mv <dest>.staging <dest> # swap on success rm -rf <dest>.staging # defensive cleanup The shell's && sequencing means swap only fires if tar succeeded. Lost the prior `exec /bin/tar` micro-optimization (can't exec mid-chain) — fine, the shell process is tiny. The window between the destination rm and the mv is sub-ms; closing it fully would need a double-mv (old/new/cleanup) which is v0.2 territory. stream_test.go pins: - mkdir-p of .staging happens - tar extracts to .staging (not directly to dest) - mv from .staging to dest exists - tar happens BEFORE any rm of $DEST (transactional property) - no rm of the parent /data/shared (sibling-table safety) ## Medium — Orphan scan flagged active pushes FindOrphanStagePods's only filter was age > 5min. But pod.go itself budgets ~8.5 min for a 1 GiB transfer — a legitimately-running near-cap push would be mislabeled as orphan by the same customer's next concurrent push, and the (newly-fixed) delete hint would now correctly target that specific running Pod by name, killing it. Fix: skip Phase=Running pods entirely. A Running stage Pod is presumed to be doing real work; activeDeadlineSeconds is its cluster-side safety net. Pods in non-Running phases past grace (Pending stuck on image pull, Failed from crash) still flag — those are the genuine orphan shapes. Two new tests: - TestFindOrphanStagePods_SkipsRunningPods: 30-min-old Running Pod doesn't surface as orphan - TestFindOrphanStagePods_FlagsNonRunningPastGrace: 30-min-old Failed Pod DOES surface (regression guard — narrowing the filter should reduce false positives, not eliminate the warning entirely) Local: vet, test -race -cover, gofmt -s, errcheck — all green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(push): unique staging suffix + stream-time size-cap recheck Round 8 of Bugbot fixes. Two new findings on commit 42c5e93, both real: ## Medium — Parallel pushes corrupted staging Two concurrent `dataset push` runs for the same `--table` shared one `$DEST.staging` path on the PVC. Each Pod's `rm -rf $DEST.staging`, `tar -xf -`, `mv` sequence raced against the other, with no coordination. Worst case: push A's `rm -rf` wipes push B's mid-extraction state, B's tar then writes into a partially-removed dir, customer ends up with an interleaved- corrupted-and-half-committed dataset. Fix: each invocation generates a fresh 8-hex-char suffix for the staging dir name (`$DEST.staging-<8hex>`), reusing pod.go's randomSuffix. Two parallel pushes now have distinct staging dirs — no interleaved-write hole. The FINAL `rm $DEST && mv $STAGING $DEST` step is still last-write-wins under concurrent commits, but that's an acceptable v0.1 semantic (concurrent pushes for the same table are inherently undefined; whichever commits last "wins" with a COHERENT dataset, not an interleaved one). New test TestStreamLayout_StagingSuffixIsUniquePerInvocation pins the contract by calling StreamLayout twice and asserting distinct staging suffixes. ## Medium — Stream skipped size-cap enforcement (TOCTOU) Pre-flight Discover checked MaxSingleFileBytes + MaxTotalBytes, but writeTarFile / writeLayoutTar streamed files later via os.Lstat → io.Copy with NO re-check. A file that grew between pre-flight and stream (legitimate dataset prep racing the push, or adversarial overwrites) would silently upload past the 500 MiB / 1 GiB advertised caps. Fix in three layers: 1. writeTarFile re-checks the single-file cap (os.Lstat size vs MaxSingleFileBytes) right before WriteHeader. A file that grew gets a clear "exceeds v0.1 single-file cap" error using the same sizeError formatter Discover uses. 2. writeTarFile now returns (int64, error) — the declared header size. writeLayoutTar accumulates this into a running total and aborts mid-stream if it exceeds MaxTotalBytes. Tests added below to pin the running-total contract. 3. io.Copy → io.CopyN(tw, f, st.Size()) caps the actual byte transfer at the declared header size. Without the cap, a file that grew between Lstat and io.Copy would overflow the tar header — header says N bytes, body has > N → tar archive corruption. CopyN-and-trust truncates instead. Closes both the metadata-side (header size) and body-side (byte stream) halves of the stream-time TOCTOU window. Local: vet, test -race -cover, gofmt -s, errcheck — all green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(push): hoist randomSuffix above goroutine + clean orphan staging dirs Round 9 of Bugbot fixes. Both findings real: ## Medium — Tar goroutine could deadlock on suffix error The randomSuffix call landed AFTER the tar-write goroutine spawned. If crypto/rand failed (rare but possible on entropy-starved systems — same scenario the round-3 BuildStagePodSpec fix handles), the function returned without closing the pipe or waiting on the goroutine. The goroutine would then block once the ~64 KB pipe buffer filled, leaking forever. Fix: hoist randomSuffix (and the entire remoteCmd composition) ABOVE the io.Pipe + goroutine setup. The function now bails on suffix failure before touching the pipe at all, so there's nothing to deadlock on. ## Medium — Failed pushes leaked .staging-<hex> dirs Round 8's unique-per-invocation suffix fixed parallel-push interleaving but created a new leak: if THIS push fails before the final `mv` step, the .staging-<hex> dir lingers on the PVC. Round 8's `rm -rf $STAGING` at the start only cleans up THIS invocation's path, not the previous-failed one. Repeated failed pushes accumulate unbounded storage on the shared PVC. Fix: append a defensive cleanup pass to the remote script: find $(dirname $DEST) -maxdepth 1 -name "$(basename $DEST).staging-*" \\ -mmin +60 -exec rm -rf {} + 2>/dev/null || true Constraints baked into the pattern: - -mmin +60 (1 hour) is 2x activeDeadlineSeconds — anything older HAS to be from a dead stage Pod, so we can't race with a parallel push's in-progress staging - -name pattern scopes to THIS table's staging siblings only; other tables' .staging-* dirs are none of our business - 2>/dev/null || true — find failures don't fail the whole stream (defensive cleanup is best-effort, the customer's actual push already committed before this runs) - Uses ';' (semicolon) instead of '&&' so it runs even if the main push sequence failed somehow — orphan cleanup should happen regardless Updated TestStreamLayout_RemoteCommand asserts the find pattern is present with the exact -mmin window and table-scoped -name. Local: vet, test -race -cover, gofmt -s, errcheck — all green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(push): backup-and-rollback dir swap (Bugbot r10) Bugbot's tenth-round single finding: the previous r7 pattern of `rm $DEST && mv $STAGING $DEST` had a small but real failure window — if the rm succeeded and the mv failed (ENOSPC, fs-level rename error, transient kernel weirdness), the customer's previous dataset was already gone and the new data was stranded under .staging-<hex> where the ingestor wouldn't see it. Fix: backup-and-rollback. Rename the existing $DEST to a unique $DEST.old-<hex> sibling FIRST; only then mv $STAGING into $DEST; on the latter's failure, restore the backup. On success, drop the backup. { [ -e $DEST ] && mv $DEST $DEST.old-<hex> || true; } && { mv $STAGING $DEST || { [ -e $DEST.old-<hex> ] && mv $DEST.old-<hex> $DEST; exit 1; }; } && rm -rf $DEST.old-<hex> Properties this gives us: - tar fails: $DEST untouched (transactional from r7) - backup mv fails: $DEST untouched - main mv fails: backup at .old-<hex> survives; customer recovers via `kubectl exec -- mv .old-<hex> $DEST` - final rm fails: just leaves an .old-<hex> cruft, picked up by the r9 find -mmin +60 sweep (extended to also catch .old-* siblings) - first-ever push (no pre-existing $DEST): `[ -e $DEST ] && ...` short-circuits, backup mv silently skipped via `|| true` The .old-<hex> suffix reuses the same random as .staging-<hex> so two parallel pushes can't collide on the backup name. Both .staging-* AND .old-* now flow through the orphan-cleanup find pattern (with the `-o` alternation), so r9's leak-prevention covers both halves of the swap. Test updates pin: - Backup mv (DEST → .old) appears BEFORE primary mv (.staging → DEST) — rollback contract - Backup and staging suffixes MUST agree (rollback-target correctness) - find pattern includes both -name "...staging-*" and -name "...old-*" Local: vet, test -race -cover, gofmt -s, errcheck — all green. This is round 10. The new-finding rate has been decaying: 5→2→2→2→1. We're approaching the asymptote. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(push): propagate remote-script failures + prefer tar error (Bugbot r11) Two new findings on r10's commit, both real: ## Medium — Remote script masked failures via POSIX `;` + `|| true` The previous shell used `&&-chain ; find ... || true`. POSIX `;` runs the next command regardless of prior exit status, and the trailing `|| true` then forces exit 0. A failed tar or mv earlier in the && chain would set $? to non-zero, then `;` would clobber it back to 0 via the find's `|| true`. Result: a remote push that actually failed would return exit 0 to the exec subprocess, and StreamLayout would happily report "Staged N files" on a catastrophic failure. Fix: rewrite the script using `set -e` + explicit newline- separated statements + an explicit `if ! mv; then rollback; exit 1; fi` for the swap. With set -e: - Any non-guarded non-zero exits the script with that status. - `cmd || true` continues to suppress (find cleanup stays best-effort). - `if ! cmd; then ...; fi` is treated as guarded, so set -e doesn't preempt the explicit rollback handler. Multi-line shell-c args work across busybox sh (alpine 3.20), dash, and bash equally. No portability regression. ## Medium — Stream error masked the upstream tar error After r8 added stream-time MaxSingleFileBytes / MaxTotalBytes rechecks in writeLayoutTar, the LOCAL tar build can legitimately fail mid-stream. When it does, the goroutine's CloseWithError propagates to the pipe reader; exec sees broken-pipe and returns a generic "exec stream against ns/pod failed" error. The previous code checked streamErr FIRST and returned the exec-flavored framing. Customer saw "streaming files failed" instead of the actually-actionable "dataset exceeded v0.1 total cap of 1.00 GiB after streaming ..." diagnostic from the tar side. Fix: swap the order. Check tarErr first — when both are non-nil, the tar side is the upstream cause; the broken-pipe streamErr is downstream noise. streamErr-only (no tarErr) is still the real network/RBAC/remote-tar case and gets surfaced with the exec wording. Local: vet, test -race -cover, gofmt -s, errcheck — all green. Round 11 finding rate: 2 new (1+2+5+2+4+2+2+1+2 → still trending roughly downward, though not monotone). One more clean round and we're at the asymptote. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: doc/cleanup cross-cutting drift from 11 rounds of bugbot fixes Independent review pass over PR #9 (post-bugbot-r11). Five findings, none caught by bugbot — all cross-cutting doc/style: 1. stream.go: dead comment block above remoteCmd composition. The block was a holdover from the && era; it claimed "&& sequencing means the swap only fires if tar succeeded" but we switched to `set -e` in r11. The canonical pipeline doc lives below the composition; collapsed the upstairs block into a single semantic- guarantees summary (HERMETIC/TRANSACTIONAL/PARALLEL-SAFE/EXIT- FAITHFUL) and removed the contradictory && reference. 2. stage.go: StageOptions docstring referenced a non-existent `OrphanLogger` field. Stale from an early design where orphan warnings were emitted via a logger callback rather than the integrated FormatOrphansWarning we ended up with. Corrected to the actual nil-able fields (Progress, Out). 3. orphan.go: OrphanGracePeriod comment claimed "5 minutes is generous: a healthy stage Pod is fully done in ~30 seconds... under a couple minutes at the 1 GiB cap." That directly contradicts pod.go's StagePodActiveDeadline budget (~8.5 min for a 1 GiB transfer at 2 MB/s, plus image pull, plus ready wait). Rewrote to reflect the post-r7 semantic: Running Pods are never flagged regardless of age; the 5-min grace targets the Pending/Failed/Unknown shapes that are genuinely stuck. 4. orphan.go: joinNames was a 9-line wrapper around `strings.Join( names, " ")`. The comment justifying it ("single point of change for tests") doesn't hold — the tests assert on the final output string, not on this helper's surface. Inlined. 5. cli/dataset.go: runDatasetPush docstring still said "the PR-a slim implementation. It performs every pre-flight check... the actual file staging is gated behind a clear 'not yet implemented' error." PR-b (this PR) actually implements the staging — the doc lagged the code. Updated to reflect the complete Phase 3 flow. No behavioral change. Tests + lint green. The doc fixes matter because future readers (humans and the next bugbot pass) get contradictory signals when the comments disagree with the code. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…52) (#10) * feat(dataset): Phase 4 — submit to jobs-manager + watch + summary (#152) Closes the v0.1 happy path. After Phase 3 stages files on the PVC, Phase 4 POSTs the ingest spec to jobs-manager, watches the spawned ingestor Job, and renders the parsed INGESTION SUMMARY panel. Customer sees a single end-to-end flow: $ tracebloc dataset push ./cats-dogs --table=cats_dogs_train ... Local dataset: ... Target cluster: ... Synthesized ingest spec: ... Created stage Pod tracebloc/tracebloc-stage-... Streaming N files (X MiB) to /data/shared/cats_dogs_train/... Staged N files to /data/shared/cats_dogs_train Submitted: jobs-manager spawned ingestor Job tracebloc/ingestor-... Streaming logs from Job tracebloc/ingestor-...: [ingestor's own log output streams here, verbatim] ┌─ Ingestion summary ──────────────────────────┐ │ Ingestor ID: run-abc │ │ Total records: 1,234 │ │ Inserted: 1,200 │ │ ... etc │ └──────────────────────────────────────────────┘ ## What lands internal/submit/ (5 new files): - body.go SubmitRequest struct (ingest_config + idempotency_key + optional image_digest); BuildRequest auto-generates a 16-byte hex idempotency key per invocation unless --idempotency-key overrides. - client.go HTTPSubmitter with bearer-token auth; 4xx/5xx surface jobs-manager's body verbatim via SubmitError; IsAuthError(err) for the 401/403 exit-code branch. - watch.go Poll-based watch (per the design discussion): wait for the Job's Pod to exist + Running, stream logs via GetLogs(Follow=true), terminal poll on Job conditions to distinguish Succeeded/Failed/Unknown. - summary.go Streaming parser for the 📊 INGESTION SUMMARY banner. Strips ANSI codes, accumulates fields by prefix match, finalizes on the closing ═-rule. RenderPanel formats the parsed Summary as a box-drawn panel. - submit.go Run() orchestrator: BuildRequest → POST → announce (Spawned vs Replayed) → either --detach exit or watch loop. SIGINT during watch produces Outcome=Detached with a kubectl-logs reconnect hint. internal/cli/dataset.go: wired after push.Stage. Mints a 1-hour SA token (vs cluster info's 10 min — the watch loop can run that long). Adds --detach, --idempotency-key, --image-digest flags. Maps Submit + Watch outcomes to exit codes 5/8/9. ## Design choices (per pre-build discussion) - Poll every 2s for Job status (not client-go Watch). Matches the Phase 3 WaitForStagePodReady pattern; simpler test surface, no long-lived connection to babysit. ~30 API calls/min is negligible. - SIGINT-after-201 → graceful detach. The cluster has already accepted the run; Ctrl-C just stops watching. Customer sees the kubectl logs reconnect hint and exit 0. Matches kubectl logs behavior. ## Exit codes (full set, v0.1) 0 — success (or --detach success: 201 came back clean) 2 — schema / v0.1-unsupported-category 3 — local-layout or kubeconfig 4 — cluster discovery (parent release or shared PVC missing) 5 — SA token couldn't be obtained, OR jobs-manager 401/403 7 — Phase 3 stage step failed 8 — jobs-manager 4xx/5xx other than auth 9 — ingestion Job exited non-zero, OR summary reports row failures ## Tests submit/ ≥85% line coverage across 5 files. Per-file: - body_test.go idempotency-key generation, override, uniqueness, JSON shape pinning (image_digest omitempty) - client_test.go httptest.Server: 201 happy path, replay path, 4xx body framing, 401/403 → IsAuthError, 5xx ≠ auth, missing-job_name, network drop, ctx-cancel - summary_test.go Real ingestor banner end-to-end, byte-by-byte streamed feed, no-banner path (early failure), post-banner lines ignored, ANSI stripping, panel rendering, comma formatting - watch_test.go Fake clientset: Running Pod surfaces, fast- completion path, Forbidden short-circuits, no-Pod timeout, Job Conditions to outcome, parserWriter contract - submit_test.go Run orchestrator: --detach happy path, replay, SubmitError propagation, request fields plumb through, nil-Out defaults to Discard, IsAuthError Local: vet, test -race -cover, gofmt -s, errcheck — all green. Phase 5 (#153: GitHub Releases, Homebrew tap, install.sh) is the last item on the v0.1 epic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(submit): enforce JobWatchTimeout + flush parser + typed watch errors + Unknown exit code Four Bugbot findings on PR #10's initial commit, all real: ## Medium — JobWatchTimeout was declared but never enforced The constant existed and documented "absolute cap on a single dataset-push's watch phase," but WatchJob never used it. Per-step timeouts (PodReadyTimeout, finalJobStatus's 30s) bounded sub-loops, but the log-stream phase ran indefinitely against ctx.Done() with no overall cap. A stuck ingestor would hang the CLI forever. Fix: context.WithTimeout(ctx, JobWatchTimeout) at the top of WatchJob. Cancel via defer. Cap fires as ctx.DeadlineExceeded through the inner loops. ## Medium — Watch failures used the submit exit code submit.Run wrapped both POST errors AND watch errors in the same return path. cli/dataset.go's switch mapped IsAuthError → 5 and "everything else" → 8 (submit-failed). But waitForJobPod / streamPodLogsAndParse / finalJobStatus failures aren't submit failures — jobs-manager already accepted the run, the cluster is doing the work, the CLI just lost the ability to follow along. Exit 8 misled customers into thinking the submit was rejected. Fix: introduce a WatchError type wrapping watch-phase errors; add IsWatchError predicate. cli/dataset.go now routes: IsAuthError → 5 (token / auth) IsWatchError → 9 (ingest-side; cluster keeps running) else → 8 (submit-side rejection) Test TestIsWatchError pins the typing including wrapped errors. ## Low — Log parser didn't flush partial last lines streamPodLogsAndParse used bufio.Scanner for line-based stream display, but the SummaryParser's own internal buffer (for chunked-write line assembly) was never flushed at end-of-stream. If the Pod terminated without a trailing '\n' on its final stdout write — rare but possible if the container's stdout was killed mid-write — the final line would be lost, potentially including the closing ═-rule that finalizes the banner. Fix: parser.FlushLine() at end-of-stream AND on non-EOF scanner errors. ## Medium — Outcome=Unknown exited zero The exit-code switch only handled Failed + Succeeded. Outcome= Unknown (returned by finalJobStatus when the 30s post-stream poll doesn't see a terminal condition) fell through to the default `return nil` → exit 0. But "we don't know if it worked" is not success. Fix: add an explicit Unknown branch that exits 9 with a diagnostic pointing the customer at `kubectl get job ...` to check the actual outcome. Local: vet, test -race -cover, gofmt -s, errcheck — all green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(submit): zero-metric finalize + namespace check + fresh ctx for final status Round 2 of Bugbot fixes on PR #10. Four findings, all real — three are second-order effects of the r1 fixes: ## Medium — Zero-metric banner never finalized The hasParsedAnyField check (which gates the closing ═-rule's finalization) required ANY metric counter to be > 0. But an ingestion that ran-then-died-early can legitimately produce a banner where ALL counts are 0 — the ingestor still prints the structure, just with zeroes. The parser would never finalize on that case, leaving the parser open to perturbation from post-banner shutdown logs. Fix: track a dedicated sawAnyField bool that latches on ANY successful field-line parse, regardless of the value. The finalization check now keys off "did we see any banner field" instead of "did we see a non-zero one." ## Medium — Missing submit response namespace not rejected client.go rejected 2xx responses with no job_name but happily accepted responses with no namespace. A malformed-but-parsing response would then drive the watch loop's k8s API calls at the empty namespace, plus produce a broken `kubectl logs` reconnect hint. Fix: parallel check for empty Namespace, same error shape as the job_name check. ## Medium — Watch timeout starved finalJobStatus R1 wrapped the WHOLE WatchJob in ctx.WithTimeout(JobWatchTimeout) to enforce the 1-hour cap. Side effect: finalJobStatus (which runs AFTER log streaming) inherited the SAME ctx — so a 59-minute log stream left finalJobStatus with 1 minute of budget, and a 60-minute stream left it with 0. Successful slow ingestions misreported as Unknown → exit 9. Fix: split the ctx hierarchy. - customerCtx — original input, carries SIGINT cancellation - watchCtx = WithTimeout(customerCtx, JobWatchTimeout) — caps the pod-wait + log-stream phases - finalCtx = WithTimeout(customerCtx, 30s) — fresh budget derived from customerCtx, not from watchCtx The customerCtx-vs-watchCtx distinction also fixes the detach detection: check customerCtx.Err() not ctx.Err() so a JobWatchTimeout expiry doesn't masquerade as a SIGINT-Detach. ## Medium — SIGINT after logs yielded exit 9 After log streaming ended cleanly, Ctrl-C during the brief finalJobStatus poll returned a watch error → exit 9. But the contract is "post-201 SIGINT = graceful detach, exit 0" — the cluster already has the work, the customer is just done watching. Fix: if customerCtx is Canceled inside the finalJobStatus error branch, return Outcome=Detached with the same reconnect-hint path the log-stream-cancel case uses. Consistent detach semantics across both watch sub-phases. Local: vet, test -race -cover, gofmt -s, errcheck — all green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(submit): port-forward to jobs-manager from off-cluster (Bugbot r3) Major architectural finding from Bugbot's third PR #10 review: the CLI runs on the customer's laptop (or off-cluster CI), but the discovered JobsManagerService URL is a *.svc.cluster.local name that the laptop can't resolve. The submit POST would always fail. The original Phase 4 design quietly assumed in-cluster network reachability — wrong for the dominant v0.1 use case. The chart's existing flow doesn't have this problem because the helm hook runs as a Pod IN the cluster. ## Fix: in-process port-forward, same mechanism as kubectl pf New internal/submit/portforward.go: - PortForwardJobsManager opens an SPDY tunnel via the customer's kubeconfig-authenticated apiserver connection. - pickServicePod resolves the jobs-manager Service to a Running Pod backing it (Services aren't directly port-forwardable; client-go's portforward API speaks Pod names). - ForwardedConnection.Close tears it down on defer; idempotent so the orchestrator's defer-Close + any inner cleanup don't risk a double-close panic. Wiring in cli/dataset.go: - After token mint, open the port-forward to JobsManagerServiceName:JobsManagerPort (8080 today). - Defer Close — runs on every path, including the SIGINT-detach + error paths from submit.Run. - HTTPSubmitter targets http://localhost:<allocated-port>; the in-cluster URL is no longer used at all from the CLI. cluster/discover.go: expose the bare Service name + port alongside the existing FQDN URL. Phase 2 still constructs the FQDN for the diagnostic output of `cluster info`, but Phase 4 needs the unqualified pieces for the port-forward setup. ## Tests internal/submit/portforward_test.go covers the pickServicePod resolution (5 cases: happy path, skip-non-Running, no-matching- Pod, missing-Service, selector-less). The full port-forward loop needs a real apiserver — covered by EKS smoke. internal/cluster/discover_test.go: updated the struct-literal comparison in TestDiscoverParentRelease_HappyPath to include the two new fields. Local: vet, test -race -cover, gofmt -s, errcheck — all green. ## What we got wrong in the initial design Designing Phase 4, I read the chart's existing flow (which runs in-cluster) + assumed the CLI could use the same URL. That's exactly the "cargo-cult the URL" mistake. The right framing would have been: "Phase 4 needs to reach jobs-manager from the laptop — what mechanism does kubectl have for that?" → port- forward. The fix is correct + well-tested where unit-testable. The full loop will get exercised on the EKS smoke test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(submit): pick most-recent useful-phase Pod, not items[0] (Bugbot r4) Bugbot's fourth PR #10 finding: waitForJobPod picked pods.Items[0] from an unordered List API response. A Job with backoffLimit > 0 (or any scenario where jobs-manager re-spawned the Pod for retry, node drain, etc.) can have multiple Pods bearing the same `job-name=<j>` label. Picking items[0] could grab the old Failed Pod from a prior retry while the current Running one waits — the CLI then streams stale logs and reports the prior-attempt outcome. Fix: iterate all Pods, prefer the most recent (by creationTimestamp) in a useful phase (Running | Succeeded | Failed). Pending Pods don't count — no log stream yet, so we keep polling rather than "return a Pending pod's name." Behaviorally: - Single Pod (common): same as before — that one Pod returns. - Old Failed + new Running: returns the Running. Correct. - Old Failed + new Pending: returns nothing yet; keeps polling until Pending → Running. Correct. - Multiple Running (parallelism > 1): returns the most recent — defensible choice; either would work for log streaming. Two new tests: - TestWaitForJobPod_PicksMostRecentNotFirst: seeds an older Failed + newer Running; asserts the Running's name returns. - TestWaitForJobPod_AllPendingKeepsPolling: seeds an only- Pending Pod; asserts DeadlineExceeded on a tight ctx (no early return with the Pending name). Local: vet, test -race -cover, gofmt -s, errcheck — all green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(submit): Pod-wait timeout = Detached (not failed); non-blocking errCh send Round 5 Bugbot findings on PR #10: ## Medium — Pod-wait timeout reported ingest failure waitForJobPod times out after PodReadyTimeout (5min) if the ingestor Pod doesn't reach Running — slow image pull, scheduling backlog, PSA still rejecting. The previous flow wrapped that as WatchError → exit 9 ("ingestion failed"). That's a false positive: jobs-manager already accepted the submit, the cluster will run the ingestion (eventually), the CLI just gave up observing within the timeout. Fix: distinguish errors.Is(err, context.DeadlineExceeded) from waitForJobPod and map to Outcome=Detached (same semantic as SIGINT-mid-watch). Customer sees the kubectl-logs reconnect hint and exits 0. The submit was successful; the ingestion runs. New test TestWatchJob_PodWaitTimeoutMapsToDetach pins the contract via a tight outer ctx that triggers DeadlineExceeded through the same code path. ## Medium — Defensive non-blocking errCh send in portforward.go Bugbot's read: "unbuffered handoff can block the goroutine indefinitely." False on the literal code (errCh is make(chan error, 1), buffered) — but the spirit of the concern is reasonable: a future refactor that adds a second send path, or changes the buffer size, could re-introduce the leak. Fix: keep the cap-1 buffer AND wrap the send in a non-blocking select with default. Two layers of safety, so the goroutine never blocks regardless of how the channel is used downstream. Closes out the finding structurally rather than relying on the code reader to spot that the buffer was already sized correctly. Local: vet, test -race -cover, gofmt -s, errcheck — all green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(submit): JobWatchTimeout-during-stream = Detached, not exit 9 Bugbot PR #10 r6 caught the inconsistency r5 introduced: the PodReadyTimeout path was mapped to Detached (commit 0b02cd4), but the SAME JobWatchTimeout constant capping the log-stream phase still mapped to exit 9 ("ingestion failed") if it expired during streaming. Both should mean the same thing: "the cluster keeps running; the CLI just gave up observing." Fix: accept context.DeadlineExceeded as well as context.Canceled in the post-stream branch. If watchCtx (the 1-hour-capped one) expired but customerCtx (the SIGINT-aware parent) didn't, that's the JobWatchTimeout-hit case — map to Detached, same UX as the SIGINT path. Customer sees the kubectl-logs reconnect hint and exits 0. The watchCtx-vs-customerCtx distinction (from r3) is what lets us tell SIGINT from JobWatchTimeout at this branch point — without it, the only way to detect timeout-vs-cancel would be the inner error. Local: vet, test -race -cover, gofmt -s, errcheck — all green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(submit): per-reason detach message (Bugbot r7) R5 + r6 expanded "Detached" to cover PodReadyTimeout and JobWatchTimeout in addition to the original SIGINT case, but the orchestrator's diagnostic still said "Detached on signal" for all three. Customers who hit the 5-min Pod-ready cap or 1-hour watch cap aren't pressing Ctrl-C; the framing was misleading. Fix: WatchResult gains a DetachReason field (DetachReasonSignal, DetachReasonPodWaitTimeout, DetachReasonWatchCap, or None). Every detach branch in WatchJob sets the appropriate reason. submit.Run's print-detach block now switches on the reason: Signal → "Detached on signal." PodWaitTimeout → "Detached: the ingestor Pod didn't reach Ready within the observation window..." WatchCap → "Detached: the watch window (1 hour) elapsed while the ingestion was still running." Common reconnect hint (`kubectl logs -f`) follows in all cases. The customerCtx-vs-watchCtx distinction (introduced in r3) makes the reason classification trivially correct: customerCtx.Canceled takes precedence over watchCtx.DeadlineExceeded if both fired, so a customer hitting Ctrl-C exactly at the 1-hour cap still gets the "on signal" message. Local: vet, test -race -cover, gofmt -s, errcheck — all green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ormula (#153) (#11) * feat(release): GitHub Release workflow, install scripts, Homebrew formula (#153) Closes the v0.1 epic (#147). Phase 5 is the distribution infrastructure that makes the CLI installable without `go install`. ## What lands in this PR ### .github/workflows/release.yml Triggered by `v*.*.*` tags (and workflow_dispatch for manual re-runs). Two-job pipeline: - `release` (matrix): cross-compiles linux/{amd64,arm64}, darwin/{amd64,arm64}, windows/amd64. Each binary gets a cosign keyless OIDC signature (.sig + .cert), per-platform SHA256, and a stamped version string via -ldflags. - `publish`: aggregates SHA256SUMS, stages the install scripts, creates the GitHub Release with auto-generated notes + every artifact attached. prerelease=true when the tag contains a hyphen (e.g. v0.1.0-rc1). A third `bump-homebrew-tap` job is wired but gated `if: false` until the tracebloc/homebrew-tap repo + HOMEBREW_TAP_TOKEN secret are set up. The path is: flip the gate, secret is added, formula gets updated automatically on each tag. RELEASE_CHECKLIST documents the one-time setup. ### scripts/install.sh POSIX-compatible (tested against dash, busybox sh, bash) — the customer's distro might not have bash. Detects OS+arch, resolves the latest tag via the /releases/latest redirect-trail (no rate-limited API call), downloads binary + SHA256SUMS, verifies the checksum, optionally verifies cosign signature if cosign is on PATH, installs to /usr/local/bin (falls back to ~/.local/bin with PATH advice if not writable). One-liner: `curl -fsSL https://github.com/tracebloc/cli/releases/latest/download/install.sh | sh` ### scripts/install.ps1 PowerShell 5.1+ (ships with Windows 10 21H1+). amd64 only for v0.1; arm64 errors with a clear "file an issue" pointer. Strict-mode + halt-on-error so half-installs don't happen. Same SHA256 + cosign verification surface as install.sh. Installs to $LOCALAPPDATA\Programs\tracebloc and PATH-adds at user scope. One-liner: `irm https://github.com/tracebloc/cli/releases/latest/download/install.ps1 | iex` ### scripts/homebrew-formula.rb.tmpl Formula template the bump-homebrew-tap job renders per release. Substitutes __VERSION__, __TAG__, and four per-platform __SHA_*__ placeholders via sed (no Ruby in the runner needed). Ships pre-built binaries rather than building from source — the k8s.io transitive dep tree is too heavy to build on customer laptops. ### scripts/RELEASE_CHECKLIST.md Per-release on-call doc. Distinguishes "what's automated" (everything in the workflow) from "what needs one-time setup" (homebrew-tap repo, eventually install.tracebloc.io DNS) from "what's manual per release" (the actual tag push + sanity-test of each install path). ## Hosting choice GitHub Release raw URLs (per user decision). Zero infrastructure needed; the bootstrap URL works the day v0.1.0 ships. install. tracebloc.io is a v0.2 follow-up via CNAME / Cloudflare worker. ## Out of this PR - Creating the tracebloc/homebrew-tap repo (separate; documented in RELEASE_CHECKLIST.md) - DNS for install.tracebloc.io (v0.2) - Tagging v0.1.0 (post-merge, after the Phase 6 dogfood) Local: vet, test -race -cover, gofmt -s, errcheck — all green. After this merges, the v0.1 epic (#147) is mechanically complete — `tag v0.1.0` produces the full release pipeline output. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(install.sh): no sha256sum/shasum = hard error, not silent skip Bugbot PR #11 round 1: the previous flow printed "✓ checksum matches" unconditionally after the verify block, even when neither sha256sum nor shasum was on PATH (the no-tool branch set actual="" and the mismatch check guarded on -n "$actual", so it was silently skipped). Two failures in one: 1. Misleading log: claimed verification succeeded when it didn't run. 2. Security regression: an attacker could MITM the binary and the customer would never notice on a tool-missing host. Fix: refuse to install when no SHA256 tool is available. Error message lists the per-distro install commands so the customer knows the fix. Trade-off: a "warn-but-continue" mode would be friendlier on exotic hosts, but the whole point of this script is to verify a binary the customer pulls off the internet. "Friendly" defaults that skip the security check are exactly the install-script anti-pattern that's bitten the industry repeatedly. Hard error is the right call. `sh -n` syntax check clean. * fix(install): cosign-fail-installs (.ps1) + custom-prefix-ignored (.sh) Two Bugbot PR #11 r2 findings, both real: ## Medium — install.ps1: failed cosign verify treated as "skip" The previous flow's try-block contained BOTH the .sig/.cert download AND the cosign verify + Write-Error. With $ErrorActionPreference = 'Stop', Write-Error throws an exception caught by the SAME catch that handled missing-sig downloads — so a verified-failed binary went through the "couldn't download .sig/.cert — release may pre-date signing" branch and continued to install. Fix: separate the two concerns. Download in a try/catch (failures = "predates signing, skip"). Verify OUTSIDE the try via & cosign (external process, doesn't interact with $ErrorActionPreference); $LASTEXITCODE check + Write-Host + explicit exit 1 if non-zero. A failed signature now correctly refuses to install. ## Medium — install.sh: custom --prefix silently ignored if missing POSIX `test -w` returns false for nonexistent paths. The previous flow used `[ ! -w "$PREFIX" ]` to decide whether to fall back to ~/.local/bin — but that meant a legitimate `--prefix /opt/tracebloc` (a dir that doesn't exist yet) silently triggered the fallback, overriding the customer's explicit choice. Fix: try `mkdir -p "$PREFIX"` first; if THAT succeeds AND the resulting dir is -w, use it. Only fall back if mkdir fails (no write perms on parent) OR the existing dir is unwriteable. The dominant default-prefix case (`/usr/local/bin` without sudo) still routes to ~/.local/bin; custom prefixes get respected. Local: `sh -n scripts/install.sh` clean; full Go test suite + gofmt + errcheck green. * fix(install.ps1): Fail helper for clean errors + null-PATH guard (Bugbot r3) Two more PowerShell-specific Bugbot findings on PR #11: ## Medium — Write-Error + exit 1 = dead exit, ugly UX With $ErrorActionPreference = 'Stop', every Write-Error call throws a terminating error BEFORE any subsequent statement runs, so the `exit 1` lines after them were dead code. The result: a customer hitting an error saw PowerShell's full error record (stack trace + script line numbers + ErrorCategoryInfo) instead of a clean one-line message. The cosign branch was fixed in r2 with Write-Host + explicit exit. The earlier error paths (Get-Arch, Resolve-Tag, SHA256 mismatch, missing SHA256SUMS entry) still used the broken Write-Error pattern. Fix: add a Fail helper that does `Write-Host "Error: $msg" -ForegroundColor Red; exit 1` and replace every Write-Error call site with `Fail "..."`. DRY + consistent UX. ## Medium-Security — Null user PATH = leading semicolon = PATH-injection GetEnvironmentVariable('Path', 'User') returns $null on fresh Windows installs (or accounts that never set a user-scope PATH). The naive `"$userPath;$InstallPrefix"` interpolation then produced `";C:\Users\...\Programs\tracebloc"` — a leading semicolon = empty PATH entry, which Windows resolves as the CURRENT WORKING DIRECTORY. That's a well-known PATH-injection vector: any binary planted in the user's cwd runs ahead of real PATH entries. Fix: null-guard $userPath before concatenation. If $userPath is falsy (null or empty), use just $InstallPrefix as the new value. The `$existingEntries -notcontains` check now also handles the null case correctly via the `if ($userPath) { ... } else { @() }` fallback. Local: go test green, gofmt + errcheck clean. This is r3 on PR #11 — the install scripts have surfaced the most findings because they're shell+PowerShell where bugbot's language-specific coverage is sharpest. Each finding has been real. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The first real-cluster run of `dataset push` surfaced three issues that the mock-only test suite couldn't catch. All three are now validated end-to-end against a live tracebloc/client release (image_classification, 6/6 records, 100% success, rows confirmed in MySQL). 1. Staging collided with the ingestor's DEST_PATH (the blocker). The CLI staged source files into /data/shared/<table>, which is exactly the ingestor's DEST_PATH (data-ingestors config.DEST_PATH = STORAGE_PATH/TABLE_NAME). Its DuplicateValidator rejects a pre-existing, non-empty destination, so the CLI's own staging tripped the check on every run. StagedPrefix now stages under /data/shared/.tracebloc-staging/<table>, leaving the destination fresh. Added FinalDestPrefix for the real destination shown in pre-flight. 2. Image target_size had no override. image_classification defaults to 512x512 and the ingestor VALIDATES the resolution (it does not resize), so any other size hard-failed the Image Resolution Validator. Added --target-size WxH, plus auto-detect from the first image when the flag is omitted, emitted as spec.file_options.target_size. 3. Watch treated a broken log stream as fatal. A Pod replaced/restarted/ deleted mid-follow (e.g. a backoffLimit retry) returned exit 9 even when the Job ultimately succeeded. The Job's terminal status is now the source of truth: a non-ctx stream error is only fatal if the Job outcome can't be determined. Tests: new push/detect_test.go (ParseTargetSize, DetectImageSize); spec_test.go (staging != dest regression, target_size passes schema); watch_test.go (Job status wins on stream break); stream_test.go updated to derive paths from StagedPrefix. go build / vet / test all green. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
…#13) Extends `dataset push` from image_classification-only to also cover tabular_classification, tabular_regression, time_series_forecasting, and time_to_event_prediction. The Python ingestor already supports these; this adds the CLI-side flag / layout / spec surface. Validated end-to-end on a live cluster (tabular_classification, 8/8 records, 100%, rows confirmed in MySQL). - Category dispatch (push/category.go): image vs tabular families, mirroring data-ingestors' conventions.py groupings. - Tabular local layout (push/tabular.go): a single CSV (no sidecar files), staged via the existing machinery (CSV + empty image list). - Schema: auto-inferred from the CSV (INT/FLOAT/VARCHAR) so customers don't hand-write one; --schema col:TYPE,... overrides. Reserved framework columns (id, data_id, ...) are skipped so a CSV carrying an id column doesn't produce a schema the ingestor rejects (the #135b guard). - Label: string form for tabular_classification; object form with policy=bucket (default) for the regression-class categories so the raw numeric target never leaves the cluster. Added --label-policy and --time-column. - Build() branches by category (push/spec.go); pre-flight is category-aware (data CSV + column count for tabular). Tests: push/tabular_test.go (DiscoverTabular, InferSchema incl. reserved-skip, ParseSchema); spec_test.go (tabular Build passes the schema for all three label shapes, regression defaults to bucket); updated the unsupported-category gate test. go build / vet / test green. Stacked on cli#12 (the dataset-push live-ingestion fixes). Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
…#14) Adds text_classification and masked_language_modeling, and generalizes the staging machinery from images-only to arbitrary sidecar directories plus extra root files — the shared piece the remaining families need. - Generic sidecar staging (walk.go, stream.go): LocalLayout gains Sidecars (dir name -> files, staged under "<name>/") and ExtraFiles (dest -> src, staged at the table root), plus FileCount(). The tar writer packages them after images, sorted for determinism. Images stays for image_classification. - text.go: DiscoverText for text_classification (labels.csv + texts/) and masked_language_modeling (labels.csv + sequences/ + a required tokenizer.json at root, staged as an ExtraFile). discoverSidecarFiles is a reusable walker (object detection / segmentation will reuse it). - Build (spec.go): the text branch emits texts/ or sequences/ + a label (text_classification only; MLM has none). - dataset.go: category dispatch + gate now accept the text family; pre-flight is text-aware (sidecar file count + tokenizer line). Validated live: text_classification — staged labels.csv + 5 texts/ -> ingestor Job 100% (5/5), rows confirmed in MySQL. MLM note: code-complete + unit-tested + accepted by the current data-ingestors schema/engine (proven by the e2e), but the *deployed* ingdemo jobs-manager carries a stale embedded schema predating MLM, so a live MLM push is rejected server-side (HTTP 400) until that image is refreshed. Not a CLI issue — the CLI surfaces the 400 cleanly. Tests: push/text_test.go (DiscoverText for both categories incl. MLM-requires-tokenizer + missing-dir errors; text Build passes the schema, MLM has no label); updated the unsupported-category gate test. go build / vet / test green. Stacked on cli#13 (tabular family) -> cli#12 (live-ingestion fixes). Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
…e-sync) (#15) Adds the two remaining engine-supported image categories, taking the CLI to 9/10 modalities (only semantic_segmentation remains, blocked on the ingestor — data-ingestors#136). - object_detection: reuses the generic sidecar walker for annotations/ (.xml). Validated live end-to-end — 128 records (bounding boxes) ingested, rows confirmed in MySQL. - keypoint_detection: labels.csv + images/ (keypoint coords live in the CSV's Annotation column, read server-side). Adds --number-of-keypoints (required; no default). Emits target_size + number_of_keypoints as TOP-LEVEL fields, which the schema's keypoint conditional requires. - Re-synced the embedded schema from data-ingestors develop. The vendored copy was stale: it lacked keypoint's top-level target_size + number_of_keypoints and their required-for-keypoint conditional, so the CLI couldn't validate a keypoint spec at all. `ingest validate` and dataset push now validate keypoint correctly. Schema-skew findings (deployment/release hygiene, NOT CLI bugs): * sync-schema.sh defaults to data-ingestors *master*, which is stale (lacks both MLM and keypoint); the current schema is on *develop*. Repoint the sync source to develop, or promote develop -> master. (sync --check vs master flags this drift — pre-existing, surfaced here.) * The deployed ingdemo client runs jobs-manager and the ingestor on DIFFERENT schema versions: the ingestor (newer) REQUIRES keypoint's top-level fields; jobs-manager (older) REJECTS them as additional properties. So keypoint can't be ingested there until both components are refreshed to a matching schema. The CLI's emission is correct against the current/consistent schema (unit-verified). OD is unaffected (no new fields). Tests: push/image_extras_test.go (DiscoverObjectDetection + missing-annotations); spec_test.go (OD emits annotations; keypoint emits top-level target_size + number_of_keypoints; both pass the schema); updated the unsupported-category gate test (now segmentation only). go build / vet / test green. Stacked on cli#14 (text) -> #13 (tabular) -> #12 (fixes). Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
) Lifts unit coverage on cheap-to-test, previously-0% functions that don't need a cluster, and beefs up the CI smoke test to exercise the real binary's schema validation. - progress_test.go: NewProgress on a non-TTY returns the no-op sink; isTTY false cases. - coverage_test.go (cli): printPushPreflight renders the key facts; exitError.Error/Code; runClusterInfo exits 3 on a bad kubeconfig. - watcherr_test.go (submit): WatchError.Unwrap + IsWatchError classification (drives the exit-9 vs exit-8 mapping). - CI smoke test now runs `ingest validate` against committed fixtures (valid -> exit 0; missing-images -> exit 2) + `dataset push --help`, so a schema/validation/wiring regression is caught on the built binary without a cluster. Per-package coverage after: cli 61%, cluster 84%, push 84%, schema 81%, submit 66%. The remaining gaps are the real-I/O seams (SPDYExecutor.Exec, PortForwardJobsManager, NewClientset) — those need a real cluster and are covered by the integration harness (separate PR). Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
The unit suite mocks every cluster boundary, leaving the highest-risk
code — the SPDYExecutor tar-over-exec stream — at 0% coverage. Every
live bug this project hit (staging<->DEST collision, validator drift,
schema skew) was only findable by running it by hand. This adds an
automated integration harness against a real cluster.
- test/integration/ (//go:build integration):
* Connectivity — cluster.Load + NewClientset against a live API server.
* StageAndVerify — creates a throwaway namespace + PVC + stage pod,
streams a dataset via the REAL push.SPDYExecutor tar-over-exec, then
exec's back into the pod to confirm the files landed. Covers
SPDYExecutor.Exec + StreamLayout + CreateStagePod /
WaitForStagePodReady — the seams with zero unit coverage.
- Makefile: `make test-integration` (ambient kubeconfig; -count=1,
build-tagged).
- .github/workflows/e2e.yml: boots kind + runs the suite. Gated to
nightly / manual dispatch / `e2e`-labeled PRs (kind boot is too heavy
to run per-PR).
Validated locally against a live k3d cluster: both tests pass; the
stream lands files on the PVC and the exec-back confirms content. The
files are build-tag-gated, so normal `go test ./...` and the CI test
job are unaffected. The kind CI job gets its first real run on the
nightly schedule / manual dispatch / an `e2e`-labeled PR.
Follow-ups (test/integration/README.md): a PortForwardJobsManager
fixture; a full `dataset push` through a real jobs-manager + ingestor
(needs the chart in-cluster) to catch cross-component schema skew e2e.
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
README was stuck at Phase 0; updates Status (9/10 modalities), Customer-experience/install caveats, Go 1.26 floor, and the roadmap. Also fixes three user-facing help/output strings (cluster info 'coming in Phase 3'; dataset and dataset push --help referencing Phase 4 / PR-b / image-only) that contradicted the shipped binary. Docs only. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The tracebloc/homebrew-tap repo isn't set up yet and Homebrew is deferred past the v0.1 alpha. Removes the brew install line + softens the Status/roadmap mentions (framed as a deferred follow-up). GitHub Release + install.sh/ps1 paths remain. Docs only; bump-homebrew-tap was already if:false. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The root command's Long help still claimed the binary "implements only version and completion" and pointed at the roadmap for future phases — stale since v0.1 shipped the dataset/ingest/cluster commands. PR #18 refreshed the README and the three other user-facing --help strings but missed this one (it scoped to cluster info / dataset / dataset push). Describe the shipped command set instead. Docs/strings only; no behavior change. Closes #19 Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Contributor
|
👋 Heads-up — Code review queue is at 11 / 8 Above the WIP limit. The team convention is to review existing PRs before opening new work. Open PRs currently in Code review (oldest first):
Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.) |
saadqbal
approved these changes
Jun 3, 2026
v0.1.0-alpha is published (pre-release). The README still said "first release pending" / "no tagged release exists yet" and advertised `curl https://install.tracebloc.io | sh` — but that vanity domain isn't wired up, and GitHub's `latest` redirect skips pre-releases, so neither the domain shortcut nor a `latest`-based one-liner can install the alpha. - Status: "first release pending" -> v0.1.0-alpha published (pre-release). - Install: replace the dead install.tracebloc.io commands with the working explicit-tag invocation + a link to the signed release assets; note the domain / `latest` / Homebrew all await the first stable tag. - Roadmap: Phase 5 "awaiting first v* tag" -> alpha published. Docs only; no behavior change. Closes #23 Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
1. target_size: emit [height, width] (was [width, height]) to match the ingest.v1 schema + ingestor; only affected non-square datasets. Added TestBuild_TargetSize_EmittedHeightWidth (non-square) regression test. 2. Makefile: make ci lint now runs the same standalone tools as CI (errcheck + ineffassign + misspell) instead of golangci-lint, restoring the local==CI invariant while golangci-lint-action stays disabled (#6); golangci-lint moved to make lint-full. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Contributor
Author
|
@BugBot run. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 534d75b. Configure here.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Summary
Roll-up sync of
develop→mainfor the v0.1.0-alpha release. Bringsmain(last atddbe987, an early Phase-0/1 state) up to the full v0.1 feature set: the completetracebloc dataset pushflow across 9/10 modalities, the release pipeline, and the test/integration hardening. No divergence —mainis an ancestor ofdevelop(17 commits).Changelog (
main..develop)Core ingestion CLI
tracebloc ingest validate#1): embedingest.v1.json+tracebloc ingest validate— offline, local schema check.tracebloc/clientrelease) + ingestor SA token via TokenRequest.dataset push(PR-a of #151) #8, feat(dataset): stage Pod + tar-over-exec stream (PR-b of #151) #9):dataset push— fail-fast pre-flight (spec synth, schema validation, layout walk, PVC discovery) → ephemeral stage Pod + tar-over-exec stream + SIGINT-safe cleanup.install.sh/install.ps1+ Homebrew formula template.Modality expansion —
dataset pushnow covers 9 of 10 categories--target-size+ auto-detect, watch resilience).text_classification,masked_language_modeling) + generic sidecar-staging engine.object_detection+keypoint_detection+ embedded-schema re-sync.Testing
SPDYExecutortar-over-exec).Docs
--helprefresh for the shipped v0.1 surface.--helpstring #21: refresh the stale root--helplong string.Review fixes
target_sizenow emitted as[height, width](the schema order; was[width, height], which affected non-square datasets), plus a non-square regression test;make cilint realigned to the CI tool set (errcheck/ineffassign/misspell), with golangci-lint moved tomake lint-full.Release: v0.1.0-alpha
First tagged release, cut from
develop(a661a9a) and validated end-to-end:SHA256SUMSand the install scripts.release.yml) succeeded on its first run; downloaded binary verifies againstSHA256SUMSand self-reportstracebloc 0.1.0-alpha.traceblocname (provisional — the stablev0.1.0will carry the final product name). Homebrew deferred.🤖 Generated with Claude Code