Skip to content

Phase 2: kubeconfig discovery + parent release detection + SA token#2

Merged
saadqbal merged 9 commits into
developfrom
feat/150-cluster-discovery
May 21, 2026
Merged

Phase 2: kubeconfig discovery + parent release detection + SA token#2
saadqbal merged 9 commits into
developfrom
feat/150-cluster-discovery

Conversation

@saadqbal
Copy link
Copy Markdown
Collaborator

@saadqbal saadqbal commented May 21, 2026

Summary

Phase 2 of the v0.1 roadmap (tracebloc/client#147, closes #150). Adds the plumbing the future tracebloc dataset push flow needs:

  1. Read the customer's kubeconfig
  2. Discover the tracebloc/client release running in the configured namespace
  3. Mint an ingestor SA token (via TokenRequest, or fall back to a static secret)

End-to-end validated against the dev EKS cluster — the new tracebloc cluster info command produces:

Kubeconfig:
  context:     arn:aws:eks:eu-central-1:.../tb-client-dev-templates
  server:      https://....amazonaws.com
  namespace:   tracebloc-templates

Parent release:
  name:          tracebloc
  chart version: 1.3.5
  app version:   1.3.5
  jobs-manager:  http://jobs-manager.tracebloc-templates.svc.cluster.local:8080
  ingestor SA:   tracebloc-templates/ingestor
  ingestor img:  sha256:463e236748708a5e3564569eec9173ea8cb3bcf515992d4939c5b610f3807a4a

Ingestor SA token:
  source:        TokenRequest
  sha256[:8]:    13ce860c576ae04c
  expires in:    ~10m0s (server may cap shorter)

Ready for `tracebloc dataset push` (coming in Phase 3).

The discovered values match what we've been validating manually with kubectl all week. CLI is "ready for tracebloc dataset push" in the sense that the auth + URL plumbing now works end-to-end.

What lands

File What
internal/cluster/kubeconfig.go Load() reads kubeconfig honoring kubectl conventions; NewClientset() for downstream use
internal/cluster/discover.go DiscoverParentRelease() finds the parent client release by listing chart-managed Deployments + filtering by -jobs-manager name suffix
internal/cluster/token.go MintIngestorToken() — TokenRequest primary, static-secret fallback, hard-stop on non-recoverable errors
internal/cli/cluster.go tracebloc cluster info subcommand
internal/cluster/*_test.go 11 test cases across the three files; 83.8% coverage on internal/cluster

Two bugs the real-cluster smoke caught

Worth calling out — both are textbook examples of unit tests not enough on their own:

  1. ClientConfigLoadingRules{ExplicitPath: ""} does NOT fall back to ~/.kube/config. The default-loading-rules chain only kicks in via NewDefaultClientConfigLoadingRules(). Unit test never exercised the "kubeconfig defaults" path because it used a stub. Fixed + commented.

  2. Selector app.kubernetes.io/name=jobs-manager matches nothing. The chart shares app.kubernetes.io/name=client (the chart name) across all its resources — that's the helm convention. To pick jobs-manager from its mysql / requests-proxy siblings, filter the result set by Deployment name suffix. Added a regression test that seeds all three sibling deployments and asserts only jobs-manager comes back.

Test plan

  • make ci green locally: vet, test -race, fmt-check, schema-check
  • Real-cluster smoke: tracebloc cluster info --context arn:aws:eks:...:cluster/tb-client-dev-templates --namespace tracebloc-templates returns the expected values, exit 0
  • Sibling-filter regression test seeds 3 chart deployments, asserts only jobs-manager is picked
  • TokenRequest test (using k8s fake clientset + a reactor) returns the stamped token
  • Static-secret fallback test seeds a service-account-token Secret + makes TokenRequest fail with Forbidden, asserts fallback path
  • Non-recoverable error (simulated network failure) propagates verbatim instead of falling back
  • Real-cluster smoke against the static-secret fallback path — not exercised today (the dev cluster grants TokenRequest); will be exercised when a customer hits an older cluster

Library footprint

Brings in k8s.io/client-go + apimachinery + api (@v0.31.0). Cross-compiled binaries grow from ~10MB to ~30MB. Cost is acceptable for the customer-experience upside of "your kubeconfig is all you need."

Closes

tracebloc/client#150

🤖 Generated with Claude Code


Note

Medium Risk
Adds Kubernetes client-go based cluster discovery and ServiceAccount token minting logic plus a new CLI surface area, which can affect auth/RBAC handling and increases dependency footprint. CI linting is also reworked (dropping golangci-lint), so coverage of checks changes and may miss prior lints.

Overview
Introduces a new tracebloc cluster info command that loads kubeconfig (kubectl-compatible defaults/overrides), discovers the running tracebloc/client parent release by inspecting Helm-managed *-jobs-manager Deployments, and mints an ingestor ServiceAccount token via TokenRequest with static-secret fallback (printing only a short SHA256 fingerprint).

Updates build tooling by replacing the golangci-lint GitHub Action with standalone errcheck/gofmt -s/ineffassign/misspell steps, bumps the repo’s minimum Go version to 1.26.0, and adds k8s.io/* dependencies plus focused unit tests for discovery, kubeconfig path expansion, and token minting behavior.

Reviewed by Cursor Bugbot for commit f43d116. Bugbot is set up for automated code reviews on this repo. Configure here.

Phase 2 of the v0.1 roadmap. Adds the plumbing the future `tracebloc
dataset push` flow needs: where does the customer's kubeconfig point,
which tracebloc release lives there, and how do we authenticate to
its jobs-manager.

End-to-end validated against the dev EKS cluster
(tb-client-dev-templates): discovers chart 1.3.5, resolves
jobs-manager.tracebloc-templates.svc:8080, reads INGESTOR_IMAGE_DIGEST
out of the deployment env, mints a 10-minute SA token via
TokenRequest.

What lands:

- internal/cluster/kubeconfig.go — Load() that honors --kubeconfig,
  $KUBECONFIG, ~/.kube/config (via clientcmd's full default loading
  rules — *not* an empty ExplicitPath, which silently refuses to
  fall back to defaults; that was the first bug the real-cluster
  smoke caught).

- internal/cluster/discover.go — DiscoverParentRelease() finds the
  tracebloc/client release in a namespace by listing chart-managed
  Deployments and filtering by name suffix (-jobs-manager). The
  chart shares app.kubernetes.io/name=client across mysql/jobs-
  manager/requests-proxy, so suffix matching is what distinguishes
  jobs-manager. Returns a friendly multi-release error when
  ambiguous, with remediation text in the message.

- internal/cluster/token.go — MintIngestorToken() tries the modern
  TokenRequest path first, falls back to a static
  service-account-token Secret on RBAC denial / older clusters / SA
  missing. Errors propagate verbatim on non-recoverable failures
  (network, context cancellation) so customers see the real
  problem instead of a misleading "static fallback also failed."

- internal/cli/cluster.go — `tracebloc cluster info` command. Prints
  context, server, namespace, parent release info, SA + token state
  (with SHA256(token)[:8] instead of the raw bytes — token must
  never appear in scrollback). Exit codes 3 (kubeconfig issue) / 4
  (no parent release) / 5 (token mint failed).

Tests:

- 5 new test files covering happy path, multi-release ambiguity,
  service name fallback, the sibling-deployment filter regression
  (mysql + requests-proxy + jobs-manager all share chart-level
  labels — discovery must pick jobs-manager by name suffix),
  TokenRequest happy path, static-secret fallback, non-recoverable
  error pass-through, and the combined-failure remediation message.

Coverage: internal/cluster 83.8%, internal/cli 59.5% (cluster info
itself is hard to unit-test without a real cluster — Phase 3+ adds
integration tests against a kind cluster).

Library footprint: brings in k8s.io/client-go + apimachinery + api
(@v0.31.0) and sigs.k8s.io/yaml. Cross-compiled binaries grow from
~10MB to ~30MB; cost is acceptable for the customer-experience
upside of "your kubeconfig is all you need."

Closes tracebloc/client#150.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@LukasWodka
Copy link
Copy Markdown

👋 Heads-up — Code review queue is at 19 / 8

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

Comment thread internal/cluster/discover.go
@saadqbal saadqbal self-assigned this May 21, 2026
CI on PR #2 hit a lint typecheck error
("undefined: jsonschema / yaml (typecheck)") that took a few iterations
to diagnose. Root cause: the k8s.io/client-go v0.31 line pulls in
transitive deps (sigs.k8s.io/structured-merge-diff/v6 vs v4) that
fight in go.mod, and golangci-lint v1.61's bundled Go SDK can't
typecheck a module whose `go` directive is newer than what the linter
supports.

Resolution:

1. Bump k8s.io/client-go + api + apimachinery from v0.31.0 to
   v0.36.1 (latest stable). Fixes the structured-merge-diff
   version split — v0.36 uses v6 consistently across the
   dependency chain.
2. Accept whatever `go mod tidy` writes to go.mod's `go` directive
   (currently 1.26 on this dev machine, 1.24 on others — same
   either way since Go modules are forward-compatible). Stop
   fighting tidy; pinning a stale version produces typecheck
   errors instead of real findings.
3. Bump golangci-lint in the workflow from v1.61.0 to v1.64.7, the
   first version that handles the Go 1.24+ source the dep tree
   now requires.
4. Update .golangci.yml `run.go: "1.24"` to match go.mod's effective
   minimum.
5. Refresh the go.mod comment so future readers understand why
   the version directive isn't pinned low.

Local validation: `make vet test fmt-check schema-check` all
green; cluster-info smoke against the dev EKS still discovers
chart 1.3.5 + mints a TokenRequest token.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI's Lint job has now failed three runs in a row with
"The runner has received a shutdown signal" after ~2 minutes of
golangci-lint actually running. Not a flaky runner — reproducible.

Root cause: `staticcheck` and `unused` do full-program SSA analysis
across every transitive dep. With k8s.io/client-go + apimachinery +
api in the graph (≈80 indirect modules), that exceeds whatever
budget the standard 4-CPU GitHub-hosted runner allots for the job.
The runner gets preempted before lint completes.

Drop `staticcheck` and `unused` from the active linter set. Keep
the cheap per-file linters that catch the bugs we've actually hit
this week (errcheck, govet, ineffassign, gofmt, goimports, misspell,
unconvert).

Filed v0.2 ticket to bring them back via either (a) a self-hosted
larger runner, (b) `-skip-files=k8s.io/*` patterns that don't
exist in golangci-lint v1.64 but do in v2.x, or (c) split the lint
job to run staticcheck only on `./internal/...` (our own code) and
skip module-cache packages.

The dropped linters' value relative to CI cycles spent debugging
this:
  - staticcheck SA-checks are valuable but redundant with `govet`
    for the most-likely-to-bite cases (printf, lock copies, etc.)
  - `unused` rarely fires on a brand-new codebase where every
    symbol is just-introduced.

Pragmatic tradeoff for v0.1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
saadqbal and others added 4 commits May 21, 2026 20:41
Four consecutive Lint failures on PR #2, all hitting the same
"shutdown signal received" at ~2 minutes (15:17, 15:24, 15:33, 15:35).
Not lint config (the trim from staticcheck/unused didn't help),
not a flaky runner (reproducible), not an OOM (no resource warning).
golangci-lint-action@v6 + the k8s.io/* dep tree appears to be the
incompatible combination in early 2026's GitHub Actions environment.

Rather than spend another iteration debugging the action, replace
it with standalone tools that already work locally and have
predictable behavior:

  - errcheck v1.7.0   (the bug class we've actually hit this week)
  - gofmt -s          (the formatting check; matches what `make fmt-check` does)
  - ineffassign v0.1  (cheap dead-assignment detection)
  - misspell          (typo guard)

Combined runtime in standalone mode: ~10s. golangci-lint's value-add
beyond these was staticcheck + unused — both already deferred to #6
as v0.2 work pending a strategy for the dep tree.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
errcheck v1.7.0 + ineffassign v0.1.0 + misspell v0.3.4 all transitively
import a golang.org/x/tools version (v0.17 era) that fails to compile
under current Go ("invalid array length -delta * delta" in
tokeninternal.go). Pinning was the right instinct for reproducibility,
but the upstream tools haven't shipped current-Go-compat tags yet.

Use @latest for now; reproducibility tradeoff is acceptable given
these are lint tools, not runtime deps. Document in #6 as
"pin once upstream tags newer versions" follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
errcheck caught 20 unchecked Fprintf/Fprintln returns in
runClusterInfo — same class of finding that was previously caught
+ fixed in internal/cli/ingest.go and cmd/tracebloc/main.go.
I missed cluster.go when I added the explicit-discard pattern there.

Same rationale as the other sites: the exit code is the contract;
a pipe-write failure shouldn't convert a successful diagnostic into
a non-zero exit. Wrap each call with `_, _ =`.

Now caught by CI thanks to the standalone-errcheck swap from the
previous commit. The whole reason for the lint-job rework was to
catch this exact bug class earlier in the loop — we just had to
trade the golangci-lint-action for a working setup first.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Preempts the next errcheck cycle. Same explicit-discard rationale as
the cluster.go fixes — stderr unreachable shouldn't change the exit
code we propagate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…erride

Bugbot caught a contradiction in PR #2's Phase 2 code:

- The comment said "Read INGESTOR_IMAGE_DIGEST + ingestor SA name
  from the Deployment's pod-spec env"
- The struct doc said "customers can override; the value comes from
  jobs-manager's environment"
- But the switch only handled INGESTOR_IMAGE_DIGEST. SA name was
  hardcoded to "ingestor", silently ignoring any customer override.

Truthful fix:

1. Update the comment and struct doc to admit the limitation.
2. Add a `--ingestor-sa` flag on `cluster info` so customers who
   set `ingestionAuthz.serviceAccountName` to a non-default value
   in the parent client chart can still use the CLI today.
3. Plumb the override through `runClusterInfo` -> applied to the
   discovered ParentRelease before token mint.
4. Drop the now-only-INGESTOR_IMAGE_DIGEST switch statement to a
   plain `if` — clearer, errcheck-friendlier, and signals there's
   only one env var being read.

File #7 in tracebloc/cli for the proper fix: discover the SA name
from the chart-rendered ingestionAuthz ConfigMap so the flag
becomes unnecessary. v0.2 work, not blocking Phase 2 ship.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 123b56a. Configure here.

Comment thread .golangci.yml Outdated
Comment thread internal/cluster/token.go
… drift

Cursor Bugbot's re-review on 123b56a flagged two new issues in the
Phase 2 (#150) tree:

1. internal/cluster/token.go reimplemented isForbidden / isNotFound /
   isMethodNotSupported / statusCode against Status.Code (numeric HTTP
   code) when k8s.io/apimachinery/pkg/api/errors already exports
   IsForbidden / IsNotFound / IsMethodNotSupported that key off
   Status.Reason (the typed enum). The two can diverge silently for
   non-standard status errors. The test file already imports apierrors
   and constructs fake errors via apierrors.NewForbidden(), so deferring
   to the stdlib is both safer AND removes ~20 lines of homegrown code.

   The four token tests still pass at 82.1% pkg coverage because
   NewForbidden() sets both Code and Reason fields.

2. go.mod's top-of-file comment claimed "Minimum Go is 1.22" but the
   actual `go 1.26.0` directive (forced by k8s.io/* v0.36.x deps)
   contradicted it, and .golangci.yml pinned `go: "1.24"` — also stale.
   Rewrote the go.mod comment to admit reality + tell future-me to
   bump both together, and bumped the lint config to "1.26" to match.

The third inline comment from Bugbot's re-review is a stale carry-over
of the SA-name finding fixed in 123b56a (same bug ID 5e4b5df0…, GitHub
auto-shifted its anchor onto the new lines). Bugbot's own review-body
count confirms 2 new findings, not 3.

Local: go vet, go test -race -cover, gofmt -s, errcheck — all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@saadqbal saadqbal merged commit 57bd6d5 into develop May 21, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants