docs(rfc-0001): retag cluster_id anchor to its own ticket (backend#883)#97
Merged
saadqbal merged 1 commit intoJun 24, 2026
Conversation
…(backend#883) The cluster_id anchor (field + get-or-create + cross-account 409 + adopt- backfill) was attributed to backend#836 in §6.3 and C.3, but #836/#862 ship only namespace validation + per-action RBAC. Split the anchor out to its own ticket so the critical-path lynchpin is tracked: - §6.3: retag the cluster_id sub-bullet to backend#883 (split out of #836). - C.3: heading now credits #836 (namespace + RBAC) and #883 (the [NEW] cluster_id items) separately, so the doc no longer self-contradicts §6.3. - Ref-links for backend#862 (PR) and backend#883 (issue). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
saadqbal
approved these changes
Jun 24, 2026
ab0ee88
into
docs/rfc-0001-cli-auth-provisioning
4 checks passed
saadqbal
pushed a commit
that referenced
this pull request
Jul 3, 2026
…(backend#883) (#97) The cluster_id anchor (field + get-or-create + cross-account 409 + adopt- backfill) was attributed to backend#836 in §6.3 and C.3, but #836/#862 ship only namespace validation + per-action RBAC. Split the anchor out to its own ticket so the critical-path lynchpin is tracked: - §6.3: retag the cluster_id sub-bullet to backend#883 (split out of #836). - C.3: heading now credits #836 (namespace + RBAC) and #883 (the [NEW] cluster_id items) separately, so the doc no longer self-contradicts §6.3. - Ref-links for backend#862 (PR) and backend#883 (issue). Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
saadqbal
added a commit
that referenced
this pull request
Jul 3, 2026
…ecord (#55) * docs: add DRAFT RFC 0001 — browser auth & client provisioning Design epic for replacing copy-pasted Client ID + password onboarding with a device-flow (RFC 8628) browser sign-in + auto-provisioning. Refs #54. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * docs(rfc-0001): derive-once-freeze namespace + soft-required location - §6.6: derive namespace slug from display name ONCE then freeze (k8s namespaces are immutable); collision-suffix + empty-slug guard + --namespace override; backfill leaves existing slugs untouched. - §6.7: location is soft-required (required but pre-filled); never accept a silent empty (reads as carbon-free); explicit "set later" path; keep DB blank=True for back-compat, enforce at UX layer. - Appendix B: name→slug reference algorithm + prototype validation table + manage.py query to validate against production namespaces. Refs #54. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * docs(rfc-0001): rev 2 — lead with client lifecycle; settle setup/credential/handle decisions Refresh the RFC against the current code and the cross-repo review on backend#830. The auth handshake turned out to be the easy half; the design now leads with the client lifecycle on a machine, which is where the real bugs are. - New §0: settle three product decisions — silent/auto setup (zero prompts: name=hostname, location=auto-detect, surfaced not asked); the machine credential is never shown (written to the cluster secret 0600, never to stdout/scrollback/~/.tracebloc); clients are referred to by slug + arrow-key picker, never the UUID/username/password. - New §3.2: two operational contexts (account vs client) + command map. - New §7: client-lifecycle loopholes and their resolutions — idempotent create + machine→client anchor, selected-vs-connected, guarded delete, cross-account pointer scoping (fixes logout leaving the active client set), orphan resume, auth/expiry, manage-by-name + rotation. - Refresh §4 "what exists" to today: auth scaffold merged (cli#83), client commands in flight (cli#84/#92), dataset commands target a cluster via kubeconfig flags and never read the active pointer (§4.6). - §12 records the backend#830 resolutions of the old §11 open questions (air-gap out, namespace name→slug→both, location future-only, RBAC read/write split, multi-client free / re-parenting deferred, reuse web IdP). Two product calls flagged for owner confirmation. - Rewrite §8 UX (zero-prompt flows) and §9 security (credential never shown; where it lives; rotation = delete + recreate). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * docs(rfc-0001): cluster is the idempotency anchor (1:1 client↔cluster) A client attaches 1:1 to a cluster, so client identity is per-cluster. `client create` becomes get-or-create keyed on the cluster identity (proposed: kube-system namespace UID), which is readable before install — closing the config-lost / pre-install orphan gap a machine-id key couldn't. - §3.1: state the 1:1 client↔cluster invariant up front. - §6.3: new backend cluster_id field (unique=True) + get-or-create-by- cluster on POST /edge-device/. - §7.2: rewrite around the cluster anchor; demote the -2/-3 suffix to cross-cluster-only — kills the same-cluster duplicate the current PR #92 derive would mint on re-run. - §7.9: orphan recovery keyed on cluster_id; password-reset fallback when the credential was lost before Helm consumed it. - §6.4 / §6.6 / §8.2 / §12-Q5 / §13: align everything to one-client-per-cluster. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * docs(rfc-0001): add §14 Risks & dependencies; reframe §11 around the critical path Capture the bottlenecks around the flow (not the in-flow bugs of §7): - R1 critical path crosses backend#835 + backend#836 + an unowned frontend /activate page; the CLI is gated on them, not the reverse. - R2 the user token (account-scoped, long-lived, 0600 on every box, logout is local-only) is the real blast radius — not the machine credential D2 hid. - R3 the cluster anchor needs k8s up at create time, and the kube-system UID changes on a cluster rebuild. - R4 the namespace-uniqueness migration can hit k8s namespace immutability — destroy+rebuild, not rename. - R5 fleet provisioning is a thundering herd on the unique constraint. §11 reframed to show the backend + frontend → CLI dependency order. Appendix A: replace the slug-check sketch with a complete READ-ONLY collision check that reports the (account, namespace) duplicates that would block backend#863 — runnable against staging now (R4). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * docs(rfc-0001): fold in review — cross-account, fleet backfill, logout revoke + polish Address the code-grounded review's blocking + polish findings. Blocking: - R6 (new): account-scope get-or-create — a cluster_id bound to another account is a 409, never a silent adoption (the kube-system UID isn't a secret) (§6.3/§7.2). - R7 (new): the existing fleet has null cluster_id, so a naive re-run would double-provision and orphan the live client. Backfill cluster_id via the heartbeat, plus a §7.2 step-2a "never mint over a live in-namespace release" guard for the pre-backfill window (§4.5/§10). - Heartbeat must report cluster_id (§4.5) — powers the connected check and backfills R7. Correctness/security: - logout revokes server-side now via backend#845, not Phase 2 (§7.5/§9/R2). - Orphan password-reset gated on heartbeat recency, not just "no values file" — a once-connected client still has a running pod (§7.9). - client create operates against an already-reachable cluster (§6.2); add a reaper/teardown hook for rebuild orphans (R3). Clarity: - Fix the Appendix A "authoritative" contradiction: cluster_id is authoritative for idempotency; namespace-unique is cosmetic dedup (§12). - Rename is cosmetic; the handle stays the frozen slug (§7.1/§8.1). - §9: device-code phishing mitigations + etcd-at-rest note. - §0 + rev note (Rev 3); §12 marks cluster-id + Q1 + Q5 confirmed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * docs(rfc-0001): ground R4/§4.2/§6.6 in backend code (namespace is unguarded today) Verified against the backend tree: namespace is stored client-reported and verbatim (common/utils/edge_device_utils.py) — no slug derivation, no format validation, no uniqueness anywhere (the only slugify is for Competition titles). - §4.2: state that the heartbeat stores client_info.namespace as-is. - §6.6: callout that the slug rule, set-at-create, and the §6.3 constraint are all net-new; the slug rule lives only in the CLI + Appendix A, not the backend. - R4: collisions are *unprevented* today, not merely possible. The code already proves they're possible; only the data shows whether any exist — so the collision check is the implementer's pre-migration step, not an RFC blocker (no staging access needed to finalize the RFC). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * docs(rfc-0001): hardening pass — supply-chain, audit, multi-env, version, uninstall Second adversarial review (independent security + product reviewers + own pass) against the five goals. The auth/idempotency core held; the gaps clustered in bootstrap trust, compliance, and the operational failure tail: - §9: bootstrap supply-chain (R8), audit trail (R9), machine-credential revoke in phase 1, authenticated cluster_id claims, explicit tenancy boundary. - §14: R8 supply-chain, R9 audit, R10 multi-env config clobber (a confirmed bug — login --env strands the old env's ActiveClientID), R11 version negotiation, R12 uninstall/offboarding; + watch-items (FL threat model, data-residency, emoji glyphs, --token re-apply). - §8.5: silent-failure flow — --verbose, persistent install log, resume command, cluster doctor auth/config check, streamed rollout progress. - §7.2: adopt keeps the existing namespace (never re-derive from hostname). - §7.4: refuse delete on a running experiment, not just an advisory heartbeat. - §7.5: scope the active client to env, not just account (the R10 fix). - §11 / §13: phases + cross-repo work breakdown updated (cli / backend / installer). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * docs(rfc-0001): add Appendix C — implementation-grade API & data contracts Pin the cross-repo wire contracts so the RFC can be built from directly, without a separate SDD. Shapes are grounded in the shipped CLI client (internal/api/client.go) — the backend must match them or reconcile deliberately; [NEW] marks net-new work. - C.1 conventions: env base URLs, Bearer auth, CLI version header (R11). - C.2 device grant (backend#835): /device/code, /device/token error model, /userinfo/, /activate. - C.3 provisioning (backend#836): /edge-device/ get-or-create with cluster_id — 201 mint / 200 adopt / 409 cross-account; ProvisionedClient; list / admins / delete. - C.4 heartbeat cluster_id + authenticity rule (R7 / §9). - C.5 audit event schema (R9). C.6 token revoke (backend#845). - C.7 data model: cluster_id field + global-unique + the namespace constraint, with the strict migration order (R4 / R7). C.8 env-scoped config v2 (R10). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * docs(rfc-0001): lock data-verb naming (ingest, not push) + tie flows to the #877–880 trackers (#96) Folds in two decisions made with Lukas (2026-06-23) that postdate Rev 3: - Naming: `dataset push|list|rm` → `data ingest|list|delete`. **ingest, not push** (data is loaded into the client's own on-prem cluster and never leaves it; "push" implies egress to a remote and undermines the core trust message). **delete, not rm** (spelled-out, consistent with `client delete`). `dataset`/`push`/`rm` kept as hidden aliases for one deprecation cycle. Swapped across §3.2/§4.6/§6.2/§6.4/§7.3/§13 + a rationale note in §6.2. - §8: framed the drafted flows as the four acceptance families (#877–880) under the 2-phase shape (one human gate → unattended idempotent convergence) + the 7 design principles. Additive only — does NOT touch the cluster_id anchor design. Two round-2 review residuals remain for Rev 4: the heartbeat `cluster_id` backfill names a carrier (jobs-manager) that lacks RBAC to read the kube-system UID, and §7.2 step-2a adopts the in-cluster credential before the cross-account check (a 409 bypass). Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(rfc-0001): Rev 6 — close the two anchor residuals flagged on #96 Carrier decision: cluster_id is set/backfilled by the CLI/installer (the kubeconfig-holder that can read the kube-system UID), NOT the heartbeat — whose sender (jobs-manager) has no `namespaces` RBAC and can't read it. - Residual 1 (carrier): §4.5 / §6.3 / §10 / R7 / C.4 / C.7 — the heartbeat no longer carries cluster_id; the CLI PATCHes it on adopt (new C.3 PATCH /edge-device/<id>/), with the §7.2 step-2a live-release guard covering the pre-backfill window. §9: since the authenticated CLI sets it from the real UID, the self-report spoofing surface is gone. - Residual 2 (ordering): §7.2 — the account-scoped backend check now gates ADOPTION itself, so reading a live TB_CLIENT_ID off the cluster can't bypass the cross-account 409 (previously step-2a adopted before the check). - §13 + Appendix C updated to match. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * docs(rfc-0001): retag §6.3 + C.3 cluster_id anchor to its own ticket (backend#883) (#97) The cluster_id anchor (field + get-or-create + cross-account 409 + adopt- backfill) was attributed to backend#836 in §6.3 and C.3, but #836/#862 ship only namespace validation + per-action RBAC. Split the anchor out to its own ticket so the critical-path lynchpin is tracked: - §6.3: retag the cluster_id sub-bullet to backend#883 (split out of #836). - C.3: heading now credits #836 (namespace + RBAC) and #883 (the [NEW] cluster_id items) separately, so the doc no longer self-contradicts §6.3. - Ref-links for backend#862 (PR) and backend#883 (issue). Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(rfc-0001): Rev 7 — correct the `logout` server-side-revoke overclaim (FR finding) FR'ing the connect/install flow on dev (#877) confirmed `tracebloc logout` is **local-only** today: it clears the local token but a copied/leaked token still authenticates afterward. The RFC claimed "logout revokes server-side (backend#845)" as shipped fact across §6.3, §7.5, §9, §13, and Appendix C.6 — that overstated it. Reframed consistently: server-side revoke is **pending the `POST /auth/revoke` endpoint (backend#887, not built) + a CLI `logout`→revoke call**; backend#845 shipped only the underlying `revoke()` primitive. Added a Rev 7 changelog note. Evidence + tracking: backend#887 (endpoint) carries the FR evidence; the earlier cli#55 resolution-map reply is corrected. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * docs(rfc-0001): mark ACCEPTED — implemented in v0.4.0 (close out the draft) The design shipped in CLI v0.4.0 (#107) and epic #54 is closed. Flip the status header from DRAFT to ACCEPTED and reconcile it with what actually landed, so the RFC can merge as the design-of-record rather than sit as a perpetual draft. Rev history retained as the convergence record. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> Co-authored-by: lukasWuttke <54042461+LukasWodka@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked on the RFC branch (the cli#96 pattern) — @saadqbal to merge into the RFC PR (cli#55).
What
The
cluster_idanchor — the field + get-or-create keyed on cluster identity + cross-account 409 + adopt-backfill PATCH — is the idempotency core of §7.2 and the lynchpin of the whole connect → converge flow. The RFC attributed it tobackend#836in §6.3 and Appendix C.3, but #836 (and its PR #862) ship only namespace validation + per-action RBAC + ask-an-admin. So the anchor had no real ticket — it was hiding behind a reference.This retags it to its own ticket, backend#883 (filed with the full scope from §6.3/§7.2/§9/C.3/C.7):
cluster_idsub-bullet now points tobackend#883(split out of #836).#836(namespace + RBAC) and#883(the[NEW]cluster_iditems) separately, so the contract section no longer contradicts §6.3.backend#862(the impl PR) andbackend#883.No design change — pure attribution. Verified surgical (only the cluster_id attribution moved; the Q4/RBAC
#836tag is untouched).Why now
This came out of the #862 review + an RFC-conformance pass. The anchor is the biggest single gap on the RFC critical path (the CLI
client createrevision and the installer reorder #838 both block on it), so it needed a real home before roadmap planning.🤖 Generated with Claude Code