Skip to content

docs(rfc-0001): retag cluster_id anchor to its own ticket (backend#883)#97

Merged
saadqbal merged 1 commit into
docs/rfc-0001-cli-auth-provisioningfrom
docs/rfc-retag-cluster-anchor
Jun 24, 2026
Merged

docs(rfc-0001): retag cluster_id anchor to its own ticket (backend#883)#97
saadqbal merged 1 commit into
docs/rfc-0001-cli-auth-provisioningfrom
docs/rfc-retag-cluster-anchor

Conversation

@LukasWodka

Copy link
Copy Markdown
Contributor

Stacked on the RFC branch (the cli#96 pattern) — @saadqbal to merge into the RFC PR (cli#55).

What

The cluster_id anchor — the field + get-or-create keyed on cluster identity + cross-account 409 + adopt-backfill PATCH — is the idempotency core of §7.2 and the lynchpin of the whole connect → converge flow. The RFC attributed it to backend#836 in §6.3 and Appendix C.3, but #836 (and its PR #862) ship only namespace validation + per-action RBAC + ask-an-admin. So the anchor had no real ticket — it was hiding behind a reference.

This retags it to its own ticket, backend#883 (filed with the full scope from §6.3/§7.2/§9/C.3/C.7):

  • §6.3 — the cluster_id sub-bullet now points to backend#883 (split out of #836).
  • Appendix C.3 — the heading now credits #836 (namespace + RBAC) and #883 (the [NEW] cluster_id items) separately, so the contract section no longer contradicts §6.3.
  • Ref-link defs for backend#862 (the impl PR) and backend#883.

No design change — pure attribution. Verified surgical (only the cluster_id attribution moved; the Q4/RBAC #836 tag is untouched).

Why now

This came out of the #862 review + an RFC-conformance pass. The anchor is the biggest single gap on the RFC critical path (the CLI client create revision and the installer reorder #838 both block on it), so it needed a real home before roadmap planning.

🤖 Generated with Claude Code

…(backend#883)

The cluster_id anchor (field + get-or-create + cross-account 409 + adopt-
backfill) was attributed to backend#836 in §6.3 and C.3, but #836/#862 ship
only namespace validation + per-action RBAC. Split the anchor out to its own
ticket so the critical-path lynchpin is tracked:

- §6.3: retag the cluster_id sub-bullet to backend#883 (split out of #836).
- C.3: heading now credits #836 (namespace + RBAC) and #883 (the [NEW]
  cluster_id items) separately, so the doc no longer self-contradicts §6.3.
- Ref-links for backend#862 (PR) and backend#883 (issue).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@saadqbal saadqbal merged commit ab0ee88 into docs/rfc-0001-cli-auth-provisioning Jun 24, 2026
4 checks passed
saadqbal pushed a commit that referenced this pull request Jul 3, 2026
…(backend#883) (#97)

The cluster_id anchor (field + get-or-create + cross-account 409 + adopt-
backfill) was attributed to backend#836 in §6.3 and C.3, but #836/#862 ship
only namespace validation + per-action RBAC. Split the anchor out to its own
ticket so the critical-path lynchpin is tracked:

- §6.3: retag the cluster_id sub-bullet to backend#883 (split out of #836).
- C.3: heading now credits #836 (namespace + RBAC) and #883 (the [NEW]
  cluster_id items) separately, so the doc no longer self-contradicts §6.3.
- Ref-links for backend#862 (PR) and backend#883 (issue).

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
saadqbal added a commit that referenced this pull request Jul 3, 2026
…ecord (#55)

* docs: add DRAFT RFC 0001 — browser auth & client provisioning

Design epic for replacing copy-pasted Client ID + password onboarding
with a device-flow (RFC 8628) browser sign-in + auto-provisioning.

Refs #54.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* docs(rfc-0001): derive-once-freeze namespace + soft-required location

- §6.6: derive namespace slug from display name ONCE then freeze (k8s
  namespaces are immutable); collision-suffix + empty-slug guard +
  --namespace override; backfill leaves existing slugs untouched.
- §6.7: location is soft-required (required but pre-filled); never accept
  a silent empty (reads as carbon-free); explicit "set later" path; keep
  DB blank=True for back-compat, enforce at UX layer.
- Appendix B: name→slug reference algorithm + prototype validation table
  + manage.py query to validate against production namespaces.

Refs #54.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* docs(rfc-0001): rev 2 — lead with client lifecycle; settle setup/credential/handle decisions

Refresh the RFC against the current code and the cross-repo review on
backend#830. The auth handshake turned out to be the easy half; the
design now leads with the client lifecycle on a machine, which is where
the real bugs are.

- New §0: settle three product decisions — silent/auto setup (zero
  prompts: name=hostname, location=auto-detect, surfaced not asked); the
  machine credential is never shown (written to the cluster secret 0600,
  never to stdout/scrollback/~/.tracebloc); clients are referred to by
  slug + arrow-key picker, never the UUID/username/password.
- New §3.2: two operational contexts (account vs client) + command map.
- New §7: client-lifecycle loopholes and their resolutions — idempotent
  create + machine→client anchor, selected-vs-connected, guarded
  delete, cross-account pointer scoping (fixes logout leaving the active
  client set), orphan resume, auth/expiry, manage-by-name + rotation.
- Refresh §4 "what exists" to today: auth scaffold merged (cli#83),
  client commands in flight (cli#84/#92), dataset commands target a
  cluster via kubeconfig flags and never read the active pointer (§4.6).
- §12 records the backend#830 resolutions of the old §11 open questions
  (air-gap out, namespace name→slug→both, location future-only, RBAC
  read/write split, multi-client free / re-parenting deferred, reuse web
  IdP). Two product calls flagged for owner confirmation.
- Rewrite §8 UX (zero-prompt flows) and §9 security (credential never
  shown; where it lives; rotation = delete + recreate).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* docs(rfc-0001): cluster is the idempotency anchor (1:1 client↔cluster)

A client attaches 1:1 to a cluster, so client identity is per-cluster.
`client create` becomes get-or-create keyed on the cluster identity
(proposed: kube-system namespace UID), which is readable before install —
closing the config-lost / pre-install orphan gap a machine-id key
couldn't.

- §3.1: state the 1:1 client↔cluster invariant up front.
- §6.3: new backend cluster_id field (unique=True) + get-or-create-by-
  cluster on POST /edge-device/.
- §7.2: rewrite around the cluster anchor; demote the -2/-3 suffix to
  cross-cluster-only — kills the same-cluster duplicate the current PR #92
  derive would mint on re-run.
- §7.9: orphan recovery keyed on cluster_id; password-reset fallback when
  the credential was lost before Helm consumed it.
- §6.4 / §6.6 / §8.2 / §12-Q5 / §13: align everything to
  one-client-per-cluster.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* docs(rfc-0001): add §14 Risks & dependencies; reframe §11 around the critical path

Capture the bottlenecks around the flow (not the in-flow bugs of §7):
- R1 critical path crosses backend#835 + backend#836 + an unowned
  frontend /activate page; the CLI is gated on them, not the reverse.
- R2 the user token (account-scoped, long-lived, 0600 on every box,
  logout is local-only) is the real blast radius — not the machine
  credential D2 hid.
- R3 the cluster anchor needs k8s up at create time, and the kube-system
  UID changes on a cluster rebuild.
- R4 the namespace-uniqueness migration can hit k8s namespace
  immutability — destroy+rebuild, not rename.
- R5 fleet provisioning is a thundering herd on the unique constraint.

§11 reframed to show the backend + frontend → CLI dependency order.
Appendix A: replace the slug-check sketch with a complete READ-ONLY
collision check that reports the (account, namespace) duplicates that
would block backend#863 — runnable against staging now (R4).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* docs(rfc-0001): fold in review — cross-account, fleet backfill, logout revoke + polish

Address the code-grounded review's blocking + polish findings.

Blocking:
- R6 (new): account-scope get-or-create — a cluster_id bound to another
  account is a 409, never a silent adoption (the kube-system UID isn't a
  secret) (§6.3/§7.2).
- R7 (new): the existing fleet has null cluster_id, so a naive re-run would
  double-provision and orphan the live client. Backfill cluster_id via the
  heartbeat, plus a §7.2 step-2a "never mint over a live in-namespace
  release" guard for the pre-backfill window (§4.5/§10).
- Heartbeat must report cluster_id (§4.5) — powers the connected check and
  backfills R7.

Correctness/security:
- logout revokes server-side now via backend#845, not Phase 2 (§7.5/§9/R2).
- Orphan password-reset gated on heartbeat recency, not just "no values
  file" — a once-connected client still has a running pod (§7.9).
- client create operates against an already-reachable cluster (§6.2); add a
  reaper/teardown hook for rebuild orphans (R3).

Clarity:
- Fix the Appendix A "authoritative" contradiction: cluster_id is
  authoritative for idempotency; namespace-unique is cosmetic dedup (§12).
- Rename is cosmetic; the handle stays the frozen slug (§7.1/§8.1).
- §9: device-code phishing mitigations + etcd-at-rest note.
- §0 + rev note (Rev 3); §12 marks cluster-id + Q1 + Q5 confirmed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* docs(rfc-0001): ground R4/§4.2/§6.6 in backend code (namespace is unguarded today)

Verified against the backend tree: namespace is stored client-reported and
verbatim (common/utils/edge_device_utils.py) — no slug derivation, no format
validation, no uniqueness anywhere (the only slugify is for Competition titles).

- §4.2: state that the heartbeat stores client_info.namespace as-is.
- §6.6: callout that the slug rule, set-at-create, and the §6.3 constraint are
  all net-new; the slug rule lives only in the CLI + Appendix A, not the backend.
- R4: collisions are *unprevented* today, not merely possible. The code already
  proves they're possible; only the data shows whether any exist — so the
  collision check is the implementer's pre-migration step, not an RFC blocker
  (no staging access needed to finalize the RFC).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* docs(rfc-0001): hardening pass — supply-chain, audit, multi-env, version, uninstall

Second adversarial review (independent security + product reviewers + own pass)
against the five goals. The auth/idempotency core held; the gaps clustered in
bootstrap trust, compliance, and the operational failure tail:

- §9: bootstrap supply-chain (R8), audit trail (R9), machine-credential revoke in
  phase 1, authenticated cluster_id claims, explicit tenancy boundary.
- §14: R8 supply-chain, R9 audit, R10 multi-env config clobber (a confirmed bug —
  login --env strands the old env's ActiveClientID), R11 version negotiation, R12
  uninstall/offboarding; + watch-items (FL threat model, data-residency, emoji
  glyphs, --token re-apply).
- §8.5: silent-failure flow — --verbose, persistent install log, resume command,
  cluster doctor auth/config check, streamed rollout progress.
- §7.2: adopt keeps the existing namespace (never re-derive from hostname).
- §7.4: refuse delete on a running experiment, not just an advisory heartbeat.
- §7.5: scope the active client to env, not just account (the R10 fix).
- §11 / §13: phases + cross-repo work breakdown updated (cli / backend / installer).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* docs(rfc-0001): add Appendix C — implementation-grade API & data contracts

Pin the cross-repo wire contracts so the RFC can be built from directly, without a
separate SDD. Shapes are grounded in the shipped CLI client (internal/api/client.go)
— the backend must match them or reconcile deliberately; [NEW] marks net-new work.

- C.1 conventions: env base URLs, Bearer auth, CLI version header (R11).
- C.2 device grant (backend#835): /device/code, /device/token error model,
  /userinfo/, /activate.
- C.3 provisioning (backend#836): /edge-device/ get-or-create with cluster_id —
  201 mint / 200 adopt / 409 cross-account; ProvisionedClient; list / admins / delete.
- C.4 heartbeat cluster_id + authenticity rule (R7 / §9).
- C.5 audit event schema (R9).  C.6 token revoke (backend#845).
- C.7 data model: cluster_id field + global-unique + the namespace constraint, with
  the strict migration order (R4 / R7).  C.8 env-scoped config v2 (R10).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* docs(rfc-0001): lock data-verb naming (ingest, not push) + tie flows to the #877–880 trackers (#96)

Folds in two decisions made with Lukas (2026-06-23) that postdate Rev 3:

- Naming: `dataset push|list|rm` → `data ingest|list|delete`. **ingest, not push**
  (data is loaded into the client's own on-prem cluster and never leaves it; "push"
  implies egress to a remote and undermines the core trust message). **delete, not
  rm** (spelled-out, consistent with `client delete`). `dataset`/`push`/`rm` kept as
  hidden aliases for one deprecation cycle. Swapped across §3.2/§4.6/§6.2/§6.4/§7.3/§13
  + a rationale note in §6.2.
- §8: framed the drafted flows as the four acceptance families (#877–880) under the
  2-phase shape (one human gate → unattended idempotent convergence) + the 7 design
  principles.

Additive only — does NOT touch the cluster_id anchor design. Two round-2 review
residuals remain for Rev 4: the heartbeat `cluster_id` backfill names a carrier
(jobs-manager) that lacks RBAC to read the kube-system UID, and §7.2 step-2a adopts
the in-cluster credential before the cross-account check (a 409 bypass).

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(rfc-0001): Rev 6 — close the two anchor residuals flagged on #96

Carrier decision: cluster_id is set/backfilled by the CLI/installer (the
kubeconfig-holder that can read the kube-system UID), NOT the heartbeat — whose
sender (jobs-manager) has no `namespaces` RBAC and can't read it.

- Residual 1 (carrier): §4.5 / §6.3 / §10 / R7 / C.4 / C.7 — the heartbeat no longer
  carries cluster_id; the CLI PATCHes it on adopt (new C.3 PATCH /edge-device/<id>/),
  with the §7.2 step-2a live-release guard covering the pre-backfill window. §9: since
  the authenticated CLI sets it from the real UID, the self-report spoofing surface
  is gone.
- Residual 2 (ordering): §7.2 — the account-scoped backend check now gates ADOPTION
  itself, so reading a live TB_CLIENT_ID off the cluster can't bypass the
  cross-account 409 (previously step-2a adopted before the check).
- §13 + Appendix C updated to match.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* docs(rfc-0001): retag §6.3 + C.3 cluster_id anchor to its own ticket (backend#883) (#97)

The cluster_id anchor (field + get-or-create + cross-account 409 + adopt-
backfill) was attributed to backend#836 in §6.3 and C.3, but #836/#862 ship
only namespace validation + per-action RBAC. Split the anchor out to its own
ticket so the critical-path lynchpin is tracked:

- §6.3: retag the cluster_id sub-bullet to backend#883 (split out of #836).
- C.3: heading now credits #836 (namespace + RBAC) and #883 (the [NEW]
  cluster_id items) separately, so the doc no longer self-contradicts §6.3.
- Ref-links for backend#862 (PR) and backend#883 (issue).

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(rfc-0001): Rev 7 — correct the `logout` server-side-revoke overclaim (FR finding)

FR'ing the connect/install flow on dev (#877) confirmed `tracebloc logout` is
**local-only** today: it clears the local token but a copied/leaked token still
authenticates afterward. The RFC claimed "logout revokes server-side (backend#845)"
as shipped fact across §6.3, §7.5, §9, §13, and Appendix C.6 — that overstated it.

Reframed consistently: server-side revoke is **pending the `POST /auth/revoke`
endpoint (backend#887, not built) + a CLI `logout`→revoke call**; backend#845
shipped only the underlying `revoke()` primitive. Added a Rev 7 changelog note.

Evidence + tracking: backend#887 (endpoint) carries the FR evidence; the earlier
cli#55 resolution-map reply is corrected.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* docs(rfc-0001): mark ACCEPTED — implemented in v0.4.0 (close out the draft)

The design shipped in CLI v0.4.0 (#107) and epic #54 is closed. Flip the
status header from DRAFT to ACCEPTED and reconcile it with what actually
landed, so the RFC can merge as the design-of-record rather than sit as a
perpetual draft. Rev history retained as the convergence record.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Co-authored-by: lukasWuttke <54042461+LukasWodka@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants