Skip to content

ops: enable stateless requests-proxy in staging then prod #99

@saadqbal

Description

@saadqbal

Context

Cutover task for the stateless requests-proxy auth design. Code and chart changes land first behind a feature flag (requestsProxy.statelessTokens); this ticket tracks the rollout itself.

Parent feature: tracebloc/client-runtime#14

Depends on:

Plan

  1. Pre-check — confirm CR-1, CR-2, HC-1 merged and the new chart version is published.
  2. Staginghelm upgrade the staging release with requestsProxy.statelessTokens: true. Both deployments roll. Bake for 2 weeks.
    • Watch: pod-side 401/503 rates, proxy CPU, jobs-manager error logs, revoke ConfigMap size.
    • Run a controlled proxy-restart drill: kill the proxy pod, confirm in-flight training jobs continue without 401s.
    • Run a controlled jobs-manager-restart drill while pods are mid-job: confirm tokens still verify (no recovery needed since they're self-contained now).
  3. Prod — same helm upgrade against prod. Stage one client at a time if multi-tenant rollout makes sense.

Rollback

Flip requestsProxy.statelessTokens: false and helm upgrade. Both code paths coexist in CR-1 / CR-2 to support this.

Caveat: any tokens already minted as JWTs in env on running pods will keep working only if the proxy still has the keys mounted — on legacy mode the proxy ignores the keys entirely and falls back to its registered-token map, which would be empty. So a mid-flight rollback strands those pods until their Jobs finish (~the same failure mode this whole change was designed to fix, but bounded to a one-off rollback event). Document this caveat in the runbook.

Acceptance criteria

  • Staging running on stateless mode for ≥2 weeks with no proxy-auth incidents.
  • Prod running on stateless mode for ≥1 week before HC-2 (horizontal scale-up) is queued.
  • A short post-rollout note (paragraph in the runbook or the chart's CHANGELOG) capturing what got measured and any surprises.

Out of scope

  • Removing legacy code paths from CR-1 / CR-2. That's a follow-up after stateless mode has been the default in prod for a full quarter.
  • Moving SB connection strings into a mounted Secret (deferred — separate ticket batch).

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions