Context
Cutover task for the stateless requests-proxy auth design. Code and chart changes land first behind a feature flag (requestsProxy.statelessTokens); this ticket tracks the rollout itself.
Parent feature: tracebloc/client-runtime#14
Depends on:
Plan
- Pre-check — confirm CR-1, CR-2, HC-1 merged and the new chart version is published.
- Staging —
helm upgrade the staging release with requestsProxy.statelessTokens: true. Both deployments roll. Bake for 2 weeks.
- Watch: pod-side 401/503 rates, proxy CPU, jobs-manager error logs, revoke ConfigMap size.
- Run a controlled proxy-restart drill: kill the proxy pod, confirm in-flight training jobs continue without 401s.
- Run a controlled jobs-manager-restart drill while pods are mid-job: confirm tokens still verify (no recovery needed since they're self-contained now).
- Prod — same
helm upgrade against prod. Stage one client at a time if multi-tenant rollout makes sense.
Rollback
Flip requestsProxy.statelessTokens: false and helm upgrade. Both code paths coexist in CR-1 / CR-2 to support this.
Caveat: any tokens already minted as JWTs in env on running pods will keep working only if the proxy still has the keys mounted — on legacy mode the proxy ignores the keys entirely and falls back to its registered-token map, which would be empty. So a mid-flight rollback strands those pods until their Jobs finish (~the same failure mode this whole change was designed to fix, but bounded to a one-off rollback event). Document this caveat in the runbook.
Acceptance criteria
- Staging running on stateless mode for ≥2 weeks with no proxy-auth incidents.
- Prod running on stateless mode for ≥1 week before HC-2 (horizontal scale-up) is queued.
- A short post-rollout note (paragraph in the runbook or the chart's CHANGELOG) capturing what got measured and any surprises.
Out of scope
- Removing legacy code paths from CR-1 / CR-2. That's a follow-up after stateless mode has been the default in prod for a full quarter.
- Moving SB connection strings into a mounted Secret (deferred — separate ticket batch).
Context
Cutover task for the stateless requests-proxy auth design. Code and chart changes land first behind a feature flag (
requestsProxy.statelessTokens); this ticket tracks the rollout itself.Parent feature: tracebloc/client-runtime#14
Depends on:
Plan
helm upgradethe staging release withrequestsProxy.statelessTokens: true. Both deployments roll. Bake for 2 weeks.helm upgradeagainst prod. Stage one client at a time if multi-tenant rollout makes sense.Rollback
Flip
requestsProxy.statelessTokens: falseandhelm upgrade. Both code paths coexist in CR-1 / CR-2 to support this.Caveat: any tokens already minted as JWTs in env on running pods will keep working only if the proxy still has the keys mounted — on legacy mode the proxy ignores the keys entirely and falls back to its registered-token map, which would be empty. So a mid-flight rollback strands those pods until their Jobs finish (~the same failure mode this whole change was designed to fix, but bounded to a one-off rollback event). Document this caveat in the runbook.
Acceptance criteria
Out of scope