Impact
On every client install with networkPolicy.training.enabled, training pods cannot reach the requests-proxy on port 8888, so they fail to report epoch results → Connection refused → CrashLoopBackOff. All experiments fail.
Observed live on a fresh client (tracebloc-amazon ns, k3d): training pods looped on
HTTPConnectionPool(host='requests-proxy-service', port=8888): ... [Errno 111] Connection refused
at the first epoch finalize. jobs-manager (not subject to the policy) reached the proxy fine (HTTP 401) — the only differentiator was this egress policy.
Root cause
templates/network-policy-training.yaml denies all pod-to-pod / ClusterIP egress (rule 2 excepts the cluster CIDRs) and then explicitly re-permits only MySQL (rule 3). When the requests-proxy architecture shipped (pods POST results to requests-proxy-service:8888 instead of holding SB creds), this template was not updated to re-permit egress to the proxy. So the proxy — the entire intended egress path — is blocked.
The template comment even says: "This blocks pod-to-pod, ClusterIPs, jobs-manager, K8s API. MySQL is explicitly re-permitted by the next rule." — the proxy needs the same treatment.
Fix
Add a 4th egress rule mirroring the MySQL one: allow TCP/8888 to podSelector app=requests-proxy (same namespace). The proxy Service selector is app=requests-proxy, port 8888 (templates/requests-proxy-service.yaml).
Interim mitigation (already applied on the affected cluster)
- Live-patched the NetworkPolicy with the egress rule (verified: training-labeled probe pod now connects; real pods progressing past epoch 0; jobs completing).
- Suspended the
auto-upgrade CronJob so its helm upgrade --reuse-values doesn't revert the live patch. Re-enable once this chart fix is released.
Acceptance
Impact
On every client install with
networkPolicy.training.enabled, training pods cannot reach the requests-proxy on port 8888, so they fail to report epoch results →Connection refused→CrashLoopBackOff. All experiments fail.Observed live on a fresh client (
tracebloc-amazonns, k3d): training pods looped onHTTPConnectionPool(host='requests-proxy-service', port=8888): ... [Errno 111] Connection refusedat the first epoch finalize. jobs-manager (not subject to the policy) reached the proxy fine (HTTP 401) — the only differentiator was this egress policy.
Root cause
templates/network-policy-training.yamldenies all pod-to-pod / ClusterIP egress (rule 2excepts the cluster CIDRs) and then explicitly re-permits only MySQL (rule 3). When the requests-proxy architecture shipped (pods POST results torequests-proxy-service:8888instead of holding SB creds), this template was not updated to re-permit egress to the proxy. So the proxy — the entire intended egress path — is blocked.The template comment even says: "This blocks pod-to-pod, ClusterIPs, jobs-manager, K8s API. MySQL is explicitly re-permitted by the next rule." — the proxy needs the same treatment.
Fix
Add a 4th egress rule mirroring the MySQL one: allow TCP/8888 to
podSelector app=requests-proxy(same namespace). The proxy Service selector isapp=requests-proxy, port 8888 (templates/requests-proxy-service.yaml).Interim mitigation (already applied on the affected cluster)
auto-upgradeCronJob so itshelm upgrade --reuse-valuesdoesn't revert the live patch. Re-enable once this chart fix is released.Acceptance
app=requests-proxy