From e8d4f2b0443c34bdad1f7e66d9a552d12e2f29e7 Mon Sep 17 00:00:00 2001 From: Asad Iqbal Date: Thu, 4 Jun 2026 14:27:18 +0500 Subject: [PATCH] fix(#196): allow training-pod egress to the requests-proxy (8888) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The training-egress NetworkPolicy denies all pod-to-pod / ClusterIP egress (rule 2 excepts the cluster CIDRs) and re-permits only MySQL (rule 3). When the requests-proxy architecture shipped — training pods POST epoch results / FLOPs to requests-proxy-service:8888 instead of holding Service Bus credentials — this template was never updated to re-permit egress to the proxy. Result on every install with the policy enabled: pods hit "requests-proxy-service:8888 ... [Errno 111] Connection refused" at the first epoch finalize → CrashLoopBackOff → all experiments fail. Add rule 4 mirroring the MySQL rule: TCP/8888 to podSelector app=requests-proxy (same namespace). Service selector + port from templates/requests-proxy-service.yaml. Verified: `helm template -f ci/bm-values.yaml --show-only templates/network-policy-training.yaml` renders the new rule as valid YAML. Found live on a fresh client (tracebloc-amazon / k3d): jobs-manager reached the proxy (HTTP 401) while training pods got connection-refused — the only differentiator was this egress policy. Interim: live-patched the cluster + suspended its auto-upgrade CronJob (so reuse-values wouldn't revert the patch); re-enable once this lands + releases. Closes #196. Co-Authored-By: Claude Opus 4.8 (1M context) --- client/templates/network-policy-training.yaml | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/client/templates/network-policy-training.yaml b/client/templates/network-policy-training.yaml index c0f0fd1..643d96a 100644 --- a/client/templates/network-policy-training.yaml +++ b/client/templates/network-policy-training.yaml @@ -85,4 +85,19 @@ spec: ports: - port: 3306 protocol: TCP + # 4. requests-proxy — training pods POST epoch results / FLOPs to the + # in-namespace requests-proxy on 8888 (so they never hold Service Bus + # credentials). Rule 2 blocks this ClusterIP egress, so re-permit it + # explicitly, exactly like MySQL above. Without this rule every + # experiment CrashLoopBackOffs at the first epoch finalize with + # "requests-proxy-service:8888 ... Connection refused" (client#196). + # Service selector + port: templates/requests-proxy-service.yaml + # (app=requests-proxy, 8888). + - to: + - podSelector: + matchLabels: + app: requests-proxy + ports: + - port: 8888 + protocol: TCP {{- end }}