Skip to content

training-egress NetworkPolicy blocks training pods from the requests-proxy (8888) #196

@saadqbal

Description

@saadqbal

Impact

On every client install with networkPolicy.training.enabled, training pods cannot reach the requests-proxy on port 8888, so they fail to report epoch results → Connection refusedCrashLoopBackOff. All experiments fail.

Observed live on a fresh client (tracebloc-amazon ns, k3d): training pods looped on
HTTPConnectionPool(host='requests-proxy-service', port=8888): ... [Errno 111] Connection refused
at the first epoch finalize. jobs-manager (not subject to the policy) reached the proxy fine (HTTP 401) — the only differentiator was this egress policy.

Root cause

templates/network-policy-training.yaml denies all pod-to-pod / ClusterIP egress (rule 2 excepts the cluster CIDRs) and then explicitly re-permits only MySQL (rule 3). When the requests-proxy architecture shipped (pods POST results to requests-proxy-service:8888 instead of holding SB creds), this template was not updated to re-permit egress to the proxy. So the proxy — the entire intended egress path — is blocked.

The template comment even says: "This blocks pod-to-pod, ClusterIPs, jobs-manager, K8s API. MySQL is explicitly re-permitted by the next rule." — the proxy needs the same treatment.

Fix

Add a 4th egress rule mirroring the MySQL one: allow TCP/8888 to podSelector app=requests-proxy (same namespace). The proxy Service selector is app=requests-proxy, port 8888 (templates/requests-proxy-service.yaml).

Interim mitigation (already applied on the affected cluster)

  • Live-patched the NetworkPolicy with the egress rule (verified: training-labeled probe pod now connects; real pods progressing past epoch 0; jobs completing).
  • Suspended the auto-upgrade CronJob so its helm upgrade --reuse-values doesn't revert the live patch. Re-enable once this chart fix is released.

Acceptance

  • training-egress netpol permits TCP/8888 → app=requests-proxy
  • rule present whenever the proxy is deployed
  • chart release; affected cluster un-suspends auto-upgrade and self-heals

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions