training-egress NetworkPolicy blocks training pods from the requests-proxy (8888)

## Impact

On every client install with `networkPolicy.training.enabled`, training pods **cannot reach the requests-proxy on port 8888**, so they fail to report epoch results → `Connection refused` → `CrashLoopBackOff`. All experiments fail.

Observed live on a fresh client (`tracebloc-amazon` ns, k3d): training pods looped on
`HTTPConnectionPool(host='requests-proxy-service', port=8888): ... [Errno 111] Connection refused`
at the first epoch finalize. jobs-manager (not subject to the policy) reached the proxy fine (HTTP 401) — the only differentiator was this egress policy.

## Root cause

`templates/network-policy-training.yaml` denies all pod-to-pod / ClusterIP egress (rule 2 `except`s the cluster CIDRs) and then **explicitly re-permits only MySQL** (rule 3). When the requests-proxy architecture shipped (pods POST results to `requests-proxy-service:8888` instead of holding SB creds), this template was **not** updated to re-permit egress to the proxy. So the proxy — the entire intended egress path — is blocked.

The template comment even says: *"This blocks pod-to-pod, ClusterIPs, jobs-manager, K8s API. MySQL is explicitly re-permitted by the next rule."* — the proxy needs the same treatment.

## Fix

Add a 4th egress rule mirroring the MySQL one: allow TCP/8888 to `podSelector app=requests-proxy` (same namespace). The proxy Service selector is `app=requests-proxy`, port 8888 (`templates/requests-proxy-service.yaml`).

## Interim mitigation (already applied on the affected cluster)
- Live-patched the NetworkPolicy with the egress rule (verified: training-labeled probe pod now connects; real pods progressing past epoch 0; jobs completing).
- Suspended the `auto-upgrade` CronJob so its `helm upgrade --reuse-values` doesn't revert the live patch. **Re-enable once this chart fix is released.**

## Acceptance
- [ ] training-egress netpol permits TCP/8888 → `app=requests-proxy`
- [ ] rule present whenever the proxy is deployed
- [ ] chart release; affected cluster un-suspends auto-upgrade and self-heals

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

training-egress NetworkPolicy blocks training pods from the requests-proxy (8888) #196

Impact

Root cause

Fix

Interim mitigation (already applied on the affected cluster)

Acceptance

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

training-egress NetworkPolicy blocks training pods from the requests-proxy (8888) #196

Description

Impact

Root cause

Fix

Interim mitigation (already applied on the affected cluster)

Acceptance

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions