docs: drop stale tracebloc-helm-charts references in INSTALL.md#106
Merged
docs: drop stale tracebloc-helm-charts references in INSTALL.md#106
Conversation
* Add NetworkPolicy locking down training-pod egress
Training pods run untrusted ML code uploaded by external data scientists.
This policy selects on the tracebloc.io/workload=training label (injected
by jobs-manager in the companion client-runtime PR) and:
- Denies all ingress (nothing should connect TO a training pod).
- Allows DNS to the cluster DNS service.
- Allows external TCP/443 only; blocks all pod-to-pod, ClusterIP, and
in-cluster pod traffic via ipBlock with cluster-CIDR exclusions.
Training pods can still reach tracebloc backend, Azure Service Bus, and
App Insights (external HTTPS). They can no longer reach mysql-client,
the K8s API server, the jobs-manager pod IP, or other training pods.
Per-platform defaults:
AKS: enabled=true (requires Azure NPM or Calico at cluster create)
EKS: enabled=false (AWS VPC CNI does not enforce NetworkPolicy; safer
to explicitly disable than silently have no effect)
BM: enabled=true (requires Calico / Cilium / kube-router)
OC: enabled=true (OVN-Kubernetes enforces by default; custom DNS
selector and OpenShift pod/service CIDRs)
The dnsSelector default is empty with a template-side fallback to
{k8s-app: kube-dns} to avoid Helm's map-merge semantics surprising
customers who override it (OpenShift's selector would otherwise be
unioned with the default rather than replacing it).
- templates/network-policy-training.yaml: new policy (gated on
networkPolicy.training.enabled)
- values.yaml + values.schema.json: new networkPolicy.training block
- ci/{aks,eks,bm,oc}-values.yaml: per-platform overrides with notes
- tests/network_policy_test.yaml: 8 helm-unittest cases covering
rendering, ingress denial, DNS allow, external HTTPS allow, cluster
CIDR blocking, and the OpenShift selector override
No effect until the companion client-runtime PR lands, which adds the
tracebloc.io/workload=training label to spawned training pods.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Add optional Namespace resource with Pod Security Admission labels (#43)
* Add optional Namespace resource with Pod Security Admission labels
Layers Kubernetes Pod Security Admission on top of the per-pod
securityContext work for defense-in-depth. Off by default -- enabling
requires a greenfield install, since the chart does not currently own
the release namespace on existing deployments.
When namespace.create is true, the chart templates a Namespace with:
pod-security.kubernetes.io/warn: restricted
pod-security.kubernetes.io/audit: restricted
helm.sh/resource-policy: keep
Warn + audit surface any pod-spec violation as a kubectl warning and
an audit-log event, without rejecting the pod. This gives us a
tripwire for future regressions in our own pod specs (jobs-manager,
mysql, resource-monitor, training pods) and for any third-party pods
in the same namespace.
Enforce mode is deliberately left UNSET. Two of our own workloads
would be rejected under enforce: restricted:
- mysql init containers run as UID 0 (needed to chown the PVC
before the main container -- UID 999 -- starts)
- resource-monitor DaemonSet mounts hostPath /proc and /sys
Enabling enforce before those are refactored (or moved to a separate
namespace) would break the chart. Customers who want full enforcement
can set namespace.podSecurity.enforce = restricted after auditing
their own deployment; the current defaults keep them safe.
helm.sh/resource-policy: keep prevents helm uninstall from deleting
the Namespace, which would otherwise take the PVC-backed training
data and MySQL state with it.
- templates/namespace.yaml: new, gated on namespace.create (default false)
- values.yaml: new namespace block with long comments
- values.schema.json: schema entries for namespace.create + podSecurity
- tests/namespace_test.yaml: 8 helm-unittest cases (toggle off, toggle
on, keep annotation, labels, version strings, enforce omitted when
empty, enforce present when set, baseline override, namespace name
respects release)
- docs/INSTALL.md: section explaining the greenfield vs existing-ns
paths with copy-pasteable kubectl label commands
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Fix kubeVersion constraint to accept cloud pre-release suffixes
Helm's semver parser excludes pre-release versions from >= ranges by
default, so ">=1.24.0" rejected EKS ("1.34.4-eks-f69f56f"), GKE
("-gke-*"), and AKS release-tagged versions. Changing to ">=1.24.0-0"
explicitly opts the constraint into matching pre-releases, which is
how managed-Kubernetes providers encode their vendor suffix.
Surfaced while dry-run-installing PR #43 against a dev EKS cluster.
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Asad Iqbal <asad.dsoft@gmail.com>
* Add consolidated SECURITY.md covering the training-pod sandbox (#44)
Brings together the threat model, defense layers, per-platform
caveats, operator responsibilities, residual risks, and verification
steps into one reviewable artifact. Covers the complete hardening
posture as shipped across the chart + jobs-manager + new-arch
training images.
Sections:
1. Threat model: trusted platform, untrusted external-data-
scientist submissions. Explicit in-scope / out-of-scope.
2. Seven design goals (G1-G7) for the training-pod sandbox,
each mapped to current status on new-arch vs. legacy.
3. Architecture overview.
4. Defense layers -- credential isolation, network egress,
K8s API access, container runtime hardening, storage
isolation, cross-tenant forgeability, admission tripwire.
5. Per-platform caveats -- NetworkPolicy CNI matrix (AKS/EKS/
bare-metal/OpenShift), PSA version requirements, OpenShift
DNS selector override, runAsUser + arbitrary UIDs, bare-
metal hostPath note.
6. What operators must do themselves -- rotate secrets, verify
CNI enforces, label existing namespaces, monitor audit,
upgrade ordering, refactor path for enforce: restricted.
7. Verification -- copy-pasteable kubectl snippets for each
defense layer.
8. Residual risks with explicit ownership -- global SB conn
strings (backend), HTTPS egress (platform endgame), token
TTL (backend), legacy arch (migration team), PSA enforce
(chart refactor), CNI silent no-op (operator), kernel
escape (out of scope), resource DoS (out of scope).
9. Compromise response playbook.
10. Where each defense is implemented (code-path map for
reviewers).
11. Document history.
Also:
- README.md: add Security subsection under Deployment Guide
linking to docs/SECURITY.md.
- docs/INSTALL.md: prerequisite note about CNI enforcement.
No code changes; documentation only.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Add docs/MIGRATIONS.md and CLAUDE.md for Helm chart migration safety (#47)
Document the helm.sh/resource-policy=keep gotcha: Helm reads the
annotation from the stored release manifest, not live resources, so
kubectl annotate alone does not protect PVCs from helm uninstall.
Includes the 2026-04-22 tracebloc-templates migration as a case study
and three mitigation options (helm upgrade, strip ownership, or rely
on PV Retain + recreate).
* docs(client): add pre-Helm resource-monitor cleanup step to MIGRATION.md (#49)
Early-era edges were installed with a hand-rolled `resource-monitor`
DaemonSet via raw `kubectl apply` before the per-platform charts existed.
The unified chart's `tracebloc-resource-monitor` DaemonSet replaces it,
but the legacy DS is unmanaged and keeps running after migration, mounting
hostPath /proc + /sys and blocking PSA `enforce=restricted` on the namespace.
Adds a step-6 section documenting the kubectl cleanup (DS + SA + ClusterRole
+ ClusterRoleBinding, all named `resource-monitor`) with a safety check to
confirm the ClusterRole/Binding aren't shared before deletion.
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
* feat(mysql): drop root init-containers, add PSA-restricted securityContext (#48)
* feat(mysql): drop root init-containers, add PSA-restricted securityContext
Unblocks pod-security.kubernetes.io/enforce: restricted on the release
namespace. Previously the mysql-client pod had two init-containers
running as UID 0 to chown /var/lib/mysql and /var/log/mysql to 999:999
before mysqld started. PSA restricted rejects runAsUser: 0 on any
container, so these init-containers were the last blocker to promoting
the namespace from warn/audit to enforce.
The pod already had `fsGroup: 999` + `fsGroupChangePolicy: OnRootMismatch`
at the pod level, which kubelet uses to chgrp mounted volumes on first
mount. Once that is in place the init-container chowns are redundant:
- On existing PVCs (already owned 999:999 from the prior init-container
chown) OnRootMismatch sees the correct root ownership and skips the
recursive chgrp — mount is instant, no behavior change.
- On fresh PVCs kubelet applies fsGroup before the main container starts.
- On emptyDir (the logs volume) kubelet applies fsGroup at volume
creation.
Also adds a container-level securityContext with all six fields PSA
restricted requires:
- runAsNonRoot: true
- runAsUser / runAsGroup: 999 (matches the mysql:5.7.41 base image's
default user, and the entrypoint skips its root-to-mysql gosu re-exec
when already running as 999)
- allowPrivilegeEscalation: false
- capabilities: drop all
- seccompProfile: RuntimeDefault
Scope: client chart only (now the universal chart covering eks/aks/bm/oc).
Caveats for customers:
- Requires a CSI driver with fsGroupPolicy=File or ReadWriteOnceWithFSType
(EBS, AzureDisk, GCE-PD, CephRBD all qualify). NFS v3 and some
object-backed drivers do not; chart docs should flag this in a
follow-up.
Deferred to separate PR:
- readOnlyRootFilesystem on the mysql container (needs emptyDir mounts
for /tmp, /run/mysqld, /var/lib/mysql-files; real regression risk).
* fix(mysql): restore chown init-container for hostPath (bare-metal)
kubelet does not apply fsGroup ownership to hostPath volumes
(kubernetes/kubernetes#138411), so bare-metal installs need a
privileged bootstrap to chown /var/lib/mysql to 999:999 on first
start. Gated on .Values.hostPath.enabled so CSI-backed deployments
(EKS/AKS/OC) keep the clean no-init, PSA-restricted-compliant form.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
* Move tracebloc-resource-monitor to dedicated privileged namespace (#50)
* Move tracebloc-resource-monitor to dedicated privileged namespace
Pod Security Admission's `restricted` profile bans hostPath volumes
outright, and the resource-monitor DaemonSet needs hostPath /proc and
/sys to read node-level metrics. Previously, setting
`pod-security.kubernetes.io/enforce: restricted` on the release
namespace (tracebloc-templates) would reject the DaemonSet outright,
and `warn=restricted` + `audit=restricted` already spam violations.
This isolates the DaemonSet in a new dedicated namespace
(tracebloc-node-agents, configurable via `nodeAgents.namespace.name`)
that carries `pod-security.kubernetes.io/{enforce,warn,audit}:
privileged` labels. The release namespace is no longer constrained by
the node-agent and can run `enforce: restricted` once the mysql init
refactor lands.
Changes:
- templates/node-agents-namespace.yaml: new, gated on
nodeAgents.namespace.create (default true) and resourceMonitor
- templates/resource-monitor-daemonset.yaml: deploy into node-agents ns
- templates/resource-monitor-rbac.yaml: SA + (Cluster)RoleBinding in
node-agents ns
- templates/resource-monitor-scc.yaml: SCC users + CRB subject updated
(OpenShift path)
- values.yaml + values.schema.json: new `nodeAgents.namespace` block
- templates/namespace.yaml + docs/INSTALL.md: drop resource-monitor
from the enforce-blocker list; document the new node-agents ns
- tests/node_agents_namespace_test.yaml: 12 new unittest cases
Upgrade impact: existing installs will see the DaemonSet / SA /
(Cluster)RoleBinding deleted from the release namespace and recreated
in the node-agents namespace during `helm upgrade`. Brief (~seconds)
gap in node metrics during rollout; no persistent data involved.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* Mirror secrets into node-agents ns; keep namespace RBAC in release ns
Two follow-ups from review of the namespace-split change:
1. Secrets are namespace-scoped — a pod in `tracebloc-node-agents`
cannot `secretKeyRef` a Secret that only exists in the release
namespace. The resource-monitor DaemonSet was referencing CLIENT_ID /
CLIENT_PASSWORD from `tracebloc.secretName` and the registry pull
secret, both of which template only into `.Release.Namespace`, so
pods would have failed to start with CreateContainerConfigError.
templates/secrets.yaml and templates/docker-registry-secret.yaml now
template a second copy into `nodeAgents.namespace.name` when:
resourceMonitor != false AND node-agents ns != release ns
The mirror is skipped when the two namespaces collide (e.g. operator
points nodeAgents.namespace.name back at the release namespace) so
Helm does not try to create two resources with the same name.
2. When clusterScope: false, the Role must live in the RELEASE
namespace because that is where the monitored workloads run — a
namespace-scoped Role only grants access to its own namespace.
Previously this PR put the Role in `tracebloc-node-agents`, which
would have silently broken the resource-monitor for anyone not
using ClusterRole. Role + RoleBinding are now back in
`.Release.Namespace`; the RoleBinding subject still points at the
ServiceAccount in the node-agents namespace (cross-namespace
subjects in RoleBindings are valid).
Tests updated accordingly; 5 new cases cover mirror-on, mirror-off
(resourceMonitor=false), mirror-off (namespaces collide), dockercfg
mirror, and the corrected Role/RoleBinding placement.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* fix(resource-monitor): pin NAMESPACE env to release ns; guard node-agents ns==release ns
Two review fixes from the PSA hardening change:
1. NAMESPACE env var was using Downward API fieldPath: metadata.namespace,
which now resolves to the node-agents namespace (where the DaemonSet
pods live) instead of the release namespace (where the monitored
workloads live). Replace with the literal Release.Namespace so the
monitor continues to watch the right namespace regardless of where
its own pods run.
2. node-agents-namespace.yaml would stamp privileged PSA labels onto the
release namespace if an operator set nodeAgents.namespace.name to the
release namespace (and with namespace.create=true it would render two
Namespace docs with the same name — a render-time collision). Add an
equality guard so the template is a no-op in that configuration.
Adds one test covering the NAMESPACE env fix; tests: 74/74 pass.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
* feat(mysql): set readOnlyRootFilesystem on mysql-client (#52)
Completes container runtime hardening (G4) for mysql-client. Adds three
emptyDir mounts for the paths mysqld writes to at runtime that are NOT
already on PVC or log volumes:
- /var/run/mysqld pid file + unix socket
- /tmp temp tables, sort buffers, LOAD DATA staging
- /var/lib/mysql-files default secure_file_priv dir (touched at start)
Verified via helm upgrade on EKS (tb-client-dev-templates /
tracebloc-templates): pod Ready, readOnlyRootFilesystem=true, `touch /etc/x`
rejected as Read-only, mysqld.sock + mysqld.pid present under /var/run/mysqld,
existing DB data intact in /var/lib/mysql.
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
* feat(psa): enforce=restricted by default on CSI; bare-metal overrides (#51)
- values.yaml: namespace.podSecurity.enforce flipped to "restricted".
- ci/bm-values.yaml: overrides enforce to "" because kubelet does not
apply fsGroup to hostPath volumes (kubernetes/kubernetes#138411),
forcing the chart to render a privileged init-mysql-data chown
container that PSA restricted would reject. warn+audit remain on.
- namespace.yaml docstring + SECURITY.md (§4.7, §6.3, §6.6, §8.5)
updated to document the CSI-default / bare-metal-override split.
Verified with helm template --set namespace.create=true against both
eks-values.yaml (enforce rendered) and bm-values.yaml (enforce absent).
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
* feat(installer): slim k3d and add dev overrides for local testing (#54)
The tracebloc client is outbound-only: jobs-manager and pods-monitor
dial out to the platform, and the only in-cluster Service is mysql-client
(ClusterIP). The bundled k3s ingress/LB stack and metrics-server are
unused overhead, and the chart ships its own StorageClass.
Drop the loadbalancer port mappings (HTTP_PORT/HTTPS_PORT) plus their
validation/help/log references, and pass --k3s-arg "--disable=..." for
traefik, servicelb, metrics-server, and local-storage to k3d cluster
create. Applied symmetrically in scripts/install-k8s.ps1.
Also add two env vars for local-chart testing in install-client-helm.sh:
TRACEBLOC_CHART_PATH install from a local chart path instead of the
published tracebloc/client Helm repo (skips
helm repo add/update)
TRACEBLOC_VALUES_FILE use the caller-supplied values file as-is and
skip the clientId/password prompts + values.yaml
generation
With both set, the installer can exercise the full flow end-to-end
against unreleased chart changes before publishing.
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
* feat(client): harden image pinning and credentials (v1.0.4) (#53)
Address the High-severity findings from the client chart security review:
- Add digest support to tracebloc.image helper and images.* values for
jobs-manager, pods-monitor, mysql-client, and busybox. When a digest is
set, the image is rendered as repo@sha256:... and imagePullPolicy drops
to IfNotPresent (immutable pin, auditable rollout).
- Replace the hard-coded mysql-client:latest with a configurable tag that
defaults to "prod". The schema rejects "latest" outright; operators
wanting absolute pinning should set images.mysqlClient.digest.
- Harden the bare-metal mysql init-container: still runs as root (kubelet
does not apply fsGroup to hostPath volumes, k8s#138411), but now with
drop: [ALL] + add: [CHOWN], allowPrivilegeEscalation: false,
readOnlyRootFilesystem: true, and seccompProfile: RuntimeDefault.
- Remove deceptive "<CLIENT_ID>" / "<CLIENT_PASSWORD>" placeholder defaults.
The defaults are now empty strings; the schema and template both reject
empty values and <...> placeholder patterns so deployments fail fast
instead of silently encoding a placeholder into the Secret.
Bump chart version 1.0.3 -> 1.0.4. All 76 unit tests pass.
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
* feat(client): require metrics-server for resource-monitor (v1.0.5) (#55)
The tracebloc-resource-monitor DaemonSet queries the metrics.k8s.io API
for node CPU/memory. Without metrics-server registered, the DaemonSet
crash-loops with 404s against /apis/metrics.k8s.io/v1beta1 — silently,
every few seconds. Found during a bare-metal smoke test on a k3d cluster
where metrics-server had been explicitly disabled.
- scripts/lib/cluster.sh: drop --disable=metrics-server from the k3d
create args. k3s bundles metrics-server; the earlier comment claiming
the chart "ships its own" was wrong — the DaemonSet is a consumer of
metrics-server, not a replacement.
- client/templates/resource-monitor-daemonset.yaml: add a pre-install
`lookup` that fails the release up front when resourceMonitor is true
but v1beta1.metrics.k8s.io is not registered. Guarded by a kube-system
probe so offline `helm template` still renders.
- client/values.yaml: document the dependency inline on resourceMonitor,
with per-platform install notes (k3d/AKS bundled; EKS/OC/bare-metal
need manual install).
- docs/SECURITY.md: call out the dependency and the escape hatch
(resourceMonitor: false) in the architecture section.
- Chart.yaml: 1.0.4 -> 1.0.5.
Verified on a fresh k3d cluster (no --disable=metrics-server): metrics
API comes up in ~30s, smoke install succeeds, resource-monitor reaches
Running with zero ERROR/404 lines. Pre-flight fail path also verified
against a metrics-less cluster.
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
* fix(mysql): drop chmod from hostPath init (v1.0.6) (#56)
The init-container runs as UID 0 with capabilities drop:[ALL] add:[CHOWN].
After 'chown 999:999' transfers ownership, the subsequent 'chmod 755' runs
as a non-owner without CAP_FOWNER and returns EPERM on re-install where
the hostPath dir already exists from a prior run. Reversing the order
does not help (chmod first still fails once the dir is 999-owned from
any previous successful run).
kubelet creates hostPath dirs at 0755 via DirectoryOrCreate, so the chmod
was a no-op on fresh installs and broken on re-installs. Drop it.
Verified on k3d/AWS VM:
- fresh install: kubelet-created root:root dir -> chown succeeds -> 999:999
- re-install: pre-existing 999:999 dir with data -> chown no-op -> data intact
* Chore/merge main into develop (#58)
* Update README.md
* Add narrow CODEOWNERS for security-sensitive paths
* Remove metrics-server disable argument from k3d cluster creation in install-k8s.ps1 to ensure proper functionality of the resource-monitor DaemonSet, which relies on the metrics API. This change aligns with previous updates that emphasized the necessity of metrics-server for monitoring capabilities.
---------
Co-authored-by: lukasWuttke <54042461+LukasWodka@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
* Merge pull request #60 from tracebloc/fix/resource-monitor-digest-pinning
fix(client): pin resource-monitor by digest (v1.0.7)
* chore: add auto-add to engineer kanban workflow (#45)
* Add auto-add to engineer kanban workflow
* fix(ci): pin actions/add-to-project to v1.0.2
@v1 is not a valid tag — action publishes full semver only. Pin to v1.0.2.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(client): reject empty clusterCidrs on training NetworkPolicy (v1.0.8) (#61)
* fix(client): reject empty clusterCidrs on training NetworkPolicy (v1.0.8)
When `networkPolicy.training.enabled: true` and `clusterCidrs: []`, the
template's range loop produced no items, so `except:` rendered as null.
Kubernetes interprets a null `except` as "no exceptions" to `cidr: 0.0.0.0/0`,
silently granting training pods unrestricted port-443 egress to MySQL, the
K8s API, jobs-manager, and every other in-cluster destination the policy
is meant to block.
Gate the misconfiguration at two levels:
- `values.schema.json`: add `minItems: 1` to clusterCidrs (fires at
helm install/upgrade validation)
- `network-policy-training.yaml`: add a `{{ fail }}` guard as
defense-in-depth for schema-bypass paths (helm template --validate=false)
- `tests/network_policy_test.yaml`: add a unit test asserting the failure
Credit: bug bot finding.
* fix(client): tolerate missing images.resourceMonitor on --reuse-values upgrade
Caught by a live k3d upgrade 1.0.6 → 1.0.8: releases installed before
PR #60 have no `images.resourceMonitor` block in their stored values, so
`helm upgrade --reuse-values` nil-pointered on `.Values.images.resourceMonitor.digest`.
- Read the digest via nested `default (dict)` so a missing `images` map
AND a missing `resourceMonitor` entry both fall through to "" safely.
`dig` would be cleaner but it rejects chartutil.Values.
- Add tests/resource_monitor_test.yaml with a regression case that sets
`images: null` and asserts the DaemonSet still renders with the tag
fallback.
Scope limited to resourceMonitor: the other images (jobsManager,
podsMonitor, mysqlClient, busybox) were introduced together in PR #53
(1.0.4), so anyone on 1.0.4+ already has those blocks in stored values.
* fix(client): scope clusterCidrs minItems guard to enabled=true only
Bug bot flagged that the unconditional minItems:1 constraint on
networkPolicy.training.clusterCidrs rejects `enabled: false` +
`clusterCidrs: []` — a legitimate minimal config for operators on
non-enforcing CNIs who disable the policy entirely.
Move the constraint behind a JSON Schema draft 7 if/then at the
`training` object level: minItems:1 applies only when enabled=true.
The template-side fail guard was already correctly scoped inside the
`.Values.networkPolicy.training.enabled` check, so no template change
is needed — this aligns the schema with the template.
Add a unittest covering `enabled: false` + `clusterCidrs: []`
(schema must pass, no policy rendered).
---------
Co-authored-by: Lukas Wuttke <lukas@tracebloc.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: lukasWuttke <54042461+LukasWodka@users.noreply.github.com>
Merge pull request #62 from tracebloc/fix/release-workflow-lint
Enhance CI workflows and fix MySQL resource management issues
* Merge pull request #71 from tracebloc/docs/migrations-correct-option-b docs(migrations): correct Option B + add hasan-prod case + active-jobs pre-flight * chore: add default CODEOWNERS for auto-reviewer assignment (#73) * ci: add kanban closure-routing caller workflow (#75) * fix(client): release-scope resource-monitor names so multiple releases coexist (v1.2.0) (#72) Two client releases on the same cluster could not both deploy the resource-monitor DaemonSet because several resources templated into the shared tracebloc-node-agents namespace used the literal name `tracebloc-resource-monitor` rather than a release-scoped name. The second `helm install` failed with: Error: ServiceAccount "tracebloc-resource-monitor" in namespace "tracebloc-node-agents" exists and cannot be imported into the current release: invalid ownership metadata; ... must equal "hasan-prod": current value is "stg". Surfaced during the 2026-04-27 hasan-prod migration on tracebloc-templates-prod; worked around at the time by setting resourceMonitor: false on the second release, which means prod customers currently lose their per-CLIENT_ID metric stream until this lands. What changed: - New helper `tracebloc.resourceMonitorName` -> `<Release.Name>-resource-monitor`, centralised in _helpers.tpl alongside the existing per-release name helpers (secretName, serviceAccountName, etc.). - DaemonSet metadata.name, spec.selector.matchLabels.app, pod label app=, and spec.template.spec.serviceAccountName all now go through the helper. The selector + pod label have to move together because DaemonSet selectors are namespace-scoped: two DaemonSets in tracebloc-node-agents both selecting `app: tracebloc-resource-monitor` would each grab the other's pods, which is worse than the surface bug. - ServiceAccount metadata.name (resource-monitor-rbac.yaml) goes through the helper. ClusterRole / ClusterRoleBinding / Role / RoleBinding metadata.name were already release-scoped (`tracebloc-resource-monitor-<release>`) and stay as-is to avoid an unnecessary ClusterRole rename for upgrading installs. Only the *subject* names in (Cluster)RoleBinding change to point at the new SA. - Mirrored secrets (CLIENT_ID + dockerconfigjson) in tracebloc-node-agents: the secret names were already release-scoped via tracebloc.secretName / tracebloc.registrySecretName so they did not collide. Their `app` label was the literal value, which is harmless on uniquely-named resources but inconsistent — updated for consistency. - Chart bumped 1.1.0 -> 1.2.0. Per-release naming of cluster-singleton resources is a behaviour change for existing installs (DaemonSet name, ServiceAccount name, and selector label all change), so a minor bump signals that operators should review. Tests: 93 -> 98. New cases cover: - DaemonSet name + selector + serviceAccountName all release-scoped - ServiceAccount name release-scoped - ClusterRoleBinding subject points at the release-scoped SA - A second `helm template` with a different release name produces non-colliding names Verified end-to-end via `helm template stg ./client` and `helm template hasan-prod ./client` on the same chart: ServiceAccount, DaemonSet, and ClusterRoleBinding subject names all diverge per release. Upgrade path from 1.1.0: The DaemonSet and ServiceAccount rename triggers a Helm three-way merge that DELETEs the old `tracebloc-resource-monitor` resource and CREATEs the new release-scoped one. ~30-60s gap on each node where resource metrics are not collected. DaemonSet selector is immutable, so the delete-then-create path is what we want — helm upgrade handles this automatically because the names diverge in the stored manifest. No manual orphan cleanup needed. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> * fix(client): allow training pods to reach mysql-client (v1.2.1) (#76) The training-egress NetworkPolicy added in v1.1.0 only permitted DNS and external TCP/443. Training pods load their dataset from the in-namespace mysql-client over TCP/3306 (core/utils/database.py::load_dataframe_from_sql_table), so under any CNI that actually enforces NetworkPolicy the connect failed with errno 111 and the Job CrashLoopBackOff'd before the first batch: Database connection failed: 2003 (HY000): Can't connect to MySQL server on 'mysql-client:3306' (111) RuntimeError: Database connection is not available for load_dataframe_from_sql_table Surfaced on a fresh client install (k3d / k3s, which enforces policy via the built-in kube-router) where jobs-manager could reach mysql but every training Job spawned with tracebloc.io/workload=training could not. Add a third egress rule scoped to podSelector {app: mysql-client} on TCP/3306. Same-namespace by default (no namespaceSelector), so it stays tight to the chart's own mysql pod and does not open the namespace generally. The egress[1] /32 ipBlock comment is updated to note that MySQL is now explicitly re-permitted by egress[2]. Verified on a k3d cluster: pre-fix nc to mysql-client:3306 from a pod with the training label was refused; post-fix it connects. * docs(migration-tools): tenant migration runbook for eks-1.0.x → client-1.x (#74) * docs(migration-tools): tenant migration runbook for eks-1.0.x -> client-1.x Captures the operational tooling validated during the 2026-04-27 stg and hasan-prod migrations and generalises it for the remaining tenants (bmw, cisco, charite) and any future tenant on the legacy chart family. What's here: - README.md walks the workflow + recommended ordering for the pending set + skip rationale for chart toggles (resourceMonitor: false, priorityClass.create: false, etc). - generate.sh consumes a tenant-config.env (gitignored) and emits, per tenant, /tmp/tracebloc-migration-<tenant>/{values,storageclass,pvcs}.yaml. Refuses to expand placeholder __FOO__ rows so an operator running generate.sh against the unmodified template fails fast. - migrate-tenant.sh is the parameterised runbook. `phase1` is non-destructive (mysqldump-then-chunked-cp, AWS Backup on-demand recovery point, dry-run render). `phase2` is one-shot per tenant (helm uninstall, claimRef clear, SC re-create, PVC pre-create with release-scoped Helm ownership stamp, helm install, verify mysql data + keep annotation in stored manifest). - tenant-config.example.env is the template; populated copy is the secret-bearing artifact and must stay local. No real secrets in any committed file: - DOCKER_PASSWORD placeholder (__DOCKER_HUB_PERSONAL_ACCESS_TOKEN__) - per-tenant CLIENT_ID / CLIENT_PASSWORD placeholders - MYSQL_ROOT_PW placeholder (it's image-baked; required from env at runtime, no committed default) - .gitignore now excludes docs/migration-tools/tenant-config.env (only the .example variant is tracked) Operational notes: - Every kubectl/helm call passes --context explicitly. The 2026-04-27 prod run hit a context-drift bug mid-migration; the explicit form is a hard requirement. - values.yaml ships with resourceMonitor: false. Flip true after the release-scoped resource-monitor names land in client-1.2.0 (separate PR). Until then the shared SA in tracebloc-node-agents collides with the stg release. - Phase 1 is idempotent and re-runnable. Phase 2 is destructive and one-shot per tenant. Operators should pause and eyeball Phase 1 outputs before running Phase 2 — that's deliberately not automated. Once all four pending tenants are on client-1.x, this directory is historical. client-1.x -> client-1.y upgrades follow plain `helm upgrade` because the new chart already templates `helm.sh/resource-policy: keep` on PVCs, so the migration protocol isn't needed for routine upgrades. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(migration-tools): address bugbot review feedback on PR #74 Three issues flagged by Cursor Bugbot on the migration scripts: * migrate-tenant.sh used macOS-only `md5 -q` and `stat -f%z` for chunked-cp verification (HIGH). Linux operators would abort Phase 1 mid-transfer. Add portable `_md5` and `_size` helpers that pick md5sum on Linux, fall back to md5(1) on macOS, and use `wc -c` instead of stat for size. * generate.sh placeholder gate inspected only CLIENT_ID + CLIENT_PASSWORD + PV_MYSQL, missing PV_LOGS, PV_DATA, SC_NAME, and DOCKER_PASSWORD (MEDIUM). Literal `__FOO__` placeholders silently rendered into values.yaml/pvcs.yaml and only blew up at kubectl apply / helm install time. Iterate over every per-row field, plus a one-shot global check for DOCKER_PASSWORD before the loop. Error messages now name the offending field. * Phase 2.5 readiness loop was an unbounded `while :; do … sleep 5; done` (MEDIUM). After the destructive helm uninstall, a non-converging install (image-pull error, mysql kill-loop recurrence, missing PVC binding) hung the script forever instead of surfacing the failure. Add a wall-clock deadline — default 600s, override via READY_TIMEOUT — and exit 1 with the last-seen pod state on timeout. * fix(migration-tools): address bugbot follow-up on PR #74 Two more issues raised on the previous fix commit: * Readiness wait loop aborted on empty pod list (HIGH). With `set -euo pipefail`, the routine post-install window where no pods are visible yet caused `grep -c .` to exit 1, killing the script on the very first iteration before the wall-clock deadline could ever fire — defeating the bounded-wait intent. Guard the empty case explicitly. `wc -l` alone is also wrong because `echo ""` prints a newline. * MYSQL_ROOT_PW skipped the placeholder check that DOCKER_PASSWORD, CLIENT_*, and PV_* now have (LOW). An operator who copied the example without editing this row passed the non-empty gate, then the literal __LEGACY_MYSQL_ROOT_PW__ went into mysqldump and Phase 1 blew up partway through with an opaque "Access denied" inside kubectl exec. Add the same `*__*__*` case guard right after the non-empty check. * fix(migration-tools): make EFS_FS_OVERRIDE actually override (PR #74) The pre-source assignment EFS_FS="${EFS_FS_OVERRIDE:-fs-06b3faf51675ff9f9}" was a no-op: `source "$CONFIG"` runs immediately after and the example config (and any real tenant-config.env derived from it) unconditionally sets EFS_FS=fs-06b3faf51675ff9f9, so the env override was clobbered every time. Operators thinking they were targeting a non-default EFS would silently start AWS Backup on-demand jobs against the hard-coded prod filesystem. Move the override knob to AFTER source where env genuinely wins, drop the hard-coded fallback, and require EFS_FS to be set somewhere (config or override) before continuing. --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> * fix(client): release-scope SCC SA refs (v1.2.2) (#78) Bugbot caught a High-severity miss in v1.2.0's release-scoping work (PR #72). The OpenShift SCC template was the one resource-monitor file not updated when the literal `tracebloc-resource-monitor` ServiceAccount name moved to `<Release.Name>-resource-monitor`. On OpenShift the SCC granted access to a SA name that no longer existed, so the resource- monitor DaemonSet pods would fail to launch (no SCC -> can't mount hostPath /proc and /sys for node metrics). The SCC's metadata.name + ClusterRole.name + ClusterRoleBinding.name were ALREADY release-scoped (`tracebloc-resource-monitor-<release>` / `tracebloc-resource-monitor-scc-<release>`), so this slipped through — casual reading suggested it was already done. Touchpoints in resource-monitor-scc.yaml: - users[0]: now {{ include "tracebloc.resourceMonitorName" . }} - ClusterRoleBinding subjects[0].name: same helper - All `app: tracebloc-resource-monitor` labels: same helper, for consistency with the rest of the chart's resource-monitor templates - Updated the kubernetes.io/description SCC annotation prose so the literal name doesn't appear there either (cosmetic, but easier to audit "no literal references" with a single grep). Tests: - platform_test.yaml gains 3 new cases: SCC users[0] points at release-scoped SA, ClusterRoleBinding subject does too, and two releases (stg + cisco/hasan-prod) produce non-colliding SA references. - node_agents_namespace_test.yaml had a regression assertion checking the OLD literal name in users[0]; updated to the new release-scoped form (`RELEASE-NAME-resource-monitor`, helm-unittest's default release name when none is set). - 98 -> 102 passing. Verified end-to-end with two side-by-side `helm template` runs: - stg -> users[0] = system:serviceaccount:tracebloc-node-agents:stg-resource-monitor - hasan-prod -> users[0] = system:serviceaccount:tracebloc-node-agents:hasan-prod-resource-monitor Chart bumped 1.2.1 -> 1.2.2 (patch — restores OpenShift parity that v1.2.0 inadvertently broke). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> * fix: NOTES.txt rename + generator chart-version drift (v1.2.3) — bugbot follow-up #2 (#80) * fix(client): release-scope SCC SA refs (v1.2.2) Bugbot caught a High-severity miss in v1.2.0's release-scoping work (PR #72). The OpenShift SCC template was the one resource-monitor file not updated when the literal `tracebloc-resource-monitor` ServiceAccount name moved to `<Release.Name>-resource-monitor`. On OpenShift the SCC granted access to a SA name that no longer existed, so the resource- monitor DaemonSet pods would fail to launch (no SCC -> can't mount hostPath /proc and /sys for node metrics). The SCC's metadata.name + ClusterRole.name + ClusterRoleBinding.name were ALREADY release-scoped (`tracebloc-resource-monitor-<release>` / `tracebloc-resource-monitor-scc-<release>`), so this slipped through — casual reading suggested it was already done. Touchpoints in resource-monitor-scc.yaml: - users[0]: now {{ include "tracebloc.resourceMonitorName" . }} - ClusterRoleBinding subjects[0].name: same helper - All `app: tracebloc-resource-monitor` labels: same helper, for consistency with the rest of the chart's resource-monitor templates - Updated the kubernetes.io/description SCC annotation prose so the literal name doesn't appear there either (cosmetic, but easier to audit "no literal references" with a single grep). Tests: - platform_test.yaml gains 3 new cases: SCC users[0] points at release-scoped SA, ClusterRoleBinding subject does too, and two releases (stg + cisco/hasan-prod) produce non-colliding SA references. - node_agents_namespace_test.yaml had a regression assertion checking the OLD literal name in users[0]; updated to the new release-scoped form (`RELEASE-NAME-resource-monitor`, helm-unittest's default release name when none is set). - 98 -> 102 passing. Verified end-to-end with two side-by-side `helm template` runs: - stg -> users[0] = system:serviceaccount:tracebloc-node-agents:stg-resource-monitor - hasan-prod -> users[0] = system:serviceaccount:tracebloc-node-agents:hasan-prod-resource-monitor Chart bumped 1.2.1 -> 1.2.2 (patch — restores OpenShift parity that v1.2.0 inadvertently broke). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix: NOTES.txt rename + generator chart-version drift (v1.2.3) Bugbot follow-up to the v1.2.0/1.2.2 rename work. Two fresh issues: 1. (Medium) NOTES.txt:9 still hardcoded the literal `tracebloc-resource-monitor` for the resource-monitor DaemonSet display, while the actual DaemonSet name has been `<release>-resource-monitor` since v1.2.0. Operators see one name in the post-install banner and a different name when they `kubectl get ds`. Now routes through the same tracebloc.resourceMonitorName helper as the rest of the chart. 2. (Low) docs/migration-tools/generate.sh hardcoded `app.kubernetes.io/version: "1.1.0"` and `helm.sh/chart: client-1.1.0` on every pre-create PVC. The chart has moved through 1.1.0 → 1.2.3, and operators running generate.sh today get PVC labels stuck at 1.1.0 even though the install ahead is 1.2.3. Helm adoption itself is unaffected (it keys on meta.helm.sh/release-name, not the chart label), but the labels lie until a subsequent upgrade reconciles them, and `kubectl get pvc -L helm.sh/chart` is misleading during migration debugging. Fixed by reading name + version from client/Chart.yaml at generate time. Plus a few stale prose references caught while auditing the same path (no functional impact, but the doc was directing operators at "client fix in 1.2.0" as if it were still pending): - generate.sh inline comment on `resourceMonitor: false` rephrased from "until client-1.2.0 is published" to "until you have verified the chart you're installing is 1.2.0+" - migrate-tenant.sh banner relabelled from "v1.1.0 spec sanity" to "mysql spec sanity (v1.1.0+ shape: ...)" - README.md skip table cell on `resourceMonitor: false` rewritten to reflect that 1.2.0+ has shipped — operators on >=1.2.0 can flip it to true without colliding with the stg release Tests: 102 → 105 passing. New `client/tests/notes_test.yaml` covers: - Release-scoped resource-monitor name appears in NOTES.txt - A different release renders a different name (proves the helper isn't accidentally hardcoded) - Negative regex guards against the literal `tracebloc-resource-monitor` reappearing followed by a non-suffix character (i.e. the bare pre-1.2.3 form, while still letting the SCC line `tracebloc- resource-monitor-<release>` further down the file pass) - `resourceMonitor: false` removes the line entirely End-to-end smoke of generate.sh confirms PVCs ship with the live chart version (`helm.sh/chart: client-1.2.3` after this commit, verified against /tmp/tracebloc-migration-<demo>/pvcs.yaml). Stacked on PR #78 (v1.2.2 SCC fix), so this branch already contains the SCC SA-ref rename. Once #78 lands the diff against develop will reduce to just this commit. Chart bumped 1.2.2 → 1.2.3 (patch — operator-facing string fix + tooling correctness). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> * docs(claude): require @saadqbal as PR assignee (#79) Convention captured after a session-end ask. Every PR Claude opens for this repo must be assigned to saadqbal — orphaned PRs without an assignee fall through the review queue. Pass --assignee @me on `gh pr create` (or --assignee saadqbal if running unauthenticated). No exceptions. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: lukasWuttke <54042461+LukasWodka@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
chore(client): bump chart to 1.2.3 for release
…loses #70) (#83) (#84) The chart unification (4 per-platform charts -> unified client/ chart) shipped in v1.1.0; the unified chart has now been at v1.2.x in production across stg + hasan-prod for several releases. Time to retire the legacy artifacts. Removed: - aks/, bm/, eks/, oc/ chart directories — 75 files, ~330KB. Each had a DEPRECATED.md pointing at the unified chart for ~6 months. - 7 stale .tgz tarballs at repo root (aks-1.0.3, aks-1.0.4, bm-1.0.3, bm-1.0.4, eks-1.0.3, eks-1.0.4, oc-1.0.4). The release workflow publishes via gh-pages; these checked-in builds were dead weight. - Root index.yaml — stale snapshot listing only 1.0.3/1.0.4 of the legacy charts. The live index served at tracebloc.github.io/client is on the gh-pages branch and is the source of truth. - mysql.yaml at repo root — orphaned PVC manifest with hardcoded volume UUID and namespace. Audited: zero references anywhere in the repo. Other: - Added *.tgz to .gitignore so chart packages don't sneak back in. - Updated client/MIGRATION.md Rollback section. The old "the legacy charts remain in aks/, bm/, eks/, oc/ and can be used at any time" was about to become a lie. Replaced with instructions to recover the directory from git history if anyone genuinely needs the old chart. Verification: - helm lint --strict ./client -f client/ci/eks-values.yaml — clean (same invocation the release workflow runs on every tag) - helm unittest client — 105/105 still passing - helm package ./client -d /tmp — produces a valid client-1.2.3.tgz Net diff: 86 files changed, 17 insertions(+), 3447 deletions(-). Co-authored-by: Lukas Wuttke <lukas@tracebloc.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: lukasWuttke <54042461+LukasWodka@users.noreply.github.com>
Prod: Implement self-upgrade CronJob for Helm chart automation
The Deploy section opened with `docker pull tracebloc/client:latest`, but this repo ships a Helm chart — the actual install is `helm install`. External walkthrough URLs (`/local-linux`, `/local-macos`, `/aws`, `/deployment-overview`) didn't match any path in the tracebloc/docs tree, so they 404. The in-repo documentation (`docs/INSTALL.md`, `docs/MIGRATIONS.md`, `docs/migration-tools/README.md`, `client/MIGRATION.md`) was never linked from the README despite being the operational source of truth. Surgical change — the rest of the README stays as-is: - Replace `docker pull` with `helm repo add` + `helm install` (matches docs/INSTALL.md) - Call out chart version (v1.3.1) and platform support (AKS / EKS / bare-metal / OpenShift) up front - Table linking every in-repo operational doc - Fix external URLs to match actual tracebloc/docs paths (local-deployment-guide-linux, local-deployment-guide-macos, eks-client-deployment-guide, azure-deployment-guide) - Pull NetworkPolicy/CNI prerequisite into a callout Closes #101 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
docs: fix README Deploy section (Helm not docker), surface in-repo docs
The standalone installer (bash <(curl -fsSL tracebloc.io/i.sh) / irm tracebloc.io/i.ps1 | iex) is the one-command path for evaluation, local dev, and first-time installs — it provisions a cluster, detects GPU drivers, and deploys the client. Today it isn't documented anywhere reachable from this repo, so readers see the multi-step helm install flow as the only option. README: - New "Quick install" subsection at the top of Deploy with macOS/Linux and Windows commands, brief description of what it does, and a pointer to the local helper scripts under scripts/ - Existing helm flow relabeled as "Helm install (production)" — now positioned as the option for existing production clusters docs/INSTALL.md: - Top-of-doc callout pointing at the standalone installer for non-production users - Production-focused content untouched Closes #103 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous wording ("Best for evaluation, local dev, and first-time
installs" / "Just trying it out? For local dev or a quick evaluation")
implied the standalone installer produces a lesser/demo client. It
doesn't — it produces the same full client, just on a cluster the
script provisions for you.
Reframes the differentiator around cluster ownership instead of install
quality:
- README: "Use this when you don't already have a cluster — the result
is a full client install, not a demo." Helm subsection retitled
from "Helm install (production)" to just "Helm install" with
"For existing Kubernetes clusters".
- INSTALL.md: callout opens with "Don't have a Kubernetes cluster
yet?" and emphasizes "a full tracebloc client".
Refs #103
curl and PowerShell's irm both default to HTTP when no scheme is specified, so `curl -fsSL tracebloc.io/i.sh` and `irm tracebloc.io/i.ps1` issue plaintext requests. The downloaded body is piped straight into bash / iex, so a network-level attacker between the user and tracebloc.io could MITM the response and inject arbitrary code. Add explicit `https://` to every installer URL in README.md and docs/INSTALL.md so the request is encrypted from the first byte. Refs #103
docs: surface standalone installer in README and INSTALL.md
The chart was renamed and consolidated into tracebloc/client. The
"Publishing the chart (maintainers)" section still referenced
tracebloc-helm-charts as a possible alternate development repo, which
no longer exists.
- Step 1: simplify "the repo that hosts the chart (e.g. tracebloc/client
or tracebloc-helm-charts)" → "the tracebloc/client repo".
- Step 5 ("If you develop in a different repo..."): removed entirely —
there is no other repo.
- Trailing **Note:** about the cross-repo workflow: removed for the
same reason.
Closes #105
This was referenced May 6, 2026
saqlainsyed007
approved these changes
May 6, 2026
divyasinghds
approved these changes
May 6, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The "Publishing the chart (maintainers)" section in `docs/INSTALL.md` still referenced `tracebloc-helm-charts` as a possible alternate development repo. That repo was renamed and consolidated into `tracebloc/client`; the references are misleading.
Net: 1 line added, 5 removed.
Closes #105
Test plan
🤖 Generated with Claude Code
Note
Low Risk
Low risk documentation-only change that removes outdated cross-repo publishing instructions; no runtime or deployment logic is modified.
Overview
Updates
docs/INSTALL.mdto reflect that Helm chart publishing is done solely fromtracebloc/client, simplifying step 1 and removing the now-invalid guidance about developing/releasing from a separatetracebloc-helm-chartsrepo.Reviewed by Cursor Bugbot for commit 0c42323. Bugbot is set up for automated code reviews on this repo. Configure here.