feat(client): mysql kill-loop fix — resources, PriorityClass, PDB (v1.1.0) by saadqbal · Pull Request #66 · tracebloc/client

saadqbal · 2026-04-27T08:30:10Z

Summary

Fixes the prod issue where mysql-client pods crash-loop every ~12 minutes (~102s alive each) under cluster load, with Reason: Error / Exit 1 and no OOMKilled status — characteristic of kernel-level OOM SIGKILL (cgroup OOM is reported as OOMKilled). Root cause: jobs-manager pods deployed without resource requests at all (BestEffort QoS) leaked node memory under load; the kernel OOM scanner then picked mysqld as the fattest victim while it was allocating its InnoDB buffer pool during init.

Changes

mysql resources: requests.memory == limits.memory == 1Gi pins the cgroup memory budget so mysqld is no longer the oom_score_adj winner at the node level. CPU limit deliberately removed — throttled CPU causes InnoDB lock-wait timeouts and is a far more common cause of mysql crashes than a noisy CPU neighbour. CPU request bumped to 250m.
mysql probes: livenessProbe.timeoutSeconds: 5, failureThreshold: 5. The default 1s timeout falsely fires under any CPU contention because mysqladmin ping itself needs >1s to spawn. Readiness/startup probes get explicit timeoutSeconds for symmetry.
jobs-manager api: requests memory 256Mi → 512Mi, limit 512Mi → 1Gi. 256Mi was too tight under load. Pods-monitor sidecar bumped 128/256 → 256/512.
PriorityClass tracebloc-data-plane (value 1,000,000), referenced by mysql. Above default user workloads (so the scheduler preempts noisy training jobs to keep mysql scheduled), well below system-cluster-critical. Cluster-scoped, helm.sh/resource-policy: keep. Toggle via priorityClass.create.
mysql PDB gated on podDisruptionBudget.mysql.create (default true), minAvailable: 1.
jobs-manager PDB: existing template had maxUnavailable: 1 on a 1-replica deployment, which is a no-op (the only pod can always go away). Switched to minAvailable: 1 + same toggle pattern. Unrelated to the kill-loop but was effectively no protection.
All resource numbers exposed under resources.{mysql,jobsManager,podsMonitor} so operators managing tighter budgets can override without forking.

Chart bumped 1.0.8 → 1.1.0 (minor, not patch): switching mysql to a PriorityClass and changing default resource shapes is a behaviour change for existing installs — operators with custom overrides should re-review before upgrading.

Test plan

helm lint clean
helm unittest client — 90/90 pass (was 76; new tests cover memory parity + no cpu limit, liveness timeout, priorityClassName conditional, PriorityClass + both PDB templates, jobs-manager memory floors)
Fresh install on k3d: PriorityClass created cluster-scoped with keep annotation, both PDBs render minAvailable: 1, mysql pod has the new resource shape and probe timing
Upgrade path: existing PDB flips from maxUnavailable to minAvailable cleanly
Production rollout — pick one tenant namespace first, watch kubectl describe pod <mysql> for Reason: Error / Exit 1 to disappear from Last State. Re-evaluate after 24h before rolling to remaining tenants.

Out of scope (follow-ups)

jobs-manager / pods-monitor still have no liveness/readiness probes (Low finding from the original security review). Adding sensible probes needs an HTTP health endpoint or a defensible exec probe — separate PR.
MySQL config tuning (innodb_buffer_pool_size, max_connections, performance_schema = OFF) — touches the existing mysql-client-config ConfigMap behaviour and could affect performance characteristics for working installs; warrants its own change with a feature flag.
Operational: production releases on bmw, cisco, charite, tracebloc-templates-prod, tracebloc-templates-stg are still on a pre-v1.0.4 chart — upgrading to v1.1.0 picks up everything in this PR plus the prior 1.0.4–1.0.8 fixes.

Note

Medium Risk
Changes Kubernetes scheduling/eviction behavior and default resource/probe settings for stateful MySQL and core services, which can affect cluster capacity and upgrade behavior even though the changes are configuration-scoped.

Overview
Bumps the Helm chart to 1.1.0 and changes default runtime behavior to reduce MySQL crash loops under cluster pressure.

This introduces configurable resources for mysql, jobsManager, and podsMonitor, with higher default memory for jobs-manager/pods-monitor and MySQL set to memory requests==limits (and no default CPU limit). MySQL probes are made more tolerant via increased timeoutSeconds/failureThreshold.

Adds an optional cluster-scoped PriorityClass (kept on uninstall) referenced by the MySQL pod, and gates new/updated PodDisruptionBudgets behind podDisruptionBudget.{mysql,jobsManager}.create, switching jobs-manager from maxUnavailable to minAvailable for single-replica protection.

^{Reviewed by Cursor Bugbot for commit 61af0d1. Bugbot is set up for automated code reviews on this repo. Configure here.}

….1.0) Prod investigation showed all four mysql instances crash-looping every ~12 minutes, ~102s alive each. Reason: Error / Exit 1, no probe failures, no OOMKilled status — characteristic of kernel-level OOM SIGKILL (cgroup OOM is reported as OOMKilled). Root cause: jobs-manager pods deployed without resource requests at all (BestEffort QoS) leaked node memory under load; the kernel OOM scanner picked mysqld as the fattest victim while it was allocating its InnoDB buffer pool during init. Chart-side fixes: - mysql resources: requests.memory == limits.memory == 1Gi pins the cgroup memory budget so mysqld is no longer the oom_score_adj winner at node level. CPU limit deliberately *removed* — throttled CPU causes InnoDB lock-wait timeouts and is a far more common cause of mysql crashes than a noisy CPU neighbour. cpu request bumped to 250m. - mysql liveness probe: timeoutSeconds: 5, failureThreshold: 5. The default 1s timeout falsely fires under any CPU contention because `mysqladmin ping` needs >1s to even spawn. readiness/startup probes also get explicit timeoutSeconds for symmetry. - jobs-manager api: requests memory 256Mi -> 512Mi, limit 512Mi -> 1Gi. 256Mi was too tight under load; the Service Bus client + pod-event watcher push RSS over 256Mi within minutes on a busy cluster. pods-monitor sidecar bumped 128/256 -> 256/512 likewise. - New PriorityClass `tracebloc-data-plane` (value 1,000,000), referenced by mysql. Sits above default user workloads so the scheduler preempts noisy training jobs to keep mysql scheduled, well below system-cluster-critical so it cannot starve cluster-essential pods. Cluster-scoped, helm.sh/resource-policy: keep so a release uninstall doesn't yank it from sibling releases. Toggle via priorityClass.create. - mysql PodDisruptionBudget gated on podDisruptionBudget.mysql.create (default true). minAvailable: 1 blocks voluntary disruptions while the only mysql replica is the only mysql replica. - jobs-manager PDB: existing template had maxUnavailable: 1 on a 1-replica deployment, which is a no-op (the only pod can always go away). Switched to minAvailable: 1 + same toggle pattern. This was unrelated to the kill-loop but was effectively no protection. - All resource numbers exposed under `resources.{mysql,jobsManager, podsMonitor}` in values.yaml + values.schema.json so operators managing tighter budgets can override without forking. Chart bumped 1.0.8 -> 1.1.0 (minor, not patch): switching mysql to a PriorityClass and changing default resource shapes is a behaviour change for existing installs — operators with custom overrides should re-review before upgrading. Tests: 76 -> 90 (mysql memory parity + no-cpu-limit, liveness timeout, priorityClassName conditional, PriorityClass + both PDB templates, jobs-manager memory floors). All 90 pass. helm lint clean. End-to-end smoke install on k3d verified PriorityClass created cluster-scoped, both PDBs render with minAvailable: 1, mysql pod has the new resource shape and probe timing. Upgrade path also verified — existing PDB flips from maxUnavailable to minAvailable cleanly. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Two issues from automated review of #65 (mysql kill-loop fix): 1. priorityClassName was lost when using externally-managed PriorityClass. The mysql template gated `priorityClassName` on `priorityClass.create`, which conflated "chart manages this resource" with "pod references this resource." Operators on GitOps / shared-platform setups (create: false, PriorityClass managed out-of-band) silently lost the OOM protection the rest of the v1.1.0 work depends on. Now gated on `priorityClass.name` being non-empty instead, with an `allOf` schema guard so create=true + empty name fails fast at install time. Empty name with create=false is the explicit "no PriorityClass" opt-out. 2. Schema advertised mysql.limits.cpu but the template ignored it. `resources.mysql.limits.cpu` passed schema validation and was silently dropped at render — operators flipping it on saw no error and no effect. Now rendered conditionally via `with` so the default stays "no CPU limit" (the whole point of the v1.1.0 mysql tuning) but explicit overrides land in the manifest. Tests: 90 -> 93. New cases cover: - create=false + named PriorityClass keeps priorityClassName on the pod - create=false + empty name is the only path that drops priorityClassName - create=true + empty name fails schema validation - explicit mysql.limits.cpu lands in the rendered manifest values.yaml comment for priorityClass rewritten to spell out the three cases (create+name, externally-managed, disabled). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 29da809. Configure here.}

Bugbot follow-up on the bugbot follow-up. Draft-07 `properties` only validates keys that are *present*; without `required: [create]`, omitting `create` passes the `if` vacuously and unconditionally enforces the `then` (name minLength: 1). Helm merges defaults so this is harmless in practice today, but the schema was technically incorrect and out of step with the dockerRegistry conditional in the same file. Aligned both patterns. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

cursor Bot reviewed Apr 27, 2026

View reviewed changes

Comment thread client/templates/mysql-deployment.yaml

Comment thread client/values.schema.json

cursor Bot reviewed Apr 27, 2026

View reviewed changes

Comment thread client/values.schema.json Outdated

saqlainsyed007 approved these changes Apr 27, 2026

View reviewed changes

saadqbal merged commit 2d9d013 into develop Apr 27, 2026
1 check passed

saadqbal self-assigned this Apr 28, 2026

saadqbal deleted the feature/mysql-stability-and-qos branch May 13, 2026 14:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(client): mysql kill-loop fix — resources, PriorityClass, PDB (v1.1.0)#66

feat(client): mysql kill-loop fix — resources, PriorityClass, PDB (v1.1.0)#66
saadqbal merged 3 commits into
developfrom
feature/mysql-stability-and-qos

saadqbal commented Apr 27, 2026 •

edited by cursor Bot

Loading

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

saadqbal commented Apr 27, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test plan

Out of scope (follow-ups)

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

saadqbal commented Apr 27, 2026 •

edited by cursor Bot

Loading