Skip to content

chore: add PR template + customer bump + stale auto-close#65

Merged
saadqbal merged 4 commits into
developfrom
chore/process-bootstrap
Apr 27, 2026
Merged

chore: add PR template + customer bump + stale auto-close#65
saadqbal merged 4 commits into
developfrom
chore/process-bootstrap

Conversation

@LukasWodka
Copy link
Copy Markdown
Contributor

Summary

Adds three small process files (PR template + 2 workflows) to bring this repo onto the new engineering process.

Files

File What it does
.github/pull_request_template.md Standardizes every PR — summary / type / test plan / checklist
.github/workflows/customer-priority-bump.yml When from:customer label is applied to an issue, auto-bumps Priority=P1 on the kanban (calls reusable workflow in tracebloc/.github)
.github/workflows/stale-backlog.yml Mondays 00:00 UTC — issues idle 6+ weeks get a warning, idle 8+ weeks auto-close. Exempts keep-open and blocked. Does NOT touch PRs.

Activation

PR template: active immediately on merge for any new PR.
Customer bump: requires from:customer label (already created in this repo).
Stale auto-close: first run on next Monday. Use Run workflow in Actions tab to test manually.

🤖 Generated with Claude Code

@LukasWodka LukasWodka requested a review from saadqbal April 25, 2026 16:43
saadqbal added a commit that referenced this pull request Apr 27, 2026
Two issues from automated review of #65 (mysql kill-loop fix):

1. priorityClassName was lost when using externally-managed PriorityClass.
   The mysql template gated `priorityClassName` on `priorityClass.create`,
   which conflated "chart manages this resource" with "pod references this
   resource." Operators on GitOps / shared-platform setups (create: false,
   PriorityClass managed out-of-band) silently lost the OOM protection
   the rest of the v1.1.0 work depends on. Now gated on
   `priorityClass.name` being non-empty instead, with an `allOf` schema
   guard so create=true + empty name fails fast at install time. Empty
   name with create=false is the explicit "no PriorityClass" opt-out.

2. Schema advertised mysql.limits.cpu but the template ignored it.
   `resources.mysql.limits.cpu` passed schema validation and was
   silently dropped at render — operators flipping it on saw no error
   and no effect. Now rendered conditionally via `with` so the default
   stays "no CPU limit" (the whole point of the v1.1.0 mysql tuning)
   but explicit overrides land in the manifest.

Tests: 90 -> 93. New cases cover:
- create=false + named PriorityClass keeps priorityClassName on the pod
- create=false + empty name is the only path that drops priorityClassName
- create=true + empty name fails schema validation
- explicit mysql.limits.cpu lands in the rendered manifest

values.yaml comment for priorityClass rewritten to spell out the three
cases (create+name, externally-managed, disabled).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@saadqbal saadqbal merged commit 39b995c into develop Apr 27, 2026
saadqbal added a commit that referenced this pull request Apr 27, 2026
….1.0) (#66)

* feat(client): mysql kill-loop fix — resources, PriorityClass, PDB (v1.1.0)

Prod investigation showed all four mysql instances crash-looping every
~12 minutes, ~102s alive each. Reason: Error / Exit 1, no probe
failures, no OOMKilled status — characteristic of kernel-level OOM
SIGKILL (cgroup OOM is reported as OOMKilled). Root cause: jobs-manager
pods deployed without resource requests at all (BestEffort QoS) leaked
node memory under load; the kernel OOM scanner picked mysqld as the
fattest victim while it was allocating its InnoDB buffer pool during
init.

Chart-side fixes:

- mysql resources: requests.memory == limits.memory == 1Gi pins the
  cgroup memory budget so mysqld is no longer the oom_score_adj winner
  at node level. CPU limit deliberately *removed* — throttled CPU
  causes InnoDB lock-wait timeouts and is a far more common cause of
  mysql crashes than a noisy CPU neighbour. cpu request bumped to 250m.

- mysql liveness probe: timeoutSeconds: 5, failureThreshold: 5. The
  default 1s timeout falsely fires under any CPU contention because
  `mysqladmin ping` needs >1s to even spawn. readiness/startup probes
  also get explicit timeoutSeconds for symmetry.

- jobs-manager api: requests memory 256Mi -> 512Mi, limit 512Mi -> 1Gi.
  256Mi was too tight under load; the Service Bus client + pod-event
  watcher push RSS over 256Mi within minutes on a busy cluster.
  pods-monitor sidecar bumped 128/256 -> 256/512 likewise.

- New PriorityClass `tracebloc-data-plane` (value 1,000,000), referenced
  by mysql. Sits above default user workloads so the scheduler
  preempts noisy training jobs to keep mysql scheduled, well below
  system-cluster-critical so it cannot starve cluster-essential pods.
  Cluster-scoped, helm.sh/resource-policy: keep so a release uninstall
  doesn't yank it from sibling releases. Toggle via priorityClass.create.

- mysql PodDisruptionBudget gated on podDisruptionBudget.mysql.create
  (default true). minAvailable: 1 blocks voluntary disruptions while
  the only mysql replica is the only mysql replica.

- jobs-manager PDB: existing template had maxUnavailable: 1 on a
  1-replica deployment, which is a no-op (the only pod can always go
  away). Switched to minAvailable: 1 + same toggle pattern. This was
  unrelated to the kill-loop but was effectively no protection.

- All resource numbers exposed under `resources.{mysql,jobsManager,
  podsMonitor}` in values.yaml + values.schema.json so operators
  managing tighter budgets can override without forking.

Chart bumped 1.0.8 -> 1.1.0 (minor, not patch): switching mysql to a
PriorityClass and changing default resource shapes is a behaviour
change for existing installs — operators with custom overrides should
re-review before upgrading.

Tests: 76 -> 90 (mysql memory parity + no-cpu-limit, liveness timeout,
priorityClassName conditional, PriorityClass + both PDB templates,
jobs-manager memory floors). All 90 pass. helm lint clean. End-to-end
smoke install on k3d verified PriorityClass created cluster-scoped,
both PDBs render with minAvailable: 1, mysql pod has the new resource
shape and probe timing. Upgrade path also verified — existing PDB
flips from maxUnavailable to minAvailable cleanly.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(client): address bugbot review on v1.1.0 PriorityClass + mysql cpu

Two issues from automated review of #65 (mysql kill-loop fix):

1. priorityClassName was lost when using externally-managed PriorityClass.
   The mysql template gated `priorityClassName` on `priorityClass.create`,
   which conflated "chart manages this resource" with "pod references this
   resource." Operators on GitOps / shared-platform setups (create: false,
   PriorityClass managed out-of-band) silently lost the OOM protection
   the rest of the v1.1.0 work depends on. Now gated on
   `priorityClass.name` being non-empty instead, with an `allOf` schema
   guard so create=true + empty name fails fast at install time. Empty
   name with create=false is the explicit "no PriorityClass" opt-out.

2. Schema advertised mysql.limits.cpu but the template ignored it.
   `resources.mysql.limits.cpu` passed schema validation and was
   silently dropped at render — operators flipping it on saw no error
   and no effect. Now rendered conditionally via `with` so the default
   stays "no CPU limit" (the whole point of the v1.1.0 mysql tuning)
   but explicit overrides land in the manifest.

Tests: 90 -> 93. New cases cover:
- create=false + named PriorityClass keeps priorityClassName on the pod
- create=false + empty name is the only path that drops priorityClassName
- create=true + empty name fails schema validation
- explicit mysql.limits.cpu lands in the rendered manifest

values.yaml comment for priorityClass rewritten to spell out the three
cases (create+name, externally-managed, disabled).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(client): add required:[create] to PriorityClass conditional schema

Bugbot follow-up on the bugbot follow-up. Draft-07 `properties` only
validates keys that are *present*; without `required: [create]`, omitting
`create` passes the `if` vacuously and unconditionally enforces the
`then` (name minLength: 1). Helm merges defaults so this is harmless in
practice today, but the schema was technically incorrect and out of step
with the dockerRegistry conditional in the same file. Aligned both
patterns.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants