feat(k8s/vpa): expose controlledValues + fix controlledResources nesting#280
Conversation
Two related fixes to cloudExtras.vpa generation, found while
rightsizing the PAY-SPACE org's GKE Autopilot clusters.
1. Add ControlledValues field to VPAConfig.
K8s VPA's controlledValues knob ("RequestsAndLimits" default,
"RequestsOnly" optional) is currently not exposed by SC. Without
it, VPA always scales the CPU limit proportionally with the
request. Lowering minAllowed.cpu below ~250m therefore shrinks the
container's CPU limit far enough that Django/gunicorn-style cold
starts CPU-throttle, fail the startup probe, and get SIGKILL'd by
kubelet — even though the actual workload has plenty of headroom
in steady state.
With controlledValues: RequestsOnly, VPA rewrites only requests
at admission and leaves the deployment template's limits alone,
so cold-start bursts use the (higher) template limit.
2. Move controlledResources from resourcePolicy into the
containerPolicy entry.
Per the VPA CRD (autoscaling.k8s.io/v1), controlledResources is
a per-container field — it lives at
resourcePolicy.containerPolicies[*].controlledResources, not at
resourcePolicy.controlledResources. Before this commit SC wrote
it at the wrong nesting level; k8s silently dropped it on
admission. Verified by `kubectl explain
vpa.spec.resourcePolicy.containerPolicies.controlledResources`
and by reading a live VPA on a PAY-SPACE cluster that had
controlledResources set in client.yaml but missing from the
in-cluster spec.
The new ControlledValues field is placed inside containerPolicy
in the same fix.
No schema break — these are additive struct fields and a
containerPolicy reshuffle that previously did nothing. Existing
tests in TestStackConfigCompose_Copy/VPA_configuration_in_CloudExtras
extended to cover controlledValues round-trip.
Usage in client.yaml after this lands:
cloudExtras:
vpa:
enabled: true
updateMode: "Auto"
minAllowed: { cpu: "50m", memory: "64Mi" }
maxAllowed: { cpu: "2", memory: "4Gi" }
controlledResources: ["cpu", "memory"]
controlledValues: "RequestsOnly"
This unblocks PAY-SPACE/crypto#853's hotfix pattern: the 250m CPU
floor we currently hold across the org to avoid the proportional
shrink can drop to 50m once consumers adopt controlledValues:
RequestsOnly. Frees ~15 CPU on the production cluster, which is
currently at the 64-CPU global quota cap.
Signed-off-by: Dmitrii Creed <creeed22@gmail.com>
Semgrep Scan ResultsRepository:
Scanned at 2026-05-20 19:23 UTC |
Security Scan ResultsRepository:
Scanned at 2026-05-20 19:23 UTC |
|
Reviewed by Codex + Gemini (independent runs). Codex
No findings. Gemini (3 actionable)
Style notes (not addressed)
|
Pass-through smoke test that constructs a SimpleContainer with VPA configured for the full surface (minAllowed, maxAllowed, controlledResources, controlledValues) and verifies the resource creation succeeds without error. Pairs with the existing TestNewSimpleContainer_WithVPA, which only exercises the minimal enabled+updateMode shape. The exact in-cluster VPA spec shape (controlledResources and controlledValues living inside containerPolicy rather than at resourcePolicy level) is asserted by reading and trusting createVPA in simple_container.go — verified against the live K8s VPA CRD via kubectl explain. Addresses the Gemini review feedback on the parent commit: prior to this, no kubernetes-package test exercised the new ControlledValues code path at all. Signed-off-by: Dmitrii Creed <creeed22@gmail.com>
Summary
Two related fixes to
cloudExtras.vpageneration, found while rightsizing the PAY-SPACE org's GKE Autopilot clusters.1. Add
ControlledValuestoVPAConfigK8s VPA's
controlledValuesknob (RequestsAndLimitsdefault,RequestsOnlyoptional) is currently not exposed by SC. Without it, VPA always scales the CPU limit proportionally with the request.In practice this means lowering
minAllowed.cpubelow ~250m shrinks the container's CPU limit far enough that Django/gunicorn-style cold starts CPU-throttle, fail the startup probe, and get SIGKILL'd by kubelet — even though the actual workload has plenty of headroom in steady state. We hit this in production: PAY-SPACE/crypto#852 dropped the floor to 50m and tripped aCrashLoopBackOffon web-app that took a hotfix (#853) to revert.With
controlledValues: RequestsOnly, VPA rewrites only requests at admission and leaves the deployment template's limits alone, so cold-start bursts can still use the (higher) template limit.2. Move
controlledResourcesintocontainerPolicyPer the VPA CRD,
controlledResourcesis a per-container field — it lives atresourcePolicy.containerPolicies[*].controlledResources, not atresourcePolicy.controlledResources. Before this commit SC wrote it at the wrong nesting level; k8s silently dropped it on admission.Verified by:
The new
ControlledValuesfield is placed insidecontainerPolicyin the same fix so both knobs live where the CRD expects.Usage after this lands
Why this matters operationally
The PAY-SPACE production cluster is currently at 80 of a 64-CPU global quota (Google declined the recent quota increase request). About 15 CPUs of that ceiling are tied up because we have to hold
minAllowed.cpuat 250m across the org just to keep VPA from shrinking limits below the cold-start threshold. Once consumers can flipcontrolledValues: RequestsOnly, the floor drops to 50m everywhere and we get ~15 CPU back without any cold-start risk.Compatibility
ControlledResourcesconsumers who relied on the old (broken) behavior now actually get what they wrote.Test plan
go build ./...go test ./pkg/clouds/pulumi/kubernetes/... -count=1— passesgo test ./pkg/api/... -count=1 -run TestStackConfigCompose_Copy— extendedVPA_configuration_in_CloudExtrascase now coverscontrolledValuesround-tripscripts/bump-sc-version.sh), then one client.yaml PR per repo flippingcontrolledValues: RequestsOnly. Verify VPA'sspec.resourcePolicy.containerPolicies[0]shows bothcontrolledResourcesandcontrolledValuespopulated on the cluster.Refs: PAY-SPACE/crypto#852, PAY-SPACE/crypto#853 (hotfix that established the 250m floor workaround); PAY-SPACE org-wide rightsize ClickUp 86exmtq3v