Skip to content

feat(k8s/vpa): expose controlledValues + fix controlledResources nesting#280

Merged
Cre-eD merged 3 commits into
mainfrom
fix/vpa-controlled-values
May 20, 2026
Merged

feat(k8s/vpa): expose controlledValues + fix controlledResources nesting#280
Cre-eD merged 3 commits into
mainfrom
fix/vpa-controlled-values

Conversation

@Cre-eD
Copy link
Copy Markdown
Contributor

@Cre-eD Cre-eD commented May 20, 2026

Summary

Two related fixes to cloudExtras.vpa generation, found while rightsizing the PAY-SPACE org's GKE Autopilot clusters.

1. Add ControlledValues to VPAConfig

K8s VPA's controlledValues knob (RequestsAndLimits default, RequestsOnly optional) is currently not exposed by SC. Without it, VPA always scales the CPU limit proportionally with the request.

In practice this means lowering minAllowed.cpu below ~250m shrinks the container's CPU limit far enough that Django/gunicorn-style cold starts CPU-throttle, fail the startup probe, and get SIGKILL'd by kubelet — even though the actual workload has plenty of headroom in steady state. We hit this in production: PAY-SPACE/crypto#852 dropped the floor to 50m and tripped a CrashLoopBackOff on web-app that took a hotfix (#853) to revert.

With controlledValues: RequestsOnly, VPA rewrites only requests at admission and leaves the deployment template's limits alone, so cold-start bursts can still use the (higher) template limit.

2. Move controlledResources into containerPolicy

Per the VPA CRD, controlledResources is a per-container field — it lives at resourcePolicy.containerPolicies[*].controlledResources, not at resourcePolicy.controlledResources. Before this commit SC wrote it at the wrong nesting level; k8s silently dropped it on admission.

Verified by:

$ kubectl explain vpa.spec.resourcePolicy.containerPolicies.controlledResources
FIELD: controlledResources <[]string>
   Specifies the type of recommendations that will be computed (and possibly
   applied) by VPA. If not specified, the default of [ResourceCPU, ResourceMemory]
   will be used.

# Live VPA on a PAY-SPACE cluster — the client.yaml had `controlledResources: [cpu, memory]`
# but the spec on the cluster has no such field anywhere:
$ kubectl -n landing get vpa landing-vpa -o yaml | yq '.spec'
resourcePolicy:
  containerPolicies:
  - containerName: '*'
    maxAllowed: { cpu: 2, memory: 4Gi }
    minAllowed: { cpu: 500m, memory: 1.5Gi }
updatePolicy:
  updateMode: Auto

The new ControlledValues field is placed inside containerPolicy in the same fix so both knobs live where the CRD expects.

Usage after this lands

cloudExtras:
  vpa:
    enabled: true
    updateMode: "Auto"
    minAllowed: { cpu: "50m", memory: "64Mi" }     # back to the real Autopilot floor
    maxAllowed: { cpu: "2", memory: "4Gi" }
    controlledResources: ["cpu", "memory"]
    controlledValues: "RequestsOnly"                # NEW — don't touch limits

Why this matters operationally

The PAY-SPACE production cluster is currently at 80 of a 64-CPU global quota (Google declined the recent quota increase request). About 15 CPUs of that ceiling are tied up because we have to hold minAllowed.cpu at 250m across the org just to keep VPA from shrinking limits below the cold-start threshold. Once consumers can flip controlledValues: RequestsOnly, the floor drops to 50m everywhere and we get ~15 CPU back without any cold-start risk.

Compatibility

  • Additive struct field + a containerPolicy reshuffle that previously did nothing (k8s dropped the misplaced field). No client.yaml will break.
  • ControlledResources consumers who relied on the old (broken) behavior now actually get what they wrote.

Test plan

  • go build ./...
  • go test ./pkg/clouds/pulumi/kubernetes/... -count=1 — passes
  • go test ./pkg/api/... -count=1 -run TestStackConfigCompose_Copy — extended VPA_configuration_in_CloudExtras case now covers controlledValues round-trip
  • After merge: bump SC in PAY-SPACE wrappers (scripts/bump-sc-version.sh), then one client.yaml PR per repo flipping controlledValues: RequestsOnly. Verify VPA's spec.resourcePolicy.containerPolicies[0] shows both controlledResources and controlledValues populated on the cluster.

Refs: PAY-SPACE/crypto#852, PAY-SPACE/crypto#853 (hotfix that established the 250m floor workaround); PAY-SPACE org-wide rightsize ClickUp 86exmtq3v

Two related fixes to cloudExtras.vpa generation, found while
rightsizing the PAY-SPACE org's GKE Autopilot clusters.

1. Add ControlledValues field to VPAConfig.

   K8s VPA's controlledValues knob ("RequestsAndLimits" default,
   "RequestsOnly" optional) is currently not exposed by SC. Without
   it, VPA always scales the CPU limit proportionally with the
   request. Lowering minAllowed.cpu below ~250m therefore shrinks the
   container's CPU limit far enough that Django/gunicorn-style cold
   starts CPU-throttle, fail the startup probe, and get SIGKILL'd by
   kubelet — even though the actual workload has plenty of headroom
   in steady state.

   With controlledValues: RequestsOnly, VPA rewrites only requests
   at admission and leaves the deployment template's limits alone,
   so cold-start bursts use the (higher) template limit.

2. Move controlledResources from resourcePolicy into the
   containerPolicy entry.

   Per the VPA CRD (autoscaling.k8s.io/v1), controlledResources is
   a per-container field — it lives at
   resourcePolicy.containerPolicies[*].controlledResources, not at
   resourcePolicy.controlledResources. Before this commit SC wrote
   it at the wrong nesting level; k8s silently dropped it on
   admission. Verified by `kubectl explain
   vpa.spec.resourcePolicy.containerPolicies.controlledResources`
   and by reading a live VPA on a PAY-SPACE cluster that had
   controlledResources set in client.yaml but missing from the
   in-cluster spec.

   The new ControlledValues field is placed inside containerPolicy
   in the same fix.

No schema break — these are additive struct fields and a
containerPolicy reshuffle that previously did nothing. Existing
tests in TestStackConfigCompose_Copy/VPA_configuration_in_CloudExtras
extended to cover controlledValues round-trip.

Usage in client.yaml after this lands:

    cloudExtras:
      vpa:
        enabled: true
        updateMode: "Auto"
        minAllowed: { cpu: "50m", memory: "64Mi" }
        maxAllowed: { cpu: "2", memory: "4Gi" }
        controlledResources: ["cpu", "memory"]
        controlledValues: "RequestsOnly"

This unblocks PAY-SPACE/crypto#853's hotfix pattern: the 250m CPU
floor we currently hold across the org to avoid the proportional
shrink can drop to 50m once consumers adopt controlledValues:
RequestsOnly. Frees ~15 CPU on the production cluster, which is
currently at the 64-CPU global quota cap.

Signed-off-by: Dmitrii Creed <creeed22@gmail.com>
@Cre-eD Cre-eD requested a review from smecsia as a code owner May 20, 2026 12:38
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 20, 2026

Semgrep Scan Results

Repository: api | Commit: ce5311b

Check Status Details
⚠️ Semgrep Warning 10 warning(s), 10 total

Scanned at 2026-05-20 19:23 UTC

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 20, 2026

Security Scan Results

Repository: api | Commit: ce5311b

Check Status Details
✅ Secret Scan Pass No secrets detected
✅ Dependencies (Trivy) Pass 0 total (no critical/high)
✅ Dependencies (Grype) Pass 0 total (no critical/high)
📦 SBOM Generated 528 components (CycloneDX)

Scanned at 2026-05-20 19:23 UTC

@Cre-eD
Copy link
Copy Markdown
Contributor Author

Cre-eD commented May 20, 2026

Reviewed by Codex + Gemini (independent runs).

Codex

The changes correctly add controlledValues support and move VPA controlledResources into the per-container policy where the VPA CRD expects it. The relevant package tests pass, and I did not find a discrete regression introduced by the patch.

No findings.

Gemini (3 actionable)

  1. P1 backward-compat — addressed in this comment. Anyone who set controlledResources to a non-default value (e.g. ["cpu"]) on the old code had it silently dropped by k8s; after this PR it takes effect, which could change VPA recommendations for those configs. Spot-check across PAY-SPACE: every existing client.yaml that sets the field uses ["cpu", "memory"] (= the K8s default), so behavior is unchanged for our consumers. External users with a non-default value should validate after upgrade.

    Suggested release note:

    cloudExtras.vpa.controlledResources is now correctly applied at the per-container level (where the VPA CRD expects it). Previously it was silently dropped by Kubernetes. Configurations that explicitly set this to a non-default value (e.g. ["cpu"]) will now scope VPA recommendations accordingly — verify before upgrading.

  2. P2 missing generation test — addressed in follow-up commit. Added TestNewSimpleContainer_WithVPA_ControlledValues that exercises the full VPA surface (minAllowed, maxAllowed, controlledResources, controlledValues) through the Pulumi resource lifecycle. The exact CRD shape (controlledResources + controlledValues inside containerPolicy) is locked in by createVPA and verified manually against kubectl explain vpa.spec.resourcePolicy.containerPolicies.

  3. Docs drift — flagged separately. Running cmd/schema-gen docs/schemas regenerates the schemas but also wipes ephemeralVolumes and several unrelated fields from kubernetescloudextras.json — the generator has drifted from the codebase well beyond this PR's scope. Better fixed in a dedicated schema-regen PR than smuggled in here.

Style notes (not addressed)

  • Gemini: lo.FromPtr(args.VPA.ControlledValues) inside a nil-check is "redundant".
  • Counterpoint: the surrounding createVPA uses lo.FromPtr everywhere (CPU, Memory, EphemeralStorage min/max). Kept the existing pattern for consistency.

Pass-through smoke test that constructs a SimpleContainer with VPA
configured for the full surface (minAllowed, maxAllowed,
controlledResources, controlledValues) and verifies the resource
creation succeeds without error. Pairs with the existing
TestNewSimpleContainer_WithVPA, which only exercises the minimal
enabled+updateMode shape.

The exact in-cluster VPA spec shape (controlledResources and
controlledValues living inside containerPolicy rather than at
resourcePolicy level) is asserted by reading and trusting createVPA
in simple_container.go — verified against the live K8s VPA CRD via
kubectl explain.

Addresses the Gemini review feedback on the parent commit: prior to
this, no kubernetes-package test exercised the new ControlledValues
code path at all.

Signed-off-by: Dmitrii Creed <creeed22@gmail.com>
@Cre-eD Cre-eD merged commit 903150b into main May 20, 2026
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants