Open
Description
Description
Observed Behavior:
We see increased scale up latency with pods frequently being stuck in Pending for over 10 minutes starting with the upgrade from v1.4.0 to v1.5.0:
{"level":"ERROR","time":"2025-06-17T22:06:16.876Z","logger":"controller","message":"Observed a panic","controller":"disruption","namespace":"","name":"","reconcileID":"c34c94ca-8cac-4913-87be-8fd86357e1c6","panic":"attempted to over-reserve an offering with reservation id \"cr-09ac5c8da40022bfa\"","stacktrace":"goroutine 494 [running]:\nk8s.io/apimachinery/pkg/util/runtime.logPanic({0x4773930, 0xc07b6a65a0}, {0x3a59c40, 0xc075ca9880})\n\tk8s.io/apimachinery@v0.32.3/pkg/util/runtime/runtime.go:107 +0xbc\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile.func1()\n\tsigs.k8s.io/controller-runtime@v0.20.4/pkg/internal/controller/controller.go:108 +0x112\npanic({0x3a59c40?, 0xc075ca9880?})\n\truntime/panic.go:792 +0x132\nsigs.k8s.io/karpenter/pkg/controllers/provisioning/scheduling.(*ReservationManager).Reserve(0xc0484ed810, {0xc08e458740, 0x1b}, {0xc0873eafb8, 0x1, 0xc06ab36ba0?})\n\tsigs.k8s.io/karpenter@v1.4.1-0.20250523044835-349487633193/pkg/controllers/provisioning/scheduling/reservationmanager.go:76 +0x285\nsigs.k8s.io/karpenter/pkg/controllers/provisioning/scheduling.(*NodeClaim).Add(0xc07a87f088, 0xc024dd5208, 0xc04a197398?, 0xc06ab36ba0, {0xc0873eafa8?, 0x1?, 0x1?}, {0xc0873eafb8, 0x1, 0x1})\n\tsigs.k8s.io/karpenter@v1.4.1-0.20250523044835-349487633193/pkg/controllers/provisioning/scheduling/nodeclaim.go:171 +0x353\nsigs.k8s.io/karpenter/pkg/controllers/provisioning/scheduling.(*Scheduler).addToInflightNode(0xc0065801c0, {0x4773930, 0xc08666b140}, 0xc024dd5208)\n\tsigs.k8s.io/karpenter@v1.4.1-0.20250523044835-349487633193/pkg/controllers/provisioning/scheduling/scheduler.go:528 +0x25f\nsigs.k8s.io/karpenter/pkg/controllers/provisioning/scheduling.(*Scheduler).add(0xc0065801c0, {0x4773930, 0xc08666b140}, 0xc024dd5208)\n\tsigs.k8s.io/karpenter@v1.4.1-0.20250523044835-349487633193/pkg/controllers/provisioning/scheduling/scheduler.go:453 +0x8d\nsigs.k8s.io/karpenter/pkg/controllers/provisioning/scheduling.(*Scheduler).trySchedule(0xc0065801c0, {0x4773930, 0xc08666b140}, 0xc024dd5208)\n\tsigs.k8s.io/karpenter@v1.4.1-0.20250523044835-349487633193/pkg/controllers/provisioning/scheduling/scheduler.go:401 +0x90\nsigs.k8s.io/karpenter/pkg/controllers/provisioning/scheduling.(*Scheduler).Solve(0xc0065801c0, {0x4773930, 0xc08666b140}, {0xc0abc38dc8, 0x39, 0x57})\n\tsigs.k8s.io/karpenter@v1.4.1-0.20250523044835-349487633193/pkg/controllers/provisioning/scheduling/scheduler.go:368 +0x8ef\nsigs.k8s.io/karpenter/pkg/controllers/disruption.SimulateScheduling({0x47739d8, 0xc01bc7ad90}, {0x4788480, 0xc0007da870}, 0xc0002638c8, 0xc000a7ac00, {0xc01bd6a008, 0xb, 0xc09ab3ef60?})\n\tsigs.k8s.io/karpenter@v1.4.1-0.20250523044835-349487633193/pkg/controllers/disruption/helpers.go:115 +0x151d\nsigs.k8s.io/karpenter/pkg/controllers/disruption.(*consolidation).computeConsolidation(0xc0007aa980, {0x47739d8, 0xc01bc7ad90}, {0xc01bd6a008, 0xb, 0x1fb})\n\tsigs.k8s.io/karpenter@v1.4.1-0.20250523044835-349487633193/pkg/controllers/disruption/consolidation.go:137 +0xe5\nsigs.k8s.io/karpenter/pkg/controllers/disruption.(*MultiNodeConsolidation).firstNConsolidationOption(0xc0007aa980, {0x4773930, 0xc07b6a6600}, {0xc01bd6a008, 0x15, 0x1fb}, 0x15)\n\tsigs.k8s.io/karpenter@v1.4.1-0.20250523044835-349487633193/pkg/controllers/disruption/multinodeconsolidation.go:136 +0x2e5\nsigs.k8s.io/karpenter/pkg/controllers/disruption.(*MultiNodeConsolidation).ComputeCommand(0xc0007aa980, {0x4773930, 0xc07b6a6600}, 0xc05a603e00, {0xc0577d8608, 0x1fb, 0x318})\n\tsigs.k8s.io/karpenter@v1.4.1-0.20250523044835-349487633193/pkg/controllers/disruption/multinodeconsolidation.go:88 +0x3eb\nsigs.k8s.io/karpenter/pkg/controllers/disruption.(*Controller).disrupt(0xc0007aaa80, {0x4773930, 0xc07b6a6600}, {0x47769c0, 0xc0007aa980})\n\tsigs.k8s.io/karpenter@v1.4.1-0.20250523044835-349487633193/pkg/controllers/disruption/controller.go:184 +0x5dd\nsigs.k8s.io/karpenter/pkg/controllers/disruption.(*Controller).Reconcile(0xc0007aaa80, {0x4773930, 0xc07b6a65a0})\n\tsigs.k8s.io/karpenter@v1.4.1-0.20250523044835-349487633193/pkg/controllers/disruption/controller.go:146 +0x885\nsigs.k8s.io/karpenter/pkg/controllers/disruption.(*Controller).Register.AsReconciler.func1({0x4773930?, 0xc07b6a65a0?}, {{{0x0?, 0x0?}, {0x416fa5b?, 0x5?}}})\n\tgithub.com/awslabs/operatorpkg@v0.0.0-20250425180727-b22281cd8057/singleton/controller.go:26 +0x2f\nsigs.k8s.io/controller-runtime/pkg/reconcile.TypedFunc[...].Reconcile(...)\n\tsigs.k8s.io/controller-runtime@v0.20.4/pkg/reconcile/reconcile.go:124\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile(0xc0a7bc3f20?, {0x4773930?, 0xc07b6a65a0?}, {{{0x0?, 0x0?}, {0x0?, 0x0?}}})\n\tsigs.k8s.io/controller-runtime@v0.20.4/pkg/internal/controller/controller.go:119 +0xbf\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler(0x479f340, {0x4773968, 0xc0000495e0}, {{{0x0, 0x0}, {0x0, 0x0}}}, 0x0)\n\tsigs.k8s.io/controller-runtime@v0.20.4/pkg/internal/controller/controller.go:334 +0x3ad\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem(0x479f340, {0x4773968, 0xc0000495e0})\n\tsigs.k8s.io/controller-runtime@v0.20.4/pkg/internal/controller/controller.go:294 +0x21b\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2()\n\tsigs.k8s.io/controller-runtime@v0.20.4/pkg/internal/controller/controller.go:255 +0x85\ncreated by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2 in goroutine 303\n\tsigs.k8s.io/controller-runtime@v0.20.4/pkg/internal/controller/controller.go:251 +0x6e8\n"}
Expected Behavior:
No regression in pod scheduling latency.
Reproduction Steps (Please include YAML):
We see this in a production environment with a large deployment of a new replicaset with 500+ pods using partially fulfilled reserved instances (~20% of capacity). Reverting to v1.4.0 fixes the issue.
Versions:
- Chart Version: v1.5.0
- Kubernetes Version (
kubectl version
): v1.29.15
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment