gpu nodegroup may cant trigger scale-up from 0

focus this code (cluster-autoscaler-1.26.6)

<img width="896" alt="Image" src="https://github.com/user-attachments/assets/9c15f013-27c4-4033-9889-58f9bc989f80" />

assume nvdp cant start up, may be image not found or etc. then a gpu node come in nodegroup, `p.nodeInfoCache` will cache a node without nvidia.com/gpu; and this moment trigger scaledown to 0, this cache item still exist in cluster-autoscaler

when next scale-up triggered, even now nvdp is ok, due to this cache item, cant trigger scale-up, describe the pending pod will see:

<img width="1254" alt="Image" src="https://github.com/user-attachments/assets/5ffa180d-7c66-4455-b853-a8b24034ca60" />




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

gpu nodegroup may cant trigger scale-up from 0 #8123

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

gpu nodegroup may cant trigger scale-up from 0 #8123

Description

Activity

suqinglee commented on May 13, 2025

adrianmoisey commented on May 13, 2025

chansuke commented on May 13, 2025

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions