Open
Description
focus this code (cluster-autoscaler-1.26.6)

assume nvdp cant start up, may be image not found or etc. then a gpu node come in nodegroup, p.nodeInfoCache
will cache a node without nvidia.com/gpu; and this moment trigger scaledown to 0, this cache item still exist in cluster-autoscaler
when next scale-up triggered, even now nvdp is ok, due to this cache item, cant trigger scale-up, describe the pending pod will see:

Activity
[-]gpu ndoegroup may cant trigger scale-up from 0[/-][+]gpu nodegroup may cant trigger scale-up from 0[/+]suqinglee commentedon May 13, 2025
may same as #5278
adrianmoisey commentedon May 13, 2025
/area cluster-autoscaler
chansuke commentedon May 13, 2025
/assign