Skip to content

gpu nodegroup may cant trigger scale-up from 0 #8123

Open
@suqinglee

Description

@suqinglee

focus this code (cluster-autoscaler-1.26.6)

Image

assume nvdp cant start up, may be image not found or etc. then a gpu node come in nodegroup, p.nodeInfoCache will cache a node without nvidia.com/gpu; and this moment trigger scaledown to 0, this cache item still exist in cluster-autoscaler

when next scale-up triggered, even now nvdp is ok, due to this cache item, cant trigger scale-up, describe the pending pod will see:

Image

Activity

changed the title [-]gpu ndoegroup may cant trigger scale-up from 0[/-] [+]gpu nodegroup may cant trigger scale-up from 0[/+] on May 13, 2025
suqinglee

suqinglee commented on May 13, 2025

@suqinglee
Author

may same as #5278

adrianmoisey

adrianmoisey commented on May 13, 2025

@adrianmoisey
Member

/area cluster-autoscaler

chansuke

chansuke commented on May 13, 2025

@chansuke
Member

/assign

linked a pull request that will close this issue on May 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

    Development

    Participants

    @chansuke@adrianmoisey@k8s-ci-robot@suqinglee

    Issue actions

      gpu nodegroup may cant trigger scale-up from 0 · Issue #8123 · kubernetes/autoscaler