New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Fix placement group GPU assignment bug #15049
Conversation
ResourceInstanceCapacities *node_instances; | ||
local_resources_.predefined_resources.resize(PredefinedResources_MAX); | ||
if (kCPU_ResourceLabel == resource_name) { | ||
node_instances = &local_resources_.predefined_resources[CPU]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you use &local_resources_.
here? (Why do you use & in the beginning of this attribute?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To get a pointer. In an ideal world, c++ would let us do ResourceInstanceCapacities &node_instances
then initialize it in the if statements, but this is the next best thing.
ASAN test failures (use after free) |
@wuisawesome can you add this to the release blocker spreadsheet please (along with ETA and any tests that need to be re-run). |
Small+large test failure is a CI bug. Looks fine in buildkite. Merging now. cc @amogkam |
Why are these changes needed?
When the new scheduler was implemented, we introduced a bug by assuming that all dynamic resources (including placement group resources) were non-unit resources. This assumption was fine for everything except GPUs. This PR fixes that by allowing dynamic, unit-resources.
A lot of this came from @clay4444 (especially the test cases!) in #14891
Related issue number
#14759
Checks
scripts/format.sh
to lint the changes in this PR.