Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Fix placement group GPU assignment bug #15049

Merged
merged 4 commits into from Apr 2, 2021

Conversation

wuisawesome
Copy link
Contributor

@wuisawesome wuisawesome commented Mar 31, 2021

Why are these changes needed?

When the new scheduler was implemented, we introduced a bug by assuming that all dynamic resources (including placement group resources) were non-unit resources. This assumption was fine for everything except GPUs. This PR fixes that by allowing dynamic, unit-resources.

A lot of this came from @clay4444 (especially the test cases!) in #14891

Related issue number

#14759

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

src/ray/raylet/scheduling/cluster_resource_data.cc Outdated Show resolved Hide resolved
src/ray/raylet/scheduling/cluster_resource_scheduler.cc Outdated Show resolved Hide resolved
ResourceInstanceCapacities *node_instances;
local_resources_.predefined_resources.resize(PredefinedResources_MAX);
if (kCPU_ResourceLabel == resource_name) {
node_instances = &local_resources_.predefined_resources[CPU];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you use &local_resources_. here? (Why do you use & in the beginning of this attribute?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To get a pointer. In an ideal world, c++ would let us do ResourceInstanceCapacities &node_instances then initialize it in the if statements, but this is the next best thing.

@ericl
Copy link
Contributor

ericl commented Mar 31, 2021

ASAN test failures (use after free)

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Mar 31, 2021
@rkooo567 rkooo567 added the release-blocker P0 Issue that blocks the release label Apr 1, 2021
@amogkam
Copy link
Contributor

amogkam commented Apr 1, 2021

@wuisawesome can you add this to the release blocker spreadsheet please (along with ETA and any tests that need to be re-run).

@wuisawesome
Copy link
Contributor Author

Small+large test failure is a CI bug. Looks fine in buildkite. Merging now.

cc @amogkam

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. release-blocker P0 Issue that blocks the release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants