Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Datasets] Fix spread resource prefix tasks with no CPU requested. #22017

Conversation

clarkzinzow
Copy link
Contributor

When applying the _spread_resouce_prefix hack, don't make the CPU resource a required resource when num_cpus=0 is requested.

Related issue number

Closes #22003

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

resources={"foo": 100}, _system_config={"max_direct_call_object_size": 0}
)
cluster.add_node(resources={"bar:1": 100})
cluster.add_node(resources={"bar:2": 100}, num_cpus=0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add GPU as well?

Copy link
Contributor Author

@clarkzinzow clarkzinzow Jan 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't matter for this test. This test is ensuring that a task with the spread resource prefix and with num_cpus=0 is able to be placed on a node with 0 CPUs advertised, so it's orthogonal to whether or not the node has any GPUs.

@clarkzinzow
Copy link
Contributor Author

Note that this fixes the num_cpus=0 request issue with the spread_resource_prefix_ hack, but does not fix the underlying anti-affinity problem where GPU nodes advertise 0 CPUs to repel CPU workloads, yet we want to load the validation dataset on GPU nodes.

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 31, 2022
@clarkzinzow clarkzinzow added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Feb 1, 2022
@clarkzinzow
Copy link
Contributor Author

Datasets tests ok, ready to merge.

@ericl ericl merged commit b3fd3c6 into ray-project:master Feb 1, 2022
scv119 pushed a commit that referenced this pull request Feb 9, 2022
…22017)

When applying the `_spread_resouce_prefix` hack, don't make the CPU resource a required resource when `num_cpus=0` is requested.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Dataset][Bug] _spread_resource_prefix doesn't work with GPU nodes
3 participants