New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RLlib + Tune] Add placement group support to RLlib. #14289
[RLlib + Tune] Add placement group support to RLlib. #14289
Conversation
…ement_groups_support
…ement_groups_support
…ement_groups_support
…ement_groups_support
…ement_groups_support
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks great, let's just move the check a bit up and we don't need to build a proper resources object anymore.
python/ray/tune/trial.py
Outdated
counts = defaultdict(int) | ||
for bundle in self.placement_group_factory._bundles: | ||
for device, c in bundle.items(): | ||
counts[device.lower()] += c | ||
custom_resources = { | ||
k: c | ||
for k, c in counts.items() if k not in ["cpu", "gpu"] | ||
} | ||
self.resources = Resources( | ||
cpu=counts["cpu"], | ||
gpu=counts["gpu"], | ||
custom_resources=custom_resources, | ||
has_placement_group=True, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need this anymore - when using placement group factories, the trial.resources
is not used at all (None
is passed and it is initialized as Resources(cpu=1, gpu=0)
). Assuming that RLLib doesn't read trial.resources
, we can remove this part
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should move this to line 235, like this:
if isinstance(default_resources, PlacementGroupFactory):
placement_group_factory = default_resources
resources = None
else:
# Set placement group factory to None for backwards compatiblity
placement_group_factory = None
resources = default_resources
We can then leave the rest of the code as-is.
Also, in line 229, let's check if the placement group factory is set, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, cool, even simpler :) So we don't need self.resources
at all anymore, then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
rllib/agents/impala/impala.py
Outdated
# from RolloutWorkers (n rollout workers map to m | ||
# aggregation workers, where m < n). | ||
"CPU": cf["num_cpus_for_driver"] + | ||
cf["num_cpus_per_worker"] * cf["num_aggregation_workers"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wasn't this just one CPU per aggregation worker before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right! Good catch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
…ement_groups_support
python/ray/tune/trial.py
Outdated
|
||
# If Trainable returns resources, do not allow manual overrid via | ||
# `resources_per_trial` by the user. | ||
if default_resources and (resources or placement_group_factory): | ||
raise ValueError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's a lot of special-casing here in this file diff -- can we please avoid this?
python/ray/tune/trial.py
Outdated
|
||
# If Trainable returns resources, do not allow manual overrid via | ||
# `resources_per_trial` by the user. | ||
if default_resources and (resources or placement_group_factory): | ||
raise ValueError( | ||
"Resources for {} have been automatically set to {} " | ||
"by its `default_resource_request()` method. Please " | ||
"clear the `resources_per_trial` option.".format( | ||
trainable_cls, default_resources)) | ||
|
||
# New way: Trainable returns a PlacementGroupFactory object. | ||
if isinstance(default_resources, PlacementGroupFactory): | ||
placement_group_factory = default_resources | ||
resources = None | ||
# Set placement group factory to None for backwards compatibility. | ||
else: | ||
placement_group_factory = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# If Trainable returns resources, do not allow manual overrid via | |
# `resources_per_trial` by the user. | |
if default_resources and (resources or placement_group_factory): | |
raise ValueError( | |
"Resources for {} have been automatically set to {} " | |
"by its `default_resource_request()` method. Please " | |
"clear the `resources_per_trial` option.".format( | |
trainable_cls, default_resources)) | |
# New way: Trainable returns a PlacementGroupFactory object. | |
if isinstance(default_resources, PlacementGroupFactory): | |
placement_group_factory = default_resources | |
resources = None | |
# Set placement group factory to None for backwards compatibility. | |
else: | |
placement_group_factory = None | |
# If Trainable returns resources, do not allow manual override via | |
# `resources_per_trial` by the user. | |
if default_resources: | |
if resources or placement_group_factory: | |
raise ValueError( | |
"Resources for {} have been automatically set to {} " | |
"by its `default_resource_request()` method. Please " | |
"clear the `resources_per_trial` option.".format( | |
trainable_cls, default_resources)) | |
# New way: Trainable returns a PlacementGroupFactory object. | |
if isinstance(default_resources, PlacementGroupFactory): | |
placement_group_factory = default_resources | |
resources = None | |
# Set placement group factory to None for backwards compatibility. | |
else: | |
placement_group_factory = None | |
resources = default_resources |
(and remove line under this).
We need to keep the indent here, and with the suggested change we keep the same logic as before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
python/ray/tune/trial.py
Outdated
if isinstance(resources, PlacementGroupFactory): | ||
self.placement_group_factory = resources | ||
else: | ||
self.placement_group_factory = placement_group_factory | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if isinstance(resources, PlacementGroupFactory): | |
self.placement_group_factory = resources | |
else: | |
self.placement_group_factory = placement_group_factory | |
self.placement_group_factory = placement_group_factory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need this block anymore (placement_group_factory
is an argument of the constructor). With the changes above resources
will never hold anything other than None
or a Resources
object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
…ement_groups_support # Conflicts: # rllib/agents/trainer.py
…ject#14289)" This reverts commit 6cd0cd3.
This PR adds support for placement groups to RLlib by:
Why are these changes needed?
Related issue number
Checks
scripts/format.sh
to lint the changes in this PR.