-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RLlib] Re-enable placement group support for RLlib. #14384
[RLlib] Re-enable placement group support for RLlib. #14384
Conversation
…ement_group_support_2
…ement_group_support_2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's been some semantic changes we should address
python/ray/tune/trial.py
Outdated
# Placement groups are force-disabled via env variable. | ||
if int(os.getenv("TUNE_PLACEMENT_GROUP_AUTO_DISABLED", "0")): | ||
if self.placement_group_factory: | ||
self.resources = pg_factory_to_resources( | ||
self.placement_group_factory) | ||
self.placement_group_factory = None | ||
|
||
# Placement groups are not disabled, but none is given. | ||
# Produce one automatically from self.resources. | ||
elif not self.placement_group_factory: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, this semantically alters the previous behavior, where TUNE_PLACEMENT_GROUP_AUTO_DISABLED
only disabled automatic conversion of resources to placement groups but, importantly, allowed manual opt-in to using placement groups when passed in resources_per_trial
. See https://docs.ray.io/en/master/tune/user-guide.html#environment-variables
Generally if a user passes a placement group factory or provides a PGF as a default resource request we would still want to use placement groups (otherwise they just shouldn't pass them).
If we want to enable users to use the old-style Resources for RLLib trainables, we should move the conversion of PGF to Resources into the default_resource_request()
methods of the Trainer
(and maybe use a different environment variable? That's up to you).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Thanks for the catch.
counts[resource_type.lower()] += count | ||
num_cpus = counts.pop("cpu", 1) | ||
num_gpus = counts.pop("gpu", 0) | ||
return Resources(cpu=num_cpus, gpu=num_gpus, **counts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably too coarse. The cpu
/gpu
should have the CPUs/GPUs of the first bundle, all other bundle CPUs/GPUs should be in extra_cpu
/extra_gpu
. memory
and object_store_memory
(though deprecated) should be popped as well, and remaining resources have to be passed in an extra_resources
dict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we still using this?
extra_gpu=cf["num_gpus_per_worker"] * cf["num_workers"]) | ||
|
||
eval_config = cf["evaluation_config"] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of converting, we could just have a if/else clause here to decide between old-style Resources and new-style PGF returns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But that would mean a new config parameter. Don't we want to deprecate Resources
anyways?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is just to keep backwards compatibility. Once we deprecate Resources
for good (in 1-2 releases) we will get rid of this.
Let's not use a config parameter here, but rather an environment variable. Default behavior should be to use placement groups, but people should be able to opt out. I think you can re-use TUNE_PLACEMENT_GROUP_AUTO_DISABLED
for this. Setting this to 1 would then mean that RLLibs default resources do not use Placement groups, and resource requests are not automatically converted. If someone deliberately sets this env variable to 1 and still want to use PGs in some cases they'll have to take care of that themselves. (I doubt that many people will want to mix anyways)
gpu=cf["num_gpus"], | ||
extra_cpu=cf["num_cpus_per_worker"] * num_workers, | ||
extra_gpu=cf["num_gpus_per_worker"] * num_workers) | ||
# Return PlacementGroupFactory containing all needed resources |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(same here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above
python/ray/tune/trial.py
Outdated
## Placement group are force-disabled via env variable. | ||
#if int(os.getenv("TUNE_PLACEMENT_GROUP_AUTO_DISABLED", "0")): | ||
# if self.placement_group_factory: | ||
# self.resources = pg_factory_to_resources( | ||
# self.placement_group_factory) | ||
# self.placement_group_factory = None | ||
|
||
## Placement groups are not disabled, but none is given. | ||
## Produce one automatically from self.resources. | ||
#elif not self.placement_group_factory: | ||
# try: | ||
# self.placement_group_factory = resource_dict_to_pg_factory( | ||
# self.resources) | ||
# except ValueError as exc: | ||
# if log_always or log_once("tune_pg_extra_resources"): | ||
# logger.warning(exc) | ||
# self.placement_group_factory = None | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove these comments?
counts[resource_type.lower()] += count | ||
num_cpus = counts.pop("cpu", 1) | ||
num_gpus = counts.pop("gpu", 0) | ||
return Resources(cpu=num_cpus, gpu=num_gpus, **counts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we still using this?
python/ray/tune/trial.py
Outdated
# If Trainable returns placement group factory, only use it | ||
# if no manual `placement_group_factory` are has been provided | ||
# by user. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah so this means that users could overwrite default resource requests with their own placement groups. Is this what we want here? Old behavior was just to disallow any resources_per_trial
when a default resource request exists.
# If Trainable returns placement group factory, only use it | |
# if no manual `placement_group_factory` are has been provided | |
# by user. | |
# If Trainable returns placement group factory, only use it | |
# if no manual `placement_group_factory` has been provided | |
# by user. |
[RLlib] Re-enable placement group support for RLlib.
Why are these changes needed?
Related issue number
Checks
scripts/format.sh
to lint the changes in this PR.