Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RLlib + Tune] Add placement group support to RLlib. #14289

Merged
merged 19 commits into from Feb 25, 2021

Conversation

sven1977
Copy link
Contributor

@sven1977 sven1977 commented Feb 23, 2021

This PR adds support for placement groups to RLlib by:

  • Allowing Trainable.default_resource_request() to return a PlacementGroupFactory (alternatively to returning a Resources object).
  • Tune will use the PlacementGroupFactory to derive a Resources object + use the placement group for bundling resources.
  • Alternatively, one can still manually provide a placement group factory via tune.run().

Why are these changes needed?

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Copy link
Contributor

@krfricke krfricke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks great, let's just move the check a bit up and we don't need to build a proper resources object anymore.

Comment on lines 244 to 257
counts = defaultdict(int)
for bundle in self.placement_group_factory._bundles:
for device, c in bundle.items():
counts[device.lower()] += c
custom_resources = {
k: c
for k, c in counts.items() if k not in ["cpu", "gpu"]
}
self.resources = Resources(
cpu=counts["cpu"],
gpu=counts["gpu"],
custom_resources=custom_resources,
has_placement_group=True,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this anymore - when using placement group factories, the trial.resources is not used at all (None is passed and it is initialized as Resources(cpu=1, gpu=0)). Assuming that RLLib doesn't read trial.resources, we can remove this part

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should move this to line 235, like this:

if isinstance(default_resources, PlacementGroupFactory):
    placement_group_factory = default_resources
    resources = None
else:
    # Set placement group factory to None for backwards compatiblity
    placement_group_factory = None
    resources = default_resources

We can then leave the rest of the code as-is.

Also, in line 229, let's check if the placement group factory is set, too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, cool, even simpler :) So we don't need self.resources at all anymore, then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

# from RolloutWorkers (n rollout workers map to m
# aggregation workers, where m < n).
"CPU": cf["num_cpus_for_driver"] +
cf["num_cpus_per_worker"] * cf["num_aggregation_workers"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wasn't this just one CPU per aggregation worker before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right! Good catch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 227 to 231

# If Trainable returns resources, do not allow manual overrid via
# `resources_per_trial` by the user.
if default_resources and (resources or placement_group_factory):
raise ValueError(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's a lot of special-casing here in this file diff -- can we please avoid this?

Comment on lines 227 to 243

# If Trainable returns resources, do not allow manual overrid via
# `resources_per_trial` by the user.
if default_resources and (resources or placement_group_factory):
raise ValueError(
"Resources for {} have been automatically set to {} "
"by its `default_resource_request()` method. Please "
"clear the `resources_per_trial` option.".format(
trainable_cls, default_resources))

# New way: Trainable returns a PlacementGroupFactory object.
if isinstance(default_resources, PlacementGroupFactory):
placement_group_factory = default_resources
resources = None
# Set placement group factory to None for backwards compatibility.
else:
placement_group_factory = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# If Trainable returns resources, do not allow manual overrid via
# `resources_per_trial` by the user.
if default_resources and (resources or placement_group_factory):
raise ValueError(
"Resources for {} have been automatically set to {} "
"by its `default_resource_request()` method. Please "
"clear the `resources_per_trial` option.".format(
trainable_cls, default_resources))
# New way: Trainable returns a PlacementGroupFactory object.
if isinstance(default_resources, PlacementGroupFactory):
placement_group_factory = default_resources
resources = None
# Set placement group factory to None for backwards compatibility.
else:
placement_group_factory = None
# If Trainable returns resources, do not allow manual override via
# `resources_per_trial` by the user.
if default_resources:
if resources or placement_group_factory:
raise ValueError(
"Resources for {} have been automatically set to {} "
"by its `default_resource_request()` method. Please "
"clear the `resources_per_trial` option.".format(
trainable_cls, default_resources))
# New way: Trainable returns a PlacementGroupFactory object.
if isinstance(default_resources, PlacementGroupFactory):
placement_group_factory = default_resources
resources = None
# Set placement group factory to None for backwards compatibility.
else:
placement_group_factory = None
resources = default_resources

(and remove line under this).

We need to keep the indent here, and with the suggested change we keep the same logic as before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 249 to 253
if isinstance(resources, PlacementGroupFactory):
self.placement_group_factory = resources
else:
self.placement_group_factory = placement_group_factory

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if isinstance(resources, PlacementGroupFactory):
self.placement_group_factory = resources
else:
self.placement_group_factory = placement_group_factory
self.placement_group_factory = placement_group_factory

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this block anymore (placement_group_factory is an argument of the constructor). With the changes above resources will never hold anything other than None or a Resources object

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed.

Copy link
Contributor

@krfricke krfricke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

…ement_groups_support

# Conflicts:
#	rllib/agents/trainer.py
@sven1977 sven1977 merged commit 6cd0cd3 into ray-project:master Feb 25, 2021
richardliaw added a commit to richardliaw/ray that referenced this pull request Feb 25, 2021
richardliaw added a commit that referenced this pull request Feb 25, 2021
@sven1977 sven1977 deleted the placement_groups_support branch March 27, 2021 11:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants