[RLlib + Tune] Add placement group support to RLlib. #14289

sven1977 · 2021-02-23T22:05:09Z

This PR adds support for placement groups to RLlib by:

Allowing Trainable.default_resource_request() to return a PlacementGroupFactory (alternatively to returning a Resources object).
Tune will use the PlacementGroupFactory to derive a Resources object + use the placement group for bundling resources.
Alternatively, one can still manually provide a placement group factory via tune.run().

Why are these changes needed?

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

…ement_groups_support

krfricke

Generally looks great, let's just move the check a bit up and we don't need to build a proper resources object anymore.

krfricke · 2021-02-24T14:54:44Z

python/ray/tune/trial.py

+            counts = defaultdict(int)
+            for bundle in self.placement_group_factory._bundles:
+                for device, c in bundle.items():
+                    counts[device.lower()] += c
+            custom_resources = {
+                k: c
+                for k, c in counts.items() if k not in ["cpu", "gpu"]
+            }
+            self.resources = Resources(
+                cpu=counts["cpu"],
+                gpu=counts["gpu"],
+                custom_resources=custom_resources,
+                has_placement_group=True,
+            )


We don't need this anymore - when using placement group factories, the trial.resources is not used at all (None is passed and it is initialized as Resources(cpu=1, gpu=0)). Assuming that RLLib doesn't read trial.resources, we can remove this part

I think we should move this to line 235, like this:

if isinstance(default_resources, PlacementGroupFactory): placement_group_factory = default_resources resources = None else: # Set placement group factory to None for backwards compatiblity placement_group_factory = None resources = default_resources

We can then leave the rest of the code as-is.

Also, in line 229, let's check if the placement group factory is set, too.

Ah, cool, even simpler :) So we don't need self.resources at all anymore, then.

krfricke · 2021-02-24T15:01:33Z

rllib/agents/impala/impala.py

+                # from RolloutWorkers (n rollout workers map to m
+                # aggregation workers, where m < n).
+                "CPU": cf["num_cpus_for_driver"] +
+                cf["num_cpus_per_worker"] * cf["num_aggregation_workers"],


Wasn't this just one CPU per aggregation worker before?

you are right! Good catch.

…ement_groups_support

richardliaw · 2021-02-25T09:02:39Z

python/ray/tune/trial.py

+
+            # If Trainable returns resources, do not allow manual overrid via
+            # `resources_per_trial` by the user.
+            if default_resources and (resources or placement_group_factory):
+                raise ValueError(


there's a lot of special-casing here in this file diff -- can we please avoid this?

krfricke · 2021-02-25T09:26:06Z

python/ray/tune/trial.py

+
+            # If Trainable returns resources, do not allow manual overrid via
+            # `resources_per_trial` by the user.
+            if default_resources and (resources or placement_group_factory):
+                raise ValueError(
+                    "Resources for {} have been automatically set to {} "
+                    "by its `default_resource_request()` method. Please "
+                    "clear the `resources_per_trial` option.".format(
+                        trainable_cls, default_resources))
+
+            # New way: Trainable returns a PlacementGroupFactory object.
+            if isinstance(default_resources, PlacementGroupFactory):
+                placement_group_factory = default_resources
+                resources = None
+            # Set placement group factory to None for backwards compatibility.
+            else:
+                placement_group_factory = None


Suggested change

# If Trainable returns resources, do not allow manual overrid via

# `resources_per_trial` by the user.

if default_resources and (resources or placement_group_factory):

raise ValueError(

"Resources for {} have been automatically set to {} "

"by its `default_resource_request()` method. Please "

"clear the `resources_per_trial` option.".format(

trainable_cls, default_resources))

# New way: Trainable returns a PlacementGroupFactory object.

if isinstance(default_resources, PlacementGroupFactory):

placement_group_factory = default_resources

resources = None

# Set placement group factory to None for backwards compatibility.

else:

placement_group_factory = None

# If Trainable returns resources, do not allow manual override via

# `resources_per_trial` by the user.

if default_resources:

if resources or placement_group_factory:

raise ValueError(

"Resources for {} have been automatically set to {} "

"by its `default_resource_request()` method. Please "

"clear the `resources_per_trial` option.".format(

trainable_cls, default_resources))

# New way: Trainable returns a PlacementGroupFactory object.

if isinstance(default_resources, PlacementGroupFactory):

placement_group_factory = default_resources

resources = None

# Set placement group factory to None for backwards compatibility.

else:

placement_group_factory = None

resources = default_resources

(and remove line under this).

We need to keep the indent here, and with the suggested change we keep the same logic as before.

krfricke · 2021-02-25T09:27:25Z

python/ray/tune/trial.py

+        if isinstance(resources, PlacementGroupFactory):
+            self.placement_group_factory = resources
+        else:
+            self.placement_group_factory = placement_group_factory
+


Suggested change

if isinstance(resources, PlacementGroupFactory):

self.placement_group_factory = resources

else:

self.placement_group_factory = placement_group_factory

self.placement_group_factory = placement_group_factory

We don't need this block anymore (placement_group_factory is an argument of the constructor). With the changes above resources will never hold anything other than None or a Resources object

krfricke

Looks great!

…ement_groups_support # Conflicts: # rllib/agents/trainer.py

…ject#14289)" This reverts commit 6cd0cd3.

#14360) This reverts commit 6cd0cd3.

sven1977 added 8 commits February 23, 2021 09:47

Merge.

00a45b3

Merge branch 'master' of https://github.com/ray-project/ray into plac…

668902f

…ement_groups_support

wip.

824467f

Merge branch 'master' of https://github.com/ray-project/ray into plac…

a19ea3a

…ement_groups_support

wip and LINT.

d8038af

Merge branch 'master' of https://github.com/ray-project/ray into plac…

df93afd

…ement_groups_support

Merge branch 'master' of https://github.com/ray-project/ray into plac…

925f8c9

…ement_groups_support

wip and LINT.

8fe428b

sven1977 requested a review from krfricke February 23, 2021 22:05

sven1977 assigned krfricke Feb 23, 2021

sven1977 added 4 commits February 24, 2021 11:32

wip and LINT.

389ed96

Merge branch 'master' of https://github.com/ray-project/ray into plac…

d463cf1

…ement_groups_support

docs

5963334

wip.

439cf11

krfricke requested changes Feb 24, 2021

View reviewed changes

sven1977 added 4 commits February 24, 2021 16:17

wip.

42412d6

LINT.

c151046

fix and LINT.

528f4e8

Merge branch 'master' of https://github.com/ray-project/ray into plac…

dca9077

…ement_groups_support

richardliaw reviewed Feb 25, 2021

View reviewed changes

krfricke reviewed Feb 25, 2021

View reviewed changes

sven1977 added 2 commits February 25, 2021 10:40

wip.

280aea5

wip.

8474df4

krfricke approved these changes Feb 25, 2021

View reviewed changes

Merge branch 'master' of https://github.com/ray-project/ray into plac…

ef13eef

…ement_groups_support # Conflicts: # rllib/agents/trainer.py

sven1977 merged commit 6cd0cd3 into ray-project:master Feb 25, 2021

richardliaw added a commit to richardliaw/ray that referenced this pull request Feb 25, 2021

Revert "[RLlib + Tune] Add placement group support to RLlib. (ray-pro…

276cfc3

…ject#14289)" This reverts commit 6cd0cd3.

richardliaw added a commit that referenced this pull request Feb 25, 2021

Revert "[RLlib + Tune] Add placement group support to RLlib. (#14289)" (

a2d2275

#14360) This reverts commit 6cd0cd3.

sven1977 deleted the placement_groups_support branch March 27, 2021 11:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RLlib + Tune] Add placement group support to RLlib. #14289

[RLlib + Tune] Add placement group support to RLlib. #14289

sven1977 commented Feb 23, 2021 •

edited

Loading

krfricke left a comment

krfricke Feb 24, 2021

krfricke Feb 24, 2021

sven1977 Feb 24, 2021

sven1977 Feb 25, 2021

krfricke Feb 24, 2021

sven1977 Feb 24, 2021

sven1977 Feb 25, 2021

richardliaw Feb 25, 2021

krfricke Feb 25, 2021

sven1977 Feb 25, 2021

krfricke Feb 25, 2021

krfricke Feb 25, 2021

sven1977 Feb 25, 2021

krfricke left a comment

[RLlib + Tune] Add placement group support to RLlib. #14289

[RLlib + Tune] Add placement group support to RLlib. #14289

Conversation

sven1977 commented Feb 23, 2021 • edited Loading

Why are these changes needed?

Related issue number

Checks

krfricke left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

krfricke left a comment

Choose a reason for hiding this comment

sven1977 commented Feb 23, 2021 •

edited

Loading