[serve] Add replica placement group support #37830

edoakes · 2023-07-26T19:34:52Z

Why are these changes needed?

Adds support for placement_group_strategy and placement_group_policy to the deployment config. This enables creating a placement group per replica of a deployment which is a feature request from users orchestrating multiple actors within a replica (e.g., to perform model-parallel inference).

The replica actor will be created in the bundle with index 0 (following the precedent set in Ray Train and Ray Tune).

TODO before merging:

Add necessary docstrings.
Better handling of invalid placement groups (deployment scheduler spams forever if there's an exception scheduling a replica).
Add test for deploying via config (valid and invalid).
Add unit test to test_deployment_scheduler.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

ericl · 2023-07-26T19:43:59Z

cc @Yard1

…ica-pg-support

python/ray/serve/_private/deployment_scheduler.py

…ica-pg-support

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

edoakes · 2023-08-08T16:13:45Z

@jjyao this is ready for review (note there is a little bit of cleanup pending as noted in the description).

…ica-pg-support

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

jjyao

overall lgtm

jjyao · 2023-08-08T17:19:24Z

python/ray/serve/_private/deployment_scheduler.py

+    # Placement group bundles and strategy *for this replica*.
+    # These are optional: by default replicas do not have a placement group.
+    placement_group_bundles: Optional[List[Dict[str, float]]] = None
+    placement_group_strategy: Optional[str] = None


I think we can create a PlacementGroupDeploymentSchedulingPolicy that contains placement_group_bundles and placement_group_strategy and register this policy during on_deployment_created.

I thought more about this and actually I don't think it changes anything related to the DeploymentSchedulingPolicy. The placement group is only relevant to each replica itself. We probably still want to SPREAD the different placement groups among each other like the existing policy (and maintain things like max_replicas_per_node).

Yea makes sense. Currently there is no way to spread PGs.

If you think it's a valid case, do you mind filing an enhancement issue so I can track on my side.

python/ray/serve/tests/test_replica_placement_group.py

shrekris-anyscale

Nice work so far! Most of my suggestions are nits to improve docstrings.

I left a few questions about co-scheduling Serve replicas with Ray actors/tasks using PGs, but based on the unit tests those don't seem as relevant. Feel free to resolve them if they're not.

python/ray/serve/_private/deploy_utils.py

python/ray/serve/_private/deployment_scheduler.py

python/ray/serve/_private/deployment_state.py

python/ray/serve/_private/utils.py

python/ray/serve/api.py

python/ray/serve/schema.py

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

…ica-pg-support

sihanwang41 · 2023-08-09T17:16:18Z

python/ray/serve/_private/version.py

@@ -76,9 +82,18 @@ def requires_long_poll_broadcast(self, new_version):
        )

    def compute_hashes(self):
-        # If this changes, the controller will directly restart all existing replicas.
+        # If these change, the controller will rolling upgrade existing replicas.


For my knowledge, is it from core requirement or some other considerations we have to do rolling upgrade after the placement group change?

You can't modify a placement group in-place, need to remove it and create another (similar to changing the actor resource requirement)

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

edoakes · 2023-08-09T18:38:31Z

python/ray/serve/tests/test_cli.py

+            ["serve", "status", "-a", "http://localhost:52365/"]
+        )
+        status = yaml.safe_load(cli_output)["applications"]
+        # TODO(zcin): fix error handling in the application state manager for


@zcin here is the issue I discussed w/ you on slack. Plan to merge this as-is and you can take it as a follow-up.

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

shrekris-anyscale

Nice work! This change looks good to me.

python/ray/serve/_private/deployment_state.py

python/ray/serve/deployment.py

jjyao

lgtm for deployment state and scheduler parts.

jjyao · 2023-08-09T20:44:31Z

python/ray/serve/_private/deployment_scheduler.py

+    # Placement group bundles and strategy *for this replica*.
+    # These are optional: by default replicas do not have a placement group.
+    placement_group_bundles: Optional[List[Dict[str, float]]] = None
+    placement_group_strategy: Optional[str] = None


Yea makes sense. Currently there is no way to spread PGs.

If you think it's a valid case, do you mind filing an enhancement issue so I can track on my side.

python/ray/serve/_private/deployment_state.py

sihanwang41

Nice work! Two question (non-blocker)

Do we add the placement group info from serve status?
Should we add a test spawning a pure ray actor or ray task from serve replica, to make sure they are under the same pg? (If already had, can ignore)

edoakes · 2023-08-09T21:50:24Z

@sihanwang41 yes we should, I can file a follow-up one for that. And yeah I'm verifying we get spawned in the right PG but not actually spawning a task/actor -- I'll add that.

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

…ica-pg-support

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

edoakes · 2023-08-10T14:32:46Z

python/ray/serve/tests/test_replica_placement_group.py

+        serve.run(Infeasible.bind())
+
+
+def test_coschedule_actors_and_tasks(serve_instance):


@sihanwang41 added this test you asked about

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

…ica-pg-support

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

Adds support for `placement_group_strategy` and `placement_group_policy` to the deployment config. This enables creating a placement group _per replica_ of a deployment which is a feature request from users orchestrating multiple actors within a replica (e.g., to perform model-parallel inference). The replica actor will be created in the bundle with index `0` (following the precedent set in Ray Train and Ray Tune). Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>

Adds support for `placement_group_strategy` and `placement_group_policy` to the deployment config. This enables creating a placement group _per replica_ of a deployment which is a feature request from users orchestrating multiple actors within a replica (e.g., to perform model-parallel inference). The replica actor will be created in the bundle with index `0` (following the precedent set in Ray Train and Ray Tune). Signed-off-by: NripeshN <nn2012@hw.ac.uk>

Adds support for `placement_group_strategy` and `placement_group_policy` to the deployment config. This enables creating a placement group _per replica_ of a deployment which is a feature request from users orchestrating multiple actors within a replica (e.g., to perform model-parallel inference). The replica actor will be created in the bundle with index `0` (following the precedent set in Ray Train and Ray Tune). Signed-off-by: harborn <gangsheng.wu@intel.com>

Adds support for `placement_group_strategy` and `placement_group_policy` to the deployment config. This enables creating a placement group _per replica_ of a deployment which is a feature request from users orchestrating multiple actors within a replica (e.g., to perform model-parallel inference). The replica actor will be created in the bundle with index `0` (following the precedent set in Ray Train and Ray Tune).

Adds support for `placement_group_strategy` and `placement_group_policy` to the deployment config. This enables creating a placement group _per replica_ of a deployment which is a feature request from users orchestrating multiple actors within a replica (e.g., to perform model-parallel inference). The replica actor will be created in the bundle with index `0` (following the precedent set in Ray Train and Ray Tune). Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

Adds support for `placement_group_strategy` and `placement_group_policy` to the deployment config. This enables creating a placement group _per replica_ of a deployment which is a feature request from users orchestrating multiple actors within a replica (e.g., to perform model-parallel inference). The replica actor will be created in the bundle with index `0` (following the precedent set in Ray Train and Ray Tune). Signed-off-by: Victor <vctr.y.m@example.com>

WIP

6223248

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

Merge branch 'master' of https://github.com/ray-project/ray into repl…

e8a7f9b

…ica-pg-support

edoakes self-assigned this Aug 1, 2023

jjyao reviewed Aug 4, 2023

View reviewed changes

python/ray/serve/_private/deployment_scheduler.py Outdated Show resolved Hide resolved

edoakes added 5 commits August 8, 2023 09:13

Merge branch 'master' of https://github.com/ray-project/ray into repl…

f57f999

…ica-pg-support

fix

2cddb7d

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

fix

9259426

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

basic test

5227695

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

fix

1388cae

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

edoakes changed the title ~~[WIP][serve] Add replica placement group support~~ [serve] Add replica placement group support Aug 8, 2023

edoakes added 2 commits August 8, 2023 11:11

fix stuff

c3f3efe

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

fix

480216a

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

edoakes requested review from jjyao and a team August 8, 2023 16:13

edoakes added 2 commits August 8, 2023 11:17

Merge branch 'master' of https://github.com/ray-project/ray into repl…

d28861b

…ica-pg-support

fix

513d523

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

jjyao reviewed Aug 8, 2023

View reviewed changes

shrekris-anyscale reviewed Aug 8, 2023

View reviewed changes

edoakes added 10 commits August 8, 2023 13:59

fix

8caab41

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

fix linter

5cb9a3f

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

fix

702b6f5

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

fix

8de4726

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

fix

6c89e7b

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

fix

15ce708

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

pg util tests

9a4553a

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

small fix

e310c87

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

fix

a3dc176

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

fix

ea898b4

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

edoakes added 2 commits August 9, 2023 11:34

fix

a771ba8

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

Merge branch 'master' of https://github.com/ray-project/ray into repl…

1a9eedd

…ica-pg-support

sihanwang41 reviewed Aug 9, 2023

View reviewed changes

edoakes added 3 commits August 9, 2023 12:45

fix

e9949fb

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

fix

ef6597e

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

add CLI support + test

e541206

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

edoakes commented Aug 9, 2023

View reviewed changes

fix

1a53810

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

shrekris-anyscale approved these changes Aug 9, 2023

View reviewed changes

python/ray/serve/_private/deployment_state.py Show resolved Hide resolved

sihanwang41 reviewed Aug 9, 2023

View reviewed changes

python/ray/serve/deployment.py Show resolved Hide resolved

jjyao approved these changes Aug 9, 2023

View reviewed changes

sihanwang41 approved these changes Aug 9, 2023

View reviewed changes

edoakes added 4 commits August 10, 2023 09:09

nit

a092ea8

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

Merge branch 'master' of https://github.com/ray-project/ray into repl…

1722ce7

…ica-pg-support

test case

129eebc

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

fix msg

82b7924

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

edoakes commented Aug 10, 2023

View reviewed changes

edoakes added 4 commits August 10, 2023 09:33

fix

3b75f5b

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

make it medium

0abe3e1

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

Merge branch 'master' of https://github.com/ray-project/ray into repl…

d490f59

…ica-pg-support

skip windows

d59f00b

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

edoakes merged commit 6fef803 into ray-project:master Aug 10, 2023
126 of 128 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[serve] Add replica placement group support #37830

[serve] Add replica placement group support #37830

edoakes commented Jul 26, 2023 •

edited

ericl commented Jul 26, 2023

edoakes commented Aug 8, 2023

jjyao left a comment

jjyao Aug 8, 2023

edoakes Aug 9, 2023

jjyao Aug 9, 2023

shrekris-anyscale left a comment

sihanwang41 Aug 9, 2023

edoakes Aug 9, 2023

edoakes Aug 9, 2023

shrekris-anyscale left a comment

jjyao left a comment

jjyao Aug 9, 2023

sihanwang41 left a comment •

edited

edoakes commented Aug 9, 2023

edoakes Aug 10, 2023

		serve.run(Infeasible.bind())


		def test_coschedule_actors_and_tasks(serve_instance):

[serve] Add replica placement group support #37830

[serve] Add replica placement group support #37830

Conversation

edoakes commented Jul 26, 2023 • edited

Why are these changes needed?

Related issue number

Checks

ericl commented Jul 26, 2023

edoakes commented Aug 8, 2023

jjyao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shrekris-anyscale left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shrekris-anyscale left a comment

Choose a reason for hiding this comment

jjyao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sihanwang41 left a comment • edited

Choose a reason for hiding this comment

edoakes commented Aug 9, 2023

Choose a reason for hiding this comment

edoakes commented Jul 26, 2023 •

edited

sihanwang41 left a comment •

edited