[Data] Add function to dynamically generate `ray_remote_args` for Map APIs #45143

scottjlee · 2024-05-04T02:36:31Z

Why are these changes needed?

Adds a new parameterray_remote_args_fn to Map APIs (map(), map_batches(), flat_map(), filter()), which allows the user to specify a function which returns a dict of Ray remote args be passed to an actor initialized from ActorPoolMapOperator. This function is called each time a worker is initialized, allowing the user to specify the parameters for every worker (e.g. setting the scheduling strategy at runtime).

Currently, Ray Data only allows passing static ray remote args, which has the limitation of sharing the placement group for all actors. This feature allows users to create different placement groups for each actor. For example, this will enable users to use Ray Data with vLLM with tensor parallel size > 1.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Scott Lee <sjl@anyscale.com>

c21 · 2024-05-06T19:41:56Z

python/ray/data/_internal/execution/operators/actor_pool_map_operator.py

+                # For each new actor, get new scheduling strategy by
+                # calling the generation fn.
+                remote_args["scheduling_strategy"] = self._scheduling_strategy_fn()
+            self._cls = ray.remote(**remote_args)(_MapWorker)


We should also handle the actor created in scale up, right? https://github.com/ray-project/ray/blob/master/python/ray/data/_internal/execution/operators/actor_pool_map_operator.py#L241-L247

good catch, added it.

c21 · 2024-05-06T19:43:16Z

python/ray/data/dataset.py

@@ -316,6 +318,8 @@ def parse_filename(row: Dict[str, Any]) -> Dict[str, Any]:
            concurrency: The number of Ray workers to use concurrently. For a fixed-sized
                worker pool of size ``n``, specify ``concurrency=n``. For an autoscaling
                worker pool from ``m`` to ``n`` workers, specify ``concurrency=(m, n)``.
+            scheduling_strategy_fn: A function that returns a ``SchedulingStrategy``
+                used to initialize the actor. Only valid if ``fn`` is a callable class.


actor -> worker, to be consistent with rest of docstring.

c21 · 2024-05-06T19:44:37Z

python/ray/data/_internal/execution/operators/map_operator.py

@@ -110,6 +110,7 @@ def create(
        # TODO(ekl): slim down ComputeStrategy to only specify the compute
        # config and not contain implementation code.
        compute_strategy: Optional[ComputeStrategy] = None,
+        scheduling_strategy_fn: Optional[Callable[[], Any]] = None,


nit: Any -> SchedulingStrategyT

Signed-off-by: Scott Lee <sjl@anyscale.com>

raulchen · 2024-05-06T23:07:55Z

python/ray/data/dataset.py

@@ -785,6 +796,7 @@ def flat_map(
        num_cpus: Optional[float] = None,
        num_gpus: Optional[float] = None,
        concurrency: Optional[Union[int, Tuple[int, int]]] = None,
+        scheduling_strategy_fn: Optional[Callable[[], "SchedulingStrategyT"]] = None,


discussed with @c21 offline, it would be more generalizable to add a ray_remote_args_fn, because we may want to dynamically generate other args as well.
The semantics can be that if both the ray_remote_args_fn and other kwargs ray_remote_args are specified, the kwargs take higher priority.

also, this usage is a bit too advanced, maybe let's mark it as experimental for now

Signed-off-by: Scott Lee <sjl@anyscale.com>

… 0503-actor-pg

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee · 2024-05-07T19:27:22Z

python/ray/data/_internal/execution/operators/map_operator.py

@@ -205,7 +218,7 @@ def __call__(self, args):
                    self.i %= len(self.locs)
                    return args

-            self._ray_remote_args_factory = RoundRobinAssign(locs)
+            self._ray_remote_args_factory_actor_locality = RoundRobinAssign(locs)


rename to avoid similar naming with ray_remote_args_fn

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee · 2024-05-07T21:06:07Z

python/ray/data/_internal/execution/operators/actor_pool_map_operator.py

+        new_remote_args = self._ray_remote_args_fn()
+
+        # Override args from user-defined remote args function.
+        new_and_overriden_remote_args = {}


we always override with the param from ray_remote_args_fn because Ray Data will use default parameters for scheduling_strategy and insert into ray_remote_args. We want to always override with scheduling strategy provided by the user defined function, and IMO we should do the same for other parameters if the user provides ray_remote_args_fn.

raulchen · 2024-05-07T21:19:31Z

python/ray/data/dataset.py

+                passed to each map worker. This function will be called each time prior
+                to initializing the worker. Args returned from this dict will always
+                override the args in ``ray_remote_args``. Note: this is an advanced,
+                experimental feature.


nit, maybe add a note that the purpose of this argument is to allow generating dynamic arguments for each actor/task

raulchen · 2024-05-07T21:20:54Z

python/ray/data/_internal/logical/rules/operator_fusion.py

@@ -204,6 +204,12 @@ def _can_fuse(self, down_op: PhysicalOperator, up_op: PhysicalOperator) -> bool:
        ):
            return False

+        # Only fuse if at most one op specifies a `_ray_remote_args_fn`.


I tend to not fuse if any has the fn. because the fn can return any args that may be incompatible with the other.

Yes that's what i was thinking as well. But in the case of Read->Map with ray_remote_args_fn in the Map, then these two operators would not be fused, which would lead to performance drop.

do you have a concrete example of how bad the perf drop is? I guess at least for vLLM integration, it's not an issue, right? because the actors are GPU actors, they cannot be fused anyway.
Despite the perf drop in some cases, I still think, for correctness, we shouldn't fuse them, unless we have a way to tell whether the returned args are compatible with the previous op.

at least for vLLM integration, it's not an issue, right? because the actors are GPU actors, they cannot be fused anyway.

Good point. I don't have a concrete example for perf drop, I just remember previously from streaming executor discussion that it was described as must-have for fusing Read->Map cases. But since that doesn't apply for this vLLM fix, I think it should be OK.

raulchen · 2024-05-07T21:40:00Z

python/ray/data/_internal/execution/operators/actor_pool_map_operator.py

@@ -604,12 +660,18 @@ def _kill_all_running_actors(self):

    def _kill_running_actor(self, actor: ray.actor.ActorHandle):
        """Kill the provided actor and remove it from the pool."""
+        pg = self._actor_to_placement_groups.pop(actor, None)
+        if pg:
+            ray.util.remove_placement_group(pg)


I don't think we should remove the PG here. Because other bundles in the PG can be still used. also I think it's a bit hacky to add special logic for PGs here.
I have 2 alternative solutions in mind:

let the application handles PG removal after execution finishes. The drawback is scaling down will be delayed a bit.

instead of exposing ray_remote_args_fn, expose something lower level, such as "on_creating_actor", "on_actor_killed".
I slightly prefer (1). But I think vLLM should eventually move away from PGs, adding a temporary internal hook is acceptable as well.
cc @c21

let the application handles PG removal after execution finishes. The drawback is scaling down will be delayed a bit.

Got it, +1 for option 1. The user-defined function can memorize the created PGs into a list, and remove all PGs at end of user program. WDYT?

Signed-off-by: Scott Lee <sjl@anyscale.com>

raulchen · 2024-05-08T22:19:10Z

python/ray/data/tests/test_consumption.py

+        ActorClass,
+        concurrency=3,
+        ray_remote_args_fn=_generate_ray_remote_args_with_scheduling_strategy,
+    ).take_all()


nit: just to make this example more complete, let's also remove the PGs

Signed-off-by: Scott Lee <sjl@anyscale.com>

c21

LGTM

… APIs (ray-project#45143) Adds a new parameter`ray_remote_args_fn` to Map APIs (`map()`, `map_batches()`, `flat_map()`, `filter()`), which allows the user to specify a function which returns a dict of Ray remote args be passed to an actor initialized from ActorPoolMapOperator. This function is called each time a worker is initialized, allowing the user to specify the parameters for every worker (e.g. setting the scheduling strategy at runtime). Currently, Ray Data only allows passing static ray remote args, which has the limitation of sharing the placement group for all actors. This feature allows users to create different placement groups for each actor. For example, this will enable users to use Ray Data with vLLM with tensor parallel size > 1. Signed-off-by: Scott Lee <sjl@anyscale.com>

… APIs (ray-project#45143) Adds a new parameter`ray_remote_args_fn` to Map APIs (`map()`, `map_batches()`, `flat_map()`, `filter()`), which allows the user to specify a function which returns a dict of Ray remote args be passed to an actor initialized from ActorPoolMapOperator. This function is called each time a worker is initialized, allowing the user to specify the parameters for every worker (e.g. setting the scheduling strategy at runtime). Currently, Ray Data only allows passing static ray remote args, which has the limitation of sharing the placement group for all actors. This feature allows users to create different placement groups for each actor. For example, this will enable users to use Ray Data with vLLM with tensor parallel size > 1. Signed-off-by: Scott Lee <sjl@anyscale.com> Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

… APIs (ray-project#45143) Adds a new parameter`ray_remote_args_fn` to Map APIs (`map()`, `map_batches()`, `flat_map()`, `filter()`), which allows the user to specify a function which returns a dict of Ray remote args be passed to an actor initialized from ActorPoolMapOperator. This function is called each time a worker is initialized, allowing the user to specify the parameters for every worker (e.g. setting the scheduling strategy at runtime). Currently, Ray Data only allows passing static ray remote args, which has the limitation of sharing the placement group for all actors. This feature allows users to create different placement groups for each actor. For example, this will enable users to use Ray Data with vLLM with tensor parallel size > 1. Signed-off-by: Scott Lee <sjl@anyscale.com>

… APIs (ray-project#45143) Adds a new parameter`ray_remote_args_fn` to Map APIs (`map()`, `map_batches()`, `flat_map()`, `filter()`), which allows the user to specify a function which returns a dict of Ray remote args be passed to an actor initialized from ActorPoolMapOperator. This function is called each time a worker is initialized, allowing the user to specify the parameters for every worker (e.g. setting the scheduling strategy at runtime). Currently, Ray Data only allows passing static ray remote args, which has the limitation of sharing the placement group for all actors. This feature allows users to create different placement groups for each actor. For example, this will enable users to use Ray Data with vLLM with tensor parallel size > 1. Signed-off-by: Scott Lee <sjl@anyscale.com> Signed-off-by: gchurch <gabe1church@gmail.com>

scottjlee added 2 commits May 3, 2024 19:07

add unit test

3ed81c1

Signed-off-by: Scott Lee <sjl@anyscale.com>

docstring

ac00d7c

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee marked this pull request as ready for review May 6, 2024 19:32

scottjlee requested review from ericl, scv119, c21, amogkam, bveeramani, raulchen, stephanie-wang and omatthew98 as code owners May 6, 2024 19:32

scottjlee assigned c21 May 6, 2024

c21 reviewed May 6, 2024

View reviewed changes

c21 assigned raulchen May 6, 2024

scottjlee added 2 commits May 6, 2024 13:49

comments

7fad4ee

Signed-off-by: Scott Lee <sjl@anyscale.com>

lint

7bfea23

Signed-off-by: Scott Lee <sjl@anyscale.com>

raulchen reviewed May 6, 2024

View reviewed changes

scottjlee added 5 commits May 6, 2024 18:18

remove placement group when killing actors

ce42403

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch '0503-actor-pg' of https://github.com/scottjlee/ray into…

6253286

… 0503-actor-pg

refactor to generalize ray_remote_args_fn

a98f1c6

Signed-off-by: Scott Lee <sjl@anyscale.com>

lint

c4195ff

Signed-off-by: Scott Lee <sjl@anyscale.com>

comments

f5adeef

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee commented May 7, 2024

View reviewed changes

scottjlee added 3 commits May 7, 2024 12:28

clean up

6f43a7e

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge with master + manual resolution

8c5d53f

Signed-off-by: Scott Lee <sjl@anyscale.com>

lint

cc548c5

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee changed the title ~~[Data] Add scheduling strategy function param for Map APIs~~ [Data] Add function to dynamically generate ray_remote_args for Map APIs May 7, 2024

scottjlee commented May 7, 2024

View reviewed changes

raulchen reviewed May 7, 2024

View reviewed changes

remove separate placement group logic

8da3ffc

scottjlee added 2 commits May 8, 2024 12:57

lint

3c1ad47

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into 0503-actor-pg

9333609

Signed-off-by: Scott Lee <sjl@anyscale.com>

raulchen approved these changes May 8, 2024

View reviewed changes

remove pg in test/example

a7b31ca

Signed-off-by: Scott Lee <sjl@anyscale.com>

c21 approved these changes May 9, 2024

View reviewed changes

c21 merged commit 866e03e into ray-project:master May 9, 2024
4 of 5 checks passed

Yard1 mentioned this pull request May 9, 2024

[Doc]: Offline Inference Distributed Broken for TP vllm-project/vllm#4410

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Add function to dynamically generate `ray_remote_args` for Map APIs #45143

[Data] Add function to dynamically generate `ray_remote_args` for Map APIs #45143

scottjlee commented May 4, 2024 •

edited by c21

Loading

c21 May 6, 2024

scottjlee May 6, 2024

c21 May 6, 2024

c21 May 6, 2024

raulchen May 6, 2024

raulchen May 6, 2024

scottjlee May 7, 2024

scottjlee May 7, 2024 •

edited

Loading

raulchen May 7, 2024

raulchen May 7, 2024

scottjlee May 7, 2024

raulchen May 8, 2024

scottjlee May 8, 2024 •

edited

Loading

raulchen May 7, 2024

c21 May 8, 2024

raulchen May 8, 2024

c21 left a comment

[Data] Add function to dynamically generate ray_remote_args for Map APIs #45143

[Data] Add function to dynamically generate ray_remote_args for Map APIs #45143

Conversation

scottjlee commented May 4, 2024 • edited by c21 Loading

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scottjlee May 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scottjlee May 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

c21 left a comment

Choose a reason for hiding this comment

[Data] Add function to dynamically generate `ray_remote_args` for Map APIs #45143

[Data] Add function to dynamically generate `ray_remote_args` for Map APIs #45143

scottjlee commented May 4, 2024 •

edited by c21

Loading

scottjlee May 7, 2024 •

edited

Loading

scottjlee May 8, 2024 •

edited

Loading