[Data] Introduce `concurrency` argument to replace ComputeStrategy in map-like APIs #41461

c21 · 2023-11-28T22:22:14Z

Why are these changes needed?

Generated doc for review - https://anyscale-ray--41461.com.readthedocs.build/en/41461/data/transforming-data.html#transforming-with-python-class .

This PR is to add an extra concurrency argument into all map-like APIs (map_batches, map, filter, flat_map, add_column, drop_columns, select_columns), with the motivation to deprecate compute argument.

The typing for new concurrency is

Optional[Union[int, Tuple[int, int]]]

So it allows user to set a fixed-sized actors pool, or an auto-scaling actors pool. For 2.9, the compute argument would still work, but will print out a warning message for users to migrate to use concurrency. So this PR does not break any existing code and maintains backward compatibility.

Several other alternatives:

Use two arguments min_concurrency, max_concurrency: max_concurrency is already a reserved parameter for Ray Core. This represents the number of concurrent actor tasks in Ray Core. So this would introduce extra confusion for users. In addition, we are recommending our users to use a fixed-sized actors pool for now. These two arguments are only useful for auto-scaling actors pool.
Introduce a class like ConcurrencyOption: Do not see a need right now, and it would go back to have same issue with ActorPoolStrategy. We can always overload the type of concurrency and add more new types later, without breaking backward compatibility.
Overload the type of existing argument compute: This would also work and requires minimal change from user side. The naming of compute is more vague than concurrency though.

Related issue number

#40725

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

c21 · 2023-11-28T22:24:01Z

Will update all documentation + unit test code after we agree on the API change (o.w. too many places to change).
PTAL and let me know your thoughts, thanks @raulchen, @ericl, @stephanie-wang and @amogkam.

stephanie-wang · 2023-11-29T18:14:00Z

Sorry, I don't have the context here.

it would go back to have same issue with ActorPoolStrategy.

What is the issue with ActorPoolStrategy?

c21 · 2023-11-29T18:24:07Z

What is the issue with ActorPoolStrategy?

This was coming from batch inference CUJ feedback, users found it's unnecessary to learn a separate class and import ActorPoolStrategy, and we don't really need to have a class here. This would also make code simpler.

python/ray/data/dataset.py

raulchen · 2023-11-29T18:54:24Z

What is the issue with ActorPoolStrategy?

This was coming from batch inference CUJ feedback, users found it's unnecessary to learn a separate class and import ActorPoolStrategy, and we don't really need to have a class here. This would also make code simpler.

Also, normal Data users may not know the differences between Ray tasks and actors. Would be better to not expose the implementation details.

python/ray/data/_internal/util.py

stephanie-wang

Nice, really like this API improvement!

Can you add unit tests to check for the concurrency set but fn is not callable class cases?

python/ray/data/_internal/util.py

python/ray/data/dataset.py

c21 · 2023-11-29T23:44:55Z

Can you add unit tests to check for the concurrency set but fn is not callable class cases?

@stephanie-wang, yes plan to add all unit test for the new concurrency argument. Looks like we have a general consensus on the API change now. Will start adding unit test.

doc/source/data/transforming-data.rst

python/ray/data/dataset.py

c21 · 2023-11-30T06:29:41Z

doc/source/data/transforming-data.rst

@@ -115,19 +100,87 @@ uses tasks by default.
        .map_batches(increase_brightness)
    )

-.. _transforming_data_actors:
+.. _configure_batch_format:


Github somehow shows the change, but this section (Configuring batch format) is not changed.

c21 · 2023-11-30T06:30:06Z

doc/source/data/transforming-data.rst

+                .map_batches(drop_nas, batch_format="pandas")
+            )
+
+Configuring batch size


Github somehow shows the change, but this section (Configuring batch size) is not changed.

doc/source/data/transforming-data.rst

c21 · 2023-11-30T17:25:20Z

All comments are addressed, please take another look whenever you have time, thanks!

c21 · 2023-11-30T19:38:55Z

python/ray/data/_internal/util.py

+    # Check if `fn` is a function or not.
+    # NOTE: use `inspect.isfunction(fn)` instead of `instanceof(fn, CallableClass)`,
+    # because latter returns False for an object instance of callable class.
+    if inspect.isfunction(fn):


This is the major behavior change. Previously we rely on checking if isinstance(fn, CallableClass), but it's not right when creating an object for the callable class. Changed to use inspect.isfunction(fn) to check if it's a function, which is more stable. Added a unit test to cover it.

import ray from ray.data.block import CallableClass def foo(x): return x class BarClass: def __init__(self, x): self._x = x def __call__(self, x): return x >>> isinstance(foo, CallableClass) False >>> isinstance(BarClass, CallableClass) True >>> isinstance(BarClass(1), CallableClass) False # <-- This does not work with current logic

Isn't this intended though? The issue is that usually BarClass(1) is a bug, since it means the user is instantiating the model on the driver. It also requires being able to serialize the instance, which is usually problematic.

By allowing this it seems that we are opening the API to more confusing / performance issues.

@ericl - ah that makes sense. This was coming from one of test failure - https://github.com/ray-project/ray/blob/master/doc/source/ray-core/_examples/datasets_train/datasets_train.py#L702 . I changed the test code instead.

btw on master, we always allow to run this kind of object instance (as a function):

>>> class BarClass: ... def __init__(self, x): ... self._x = x ... def __call__(self, x): ... return x ... >>> >>> fn = BarClass(1) >>> ds = ray.data.range(10) >>> ds = ds.map(fn) >>> ds.take_all() [{'id': 0}, {'id': 2}, {'id': 3}, {'id': 5}, {'id': 7}, {'id': 8}, {'id': 9}, {'id': 1}, {'id': 4}, {'id': 6}]

Are you thinking we should explicitly disallow and throw error if user passes an object instance? I am a little hesitating given this is always allowed.

I see, in this case, I think we should probably raise an error since it's definitely not doing what the user is expecting--- the __init__ won't be cached even though it looks like they're passing in an actor class.

@ericl - updated to throw error, and added a unit test.

Actually hit more unit test failure after making the change. Found out the inspect.isfunction() only works for plain function, but not works for class method, and partial function. For all of our Preprocessor classes, the function is passed as class method - Preprocessor._transform_pandas/_transform_numpy. Example of test failure is here

So I reverted the change to not throw error here, and added a TODO. Prefer to handle this case in the future.

For the Preprocessor classes, you could wrap the class method in a lambda to fix?

for our own code, we can work around with lambda, given we know the function signature, and what arguments to pass. What about user-defined arbitrary class method? Also I notice inspect.isfunction() does not work for partial function. I am open to adopt another approach if anything better I am not aware of.

python/ray/data/dataset.py

python/ray/data/tests/test_map.py

doc/source/data/transforming-data.rst

ericl

I don't think we should allow callable instances, for the reasons above.

Signed-off-by: Cheng Su <scnju13@gmail.com>

doc/source/data/working-with-images.rst

python/ray/data/dataset.py

ericl

One more request: with this API change, can we disallow specifying unbounded autoscaling, e.g., None? The reason is that the vast majority of the time, users get a better experience using a fixed sized pool. So we want to encourage this as the default, and make sure that autoscaling is an advanced feature users explicitly opt into.

The only reason this wasn't done before was backwards compatibility.

c21 · 2023-12-01T00:43:46Z

One more request: with this API change, can we disallow specifying unbounded autoscaling, e.g., None? The reason is that the vast majority of the time, users get a better experience using a fixed sized pool. So we want to encourage this as the default, and make sure that autoscaling is an advanced feature users explicitly opt into.

@ericl - what's the good default value you are thinking? I am thinking we can use the fixed actor pool size with 1 actor. But this would be quite slow, and not usable, so users have to set a concurrency by themselves anyway.

Another option is to always request users to provide a concurrency setting. For 2.9, either compute or concurrency has to be set (to maintain backward compatibility). For 2.10, concurrency has to be set by itself.

Signed-off-by: Cheng Su <scnju13@gmail.com>

c21 · 2023-12-01T23:43:22Z

Changed to require concurrency or compute to be set per offline discussion with @ericl. So we no longer have the default currency=None behavior.

c21 · 2023-12-02T00:01:48Z

All comments are addressed except for this request - #41461 (comment) . Can you help take another look? Thanks @ericl and @stephanie-wang.

stephanie-wang

Approve docs changes.

doc/source/data/transforming-data.rst

python/ray/data/dataset.py

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

ericl

LGTM

ericl · 2023-12-04T19:49:24Z

python/ray/data/tests/test_map.py

+    # Test concurrency not set.
+    result = ds.map(udf).take_all()
+    assert sorted(extract_values("id", result)) == list(range(10)), result
+    error_message = "``concurrency`` must be specified when using a callable class."


This PR is to fix the unit tests failure (was marked as flaky so not shown up in previous PR) - #41461. Signed-off-by: Cheng Su <scnju13@gmail.com>

c21 requested review from ericl, scv119, amogkam, scottjlee, bveeramani, raulchen, stephanie-wang and Zandew as code owners November 28, 2023 22:22

c21 assigned ericl, stephanie-wang, raulchen and amogkam Nov 28, 2023

raulchen reviewed Nov 29, 2023

View reviewed changes

python/ray/data/dataset.py Outdated Show resolved Hide resolved

ericl reviewed Nov 29, 2023

View reviewed changes

python/ray/data/_internal/util.py Show resolved Hide resolved

stephanie-wang reviewed Nov 29, 2023

View reviewed changes

stephanie-wang added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 29, 2023

c21 added the ray 2.9 Issues targeting Ray 2.9 release (~Q4 CY2023) label Nov 30, 2023

raulchen reviewed Nov 30, 2023

View reviewed changes

doc/source/data/transforming-data.rst Outdated Show resolved Hide resolved

python/ray/data/dataset.py Outdated Show resolved Hide resolved

python/ray/data/dataset.py Outdated Show resolved Hide resolved

c21 commented Nov 30, 2023

View reviewed changes

c21 added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Nov 30, 2023

c21 commented Nov 30, 2023

View reviewed changes

raulchen approved these changes Nov 30, 2023

View reviewed changes

python/ray/data/dataset.py Outdated Show resolved Hide resolved

python/ray/data/tests/test_map.py Outdated Show resolved Hide resolved

doc/source/data/transforming-data.rst Outdated Show resolved Hide resolved

doc/source/data/transforming-data.rst Outdated Show resolved Hide resolved

ericl requested changes Nov 30, 2023

View reviewed changes

c21 added 5 commits November 30, 2023 16:10

Change to use inspect.isfunction to check if it's a function

10081c0

Signed-off-by: Cheng Su <scnju13@gmail.com>

Revert back change to check CallableClass and address all comments

e57dd90

Signed-off-by: Cheng Su <scnju13@gmail.com>

Address all comments

5b010f3

Signed-off-by: Cheng Su <scnju13@gmail.com>

Fix lint

0af91d2

Signed-off-by: Cheng Su <scnju13@gmail.com>

Fix doc link

d700bc2

Signed-off-by: Cheng Su <scnju13@gmail.com>

c21 force-pushed the concurrency branch from abdecc1 to d700bc2 Compare December 1, 2023 00:10

ericl reviewed Dec 1, 2023

View reviewed changes

doc/source/data/working-with-images.rst Outdated Show resolved Hide resolved

ericl reviewed Dec 1, 2023

View reviewed changes

python/ray/data/dataset.py Outdated Show resolved Hide resolved

ericl reviewed Dec 1, 2023

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Dec 1, 2023

c21 added 2 commits December 1, 2023 10:22

Address comments

fdbb8e6

Signed-off-by: Cheng Su <scnju13@gmail.com>

Revert back the change of throwing error on check function

1b852fc

Signed-off-by: Cheng Su <scnju13@gmail.com>

c21 removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Dec 2, 2023

stephanie-wang approved these changes Dec 2, 2023

View reviewed changes

doc/source/data/transforming-data.rst Outdated Show resolved Hide resolved

doc/source/data/transforming-data.rst Outdated Show resolved Hide resolved

doc/source/data/transforming-data.rst Outdated Show resolved Hide resolved

python/ray/data/dataset.py Outdated Show resolved Hide resolved

stephanie-wang added 4 commits December 1, 2023 16:24

Update doc/source/data/transforming-data.rst

d524a9b

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

Update doc/source/data/transforming-data.rst

c43f5e3

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

Update doc/source/data/transforming-data.rst

0b94619

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

Update python/ray/data/dataset.py

d140f5d

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

ericl approved these changes Dec 4, 2023

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Dec 4, 2023

c21 merged commit c71f43c into ray-project:master Dec 4, 2023
16 checks passed

c21 deleted the concurrency branch December 4, 2023 19:52

c21 removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Dec 4, 2023

Zandew mentioned this pull request Dec 5, 2023

[CI] linux://python/ray/data:test_streaming_integration is failing/flaky on master. #41176

Open

c21 mentioned this pull request Dec 5, 2023

[Data] Fix unit tests in test_streaming_integration.py #41623

Merged

8 tasks

c21 added a commit that referenced this pull request Dec 5, 2023

[Data] Fix unit tests in test_streaming_integration.py (#41623)

cde9bf7

This PR is to fix the unit tests failure (was marked as flaky so not shown up in previous PR) - #41461. Signed-off-by: Cheng Su <scnju13@gmail.com>

c21 mentioned this pull request Dec 6, 2023

[WIP][Data] Get rid of ActorPoolStrategy from tests and docs #41649

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Introduce `concurrency` argument to replace ComputeStrategy in map-like APIs #41461

[Data] Introduce `concurrency` argument to replace ComputeStrategy in map-like APIs #41461

c21 commented Nov 28, 2023 •

edited

Loading

c21 commented Nov 28, 2023

stephanie-wang commented Nov 29, 2023

c21 commented Nov 29, 2023

raulchen commented Nov 29, 2023

stephanie-wang left a comment

c21 commented Nov 29, 2023

c21 Nov 30, 2023

c21 Nov 30, 2023

c21 commented Nov 30, 2023

c21 Nov 30, 2023

ericl Nov 30, 2023

c21 Nov 30, 2023

ericl Dec 1, 2023

c21 Dec 1, 2023

c21 Dec 1, 2023

stephanie-wang Dec 2, 2023

c21 Dec 2, 2023 •

edited

Loading

ericl left a comment

ericl left a comment •

edited

Loading

c21 commented Dec 1, 2023 •

edited

Loading

c21 commented Dec 1, 2023

c21 commented Dec 2, 2023

stephanie-wang left a comment

ericl left a comment

ericl Dec 4, 2023

[Data] Introduce concurrency argument to replace ComputeStrategy in map-like APIs #41461

[Data] Introduce concurrency argument to replace ComputeStrategy in map-like APIs #41461

Conversation

c21 commented Nov 28, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

c21 commented Nov 28, 2023

stephanie-wang commented Nov 29, 2023

c21 commented Nov 29, 2023

raulchen commented Nov 29, 2023

stephanie-wang left a comment

Choose a reason for hiding this comment

c21 commented Nov 29, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

c21 commented Nov 30, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

c21 Dec 2, 2023 • edited Loading

Choose a reason for hiding this comment

ericl left a comment

Choose a reason for hiding this comment

ericl left a comment • edited Loading

Choose a reason for hiding this comment

c21 commented Dec 1, 2023 • edited Loading

c21 commented Dec 1, 2023

c21 commented Dec 2, 2023

stephanie-wang left a comment

Choose a reason for hiding this comment

ericl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

[Data] Introduce `concurrency` argument to replace ComputeStrategy in map-like APIs #41461

[Data] Introduce `concurrency` argument to replace ComputeStrategy in map-like APIs #41461

c21 commented Nov 28, 2023 •

edited

Loading

c21 Dec 2, 2023 •

edited

Loading

ericl left a comment •

edited

Loading

c21 commented Dec 1, 2023 •

edited

Loading