Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Introduce concurrency argument to replace ComputeStrategy in map-like APIs #41461

Merged
merged 19 commits into from
Dec 4, 2023

Conversation

c21
Copy link
Contributor

@c21 c21 commented Nov 28, 2023

Why are these changes needed?

Generated doc for review - https://anyscale-ray--41461.com.readthedocs.build/en/41461/data/transforming-data.html#transforming-with-python-class .

This PR is to add an extra concurrency argument into all map-like APIs (map_batches, map, filter, flat_map, add_column, drop_columns, select_columns), with the motivation to deprecate compute argument.

The typing for new concurrency is

Optional[Union[int, Tuple[int, int]]]

So it allows user to set a fixed-sized actors pool, or an auto-scaling actors pool. For 2.9, the compute argument would still work, but will print out a warning message for users to migrate to use concurrency. So this PR does not break any existing code and maintains backward compatibility.

Several other alternatives:

  • Use two arguments min_concurrency, max_concurrency: max_concurrency is already a reserved parameter for Ray Core. This represents the number of concurrent actor tasks in Ray Core. So this would introduce extra confusion for users. In addition, we are recommending our users to use a fixed-sized actors pool for now. These two arguments are only useful for auto-scaling actors pool.

  • Introduce a class like ConcurrencyOption: Do not see a need right now, and it would go back to have same issue with ActorPoolStrategy. We can always overload the type of concurrency and add more new types later, without breaking backward compatibility.

  • Overload the type of existing argument compute: This would also work and requires minimal change from user side. The naming of compute is more vague than concurrency though.

Related issue number

#40725

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@c21
Copy link
Contributor Author

c21 commented Nov 28, 2023

Will update all documentation + unit test code after we agree on the API change (o.w. too many places to change).
PTAL and let me know your thoughts, thanks @raulchen, @ericl, @stephanie-wang and @amogkam.

@stephanie-wang
Copy link
Contributor

Sorry, I don't have the context here.

it would go back to have same issue with ActorPoolStrategy.

What is the issue with ActorPoolStrategy?

@c21
Copy link
Contributor Author

c21 commented Nov 29, 2023

What is the issue with ActorPoolStrategy?

This was coming from batch inference CUJ feedback, users found it's unnecessary to learn a separate class and import ActorPoolStrategy, and we don't really need to have a class here. This would also make code simpler.

@raulchen
Copy link
Contributor

What is the issue with ActorPoolStrategy?

This was coming from batch inference CUJ feedback, users found it's unnecessary to learn a separate class and import ActorPoolStrategy, and we don't really need to have a class here. This would also make code simpler.

Also, normal Data users may not know the differences between Ray tasks and actors. Would be better to not expose the implementation details.

Copy link
Contributor

@stephanie-wang stephanie-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, really like this API improvement!

Can you add unit tests to check for the concurrency set but fn is not callable class cases?

python/ray/data/_internal/util.py Outdated Show resolved Hide resolved
python/ray/data/_internal/util.py Outdated Show resolved Hide resolved
python/ray/data/_internal/util.py Outdated Show resolved Hide resolved
python/ray/data/_internal/util.py Outdated Show resolved Hide resolved
python/ray/data/_internal/util.py Outdated Show resolved Hide resolved
python/ray/data/dataset.py Outdated Show resolved Hide resolved
@stephanie-wang stephanie-wang added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 29, 2023
@c21
Copy link
Contributor Author

c21 commented Nov 29, 2023

Can you add unit tests to check for the concurrency set but fn is not callable class cases?

@stephanie-wang, yes plan to add all unit test for the new concurrency argument. Looks like we have a general consensus on the API change now. Will start adding unit test.

@c21 c21 added the ray 2.9 Issues targeting Ray 2.9 release (~Q4 CY2023) label Nov 30, 2023
doc/source/data/transforming-data.rst Outdated Show resolved Hide resolved
python/ray/data/dataset.py Outdated Show resolved Hide resolved
python/ray/data/dataset.py Outdated Show resolved Hide resolved
@@ -115,19 +100,87 @@ uses tasks by default.
.map_batches(increase_brightness)
)

.. _transforming_data_actors:
.. _configure_batch_format:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Github somehow shows the change, but this section (Configuring batch format) is not changed.

.map_batches(drop_nas, batch_format="pandas")
)

Configuring batch size
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Github somehow shows the change, but this section (Configuring batch size) is not changed.

doc/source/data/transforming-data.rst Outdated Show resolved Hide resolved
@c21 c21 added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Nov 30, 2023
@c21
Copy link
Contributor Author

c21 commented Nov 30, 2023

All comments are addressed, please take another look whenever you have time, thanks!

# Check if `fn` is a function or not.
# NOTE: use `inspect.isfunction(fn)` instead of `instanceof(fn, CallableClass)`,
# because latter returns False for an object instance of callable class.
if inspect.isfunction(fn):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the major behavior change. Previously we rely on checking if isinstance(fn, CallableClass), but it's not right when creating an object for the callable class. Changed to use inspect.isfunction(fn) to check if it's a function, which is more stable. Added a unit test to cover it.

import ray
from ray.data.block import CallableClass

def foo(x):
    return x

class BarClass:
    def __init__(self, x):
        self._x = x
    def __call__(self, x):
        return x

>>> isinstance(foo, CallableClass)
False
>>> isinstance(BarClass, CallableClass)
True
>>> isinstance(BarClass(1), CallableClass)
False # <-- This does not work with current logic

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this intended though? The issue is that usually BarClass(1) is a bug, since it means the user is instantiating the model on the driver. It also requires being able to serialize the instance, which is usually problematic.

By allowing this it seems that we are opening the API to more confusing / performance issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ericl - ah that makes sense. This was coming from one of test failure - https://github.com/ray-project/ray/blob/master/doc/source/ray-core/_examples/datasets_train/datasets_train.py#L702 . I changed the test code instead.

btw on master, we always allow to run this kind of object instance (as a function):

>>> class BarClass:
...     def __init__(self, x):
...         self._x = x
...     def __call__(self, x):
...         return x
... 
>>> 
>>> fn = BarClass(1)
>>> ds = ray.data.range(10)
>>> ds = ds.map(fn)
>>> ds.take_all()
[{'id': 0}, {'id': 2}, {'id': 3}, {'id': 5}, {'id': 7}, {'id': 8}, {'id': 9}, {'id': 1}, {'id': 4}, {'id': 6}]

Are you thinking we should explicitly disallow and throw error if user passes an object instance? I am a little hesitating given this is always allowed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, in this case, I think we should probably raise an error since it's definitely not doing what the user is expecting--- the __init__ won't be cached even though it looks like they're passing in an actor class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ericl - updated to throw error, and added a unit test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually hit more unit test failure after making the change. Found out the inspect.isfunction() only works for plain function, but not works for class method, and partial function. For all of our Preprocessor classes, the function is passed as class method - Preprocessor._transform_pandas/_transform_numpy. Example of test failure is here

So I reverted the change to not throw error here, and added a TODO. Prefer to handle this case in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the Preprocessor classes, you could wrap the class method in a lambda to fix?

Copy link
Contributor Author

@c21 c21 Dec 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for our own code, we can work around with lambda, given we know the function signature, and what arguments to pass. What about user-defined arbitrary class method? Also I notice inspect.isfunction() does not work for partial function. I am open to adopt another approach if anything better I am not aware of.

python/ray/data/dataset.py Outdated Show resolved Hide resolved
python/ray/data/tests/test_map.py Outdated Show resolved Hide resolved
doc/source/data/transforming-data.rst Outdated Show resolved Hide resolved
doc/source/data/transforming-data.rst Outdated Show resolved Hide resolved
Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should allow callable instances, for the reasons above.

Signed-off-by: Cheng Su <scnju13@gmail.com>
Signed-off-by: Cheng Su <scnju13@gmail.com>
Signed-off-by: Cheng Su <scnju13@gmail.com>
Signed-off-by: Cheng Su <scnju13@gmail.com>
Signed-off-by: Cheng Su <scnju13@gmail.com>
python/ray/data/dataset.py Outdated Show resolved Hide resolved
Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more request: with this API change, can we disallow specifying unbounded autoscaling, e.g., None? The reason is that the vast majority of the time, users get a better experience using a fixed sized pool. So we want to encourage this as the default, and make sure that autoscaling is an advanced feature users explicitly opt into.

The only reason this wasn't done before was backwards compatibility.

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Dec 1, 2023
@c21
Copy link
Contributor Author

c21 commented Dec 1, 2023

One more request: with this API change, can we disallow specifying unbounded autoscaling, e.g., None? The reason is that the vast majority of the time, users get a better experience using a fixed sized pool. So we want to encourage this as the default, and make sure that autoscaling is an advanced feature users explicitly opt into.

@ericl - what's the good default value you are thinking? I am thinking we can use the fixed actor pool size with 1 actor. But this would be quite slow, and not usable, so users have to set a concurrency by themselves anyway.

Another option is to always request users to provide a concurrency setting. For 2.9, either compute or concurrency has to be set (to maintain backward compatibility). For 2.10, concurrency has to be set by itself.

Signed-off-by: Cheng Su <scnju13@gmail.com>
Signed-off-by: Cheng Su <scnju13@gmail.com>
@c21
Copy link
Contributor Author

c21 commented Dec 1, 2023

Changed to require concurrency or compute to be set per offline discussion with @ericl. So we no longer have the default currency=None behavior.

@c21 c21 removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Dec 2, 2023
@c21
Copy link
Contributor Author

c21 commented Dec 2, 2023

All comments are addressed except for this request - #41461 (comment) . Can you help take another look? Thanks @ericl and @stephanie-wang.

Copy link
Contributor

@stephanie-wang stephanie-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve docs changes.

doc/source/data/transforming-data.rst Outdated Show resolved Hide resolved
doc/source/data/transforming-data.rst Outdated Show resolved Hide resolved
doc/source/data/transforming-data.rst Outdated Show resolved Hide resolved
python/ray/data/dataset.py Outdated Show resolved Hide resolved
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

# Test concurrency not set.
result = ds.map(udf).take_all()
assert sorted(extract_values("id", result)) == list(range(10)), result
error_message = "``concurrency`` must be specified when using a callable class."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Dec 4, 2023
@c21 c21 merged commit c71f43c into ray-project:master Dec 4, 2023
16 checks passed
@c21 c21 deleted the concurrency branch December 4, 2023 19:52
@c21 c21 removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Dec 4, 2023
c21 added a commit that referenced this pull request Dec 5, 2023
This PR is to fix the unit tests failure (was marked as flaky so not shown up in previous PR) - #41461.

Signed-off-by: Cheng Su <scnju13@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ray 2.9 Issues targeting Ray 2.9 release (~Q4 CY2023)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants