-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] Introduce concurrency
argument to replace ComputeStrategy in map-like APIs
#41461
Merged
Merged
Changes from 13 commits
Commits
Show all changes
19 commits
Select commit
Hold shift + click to select a range
13c7117
Introduce argument to replace compute strategy in map-like APIs
c21 ecedee2
Address comments
c21 b278c21
Throw ValueError for function+concurrency
c21 9ee0e5e
Fix unit test
c21 7bb9c63
Fix lint
c21 b38616b
Address all comments
c21 adfb5fb
Add unit test and fix lint
c21 bfe3f93
Undo notebook change
c21 10081c0
Change to use inspect.isfunction to check if it's a function
c21 e57dd90
Revert back change to check CallableClass and address all comments
c21 5b010f3
Address all comments
c21 0af91d2
Fix lint
c21 d700bc2
Fix doc link
c21 fdbb8e6
Address comments
c21 1b852fc
Revert back the change of throwing error on check function
c21 d524a9b
Update doc/source/data/transforming-data.rst
stephanie-wang c43f5e3
Update doc/source/data/transforming-data.rst
stephanie-wang 0b94619
Update doc/source/data/transforming-data.rst
stephanie-wang d140f5d
Update python/ray/data/dataset.py
stephanie-wang File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -14,6 +14,7 @@ This guide shows you how to: | |
|
||
* :ref:`Transform rows <transforming_rows>` | ||
* :ref:`Transform batches <transforming_batches>` | ||
* :ref:`Stateful Transformation <stateful_transformation>` | ||
* :ref:`Groupby and transform groups <transforming_groupby>` | ||
* :ref:`Shuffle rows <shuffling_rows>` | ||
* :ref:`Repartition data <repartitioning_data>` | ||
|
@@ -84,22 +85,6 @@ Transforming batches | |
If your transformation is vectorized like most NumPy or pandas operations, transforming | ||
batches is more performant than transforming rows. | ||
|
||
Choosing between tasks and actors | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
Ray Data transforms batches with either tasks or actors. Actors perform setup exactly | ||
once. In contrast, tasks require setup every batch. So, if your transformation involves | ||
expensive setup like downloading model weights, use actors. Otherwise, use tasks. | ||
|
||
To learn more about tasks and actors, read the | ||
:ref:`Ray Core Key Concepts <core-key-concepts>`. | ||
|
||
Transforming batches with tasks | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
To transform batches with tasks, call :meth:`~ray.data.Dataset.map_batches`. Ray Data | ||
uses tasks by default. | ||
|
||
.. testcode:: | ||
|
||
from typing import Dict | ||
|
@@ -115,19 +100,90 @@ uses tasks by default. | |
.map_batches(increase_brightness) | ||
) | ||
|
||
.. _transforming_data_actors: | ||
.. _configure_batch_format: | ||
|
||
Configuring batch format | ||
~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
Ray Data represents batches as dicts of NumPy ndarrays or pandas DataFrames. By | ||
default, Ray Data represents batches as dicts of NumPy ndarrays. | ||
|
||
To configure the batch type, specify ``batch_format`` in | ||
:meth:`~ray.data.Dataset.map_batches`. You can return either format from your function. | ||
|
||
.. tab-set:: | ||
|
||
.. tab-item:: NumPy | ||
|
||
.. testcode:: | ||
|
||
from typing import Dict | ||
import numpy as np | ||
import ray | ||
|
||
def increase_brightness(batch: Dict[str, np.ndarray]) -> Dict[str, np.ndarray]: | ||
batch["image"] = np.clip(batch["image"] + 4, 0, 255) | ||
return batch | ||
|
||
ds = ( | ||
ray.data.read_images("s3://anonymous@ray-example-data/image-datasets/simple") | ||
.map_batches(increase_brightness, batch_format="numpy") | ||
) | ||
|
||
.. tab-item:: pandas | ||
|
||
.. testcode:: | ||
|
||
import pandas as pd | ||
import ray | ||
|
||
def drop_nas(batch: pd.DataFrame) -> pd.DataFrame: | ||
return batch.dropna() | ||
|
||
ds = ( | ||
ray.data.read_csv("s3://anonymous@air-example-data/iris.csv") | ||
.map_batches(drop_nas, batch_format="pandas") | ||
) | ||
|
||
Configuring batch size | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Github somehow shows the change, but this section ( |
||
~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
Increasing ``batch_size`` improves the performance of vectorized transformations like | ||
NumPy functions and model inference. However, if your batch size is too large, your | ||
program might run out of memory. If you encounter an out-of-memory error, decrease your | ||
``batch_size``. | ||
|
||
.. note:: | ||
|
||
The default batch size depends on your resource type. If you're using CPUs, | ||
the default batch size is 4096. If you're using GPUs, you must specify an explicit | ||
batch size. | ||
|
||
.. _stateful_transformation: | ||
|
||
Stateful Transformation | ||
============================== | ||
|
||
Transforming batches with actors | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
If your transformation is stateful to require expensive setup such as downloading | ||
model weights, use a callable Python class instead of a function. When a Python class | ||
is used, the `__init__` method is called to perform setup exactly once on each worker. | ||
In contrast, function is stateless, so any setup must be performed for each data item.. | ||
|
||
To transform batches with actors, complete these steps: | ||
Internally, Ray Data uses tasks to execute functions, and uses actors to execute classes. | ||
To learn more about tasks and actors, read the | ||
:ref:`Ray Core Key Concepts <core-key-concepts>`. | ||
|
||
To transform data with Python class, complete these steps: | ||
stephanie-wang marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
1. Implement a class. Perform setup in ``__init__`` and transform data in ``__call__``. | ||
|
||
2. Create an :class:`~ray.data.ActorPoolStrategy` and configure the number of concurrent | ||
workers. Each worker transforms a partition of data. | ||
2. Configure ``concurrency`` with the number of concurrent workers. Each worker | ||
transforms a partition of data in parallel. You can also pass a tuple of | ||
``(min, max)`` to allow Ray Data to automatically scale the number of concurrent | ||
workers. | ||
|
||
3. Call :meth:`~ray.data.Dataset.map_batches` and pass your ``ActorPoolStrategy`` to ``compute``. | ||
3. Call :meth:`~ray.data.Dataset.map_batches`, :meth:`~ray.data.Dataset.map`, or | ||
:meth:`~ray.data.Dataset.flat_map`. | ||
|
||
.. tab-set:: | ||
|
||
|
@@ -154,7 +210,7 @@ To transform batches with actors, complete these steps: | |
|
||
ds = ( | ||
ray.data.from_numpy(np.ones((32, 100))) | ||
.map_batches(TorchPredictor, compute=ray.data.ActorPoolStrategy(size=2)) | ||
.map_batches(TorchPredictor, concurrency=2) | ||
) | ||
|
||
.. testcode:: | ||
|
@@ -188,7 +244,7 @@ To transform batches with actors, complete these steps: | |
.map_batches( | ||
TorchPredictor, | ||
# Two workers with one GPU each | ||
compute=ray.data.ActorPoolStrategy(size=2), | ||
concurrency=2, | ||
# Batch size is required if you're using GPUs. | ||
batch_size=4, | ||
num_gpus=1 | ||
|
@@ -200,65 +256,6 @@ To transform batches with actors, complete these steps: | |
|
||
ds.materialize() | ||
|
||
.. _configure_batch_format: | ||
|
||
Configuring batch format | ||
~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
Ray Data represents batches as dicts of NumPy ndarrays or pandas DataFrames. By | ||
default, Ray Data represents batches as dicts of NumPy ndarrays. | ||
|
||
To configure the batch type, specify ``batch_format`` in | ||
:meth:`~ray.data.Dataset.map_batches`. You can return either format from your function. | ||
|
||
.. tab-set:: | ||
|
||
.. tab-item:: NumPy | ||
|
||
.. testcode:: | ||
|
||
from typing import Dict | ||
import numpy as np | ||
import ray | ||
|
||
def increase_brightness(batch: Dict[str, np.ndarray]) -> Dict[str, np.ndarray]: | ||
batch["image"] = np.clip(batch["image"] + 4, 0, 255) | ||
return batch | ||
|
||
ds = ( | ||
ray.data.read_images("s3://anonymous@ray-example-data/image-datasets/simple") | ||
.map_batches(increase_brightness, batch_format="numpy") | ||
) | ||
|
||
.. tab-item:: pandas | ||
|
||
.. testcode:: | ||
|
||
import pandas as pd | ||
import ray | ||
|
||
def drop_nas(batch: pd.DataFrame) -> pd.DataFrame: | ||
return batch.dropna() | ||
|
||
ds = ( | ||
ray.data.read_csv("s3://anonymous@air-example-data/iris.csv") | ||
.map_batches(drop_nas, batch_format="pandas") | ||
) | ||
|
||
Configuring batch size | ||
~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
Increasing ``batch_size`` improves the performance of vectorized transformations like | ||
NumPy functions and model inference. However, if your batch size is too large, your | ||
program might run out of memory. If you encounter an out-of-memory error, decrease your | ||
``batch_size``. | ||
|
||
.. note:: | ||
|
||
The default batch size depends on your resource type. If you're using CPUs, | ||
the default batch size is 4096. If you're using GPUs, you must specify an explicit | ||
batch size. | ||
|
||
.. _transforming_groupby: | ||
|
||
Groupby and transforming groups | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Github somehow shows the change, but this section (
Configuring batch format
) is not changed.