Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[data] torch datasource streaming read #39554

Merged
merged 18 commits into from
Oct 4, 2023
Merged

Conversation

Zandew
Copy link
Contributor

@Zandew Zandew commented Sep 11, 2023

Why are these changes needed?

Provides a parallel and streaming implementation for the from_torch read api.

In this PR, we set parallelism=1, so we only utilize the streaming portion of this api.

Not included in this PR:

  • Parallel read across multiple nodes
  • Blocks may be reordered because of parallel read tasks. Add option to preserve order. Not an issue here because we use parallelism=1.
  • Block metadata, there is no good way of getting the size/schema of a torch dataset. We cannot autodetect parallelism well. Also not a problem for now because of parallelism=1

Benchmarks (For parallel read implementation, not streaming)

S3 Images

Image data taks3://air-example-data-2/20G-image-data-synthetic-raw/ was read with parallelism=-1 and iterated over with iter_torch_batches on 32vCPU, 128GiB nodes.

6 nodes
time: ~180s
cpu usage: peak ~56 cores
Screenshot 2023-09-22 at 12 26 58 PM

1 node
time: ~240s
cpu usage: peak ~27 cores
Screenshot 2023-09-22 at 12 27 20 PM

Speed up between cases: ~1.33x

The bottleneck here is while deserializing the images afterwards on the driver. Parallelism does not help with this.
Screenshot 2023-09-25 at 3 58 45 PM

Fake Data

We can generate fake data roughly the same size as the S3 images using torch.randint(...) and spinning for some time for i in range(1e7) to emulate the scenario above but without the need to deserialize the images.

6 nodes
time: ~100s
cpu usage: peak ~170 cores

1 node
time: ~550s
cpu usage: peak ~27 cores

Speed up between cases: ~5.5x

Related issue number

Closes torch read #39287

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

python/ray/data/datasource/torch_datasource.py Outdated Show resolved Hide resolved
python/ray/data/datasource/torch_datasource.py Outdated Show resolved Hide resolved
python/ray/data/tests/test_formats.py Outdated Show resolved Hide resolved
python/ray/data/tests/test_formats.py Outdated Show resolved Hide resolved
@Zandew Zandew marked this pull request as ready for review September 13, 2023 01:14
python/ray/data/datasource/torch_datasource.py Outdated Show resolved Hide resolved
python/ray/data/read_api.py Outdated Show resolved Hide resolved
python/ray/data/read_api.py Show resolved Hide resolved
python/ray/data/read_api.py Outdated Show resolved Hide resolved
python/ray/data/read_api.py Show resolved Hide resolved
python/ray/data/read_api.py Show resolved Hide resolved
python/ray/data/datasource/torch_datasource.py Outdated Show resolved Hide resolved
python/ray/data/read_api.py Outdated Show resolved Hide resolved
python/ray/data/read_api.py Outdated Show resolved Hide resolved
python/ray/data/read_api.py Outdated Show resolved Hide resolved
python/ray/data/read_api.py Outdated Show resolved Hide resolved
python/ray/data/read_api.py Outdated Show resolved Hide resolved
python/ray/data/datasource/torch_datasource.py Outdated Show resolved Hide resolved
)
read_tasks.append(
ReadTask(
lambda subset=subsets[i], batch_size=self._batch_size: _read_subset(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will serialize the subset as part of the read task function. I think there may be 2 issues:

  1. Some user-custom datasets may not be serializable.
  2. If the data is already in memory, serializing and copying it to a remote node would lead to even bad perf compared with just using from_items.

I'm thinking of 2 solutions:

  1. fall back to from_items in these cases.
  2. Pass a factory method as the parameter, instead of passing the concrete dataset object.

python/ray/data/read_api.py Outdated Show resolved Hide resolved
python/ray/data/datasource/torch_datasource.py Outdated Show resolved Hide resolved
python/ray/data/read_api.py Outdated Show resolved Hide resolved
python/ray/data/read_api.py Outdated Show resolved Hide resolved
python/ray/data/read_api.py Outdated Show resolved Hide resolved
python/ray/data/read_api.py Outdated Show resolved Hide resolved
python/ray/data/read_api.py Outdated Show resolved Hide resolved
@Zandew Zandew changed the title [data] torch datasource parallel read [data] torch datasource streaming read Sep 29, 2023
python/ray/data/read_api.py Show resolved Hide resolved
python/ray/data/tests/test_formats.py Outdated Show resolved Hide resolved
python/ray/data/read_api.py Outdated Show resolved Hide resolved
python/ray/data/datasource/torch_datasource.py Outdated Show resolved Hide resolved
@raulchen raulchen self-assigned this Oct 3, 2023
Andrew Xue added 7 commits October 3, 2023 16:43
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Andrew Xue added 11 commits October 3, 2023 16:43
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
@Zandew Zandew force-pushed the zandew-read-torch-datasource branch from 66e25c3 to 7951b68 Compare October 3, 2023 23:44
@raulchen raulchen merged commit 18cfead into master Oct 4, 2023
41 of 44 checks passed
@raulchen raulchen deleted the zandew-read-torch-datasource branch October 4, 2023 19:46
Zandew pushed a commit to Zandew/ray that referenced this pull request Oct 10, 2023
Provides a parallel and streaming implementation for the `from_torch` read api.

In this PR, we set `parallelism=1`, so we only utilize the streaming portion of this api.

Not included in this PR:
- Parallel read across multiple nodes
- Blocks may be reordered because of parallel read tasks. Add option to preserve order. Not an issue here because we use `parallelism=1`.
- Block metadata, there is no good way of getting the size/schema of a torch dataset. We cannot autodetect parallelism well. Also not a problem for now because of `parallelism=1`

---------

Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
Provides a parallel and streaming implementation for the `from_torch` read api.

In this PR, we set `parallelism=1`, so we only utilize the streaming portion of this api.

Not included in this PR:
- Parallel read across multiple nodes
- Blocks may be reordered because of parallel read tasks. Add option to preserve order. Not an issue here because we use `parallelism=1`.
- Block metadata, there is no good way of getting the size/schema of a torch dataset. We cannot autodetect parallelism well. Also not a problem for now because of `parallelism=1`

---------

Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Victor <vctr.y.m@example.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants