New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[data] torch datasource streaming read #39554

Merged

raulchen merged 18 commits into master from zandew-read-torch-datasource

Oct 4, 2023

Contributor

Zandew commented Sep 11, 2023 •

edited

Loading

Why are these changes needed?

Provides a parallel and streaming implementation for the from_torch read api.

In this PR, we set parallelism=1, so we only utilize the streaming portion of this api.

Not included in this PR:

Parallel read across multiple nodes
Blocks may be reordered because of parallel read tasks. Add option to preserve order. Not an issue here because we use parallelism=1.
Block metadata, there is no good way of getting the size/schema of a torch dataset. We cannot autodetect parallelism well. Also not a problem for now because of parallelism=1

Benchmarks (For parallel read implementation, not streaming)

S3 Images

Image data taks3://air-example-data-2/20G-image-data-synthetic-raw/ was read with parallelism=-1 and iterated over with iter_torch_batches on 32vCPU, 128GiB nodes.

6 nodes
time: ~180s
cpu usage: peak ~56 cores

1 node
time: ~240s
cpu usage: peak ~27 cores

Speed up between cases: ~1.33x

The bottleneck here is while deserializing the images afterwards on the driver. Parallelism does not help with this.

Fake Data

We can generate fake data roughly the same size as the S3 images using torch.randint(...) and spinning for some time for i in range(1e7) to emulate the scenario above but without the need to deserialize the images.

6 nodes
time: ~100s
cpu usage: peak ~170 cores

1 node
time: ~550s
cpu usage: peak ~27 cores

Speed up between cases: ~5.5x

Related issue number

Closes torch read #39287

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Zandew assigned scottjlee

scottjlee approved these changes

View reviewed changes

python/ray/data/datasource/torch_datasource.py Outdated Show resolved Hide resolved

python/ray/data/datasource/torch_datasource.py Outdated Show resolved Hide resolved

python/ray/data/tests/test_formats.py Outdated Show resolved Hide resolved

python/ray/data/tests/test_formats.py Outdated Show resolved Hide resolved

Zandew marked this pull request as ready for review

September 13, 2023 01:14

Zandew requested review from ericl, scv119, c21, amogkam, bveeramani and raulchen as code owners

September 13, 2023 01:15

Zandew assigned amogkam and bveeramani

amogkam reviewed

View reviewed changes

python/ray/data/datasource/torch_datasource.py Show resolved Hide resolved

python/ray/data/datasource/torch_datasource.py Outdated Show resolved Hide resolved

python/ray/data/read_api.py Outdated Show resolved Hide resolved

python/ray/data/read_api.py Show resolved Hide resolved

python/ray/data/read_api.py Outdated Show resolved Hide resolved

python/ray/data/read_api.py Show resolved Hide resolved

python/ray/data/read_api.py Show resolved Hide resolved

python/ray/data/datasource/torch_datasource.py Outdated Show resolved Hide resolved

Zandew requested review from scottjlee and amogkam

September 14, 2023 16:40

scottjlee reviewed

View reviewed changes

python/ray/data/read_api.py Outdated Show resolved Hide resolved

python/ray/data/read_api.py Outdated Show resolved Hide resolved

python/ray/data/read_api.py Outdated Show resolved Hide resolved

scottjlee reviewed

View reviewed changes

python/ray/data/datasource/torch_datasource.py Outdated Show resolved Hide resolved

Zandew requested a review from stephanie-wang as a code owner

September 26, 2023 22:44

raulchen reviewed

View reviewed changes

python/ray/data/read_api.py Outdated Show resolved Hide resolved

python/ray/data/read_api.py Outdated Show resolved Hide resolved

python/ray/data/datasource/torch_datasource.py Outdated Show resolved Hide resolved

python/ray/data/datasource/torch_datasource.py Outdated

+                          )
+                          read_tasks.append(
+                              ReadTask(
+                                  lambda subset=subsets[i], batch_size=self._batch_size: _read_subset(

Contributor

raulchen Sep 27, 2023

This will serialize the subset as part of the read task function. I think there may be 2 issues:

Some user-custom datasets may not be serializable.
If the data is already in memory, serializing and copying it to a remote node would lead to even bad perf compared with just using from_items.

I'm thinking of 2 solutions:

fall back to from_items in these cases.
Pass a factory method as the parameter, instead of passing the concrete dataset object.

python/ray/data/read_api.py Outdated Show resolved Hide resolved

amogkam reviewed

View reviewed changes

python/ray/data/datasource/torch_datasource.py Outdated Show resolved Hide resolved

python/ray/data/read_api.py Outdated Show resolved Hide resolved

python/ray/data/read_api.py Outdated Show resolved Hide resolved

python/ray/data/read_api.py Outdated Show resolved Hide resolved

python/ray/data/read_api.py Outdated Show resolved Hide resolved

python/ray/data/read_api.py Outdated Show resolved Hide resolved

python/ray/data/datasource/torch_datasource.py Show resolved Hide resolved

Zandew changed the title ~~[data] torch datasource parallel read~~ [data] torch datasource streaming read

raulchen reviewed

View reviewed changes

python/ray/data/datasource/torch_datasource.py Show resolved Hide resolved

python/ray/data/read_api.py Show resolved Hide resolved

python/ray/data/tests/test_formats.py Outdated Show resolved Hide resolved

python/ray/data/read_api.py Outdated Show resolved Hide resolved

python/ray/data/datasource/torch_datasource.py Outdated Show resolved Hide resolved

raulchen self-assigned this

raulchen approved these changes

View reviewed changes

Andrew Xue added 7 commits

October 3, 2023 16:43


          parallel read torch datasource

dc16e7f

Signed-off-by: Andrew Xue <andrewxue@anyscale.com>


          proper typing

634d180

Signed-off-by: Andrew Xue <andrewxue@anyscale.com>


          batch_size as a parameter

7e121e7

Signed-off-by: Andrew Xue <andrewxue@anyscale.com>


          docs and torchreader cleanup

432403d

Signed-off-by: Andrew Xue <andrewxue@anyscale.com>


          fix random_split lengths

f38019d

Signed-off-by: Andrew Xue <andrewxue@anyscale.com>


          use ray_start_regular_shared

51f4560

Signed-off-by: Andrew Xue <andrewxue@anyscale.com>


          split parallel tests

0cfb9f5

Signed-off-by: Andrew Xue <andrewxue@anyscale.com>

Andrew Xue added 11 commits

October 3, 2023 16:43


          add back batching

e29ae9a

Signed-off-by: Andrew Xue <andrewxue@anyscale.com>


          explicit parallelism for tests

c635a8c

Signed-off-by: Andrew Xue <andrewxue@anyscale.com>


          remove torch dataloader and add batch_size param

4a3e000

Signed-off-by: Andrew Xue <andrewxue@anyscale.com>


          update api/docs

28fee85

Signed-off-by: Andrew Xue <andrewxue@anyscale.com>


          add force_local parameter

ca091d8

Signed-off-by: Andrew Xue <andrewxue@anyscale.com>


          parallelism opt-in

fa1e9a6

Signed-off-by: Andrew Xue <andrewxue@anyscale.com>


          expose only streaming impl

6ee345f

Signed-off-by: Andrew Xue <andrewxue@anyscale.com>


          remove shuffle from datasource

9fa853e

Signed-off-by: Andrew Xue <andrewxue@anyscale.com>


          update docs, test iterable-style

Signed-off-by: Andrew Xue <andrewxue@anyscale.com>


          lint

ec4ea7e

Signed-off-by: Andrew Xue <andrewxue@anyscale.com>


          update datasource docstring

7951b68

Signed-off-by: Andrew Xue <andrewxue@anyscale.com>

Zandew force-pushed the zandew-read-torch-datasource branch from 66e25c3 to 7951b68 Compare

October 3, 2023 23:44

raulchen merged commit 18cfead into master

41 of 44 checks passed

raulchen deleted the zandew-read-torch-datasource branch

October 4, 2023 19:46

Zandew pushed a commit to Zandew/ray that referenced this pull request


          [data] torch datasource streaming read (ray-project#39554)

f559d97

Provides a parallel and streaming implementation for the `from_torch` read api.

In this PR, we set `parallelism=1`, so we only utilize the streaming portion of this api.

Not included in this PR:
- Parallel read across multiple nodes
- Blocks may be reordered because of parallel read tasks. Add option to preserve order. Not an issue here because we use `parallelism=1`.
- Block metadata, there is no good way of getting the size/schema of a torch dataset. We cannot autodetect parallelism well. Also not a problem for now because of `parallelism=1`

---------

Signed-off-by: Andrew Xue <andrewxue@anyscale.com>

vymao pushed a commit to vymao/ray that referenced this pull request


          [data] torch datasource streaming read (ray-project#39554)

3df5bfc

Provides a parallel and streaming implementation for the `from_torch` read api.

In this PR, we set `parallelism=1`, so we only utilize the streaming portion of this api.

Not included in this PR:
- Parallel read across multiple nodes
- Blocks may be reordered because of parallel read tasks. Add option to preserve order. Not an issue here because we use `parallelism=1`.
- Block metadata, there is no good way of getting the size/schema of a torch dataset. We cannot autodetect parallelism well. Also not a problem for now because of `parallelism=1`

---------

Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Victor <vctr.y.m@example.com>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

scottjlee scottjlee approved these changes

amogkam amogkam left review comments

raulchen raulchen approved these changes

ericl Awaiting requested review from ericl ericl is a code owner

scv119 Awaiting requested review from scv119 scv119 is a code owner

c21 Awaiting requested review from c21 c21 is a code owner

bveeramani Awaiting requested review from bveeramani bveeramani is a code owner

stephanie-wang Awaiting requested review from stephanie-wang stephanie-wang is a code owner