-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[data] torch datasource streaming read #39554
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
scottjlee
approved these changes
Sep 13, 2023
amogkam
reviewed
Sep 13, 2023
scottjlee
reviewed
Sep 14, 2023
scottjlee
reviewed
Sep 26, 2023
raulchen
reviewed
Sep 27, 2023
) | ||
read_tasks.append( | ||
ReadTask( | ||
lambda subset=subsets[i], batch_size=self._batch_size: _read_subset( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will serialize the subset as part of the read task function. I think there may be 2 issues:
- Some user-custom datasets may not be serializable.
- If the data is already in memory, serializing and copying it to a remote node would lead to even bad perf compared with just using
from_items
.
I'm thinking of 2 solutions:
- fall back to
from_items
in these cases. - Pass a factory method as the parameter, instead of passing the concrete dataset object.
amogkam
reviewed
Sep 27, 2023
raulchen
reviewed
Sep 29, 2023
raulchen
approved these changes
Oct 3, 2023
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
66e25c3
to
7951b68
Compare
Zandew
pushed a commit
to Zandew/ray
that referenced
this pull request
Oct 10, 2023
Provides a parallel and streaming implementation for the `from_torch` read api. In this PR, we set `parallelism=1`, so we only utilize the streaming portion of this api. Not included in this PR: - Parallel read across multiple nodes - Blocks may be reordered because of parallel read tasks. Add option to preserve order. Not an issue here because we use `parallelism=1`. - Block metadata, there is no good way of getting the size/schema of a torch dataset. We cannot autodetect parallelism well. Also not a problem for now because of `parallelism=1` --------- Signed-off-by: Andrew Xue <andrewxue@anyscale.com>
vymao
pushed a commit
to vymao/ray
that referenced
this pull request
Oct 11, 2023
Provides a parallel and streaming implementation for the `from_torch` read api. In this PR, we set `parallelism=1`, so we only utilize the streaming portion of this api. Not included in this PR: - Parallel read across multiple nodes - Blocks may be reordered because of parallel read tasks. Add option to preserve order. Not an issue here because we use `parallelism=1`. - Block metadata, there is no good way of getting the size/schema of a torch dataset. We cannot autodetect parallelism well. Also not a problem for now because of `parallelism=1` --------- Signed-off-by: Andrew Xue <andrewxue@anyscale.com> Signed-off-by: Victor <vctr.y.m@example.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Why are these changes needed?
Provides a parallel and streaming implementation for the
from_torch
read api.In this PR, we set
parallelism=1
, so we only utilize the streaming portion of this api.Not included in this PR:
parallelism=1
.parallelism=1
Benchmarks (For parallel read implementation, not streaming)
S3 Images
Image data tak
s3://air-example-data-2/20G-image-data-synthetic-raw/
was read withparallelism=-1
and iterated over withiter_torch_batches
on32vCPU
,128GiB
nodes.6 nodes
![Screenshot 2023-09-22 at 12 26 58 PM](https://private-user-images.githubusercontent.com/39287272/270038242-8177b770-6258-442b-a32d-f674e54400c1.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTk2MjMxMTYsIm5iZiI6MTcxOTYyMjgxNiwicGF0aCI6Ii8zOTI4NzI3Mi8yNzAwMzgyNDItODE3N2I3NzAtNjI1OC00NDJiLWEzMmQtZjY3NGU1NDQwMGMxLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA2MjklMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNjI5VDAxMDAxNlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWMwNmU1NmU3ZWExYTA0ZGE3NjQ3NTdkYmQzZDg1MjNmNDk3NTRiYzYzMWMyNTk1NmIxMDg3MzMwMGE3MDc2YjgmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.m_yd_a8Sh1n4BQDHn5e3kpN1n10HGWXm47VYyscNcvE)
time: ~180s
cpu usage: peak ~56 cores
1 node
![Screenshot 2023-09-22 at 12 27 20 PM](https://private-user-images.githubusercontent.com/39287272/270038294-b4817506-4758-43cf-93c5-5fc4de7539de.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTk2MjMxMTYsIm5iZiI6MTcxOTYyMjgxNiwicGF0aCI6Ii8zOTI4NzI3Mi8yNzAwMzgyOTQtYjQ4MTc1MDYtNDc1OC00M2NmLTkzYzUtNWZjNGRlNzUzOWRlLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA2MjklMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNjI5VDAxMDAxNlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTgxYmZlYTc3NzI5YTBkMDc1MjcwNjUzMThlYWQyNGI1NTdlMGMwYTI5ZjJlYWY0OGYwNWFjZTYzZDRlNTIyMmMmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.LWfbMa02KtMgWc8P_aXAaj4YbOpLj6IiI0kVJKSSMAA)
time: ~240s
cpu usage: peak ~27 cores
Speed up between cases: ~1.33x
The bottleneck here is while deserializing the images afterwards on the driver. Parallelism does not help with this.
![Screenshot 2023-09-25 at 3 58 45 PM](https://private-user-images.githubusercontent.com/39287272/270491003-a0d85ba7-785c-4858-b1e8-027d60a18221.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTk2MjMxMTYsIm5iZiI6MTcxOTYyMjgxNiwicGF0aCI6Ii8zOTI4NzI3Mi8yNzA0OTEwMDMtYTBkODViYTctNzg1Yy00ODU4LWIxZTgtMDI3ZDYwYTE4MjIxLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA2MjklMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNjI5VDAxMDAxNlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTA0ZmFjNTEwOWM0MTZiNDliMGUyMWE5NjZlY2YyODk1ZGNmYzY5MjM5YWU0M2QzODRhMDAxMzYzM2NhYjAwMjkmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.r9uUsEExzWNjqjDk0f_7ZvTYwH5P-dJ7pQ06xIjq-hs)
Fake Data
We can generate fake data roughly the same size as the S3 images using
torch.randint(...)
and spinning for some timefor i in range(1e7)
to emulate the scenario above but without the need to deserialize the images.6 nodes
time: ~100s
cpu usage: peak ~170 cores
1 node
time: ~550s
cpu usage: peak ~27 cores
Speed up between cases: ~5.5x
Related issue number
Closes torch read #39287
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.