Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Part of Datasources moved to _internal #46774

Closed
to3i opened this issue Jul 24, 2024 · 6 comments
Closed

[Data] Part of Datasources moved to _internal #46774

to3i opened this issue Jul 24, 2024 · 6 comments
Labels
data Ray Data-related issues enhancement Request for new feature and/or capability question Just a question :)

Comments

@to3i
Copy link

to3i commented Jul 24, 2024

What happened

In version 2.32, many datasources have been moved to some _internal folder. Some of these datasources we are using to customize the data loading in our pipeline.

What you expected to happen

Since some of these datasources (e.g., ImageDatasource) were marked with DeveloperAPI, we were expecting such changes to be announced more broadly (major release). Additionally, it is not clear what the reasoning behind this change is (the associated PR links to a google doc that is not publicly accessible) and what this implies for the future.

Will were be more breaking changes?
How is a ray user supposed to use the datasources in the future?

Versions / Dependencies

ray[data] 2.32+

Reproduction script

from ray.data.datasource import ImageDatasource

Issue Severity

Low: It annoys or frustrates me.

@to3i to3i added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jul 24, 2024
@anyscalesam anyscalesam added the data Ray Data-related issues label Jul 29, 2024
@scottjlee
Copy link
Contributor

Our official policy:

Developer APIs are lower-level methods explicitly exposed to advanced Ray users and library developers. Their interfaces may change across minor Ray releases.

For this particular case with Datasources, we recommend using the high-level Read APIs (e.g. read_images()) when possible. Most Read APIs are PublicAPIs with higher stability. Out of curiosity, what's your use case for using the Datasource directly here?

In terms of the motivation, I don't have full context on this (couldn't find the google docs link). I will leave it to @bveeramani 's comment when he returns from vacation.

@scottjlee scottjlee added question Just a question :) enhancement Request for new feature and/or capability and removed bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Aug 21, 2024
@bveeramani
Copy link
Member

Hey @to3i, sorry about that. Breaking changes are always frustrating.

we were expecting such changes to be announced more broadly (major release).

Maybe this isn't discoverable enough, but in our documentation, we say we make breaking changes to developer APIs across minor releases: "Developer APIs are lower-level methods explicitly exposed to advanced Ray users and library developers. Their interfaces may change across minor Ray releases."

it is not clear what the reasoning behind this change is

We made datasource implementations internal for a few reasons:

  1. For consistency. Previously, datasource implementations could be internal, developer, or public. Now, they're all internal.
  2. To simplify the API surface and make the library easier to understand.
  3. To enable us to improve the library faster.

Will were be more breaking changes?

We don't have any interface changes planned for Datasource implementations as of now, although I can't guarantee anything.

How is a ray user supposed to use the datasources in the future?

If your goal is to customize data loading and you can't achieve your customization with existing APIs (e.g. read_images or read_binary_files + map), then I'd recommend copying the ImageDatasource source code.

OOC what were doing with ImageDatasource that you couldn't with read_images?

@to3i
Copy link
Author

to3i commented Aug 28, 2024

Thank you for the clarification @scottjlee, @bveeramani !

We found read_images/ImageDatasource to work more efficiently than using read_binary_files + map. Thus we extended the ImageDatasource with what we needed:

  • store the original image size in a column
  • store the storage account in a column (to identify items even if you stream from multiple storage accounts because path is insufficient)

I would prefer to use the more flexible read_binary_files + map approach. I wonder if I am missing something that would improve performance? Should one increase the _NUM_THREADS_PER_TASK similar to what ImageDatasource is doing?

BTW is there a plan to allow FileBasedDatasource to handle multiple storage accounts? Right now I need to use one ImageDatasource per storage account and later merge them with union. While this works well for single items, loading tuples (like image and label) from multiple storage accounts becomes tricky with ray data. It seems that a zip(union([read_images] * num_storage), union([read_binary_files] *num_storage)) cannot be executed efficiently. Not sure if one of your issues is related to such a topic: #46049

@bveeramani
Copy link
Member

I would prefer to use the more flexible read_binary_files + map approach. I wonder if I am missing something that would improve performance? Should one increase the _NUM_THREADS_PER_TASK similar to what ImageDatasource is doing?

Hmm...not sure why read_binary_files is performing worse than read_images. I'd expect them to perform similarily.

Increasing the number of threads might be worth a shot.

BTW is there a plan to allow FileBasedDatasource to handle multiple storage accounts?

No plans at the moment. By different storage accounts, do you mean different clouds or different accounts within the same cloud?

@to3i
Copy link
Author

to3i commented Aug 29, 2024

BTW is there a plan to allow FileBasedDatasource to handle multiple storage accounts?

No plans at the moment. By different storage accounts, do you mean different clouds or different accounts within the same cloud?

In the same cloud streaming data from multiple storage accounts in Azure or multiple projects in GCS. I guess for AWS multiple buckets is no issue, since no dedicated filesystem is passed. E.g., https://docs.ray.io/en/latest/data/loading-data.html#reading-files-from-cloud-storage

When loading from multiple storage accounts the problem, we can use union to merge the different streams, but if you then want to zip label files to form a tuple, ray first wants to materialize the union which causes problems.

Fan-in with zip/union only seems to work only for depth=1 without materialization. Do you have any suggestion?

@bveeramani
Copy link
Member

Ah, gotcha.

I don't have any recommendations off the top of my head, but I'd be happy to discuss this more if you DM me on the Ray Slack.

For the time being, I'm going to close this issue because I think the question about moving the Datasource implementations has been resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data Ray Data-related issues enhancement Request for new feature and/or capability question Just a question :)
Projects
None yet
Development

No branches or pull requests

4 participants