-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] Part of Datasources moved to _internal #46774
Comments
Our official policy:
For this particular case with Datasources, we recommend using the high-level Read APIs (e.g. In terms of the motivation, I don't have full context on this (couldn't find the google docs link). I will leave it to @bveeramani 's comment when he returns from vacation. |
Hey @to3i, sorry about that. Breaking changes are always frustrating.
Maybe this isn't discoverable enough, but in our documentation, we say we make breaking changes to developer APIs across minor releases: "Developer APIs are lower-level methods explicitly exposed to advanced Ray users and library developers. Their interfaces may change across minor Ray releases."
We made datasource implementations internal for a few reasons:
We don't have any interface changes planned for
If your goal is to customize data loading and you can't achieve your customization with existing APIs (e.g. OOC what were doing with |
Thank you for the clarification @scottjlee, @bveeramani ! We found
I would prefer to use the more flexible BTW is there a plan to allow |
Hmm...not sure why Increasing the number of threads might be worth a shot.
No plans at the moment. By different storage accounts, do you mean different clouds or different accounts within the same cloud? |
In the same cloud streaming data from multiple storage accounts in Azure or multiple projects in GCS. I guess for AWS multiple buckets is no issue, since no dedicated filesystem is passed. E.g., https://docs.ray.io/en/latest/data/loading-data.html#reading-files-from-cloud-storage When loading from multiple storage accounts the problem, we can use union to merge the different streams, but if you then want to zip label files to form a tuple, ray first wants to materialize the union which causes problems. Fan-in with zip/union only seems to work only for depth=1 without materialization. Do you have any suggestion? |
Ah, gotcha. I don't have any recommendations off the top of my head, but I'd be happy to discuss this more if you DM me on the Ray Slack. For the time being, I'm going to close this issue because I think the question about moving the |
What happened
In version 2.32, many datasources have been moved to some _internal folder. Some of these datasources we are using to customize the data loading in our pipeline.
What you expected to happen
Since some of these datasources (e.g., ImageDatasource) were marked with DeveloperAPI, we were expecting such changes to be announced more broadly (major release). Additionally, it is not clear what the reasoning behind this change is (the associated PR links to a google doc that is not publicly accessible) and what this implies for the future.
Will were be more breaking changes?
How is a ray user supposed to use the datasources in the future?
Versions / Dependencies
ray[data] 2.32+
Reproduction script
from ray.data.datasource import ImageDatasource
Issue Severity
Low: It annoys or frustrates me.
The text was updated successfully, but these errors were encountered: