-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generating references from files in S3 (using kerchunk + fsspec) #61
Comments
Wondering how we want to package up kwargs for all the different kerchunk file readers so we can pipe in things like netcdf3, netcdf4, tiff etc. all have a few options. ex:
|
Urgh that's messy - I'm not familiar at all with these so would defer to your judgement. But generally they should all go in one dict so that they can be cleanly separated from Xarray kwargs once open_virtual_dataset is replaced by xr.open_dataset(..., engine='virtualizarr'). |
Whilst I think the approach in #67 is fine for now, is it possible to imagine any other implementation of reading byte ranges from remote files that doesn't rely on fsspec / kerchunk? |
It's straightforward to abstract this via a wrapper class. Just imagine the API you'd like to use from the virtualizarr perspective, create an ABC interface for it, and then make an fsspec implementation. |
@TomNicholas that would be nice! For a stop-gap, happy to continue working on #67 or if you want to pause and look for other file reader paths sounds good to. |
I really have no idea, since I've not worked with fsspec very much at all. Input welcome. I guess from the |
@TomNicholas curious about this comment #61 (comment)
since the current implementation relies on reading files using kerchunk - are you thinking about changing that implementation entirely? If keeping the current implementation, which relies on kerchunk modules, I think #67 makes sense as that PR just builds on that integration via the additional parameters kerchunk needs to open remote files. |
Well looking forward, we want to eventually use zarr chunk manifests for storing the byte ranges on disk, which leaves the only part of this library that actually depends on kerchunk being the kerchunk backends for reading the byte ranges. It's at least interesting to imagine what we could implement instead of using kerchunk to read into a format that we have to convert before using anyway. EDIT: IIUC with zarr v3 and the rust object-store create we also wouldn't need fsspec to actually read data from disk/object storage either. (I wonder if it's possible to use the same crate to read just byte ranges?...) I'm only floating this as an idea, not a concrete plan though. |
I would say it's ok to postpone S3 support until some of those things you mentioned (zarr chunk manifests, zarr v3 rust object-store) exists - wdyt @norlandrhagen Is there already WIP |
The rust crate exists, but you'll have to ask @jhamman about the status of attempts to use it to load data into python. |
@abarciauskas-bgse actually see zarr-developers/zarr-python#1661 |
So the rust But to actually use if isinstance(h5f, str):
fs, path = fsspec.core.url_to_fs(h5f, **(storage_options or {}))
input_file = fs.open(path, "rb")
url = h5f
_h5f = h5py.File(input_file, mode="r") Is there any way to do this without using fsspec? |
Answering my own question, how about this |
Is this within scope for VirtaliZarr right now? Generating the references is a very separate problem from storing / manipulating them. I'd recommend continuing to reply on Kerchunk for the reference generation. No, it would not be easy to use h5py in this way without fsspec. But it is possible to re-implement the logic for finding the chunks from scratch in a much more efficient way. See hidefix for example. |
I agree it's fairly orthogonal, and not part of an MVP (which would rely on fsspec and kerchunk both for generating references and reading actual bytes later). But if we're thinking about how to create references that don't require fsspec to read bytes from (i.e. zarr stores with chunk manifests), then it's also interesting to think about how to generate those references without fsspec. But there is less of an obvious motivation for doing so other than reliability / performance / design simplicity (whereas zarr stores without fsspec get us an actual feature: language-agnostic data access). |
From my point of view, generating the references is a one-time cost. Yes, it could be more efficient. But that's generally not the bottleneck for people today. Our goal is to make it more flexible to manipulate those reference once they are generated. |
I'm less concerned about efficiency and more about reliability / understandability.
Yes that's the main goal, but also (a) being more confident that the generated references are correct and (b) not using dependencies that are more complex than necessary (i.e. fsspec) would be nice secondary goals too. |
I think we've gone off track from the original issue here, so I've opened #78 to specifically discuss generating references from data in S3 without using fsspec (as opposed to reading bytes from known references to data in S3). I'll leave this issue to discuss the generating of references from files in S3 using kerchunk + fsspec. |
@abarciauskas-bgse @norlandrhagen it would be great to try to move forward with using kerchunk + fsspec for now (i.e. PR #67) with the understanding that eventually we would like to replace those dependencies with the ideas in #78. That would allow us to make this package useful for real cases faster. |
I think this was closed by @norlandrhagen in #67, with the other ideas being tracked in #78. |
Raised by @abarciauskas-bgse in #60.
Presumably kerchunk's backends opener functions can already do this? In which case we just need to pass the right kwargs / fsspec whatevers to those inside
open_virtual_dataset
.The text was updated successfully, but these errors were encountered: