Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow configuring open_dataset via backend instances #8520

Open
wants to merge 36 commits into
base: main
Choose a base branch
from

Conversation

headtr1ck
Copy link
Collaborator

@headtr1ck headtr1ck commented Dec 4, 2023

Support passing instances of BackendEntryPoints as the engine argument.

Closes #8447

Then instead of passing a long list of options to the open_dataset method directly, you can also configure the entrypoint in the constructor and pass it as the engine.

It would look something like this:

engine = NetCDF4BackendEntrypoint(mode="a", clobber=False)
ds = xr.open_dataset("some_file.nc", engine=engine)

While this is actually even more lines of code, the main advantage is to have better discoverability of the options.

TODO:

  • Adapt netcdf4 backend
  • Adapt h5netcdf backend
  • Adapt zarr backend
  • Adapt scipy backend
  • Adapt pydap backend
    • output_grid seems to be always set to True? is this intentional, why not remove it instead?
    • verify and user_charset are non-existent in pydap? > I still had pydap version 3.2, in 3.4 they exist...
    • typing is only my first impression. Not easy if upstream libs are untyped :/
  • Adapt pynio backend > Won't adapt because deprecated
  • Fix docstrings to include init options
  • Check if lock=True is allowed > Not allowed, otherwise scipy backend breaks
  • Change default to lock=True instead of None? Maybe a later PR?
  • Rename XXXBackendEntrypoint > XXXBackend ?
  • The autoclose argument seems to do nothing? > Actually it is used in BaseNetCDF4Array, all good
  • Move group to open_dataset instead of backend option? > Its not really a decoder either. Not sure, for now leave it in the init...
  • Improve _resolve_decoders_kwargs, this function has a lot of implicit assumtions? Maybe remove open_dataset_parameters alltogether?
  • Add tests for passing backend directly via engine argument
  • open_dataset now has **kwargs to support backwards compatibility. Probably we should raise if unsupported stuff is added (i.e. typos) otherwise this could be confusing? (i.e. see test in zarr that checks for deprecated auto_chunk)

@max-sixty
Copy link
Collaborator

+1 for the concept!

@TomNicholas
Copy link
Contributor

I actually wondered about this when I did the ChunkManagerEntrypoint stuff - you can currently pass the entrypoint class directly to the chunks kwarg if you want, but it's less useful so I didn't advertise it.

@headtr1ck
Copy link
Collaborator Author

Seems like there is support for this :)
Then I will continue to work on this!

Another related proposal: we could remove the XXXEntrypoint from the class names and only call them XXBackend since these classes are now not only used for entry points anymore.

@dcherian
Copy link
Contributor

dcherian commented Dec 5, 2023

NetCDF4BackendEntrypoint(mask_and_scale=False, use_cftime=True)

These are encode/decode options. I think they are mostly conceptually distinct from backend_kwargs which would be the thing to enable with this proposed API.

See issue about an API for encoding/decoding options.

@headtr1ck
Copy link
Collaborator Author

Ok, I was simply taking all options that the current backend had and moved them.

Maybe a good idea to split these up in this PR and move only backends options to the init and leave the encoding options still in the open_dataset method.
Then a follow-up PR can deal with an encoder class.

@dcherian
Copy link
Contributor

dcherian commented Dec 5, 2023

a good idea to split these up in this PR and move only backends options to the init and leave the encoding options still in the open_dataset method.

Yes. I think increasing discoverability of what goes in backend_kwargs would be a major win.

Copy link
Contributor

@kmuehlbauer kmuehlbauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@headtr1ck I'm trying to add my 2cents for the h5netcdf backend.

Yes, from my perspective h5netcdf backend should be in line with netcdf4 backend (wrt autoclose and mode) as it uses BaseNetCDF4Array and other netcdf4-machinery under the hood consuming those options.

I've added a comment inline for the decode_vlen_strings option.

xarray/backends/h5netcdf_.py Outdated Show resolved Hide resolved
@headtr1ck
Copy link
Collaborator Author

Can anyone help with the pydap options (see description)?

@headtr1ck headtr1ck marked this pull request as ready for review December 12, 2023 20:24
@headtr1ck
Copy link
Collaborator Author

I think I will stop here and do additional work in a future PR, or probably collect further ideas first in an issue to discuss it properly.

@benbovy
Copy link
Member

benbovy commented Dec 19, 2023

Another related proposal: we could remove the XXXEntrypoint from the class names and only call them XXBackend since these classes are now not only used for entry points anymore.

+1 !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve discoverability of backend engine options
7 participants