Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shaping the future of Backends #8548

Open
headtr1ck opened this issue Dec 12, 2023 · 3 comments
Open

Shaping the future of Backends #8548

headtr1ck opened this issue Dec 12, 2023 · 3 comments

Comments

@headtr1ck
Copy link
Collaborator

What is your issue?

Backends in xarray are used to read and write files (or in general objects) and transform them into useful xarray Datasets.

This issue will collect ideas on how to continuously improve them.

Current state

Along the reading and writing process there are many implicit and explicit configuration possibilities. There are many backend specific options and many en-,decoder specific options. Most of them are currently difficult or even impossible to discover.

There is the infamous open_dataset method which can do everything, but there are also some specialized methods like open_zarr or to_netcdf.

The only really formalized way to extend xarray capabilities is via the BackendEntrypoint. Currently only for reading files.
This has proven to work and things are going so well that people are discussing getting rid of the special reading methods (#7495).
A major critique in this thread is again the discoverability of configuration options.

Problems

To name a few:

  • Discoverability of configuration options is poor
  • No distinction between backend and encoding options
  • New options are simply added as another keyword argument to open_dataset
  • No writing support for backends

What already improved

The future

After listing all the problems, lets see how we can improve the situation and make backends an allrounder solution to reading and writing all kinds of files.

What happens behind the scenes

In general the reading and writing of Datasets in xarray is a three-step process.

                       [ done by backend.open_dataset]
Dataset < chunking   < decoding < opening_in_store < file
Dataset > validating > encoding > storing_in_store > file

Probably you could consider combining the chunking and decoding as well as validation and encoding into a single logical step in the pipeline. This view should help decide how to set up a future architecture of backends.

You can see that there is a common middle object in this process, a in-memory representation of the file on disc between en-, decoding and the abstract store. This is actually a xarray.Dataset and is internally called a "backend dataset".

write_dataset method

A quite natural extension of backends would be to implement a write_dataset method (name pending). This would allow backends to fulfill the complete right side of the pipeline.

Transformer class

Due to a lack of a common word for a class that handles "encoding" and "decoding" I will call them transformer here.

The process of en- and decoding is currently done "hardcoded" by the respective open_dataset and to_netcdf methods.
One could imagine to introduce the concept of a common class that handles both.

This class could handle the implemented CF or netcdf encoding conventions.
But it would also allow users to define their own storing conventions (Why not create a custom transformer that adds indexes based on variable attributes?)
The possibilities are endless, and an interface that fulfills all the requirements still has to be found.

This would homogenize the reading and writing process to

Dataset <> Transformer <> Backend <> file

As a bonus this would increase discoverability of the configuration options of the decoding options (then transformer arguments).

The new interface then could be

backend = Netcdf4BackendEntrypoint(group="data")
decoder = CFTransformer(cftime=True)
ds = xr.open_dataset("file.nc", engine=backend, decoder=decoder)

while of course still allowing to pass all options simply as kwarg (since this is still the easiest way of telling beginners how to open files)

The final improvement here would be to add additional entrypoints for these transformers ;)

Disclaimer

Now this issue is just a bunch of random ideas that require quite some refinement or they might even turn out to be nonsense.
So lets have a exciting discussion about these things :)
If you have something to add to the above points I will include your ideas as well. This is meant as a collection of ideas on how to improve our backends :)

@keewis
Copy link
Collaborator

keewis commented Dec 12, 2023

see also #5954 for a previous discussion of the write_dataset idea (the name I proposed there was xr.save_dataset to be symmetric with save_mfdataset)

@TomNicholas TomNicholas added this to To do in Flexible Storage Backends via automation Dec 13, 2023
@TomNicholas
Copy link
Contributor

Due to a lack of a common word for a class that handles "encoding" and "decoding" I will call them transformer here.

The process of en- and decoding is currently done "hardcoded" by the respective open_dataset and to_netcdf methods.
One could imagine to introduce the concept of a common class that handles both.

This class could handle the implemented CF or netcdf encoding conventions.

Doesn't this already exist as xarray.coding.VariableCoder? It has .encode and .decode methods. Are we basically just taking about making it public and allowing users to pass in custom subclasses of VariableCoder, and generalizing xarray.conventions to be configurable for non-CF cases?

(Why not create a custom transformer that adds indexes based on variable attributes?)

On the other hand this suggestion seems to be something that could not be immediately handled by the current VariableCoder design.

@dcherian
Copy link
Contributor

Agree that these "transformers" are called "coders" ATM, linking this quite old proposal! #155

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

4 participants