Skip to content

Group together decoding options into a single argument #4490

Open
@shoyer

Description

@shoyer

Is your feature request related to a problem? Please describe.

open_dataset() currently has a very long function signature. This makes it hard to keep track of everything it can do, and is particularly problematic for the authors of new backends (e.g., see #4477), which might need to know how to handle all these arguments.

Describe the solution you'd like

To simple the interface, I propose to group together all the decoding options into a new DecodingOptions class. I'm thinking something like:

from dataclasses import dataclass, field, asdict
from typing import Optional, List

@dataclass(frozen=True)
class DecodingOptions:
    mask: Optional[bool] = None
    scale: Optional[bool] = None
    datetime: Optional[bool] = None
    timedelta: Optional[bool] = None
    use_cftime: Optional[bool] = None
    concat_characters: Optional[bool] = None
    coords: Optional[bool] = None
    drop_variables: Optional[List[str]] = None

    @classmethods
    def disabled(cls):
        return cls(mask=False, scale=False, datetime=False, timedelta=False,
                  concat_characters=False, coords=False)

    def non_defaults(self):
        return {k: v for k, v in asdict(self).items() if v is not None}

    # add another method for creating default Variable Coder() objects,
    # e.g., those listed in encode_cf_variable()

The signature of open_dataset would then become:

def open_dataset(
    filename_or_obj,
    group=None,
    *
    engine=None,
    chunks=None,
    lock=None,
    cache=None,
    backend_kwargs=None,
    decode: Union[DecodingOptions, bool] = None, 
    **deprecated_kwargs
):
    if decode is None:
        decode = DecodingOptions()
    if decode is False:
        decode = DecodingOptions.disabled()
    # handle deprecated_kwargs...
    ...

Question: are decode and DecodingOptions the right names? Maybe these should still include the name "CF", e.g., decode_cf and CFDecodingOptions, given that these are specific to CF conventions?

Note: the current signature is open_dataset(filename_or_obj, group=None, decode_cf=True, mask_and_scale=None, decode_times=True, autoclose=None, concat_characters=True, decode_coords=True, engine=None, chunks=None, lock=None, cache=None, drop_variables=None, backend_kwargs=None, use_cftime=None, decode_timedelta=None)

Usage with the new interface would look like xr.open_dataset(filename, decode=False) or xr.open_dataset(filename, decode=xr.DecodingOptions(mask=False, scale=False)).

This requires a little bit more typing than what we currently have, but it has a few advantages:

  1. It's easier to understand the role of different arguments. Now there is a function with ~8 arguments and a class with ~8 arguments rather than a function with ~15 arguments.
  2. It's easier to add new decoding arguments (e.g., for more advanced CF conventions), because they don't clutter the open_dataset interface. For example, I separated out mask and scale arguments, versus the current mask_and_scale argument.
  3. If a new backend plugin for open_dataset() needs to handle every option supported by open_dataset(), this makes that task significantly easier. The only decoding options they need to worry about are non-default options that were explicitly set, i.e., those exposed by the non_defaults() method. If another decoding option wasn't explicitly set and isn't recognized by the backend, they can just ignore it.

Describe alternatives you've considered

For the overall approach:

  1. We could keep the current design, with separate keyword arguments for decoding options, and just be very careful about passing around these arguments. This seems pretty painful for the backend refactor, though.
  2. We could keep the current design only for the user facing open_dataset() interface, and then internally convert into the DecodingOptions() struct for passing to backend constructors. This would provide much needed flexibility for backend authors, but most users wouldn't benefit from the new interface. Perhaps this would make sense as an intermediate step?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions