Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When creating a DataTree from a Dataset with path-like variable, subgroups are expected to be created #311

Open
etienneschalk opened this issue Feb 9, 2024 · 7 comments · May be fixed by #314

Comments

@etienneschalk
Copy link
Contributor

When creating a DataTree from a Dataset with path-like variable, subgroups are expected to be created

from pathlib import Path

import numpy as np
import xarray as xr
import datatree as dt 
from xarray import open_zarr
# Set to True to get rich HTML representations in an interactive Notebook session
# Set to False to get textual representations ready to be converted to markdown for issue report

INTERACTIVE = False 

# Convert to markdown with
# jupyter nbconvert --to markdown notebooks/unexpected-behaviour-dt-from-ds.ipynb

Test Data Initialization

A Dataset containing a single variable, with a name containing slashes, representing a path.

xda = xr.DataArray(
    np.arange(3 * 18).reshape(3, 18),
    coords={"label": list("abc"), "z": list(range(18))},
)
xda = xda.chunk({"label": 2, "z": 4})
xds = xr.Dataset({"group/subgroup/my_variable": xda})
xds if INTERACTIVE else print(xds)
<xarray.Dataset>
Dimensions:                     (label: 3, z: 18)
Coordinates:
  * label                       (label) <U1 'a' 'b' 'c'
  * z                           (z) int64 0 1 2 3 4 5 6 ... 11 12 13 14 15 16 17
Data variables:
    group/subgroup/my_variable  (label, z) int64 dask.array<chunksize=(2, 4), meta=np.ndarray>

This flat Dataset containing path-like variable name is expected to produce groups and subgroups
once injected into a DataTree.

Unfortunately, it does not happen. Instead it produces a flat DataTree with a single variable,
with an illegal name (containing slashes).

xdt = dt.DataTree(xds)
xdt if INTERACTIVE else print(xdt)
DataTree('None', parent=None)
    Dimensions:                     (label: 3, z: 18)
    Coordinates:
      * label                       (label) <U1 'a' 'b' 'c'
      * z                           (z) int64 0 1 2 3 4 5 6 ... 11 12 13 14 15 16 17
    Data variables:
        group/subgroup/my_variable  (label, z) int64 dask.array<chunksize=(2, 4), meta=np.ndarray>

This is not only cosmetic. Indeed, trying to access this malformed variable name will result in an error:

try: 
    xdt["group/subgroup/my_variable"]
except KeyError as err:
    print(err)
'Could not find node at group/subgroup/my_variable'

The expected behaviour would be the one of using __setitem__:

xdt = dt.DataTree()
for varname, xda in xds.items():
    xdt[varname] = xda
xdt if INTERACTIVE else print(xdt)
DataTree('None', parent=None)
└── DataTree('group')
    └── DataTree('subgroup')
            Dimensions:      (label: 3, z: 18)
            Coordinates:
              * label        (label) <U1 'a' 'b' 'c'
              * z            (z) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
            Data variables:
                my_variable  (label, z) int64 dask.array<chunksize=(2, 4), meta=np.ndarray>

Technical Hints

__setitem__ wraps the key into a NodePath:

path = NodePath(key)

Probably this section of the DataTree initialization logic would need to be adapted:

self._replace(

@etienneschalk
Copy link
Contributor Author

Note: this issue was generated from a notebook. You can use it to reproduce locally the bug.

@TomNicholas
Copy link
Collaborator

Thank you for spotting this!

This flat Dataset containing path-like variable name is expected to produce groups and subgroups once injected into a DataTree.

Yes, or a simpler solution would be just to raise an error if someone tries to pass a Dataset which has path-like variable names. We are then basically assuming that (a) any Dataset passed is supposed to represent a group, and (b) a single group cannot have path-like variable names.

I think there is also an argument to be made with round-tripping. I think we want this property to hold always:

ds  # start with a dataset
dt = DataTree(ds)
new_ds = dt.to_dataset()  # expect new_ds == ds

If ds contains path-like variable names, that are then automatically turned into different groups, this property won't work. If we just forbid it, we can make sure it works in all other cases.

@etienneschalk
Copy link
Contributor Author

#311 (comment)

Hello, thanks for your answer!

Indeed the round-trip capability seems the cleanest default behaviour. A Dataset is supposed to be a group. I think what I needed already exists: it's DataTree.from_dict .

def from_dict(

I can have a look to fix the bug (forbidding Datasets containing path-like variable names), and I can suggest an error message like this to help the user that probably wanted this "auto-creating-group" behaviour but did not knew about .from_dict:

Given Dataset contains path-like variable names. 
A Dataset represents a group, and a single group cannot have path-like variable names.
Consider `DataTree.from_dict` to automatically create groups from a mapping of paths to data objects

(if you have an idea to shorten and improve it I am interested!)

Here is what I really wanted to do when creating the issue:

from pathlib import PurePosixPath

# Note: paths represent variable names ; the mapping should only contain groups, hence `.parent`
# Note 2: the named DataArray is renamed to only keep the name part of the path.
mapping_path_to_dataarray = {
    PurePosixPath(str(varname)).parent: xda.rename(PurePosixPath(str(varname)).name)
    for varname, xda in xds.items()
}
mapping_path_to_dataarray
{PurePosixPath('group/subgroup'): <xarray.DataArray 'my_variable' (label: 3, z: 18)>
 dask.array<xarray-<this-array>, shape=(3, 18), dtype=int64, chunksize=(2, 4), chunktype=numpy.ndarray>
 Coordinates:
   * label    (label) <U1 'a' 'b' 'c'
   * z        (z) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17}

Though this logic will work, it seems a little bit over-complicated to me.
However, a Dataset carrying variables names containing paths is a little bit
a misuse of a Dataset. A DataTree should be used in the first place
if possible.

xdt = dt.DataTree.from_dict(mapping_path_to_dataarray)
xdt if INTERACTIVE else print(xdt)
DataTree('None', parent=None)
└── DataTree('group')
    └── DataTree('subgroup')
            Dimensions:      (label: 3, z: 18)
            Coordinates:
              * label        (label) <U1 'a' 'b' 'c'
              * z            (z) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
            Data variables:
                my_variable  (label, z) int64 dask.array<chunksize=(2, 4), meta=np.ndarray>

Note: Pylance is not happy of usage of dict of PurePosixPaths. It only expects str as keys.
Maybe it could be worth it to make it accepts Pure Paths too?

Argument of type "dict[PurePosixPath, DataArray]" cannot be assigned to parameter "d" of type "MutableMapping[str, Dataset | DataArray | DataTree[Unknown] | None]" in function "from_dict"
  "dict[PurePosixPath, DataArray]" is incompatible with "MutableMapping[str, Dataset | DataArray | DataTree[Unknown] | None]"
    Type parameter "_KT@MutableMapping" is invariant, but "PurePosixPath" is not the same as "str"
    Type parameter "_VT@MutableMapping" is invariant, but "DataArray" is not the same as "Dataset | DataArray | DataTree[Unknown] | None"PylancereportArgumentType

Reconverting the DataTree gives an empty Dataset, as the root contains only groups
but no concrete variables:

reconverted = xdt.to_dataset()
reconverted if INTERACTIVE else print(reconverted)
<xarray.Dataset>
Dimensions:  ()
Data variables:
    *empty*

Reconverting the subgroup containing the variable is successful

reconverted = xdt["group/subgroup"].to_dataset()
reconverted if INTERACTIVE else print(reconverted)
<xarray.Dataset>
Dimensions:      (label: 3, z: 18)
Coordinates:
  * label        (label) <U1 'a' 'b' 'c'
  * z            (z) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Data variables:
    my_variable  (label, z) int64 dask.array<chunksize=(2, 4), meta=np.ndarray>

@etienneschalk
Copy link
Contributor Author

Complement:

After re-reading all of this, I think the most natural behaviour from a user perspective is the one of __setitem__:

xdt = dt.DataTree()
for varname, xda in xds.items():
    xdt[varname] = xda

The only issue I have is that this make the for loop unavoidable as it contains an assignment (imperative not declarative)

The over-complicated solution...

mapping_path_to_dataarray = {
    PurePosixPath(str(varname)).parent: xda.rename(PurePosixPath(str(varname)).name)
    for varname, xda in xds.items()
}

...was only to get a dict-comprehension rather than a for loop with assignments.

@TomNicholas
Copy link
Collaborator

(if you have an idea to shorten and improve it I am interested!)

That looks good - I think long error messages are good if they are adding more context, like yours is. Only thing I would add is an explicit list of the offending variable names.

Note: Pylance is not happy of usage of dict of PurePosixPaths. It only expects str as keys.
Maybe it could be worth it to make it accepts Pure Paths too?

The Posix part is to forbid backslashes, only allowing forward slashes. Maybe we can forbid those at runtime whilst relaxing the type though.


After re-reading all of this, I think the most natural behaviour from a user perspective is the one of __setitem__:

Why? What do you dislike about the "one dataset per group" logic?

@etienneschalk
Copy link
Contributor Author

Okay for adding the offending variable names!

I completely agree with the "one dataset per group" logic. This is more from a usage perspective where I find these 3 lines to be the simplest to convert a Dataset containing paths to a DataTree:

xdt = dt.DataTree()
for varname, xda in xds.items():
    xdt[varname] = xda

The user does not have to think that much: the assignment will take care of creating groups and renaming the DataArrays from the source Dataset accordingly. The only thing I quite dislike is having to make the for-loop and assignments, not having a declarative comprehension-like approach to achieve what these lines do, without becoming verbose. But I must admit this is mainly about aesthetics rather than a true need, so I don't think anything more needs to be done on this topic!

@TomNicholas
Copy link
Collaborator

TomNicholas commented Feb 14, 2024

I personally think this little loop is fine - it's clear, explicit and should't be buggy (once we fix the bug you reported in this issue!). If many other people appear to say that they would like a special function for ingesting this type of non-compliant dataset then we could revisit the idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants