Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slightly improve DataTree repr #9064

Merged
merged 8 commits into from
Jun 26, 2024
Merged

Slightly improve DataTree repr #9064

merged 8 commits into from
Jun 26, 2024

Conversation

shoyer
Copy link
Member

@shoyer shoyer commented Jun 3, 2024

The DataTree header is now <xarray.DataTree> or <xarray.DataTree 'name'> for consistency with the Dataset and DataArray reprs.

The full path is now included for each DataTree node, rather than printing name and parent separately.

from xarray.core.datatree import DataTree
import xarray as xr

tree = DataTree.from_dict(
    {
        "/": xr.Dataset({"e": (("x",), [1, 2])}, coords={"x": [2, 3]}),
        "/b": xr.Dataset({"f": (("y",), [3])}),
        "/b/c": xr.Dataset(),
        "/b/d": xr.Dataset({"g": 4}),
    }
)

Before:

DataTree('None', parent=None)
│   Dimensions:  (x: 2)
│   Coordinates:
│     * x        (x) int64 16B 2 3
│   Data variables:
│       e        (x) int64 16B 1 2
└── DataTree('b')
    │   Dimensions:  (y: 1)
    │   Dimensions without coordinates: y
    │   Data variables:
    │       f        (y) int64 8B 3
    ├── DataTree('c')
    └── DataTree('d')
            Dimensions:  ()
            Data variables:
                g        int64 8B 4

After:

<xarray.DataTree>
Group: /
│   Dimensions:  (x: 2)
│   Coordinates:
│     * x        (x) int64 16B 2 3
│   Data variables:
│       e        (x) int64 16B 1 2
└── Group: /b
    │   Dimensions:  (y: 1)
    │   Dimensions without coordinates: y
    │   Data variables:
    │       f        (y) int64 8B 3
    ├── Group: /b/c
    └── Group: /b/d
            Dimensions:  ()
            Data variables:
                g        int64 8B 4

@TomNicholas TomNicholas added the topic-DataTree Related to the implementation of a DataTree class label Jun 5, 2024
@TomNicholas TomNicholas added this to In progress in DataTree integration via automation Jun 5, 2024
@shoyer
Copy link
Member Author

shoyer commented Jun 13, 2024

This is ready for another review, after switching the repr of each node from DataTree('name') to DataTree: /path/to/name.

There are two alternatives that are worth considering:

  1. Only include name instead of the full path, e.g.,
<xarray.DataTree>
│   Dimensions:  (x: 2)
│   Coordinates:
│     * x        (x) int64 16B 2 3
│   Data variables:
│       e        (x) int64 16B 1 2
└── DataTree: b
    │   Dimensions:  (y: 1)
    │   Dimensions without coordinates: y
    │   Data variables:
    │       f        (y) int64 8B 3
    ├── DataTree: c
    └── DataTree: d
            Dimensions:  ()
            Data variables:
                g        int64 8B 4

I think this is a little less readable, but maybe is a good idea if we think people will mostly be using very heavily nested trees.

  1. Use Group: in the repr instead of DataTree:, e.g.,
<xarray.DataTree>
Group: /
│   Dimensions:  (x: 2)
│   Coordinates:
│     * x        (x) int64 16B 2 3
│   Data variables:
│       e        (x) int64 16B 1 2
└── Group: /b
    │   Dimensions:  (y: 1)
    │   Dimensions without coordinates: y
    │   Data variables:
    │       f        (y) int64 8B 3
    ├── Group: /b/c
    └── Group: /b/d
            Dimensions:  ()
            Data variables:
                g        int64 8B 4

I Iike the look of Group a little better than DataTree, so I would lean toward making this change. In fact, this already matches what we have in the HTML repr.

@TomNicholas
Copy link
Contributor

With this change there is now no information about the parent of the top node in the repr. This does matter - it affects what .root will point to, for example. But I could never think of a non-awkward way to include that information, which is why the top node had parent=...?

Only include name instead of the full path, e.g.,

This seems okay.

Use Group: in the repr instead of DataTree:, e.g.,

I tried to avoid using the word "Group" because now we have 3 words for the same thing: "node", "group", and arguably DataTree (since the structure is recursive).

Displaying the full path in each node might quickly get unweildy - e.g. /model_run/scenario/atmospheric_forcing/...

@shoyer
Copy link
Member Author

shoyer commented Jun 13, 2024

With this change there is now no information about the parent of the top node in the repr. This does matter - it affects what .root will point to, for example. But I could never think of a non-awkward way to include that information, which is why the top node had parent=...?

This is one reason why I added the full path -- it shows the parent in an integrated part of the repr. But if we go for only printing group names, we could also add a line showing only full path only for the root.

I tried to avoid using the word "Group" because now we have 3 words for the same thing: "node", "group", and arguably DataTree (since the structure is recursive).

I think "DataTree" and "Group" are separate concepts. DataTree is Xarray's name for a tree of groups, coordinates and data variables, in the same way that Dataset is a Xarray's name an aligned collection of coordinates and data variables, and DataArray is Xarray's name for a labeled variable. To give an analogy, "DataTree" is to "group" as "file system" is to "directory."

When you access a coordinate from a DataArray you get another DataArray, but we don't use that name in the repr. It's true that our implementation of DataTree stores nested DataTree objects (unlike the case for DataArray coordinates), but that's an implementation detail that should be opaque to users.

@TomNicholas
Copy link
Contributor

we could also add a line showing only full path only for the root.

I like the idea of having the full path to the root visible somewhere. It's more useful than just parent.

I think "DataTree" and "Group" are separate concepts.

Okay I'm convinced! 😁 I do wonder then whether we should try to replace occurrences of "node" with "group" in the docs...

@shoyer
Copy link
Member Author

shoyer commented Jun 13, 2024

Let's ask in the new DataTree meeting for opinions on only showing the name vs. the full path on each group.

@eschalkargans
Copy link

Regarding displaying the group name vs full group name, here is an example:

relative

    conditions
        geometry
        mask
            detector_footprint
                r10m
                r20m
                r60m
            l1c_classification
                r60m
        meteorology
            cams
            ecmwf
    measurements
        reflectance
            r10m
            r20m
            r60m
    quality
        mask
            r10m
            r20m
            r60m

absolute

/
    /conditions
        /conditions/geometry
        /conditions/mask
            /conditions/mask/detector_footprint
                /conditions/mask/detector_footprint/r10m
                /conditions/mask/detector_footprint/r20m
                /conditions/mask/detector_footprint/r60m
            /conditions/mask/l1c_classification
                /conditions/mask/l1c_classification/r60m
        /conditions/meteorology
            /conditions/meteorology/cams
            /conditions/meteorology/ecmwf
    /measurements
        /measurements/reflectance
            /measurements/reflectance/r10m
            /measurements/reflectance/r20m
            /measurements/reflectance/r60m
    /quality
        /quality/mask
            /quality/mask/r10m
            /quality/mask/r20m
            /quality/mask/r60m

Regarding my specific case, I don't have any opinion really on the default. However, users with long paths might be annoyed by the absolute paths for sure. But having the full complete path, copy-pastable has probably more value in the everyday life.

Copy link
Contributor

@owenlittlejohns owenlittlejohns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Quick question, probably also to @TomNicholas: is the term "Group" the word we are likely to settle on terminology-wise? (There was discussion of settling on one term for particular concepts in our call this week) I'm not sure if we wanted to decide that now, or if it can wait?

Also noting that we agreed in the same call on the full path being listed for each group/node/subtree, as this PR does.

@shoyer
Copy link
Member Author

shoyer commented Jun 26, 2024

I think "group" is the right terminology for the data in a node and its children. It's the same name already used by netCDF, HDF5 and Zarr.

@TomNicholas can you give this a formal review in GitHub? I'd love to get this merged.

Copy link
Contributor

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great @shoyer. Only thing you could add is a test that checks that the group paths work properly in the case that the node you're printing is not the root of the tree (also a whatsnew entry 😉).

There is a .groups method which returns the whole path, which is consistent with the usage of the term in the repr here.

@shoyer
Copy link
Member Author

shoyer commented Jun 26, 2024

Only thing you could add is a test that checks that the group paths work properly in the case that the node you're printing is not the root of the tree

There's a test for this already inside test_repr

(also a whatsnew entry 😉).

I'm going to skip this. DataTree is not a released feature so I don't think users need to know about updates to it.

@shoyer shoyer merged commit 07b1756 into pydata:main Jun 26, 2024
28 checks passed
DataTree integration automation moved this from In progress to Done Jun 26, 2024
dcherian added a commit to dcherian/xarray that referenced this pull request Jun 28, 2024
* main:
  promote floating-point numeric datetimes to 64-bit before decoding (pydata#9182)
  also pin `numpy` in the all-but-dask CI (pydata#9184)
  temporarily remove `pydap` from CI (pydata#9183)
  temporarily pin `numpy<2` (pydata#9181)
  Change np.core.defchararray to np.char (pydata#9165) (pydata#9166)
  Fix example code formatting for CachingFileManager (pydata#9178)
  Slightly improve DataTree repr (pydata#9064)
  switch to unit `"D"` (pydata#9170)
  Docs: Add page with figure for navigating help resources (pydata#9147)
  Add test for pydata#9155 (pydata#9161)
  Remove mypy exclusions for a couple more libraries (pydata#9160)
  Include numbagg in type checks (pydata#9159)
  Improve zarr chunks docs (pydata#9140)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-DataTree Related to the implementation of a DataTree class
Development

Successfully merging this pull request may close these issues.

None yet

4 participants