Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore missing dims when mapping over tree #67

Open
TomNicholas opened this issue Mar 3, 2022 · 5 comments
Open

Ignore missing dims when mapping over tree #67

TomNicholas opened this issue Mar 3, 2022 · 5 comments
Labels
enhancement New feature or request

Comments

@TomNicholas
Copy link
Collaborator

TomNicholas commented Mar 3, 2022

This tree has a dimension present in some nodes and not others (the "people" dimension).

DataTree('root', parent=None)
│   Dimensions:  (people: 2)
│   Coordinates:
│     * people   (people) <U5 'alice' 'bob'
│       species  <U5 'human'
│   Data variables:
│       heights  (people) float64 1.57 1.82
└── DataTree('simulation')
    ├── DataTree('coarse')
    │   Dimensions:  (x: 2, y: 3)
    │   Coordinates:
    │     * x        (x) int64 10 20
    │   Dimensions without coordinates: y
    │   Data variables:
    │       foo      (x, y) float64 0.1242 -0.2324 0.2469 0.5168 0.8391 0.8686
    │       bar      (x) int64 1 2
    │       baz      float64 3.142
    └── DataTree('fine')
        Dimensions:  (x: 6, y: 3)
        Coordinates:
          * x        (x) int64 10 12 14 16 18 20
        Dimensions without coordinates: y
        Data variables:
            foo      (x, y) float64 0.1242 -0.2324 0.2469 ... 0.5168 0.8391 0.8686
            bar      (x) float64 1.0 1.2 1.4 1.6 1.8 2.0
            baz      float64 3.142

If a user calls dt.mean(dim='people'), then at the moment this will raise an error. That's because it maps the .mean call over each group, and when it gets to either the 'coarse' group or the 'fine' group it will not find a dimension called 'people'.

However the user might want to take the mean of groups only where this makes sense, and ignore the rest.

I think the best solution is to have a missing_dims argument, like xarray's .isel already has. Then the user can do dt.mean(dim='people', missing_dims='ignore').

To actually implement this I think only requires changes in xarray, not here, because those changes should propagate down to datatree. pydata/xarray#5030

@abkfenris
Copy link

Continuing from related discussion in https://discourse.pangeo.io/t/xarray-and-collections-of-forecasts/3054/6

It would also be helpful to have it on .sel for my usage.

I haven't dug around in the guts of datatree enough to understand how it's mapping functions to each group, but would it be possible to add a missing_dims kwarg at the mapping level? Then use it to decide if to catch KeyErrors from the underlying dataset methods or not?

If I'm understanding things right, Datatree uses a mixin (MappedDatasetMethodsMixin) to manage mapping methods to datasets. Could map_over_subtree pick missing_dims off the kwargs?

@TomNicholas
Copy link
Collaborator Author

Hi @abkfenris !

datatree.mapping is where the guts of the mapping occurs. The mixin just steals certain methods from xarray.Dataset and wraps them with map_over_subtree. The mapping code is basically just this:

def map_over_subtree(func, dt, *args, **kwargs):
    new_tree = ...
    for node in dt.subtree
        result_ds = func(node.ds, *args, **kwargs)
        new_tree[node.path] = result_ds

but generalised to potentially map over multiple trees simultaneously (e.g. for binary operations like __add__), with error checking, and usable as a decorator.

would it be possible to add a missing_dims kwarg at the mapping level?

We could, but missing_dims wouldn't make sense for every function we might map - that's the challenge here. That's why I suggested we might want to add something to map_over_subtree that allows you to ignore any KeyError? Or another approach would be to modify .sel upstream.

@abkfenris
Copy link

I wonder if ignoring KeyError might be too broad and could catch more than intended (I'm thinking Dask or fsspec KeyErrors bubbling up). Might be worth exploring getting more tightly defined errors upstream.

@TomNicholas
Copy link
Collaborator Author

TomNicholas commented Jan 9, 2023 via email

@TomNicholas
Copy link
Collaborator Author

See pydata/xarray#8949 for a much more thought-out solution to this problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants