Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

apply to dataset #4863

Open
wants to merge 36 commits into
base: main
Choose a base branch
from
Open

apply to dataset #4863

wants to merge 36 commits into from

Conversation

keewis
Copy link
Collaborator

@keewis keewis commented Feb 5, 2021

as discussed in #4837, this adds a method that applies a function to a DataArray by first converting it to a temporary dataset using _to_temp_dataset, applies the function and converts it back. I'm not really happy with the name but I can't find a better one.

This function is really similar to pipe, so I guess a keyword argument to pipe would work, too. The disadvantage of that is that pipe passes all kwargs to the passed function, which means we would shadow a specific kwarg.

@dcherian
Copy link
Contributor

dcherian commented Feb 5, 2021

Should this be a top-level

xr.apply_as_dataset(func_that_expects_datasets, dataset_or_dataarray, args=None, kwargs=None)

instead?

@keewis
Copy link
Collaborator Author

keewis commented Feb 5, 2021

sure, although I would use

xr.apply_as_dataset(func_that_expects_datasets, dataset_or_dataarray, *args, **kwargs)

Comment on lines 1165 to 1166
If a ``DataArray``, result will have the same name as ``obj`` but the single data
variable in the temporary ``Dataset`` will always have a generic name.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this as simple as "DataArrays will retain their name"? If so, maybe we don't need any notes? (very possible I'm missing some of the complexity, as ever)

Copy link
Collaborator Author

@keewis keewis Feb 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, I might not have explained that correctly. The temporary Dataset generates always has a <this-array> variable, but the original name will be restored by _from_temp_dataset.

Edit: I rewrote it, is that easier to understand?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the name of the variable of the temporary dataset matter to the user though? To what extent is that just an implementation detail?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the function will see the Dataset so it might be important to keep the note. For example, this would need to change the name in the units dict from None to <this-array> to work correctly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in core/parallel.py there's a dataarray_to_dataset (and inverse dataset_to_dataarray) function that preserves name if possible. I think name preservation is a good thing for a user-facing function.

@max-sixty
Copy link
Collaborator

I think this is great!

I agree the name doesn't sing, but also I can't think of a better one...

@max-sixty
Copy link
Collaborator

I reckon this is ready to merge!

One final pause on the name — apply isn't used in python much — would something with call work? But call_as_dataset sounds like we're call-ing a dataset rather than a function. call_with_dataset?. Unless anyone has any bright ideas, I think this is great as-is.

@keewis
Copy link
Collaborator Author

keewis commented Mar 11, 2021

call_on_dataset, maybe?

@max-sixty
Copy link
Collaborator

call_on_dataset, maybe?

I like that!

@keewis
Copy link
Collaborator Author

keewis commented Mar 15, 2021

I switched to dataarray_to_dataset instead of _to_temp_dataset. However, it might be important to be able to reference the data variable by name, which leaves us two choices: move _THIS_ARRAY to the public namespace or rename to a string. I chose the latter (not sure if there's a better choice) and modified the notes.

@max-sixty
Copy link
Collaborator

Looks great!

(no view on sentinel vs string though)

if isinstance(obj, DataArray):
ds = dataarray_to_dataset(obj)
if obj.name is None:
ds = ds.rename({_THIS_ARRAY: "<this-array>"})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another option would be to use backends.core.api.DATAARRAY_VARIABLE which is used when writing a DataArray to netcdf (I think). I don't feel strongly about this.

Copy link
Collaborator Author

@keewis keewis Apr 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I don't know which name would be better, either. THIS_ARRAY is not part of the public API so we can't use it, and None obviously doesn't work, either. Using a string seems like a good choice but the exact value will almost always be arbitrary. The advantage of "<this-array>" is that it is the string representation of THIS_ARRAY, but that's the only reason I chose that. DATAARRAY_VARIABLE or DATAARRAY_NAME have the value f"__xarray_dataarray_{type}__", but neither of them are actually part of the public API (I think?), which means they have the same issue as THIS_ARRAY (not sure if that's actually a problem, though: the simply reference a string).

Copy link
Collaborator

@max-sixty max-sixty Jun 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought @keewis 's idea of self was good from #5493 (comment), to the extent that could apply here

Edit: but then @dcherian pointed out this will fail if there's a dim called self!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used this in xarray-contrib/pint-xarray#110, and for that at least it's actually an advantage that I have to pass the name. Not sure if that's the same for every other use case though (but defining the name explicitly is not much overhead so it should be fine)

@keewis
Copy link
Collaborator Author

keewis commented Apr 10, 2021

upon further consideration, I think we have the choice between using a generic name, raising for unnamed DataArray objects or adding a name parameter, with a default to None / the THIS_ARRAY object which would be similar to saying "I don't care about the name" (and either always use that name or only for unnamed DataArray objects).

I think I would prefer option 3a (always use the passed name, even if the DataArray already had a name).

@max-sixty
Copy link
Collaborator

upon further consideration, I think we have the choice between using a generic name, raising for unnamed DataArray objects or adding a name parameter, with a default to None / the THIS_ARRAY object which would be similar to saying "I don't care about the name" (and either always use that name or only for unnamed DataArray objects).

I think I would prefer option 3a (always use the passed name, even if the DataArray already had a name).

IIUC people can do .rename rather than passing a name. Unless I'm missing something I would lightly vote to remove that arg.

But no strong preference and +1 to merging this.

@TomNicholas TomNicholas mentioned this pull request Jul 8, 2021
8 tasks
@keewis
Copy link
Collaborator Author

keewis commented Jul 23, 2021

IIUC people can do .rename rather than passing a name.

passing the name makes it cleaner, but we could also add a default value (THIS_ARRAY?).

@github-actions
Copy link
Contributor

Unit Test Results

         6 files           6 suites   53m 52s ⏱️
16 203 tests 14 468 ✔️ 1 735 💤 0 ❌
90 414 runs  82 238 ✔️ 8 176 💤 0 ❌

Results for commit 12400cb.

@max-sixty
Copy link
Collaborator

passing the name makes it cleaner, but we could also add a default value (THIS_ARRAY?).

Yes agree! No strong view on what it should be, THIS_ARRAY seems fine.

Would we remove the name before passing back the array?

@keewis
Copy link
Collaborator Author

keewis commented Aug 4, 2021

that name is temporary, which means it is only visible within the user function. call_on_dataset returns the original name (which I think means the user function shouldn't rename, because that would be silently undone)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

expose _to_temp_dataset / _from_temp_dataset as semi-public API?
3 participants