Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: use dask.array.apply_gufunc in xr.apply_ufunc #4060

Merged
merged 45 commits into from Aug 19, 2020

Conversation

kmuehlbauer
Copy link
Contributor

@kmuehlbauer kmuehlbauer commented May 14, 2020

use dask.array.apply_gufunc in xr.apply_ufunc for multiple outputs when dask='parallelized', add/fix tests

Remaining Issues:

  • fitting name for current dask_gufunc_kwargs
  • rephrase dask docs to fit new behaviour
  • combine output_core_dims and output_sizes, eg. xr.apply_ufunc(..., output_core_dims=[{"abc": 2]])

…utputs when `dask='parallelized'`, add/fix tests
@kmuehlbauer
Copy link
Contributor Author

This would need some docstring changing too. But I first want to check, if I've missed anything vital in the implementation.

@kmuehlbauer
Copy link
Contributor Author

This is ready for review from my side.

@dcherian dcherian added this to In progress in Automatic parallelization via automation May 14, 2020
@mathause
Copy link
Collaborator

  • It might be good to add a test with a reduction and one with vectorize=True.
  • Would it be possible to replace the call to dask.array.blockwise (for one output variable) with dask.array.apply_gufunc? Do you know why blockwise is used further below and not dask.array.apply_gufunc? I assume it's due to historical reasons but I am not sure.
  • dask.array.apply_gufunc does all sorts of stuff - e.g. infer meta. This could potentially solve apply_ufunc gives wrong dtype with dask=parallelized and vectorized=True #4015 (pull Fix/apply ufunc meta dtype #4022) and simplify the call signature of apply_ufunc?

https://github.com/dask/dask/blob/3573b2ddca81aeb41a7def6dd4194020f853ab18/dask/array/gufunc.py#L175

@kmuehlbauer
Copy link
Contributor Author

Thanks @mathause for your comments and raising those questions. JFTR, I was taking the road from #1815, so my explicit use-case was the multiple (dask) outputs.

  • It might be good to add a test with a reduction and one with vectorize=True.

I'll try to add some tests for the multiple output using dask.

  • Would it be possible to replace the call to dask.array.blockwise (for one output variable) with dask.array.apply_gufunc? Do you know why blockwise is used further below and not dask.array.apply_gufunc? I assume it's due to historical reasons but I am not sure.

AFAIK, apply_gufunc wasn't available at the time these functions were introduced. Good chance, that apply_gufunc can be used for handling single output dask too.

That's a good question. If you want me to go the long way, please be aware, that I'm a novice in xarray as well as in dask. A complete refactor of apply_ufunc would be quite some challenge.

@mathause
Copy link
Collaborator

Ah yes I see (#1815 (comment)). dask.array.apply_gufunc should also be able to handle one output only.

A complete refactor of apply_ufunc would be quite some challenge.

Indeed - I think it could simplify _apply_blockwise (and might make the meta keword obsolete) but it would be good if someone with more experience of dask could weigh in.

@dcherian @shoyer

@mathause mathause mentioned this pull request May 15, 2020
4 tasks
@shoyer
Copy link
Member

shoyer commented May 15, 2020

  • Would it be possible to replace the call to dask.array.blockwise (for one output variable) with dask.array.apply_gufunc? Do you know why blockwise is used further below and not dask.array.apply_gufunc? I assume it's due to historical reasons but I am not sure.

AFAIK, apply_gufunc wasn't available at the time these functions were introduced. Good chance, that apply_gufunc can be used for handling single output dask too.

Exactly. It would be nice remove the use of blockwise entirely in favor of apply_gufunc.

@kmuehlbauer
Copy link
Contributor Author

kmuehlbauer commented May 19, 2020

I've given this a try, but this will need some design decisions.

  • currently vectorize is handled in any case if requested, before falling through the if/else.
    dask.array.apply_gufunc takes vectorize as parameter, so for dask we do not need to apply vectorization. We would need to apply vectorize only for non-dask cases (maybe just before calling the final function).
  • currently dask is only handled, if dask keyword is issued ('allowed' and parallelized). From my perspective the dask keyword is not needed any more. We could just divert to the apply_gufunc when dask backed arrays are detected.

This will really have much impact on the code/tests. I'll come up with a updated PullRequest in short time, but any thoughts /remarks whatsoever are very much appreciated.

@mathause
Copy link
Collaborator

Thanks! That would be quite cool!

dask.array.apply_gufunc does some fancy stuff with np.vectorize (to determine the output_dtypes) so I would not vectorize the function ourselves. For the other I don't have a qualified opinion.

@pep8speaks
Copy link

pep8speaks commented May 20, 2020

Hello @kmuehlbauer! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-08-19 05:41:15 UTC

@kmuehlbauer
Copy link
Contributor Author

@mathause @shoyer

First serve of trying to use dask.array.apply_gufunc in xr.apply_ufunc. I've added a list with problems in the topmost comment of this PR, to not loose track of this. Please enhance that list, if needed.

Most problematic issue now: xr.dot doesn't work well with apply_gufunc with regard to core dimensions and chunking.

@kmuehlbauer
Copy link
Contributor Author

From looking at the tests, it seems that setting allow_rechunk=True would solve many issues. I've no idea about the implications on memory usage (docstring: "Warning: enabling this can increase memory usage significantly"). Might apply_gufunc not be suitable for processing of eg. dot?

@mathause
Copy link
Collaborator

Should you keep the dask="allowed" branch? That might solve some of the issues (instead of using allow_rechunk=True) . For example dask has it's own einsum implementation (used in xr.dot) so it may not be necessary to pipe that through dask.array.apply_gufunc (from the docstring: this function is like np.vectorize, but for the blocks of dask arrays). However, I am only speculating.

@kmuehlbauer
Copy link
Contributor Author

@mathause Great! Seems like this works better, thanks! Will update the PR after some more tests etc.

@kmuehlbauer
Copy link
Contributor Author

@mathause Only one breaking test which is connected to the meta stuff you handle in #4022. Any suggestions on that topic? I've removed meta completely since it was only needed in dask.array.blockwise but not in dask.array.apply_gufunc.

@kmuehlbauer
Copy link
Contributor Author

@mathause All tests green, good starting point for review. Please notify other people who should have a look at this.

There are still things to discuss:

  • keywords of dask.array.apply_gufunc
  • howto properly handle deprecation (eg. meta)
  • doscstring
  • adding/revision of tests

@shoyer
Copy link
Member

shoyer commented May 20, 2020

The original motivation for requiring dask='allowed' is that I was concerned that users would put a function that coerces its arguments into NumPy arrays into apply_ufunc (e.g., like many functions from SciPy), which could have surprisingly bad performance when called on dask arrays due to automatic coercion.

Maybe this is too defensive/surprising, and could be relaxed. We don't really have any guard-rails like this elsewhere in xarray.

@mathause
Copy link
Collaborator

Maybe this is too defensive/surprising, and could be relaxed.

You would remove the daks="forbidden" branch and not the dask="parallelized"?

For the functions that don't handle dask arrays gracefully, dask="parallelized" would be the better option?


Very cool - good progress.

  • I guess you'll have to properly deprecate meta, something along the lines of: meta is no longer necessary and has no effect. it will be removed in a future version
  • I think it would be good to pass allow_rechunk.

I'll only be able to look at it properly next week.

@shoyer
Copy link
Member

shoyer commented May 20, 2020

Maybe this is too defensive/surprising, and could be relaxed.

You would remove the daks="forbidden" branch and not the dask="parallelized"?

For the functions that don't handle dask arrays gracefully, dask="parallelized" would be the better option?

This is probably another good motivation: defaulting to dask='forbidden' forces users to make an explicit choice about whether or not use dask='parallelized'.

The problem is that we don't have any way to detect ahead of time whether the applied function already supports dask arrays (e.g., if it is built-up out of functions from dask.array). If it does, we don't want to set dask='parallelized' but rather let the function handle dask arrays itself.

@dcherian
Copy link
Contributor

If it does, we don't want to set dask='parallelized' but rather let the function handle dask arrays itself.

I think we still need all the current options for the dask kwarg. There can be disastrous consequences so it's good to make users explicitly choose the behaviour they want.

howto properly handle deprecation (eg. meta)

I don't think we should deprecate meta. Not all user functions can deal with zero shaped inputs, so automatically inferring meta need not always work. We've had to add a similar feature for map_blocks (#3575) so I think meta should stay.

keywords of dask.array.apply_gufunc

Shall we add a new dask_gufunc_kwargs and pass that down to appy_gufunc?

@mathause
Copy link
Collaborator

I don't see meta listed in the docs which is why it thought it's not needed. But if it is handled in dask.array.apply_gufunc it can of course stay.

I only realised the exact distinction between "allowed" and "parallelized" today - i.e. that "parallelized" is kind of the dask equivalent of np.vectorize. I can suggest something for the docstring (e.g. prefer "allowed" if ``func`` natively handles dask arrays or so)

@dcherian
Copy link
Contributor

good point @mathause. Looks like apply_gufunc tries to make blockwise infer meta and then does its own thing when that fails:
https://github.com/dask/dask/blob/3b92efff2e779f59e95e05af9b8f371d56227d02/dask/array/gufunc.py#L417-L440 . I don't understand how this can work in all possible cases.

@kmuehlbauer
Copy link
Contributor Author

I'll only be able to look at it properly next week.

@mathause I'll leave the PR unchanged and catch up with you next week.

@shoyer @dcherian Thanks for your comments. Please let me know, which tests should be added to check for any possible surprises with this change to apply_gufunc.

The problem is that we don't have any way to detect ahead of time whether the applied function already supports dask arrays (e.g., if it is built-up out of functions from dask.array). If it does, we don't want to set dask='parallelized' but rather let the function handle dask arrays itself.

(Att: no native english speaker here, so bear with me, if something sounds clunky or not exactly matching)
Then we would have to keep the dask='forbidden' as default, as well as parallelized and allowed to force the decision to the user. Maybe the keyword settings itself could be a bit more clear. In the allowed-case the function in question has to natively support dask-arrays. So I would use dask='native' in that case. For the parallelized-case this PR proposes to use dask.array.apply_gufunc (generalized ufunc). So either we stick to parallelized or we try to find a better fitting name (eg. dask='gufunc').

For the keywords I think @dcherian 's proposal of something like dask_gufunc_kwargs (or gufunc_kwargs) is useful (would match with dask='gufunc'), although only two keywords seem to be worth feeding through (keepdims, allow_rechunk).

@kmuehlbauer
Copy link
Contributor Author

@keewis @mathause Thanks for the review. I've added a checklist in the first post with "open issues" with this PR, which might be solved in a follow up PR. Would be good to know which need to go in here, so I can add this.

@keewis
Copy link
Collaborator

keewis commented Aug 17, 2020

I think all of these can be done in a new PR, we just have to make sure to include them in the next release (which might need to be soon so we regain compatibility with the most recent pandas).

@kmuehlbauer
Copy link
Contributor Author

Great, than it looks like it's finally done. 😃

@kmuehlbauer
Copy link
Contributor Author

While having a last review I've found another small glitch. I'll come back the next days to see, if anything needs to be done from reviewers side.

@kmuehlbauer
Copy link
Contributor Author

@mathause I've merged latest master into this PR to hopefully get all tests green. The former had some problems with a conda error in MinimumVersions job.

Please let me know, if there is anything for me to do, to get this merged.

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for patience here with the slow reviews. Looking this over, I have a suggestion for how to improve the warnings, but otherwise this looks good!

warnings.warn(
"``meta`` should be given in the ``dask_gufunc_kwargs`` parameter."
" It will be removed as direct parameter in a future version."
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please set a class (DeprecationWarning) and stacklevel=2 on these warnings? That results in better messages for users.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to nitpick - shouldn't that be a FutureWarning so that users actually get to see it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mathause At least in the tests the warnings are issued .

What's the actual difference between DeprecationWarning and FutureWarning (update: just found PendingDeprecationWarning)? And when should they be used? Just to know for future contributions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FutureWarning would be fine, too. We should probably try to come to consensus on a general policy for xarray.

The Python docs have some guidance but the overall recommendation is not really clear to me: https://docs.python.org/3/library/warnings.html#warning-categories

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FutureWarning is for users and DeprecationWarning for library authors (https://docs.python.org/3/library/warnings.html#warning-categories). Which is why you see DeprecationWarning in the test but won't when you execute the code. Took me a while to figure this out when I wanted to deprecate some stuff in my package.

import warnings

def test():
    warnings.warn("DeprecationWarning", DeprecationWarning)
    warnings.warn("FutureWarning", FutureWarning)

If you try this in ipython test() will raise both warnings. But if you save to a file and try

from test_warnings import test
test()

only FutureWarning will appear (I did not know this detail either https://www.python.org/dev/peps/pep-0565/).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mathause @shoyer I'll switch to FutureWarning since this seems to be the only user-visible warning, See https://www.python.org/dev/peps/pep-0565/#additional-use-case-for-futurewarning

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And, thanks for the pointers and explanations.

@mathause
Copy link
Collaborator

Nice! Unless @dcherian has any additional comments I'll merge in a few days

Copy link
Contributor

@dcherian dcherian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thanks @kmuehlbauer this is a great improvement!

xarray/tests/test_sparse.py Outdated Show resolved Hide resolved
Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>
@mathause
Copy link
Collaborator

ok then - let's do this. Thanks a lot @kmuehlbauer

@mathause mathause merged commit a7fb5a9 into pydata:master Aug 19, 2020
Automatic parallelization automation moved this from In progress to Done Aug 19, 2020
@kmuehlbauer
Copy link
Contributor Author

Thanks to all reviewers! Great job!

kmuehlbauer added a commit to kmuehlbauer/xarray that referenced this pull request Aug 30, 2020
dcherian pushed a commit that referenced this pull request Aug 30, 2020
#4391)

* move kwarg's `output_sizes` and `meta` to `dask_gufunc_kwargs` for internal use of `apply_ufunc` (follow-up to #4060, fixes #4385)

* add pull request referenz to `whats-new.rst`
@kmuehlbauer kmuehlbauer deleted the fix-1815 branch May 25, 2023 07:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
6 participants