Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto chunk #4064

Merged
merged 15 commits into from
May 25, 2020
Merged

Auto chunk #4064

merged 15 commits into from
May 25, 2020

Conversation

AndrewILWilliams
Copy link
Contributor

@AndrewILWilliams AndrewILWilliams commented May 15, 2020

Adding chunks='auto' option to Dataset.chunk().

@AndrewILWilliams
Copy link
Contributor Author

AndrewILWilliams commented May 15, 2020

In my git clone, when I run the flake8 and black . tests, I get the following messages.

(xarray-tests) Andrews-MacBook-Pro-2:xarray andrewwilliams$ black .
All done! ✨ 🍰 ✨
143 files left unchanged.
(xarray-tests) Andrews-MacBook-Pro-2:xarray andrewwilliams$ flake8
./xarray/backends/memory.py:43:32: E741 ambiguous variable name 'l'
./xarray/backends/common.py:244:32: E741 ambiguous variable name 'l'
./xarray/backends/.ipynb_checkpoints/memory-checkpoint.py:43:32: E741 ambiguous variable name 'l'

I'm not sure why something has changed in these files (I haven't touched them), I also can't work out what the l variable is meant to be doing there.

Could this somehow be associated with loads of the checks failing below? Thanks! :)

@kmuehlbauer
Copy link
Contributor

I'm not sure why something has changed in these files (I haven't touched them), I also can't work out what the l variable is meant to be doing there. Any tips appreciated!

@AndrewWilliams3142 This is due to flake8 applying new changes from pycodestyle. xarray-devs already dived into this, see #4057.

You might just need to rebase with current master, to make the error go away.

@keewis
Copy link
Collaborator

keewis commented May 15, 2020

no need to rebase, simply merge master. Also, you'd then have to remove the checkpoint.

Edit: the failing map_block tests are probably related to your changes (I don't know the code related to dask too well)

@AndrewILWilliams
Copy link
Contributor Author

AndrewILWilliams commented May 15, 2020

Okay cheers both! I'll have a look at these now.

@keewis sorry I'm still getting used to using this side of Git at the moment, could you clarify what you mean by merge master ? Do you mean merge with my local master?

@keewis
Copy link
Collaborator

keewis commented May 15, 2020

to merge master, what I usually do is:

$ git checkout master
$ git pull  # synchronize master with `origin/master`
$ git checkout <feature-branch>
$ git merge master

and if there are merge conflicts (i.e. the merge was interrupted), I follow the advice given by git status

@AndrewILWilliams
Copy link
Contributor Author

Okay, that makes sense. Though, it seems that I forked the master branch before @kmuehlbauer's commit, which fixed this flake8 issue? So I think I need to make a new fork?

@keewis
Copy link
Collaborator

keewis commented May 15, 2020

no, the git pull or git pull origin master (git pull <remote> <branch> in general) should do the trick. "Forking" just means setting up your own repository on github as a mirror of the repository you're forking. After that, git can handle the synchronization between both repositories.

If you need more explanations on git, how to use it and also how it works under the hood, check out the git book.

Edit: the book even has a section on Github

kmuehlbauer and others added 6 commits May 15, 2020 12:08
* FIX: correct dask array handling in _calc_idxminmax

* FIX: remove unneeded import, reformat via black

* fix idxmax, idxmin with dask arrays

* FIX: use array[dim].data in `_calc_idxminmax` as per @keewis suggestion, attach dim name to result

* ADD: add dask tests to `idxmin`/`idxmax` dataarray tests

* FIX: add back fixture line removed by accident

* ADD: complete dask handling in `idxmin`/`idxmax` tests in test_dataarray, xfail dask tests for dtype dateime64 (M)

* ADD: add "support dask handling for idxmin/idxmax" in whats-new.rst

* MIN: reintroduce changes added by #3953

* MIN: change if-clause to use `and` instead of `&` as per review-comment

* MIN: change if-clause to use `and` instead of `&` as per review-comment

* WIP: remove dask handling entirely for debugging purposes

* Test for dask computes

* WIP: re-add dask handling (map_blocks-approach), add `with raise_if_dask_computes()` context to idxmin-tests

* Use dask indexing instead of map_blocks.

* Better chunk choice.

* Return -1 for _nan_argminmax_object if all NaNs along dim

* Revert "Return -1 for _nan_argminmax_object if all NaNs along dim"

This reverts commit 58901b9.

* Raise error for object arrays

* No error for object arrays. Instead expect 1 compute in tests.

Co-authored-by: dcherian <deepak@cherian.net>
* rename d and l to dim and length
Added changes to whats-new.rst
Added changes to whats-new.rst
@keewis
Copy link
Collaborator

keewis commented May 15, 2020

there's something wrong with the merge. Are you able to resolve that by yourself, or should I fix it for you?

@AndrewILWilliams
Copy link
Contributor Author

Do you mean the master merge? If that's wrong would you be able to fix it for me? My bad, hopefully i'll be able to do it more cleanly in future

@keewis
Copy link
Collaborator

keewis commented May 15, 2020

if you have any unpushed commits: could you push them now? While they would not be lost, it's way easier to handle them now than after the force-push

@AndrewILWilliams
Copy link
Contributor Author

No unpushed commits

@pep8speaks
Copy link

pep8speaks commented May 15, 2020

Hello @AndrewWilliams3142! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-05-21 17:16:39 UTC

@keewis
Copy link
Collaborator

keewis commented May 15, 2020

then you'll have to run git pull -f once to get the new state

For future reference, github does not work terribly well with rebases (or merges without a merge commit) so it would be good to avoid those.

Edit: ignore the pep8speaks message for now, you didn't change anything in Line 1.

@AndrewILWilliams
Copy link
Contributor Author

AndrewILWilliams commented May 15, 2020

Okay so I've traced the error back to the map_blocks() function. I don't fully understand the code for this function in xarray/core/parallel.py, but here's a quick report on the different behaviours.

Normally, when using the make_ds() and make_da() functions in test_dask.py, without any changes to ds.chunk() we have:

>>> def func(obj):
...         result = obj + obj.x + 5 * obj.y
...         return result
...
>>> xr.map_blocks(func, ds).unify_chunks().chunks
Frozen(SortedKeysDict({'x': (4, 4, 2), 'y': (5, 5, 5, 5), 'z': (4,)}))
>>> func(ds).chunk().unify_chunks().chunks
Frozen(SortedKeysDict({'x': (4, 4, 2), 'y': (5, 5, 5, 5), 'z': (4,)}))

However, when I use the changes I've made to dataset.py (changing isinstance(chunks, Number) to is_scalar(chunks)), the behaviour becomes:

>>> xr.map_blocks(func, ds).unify_chunks().chunks
Frozen(SortedKeysDict({'x': (4, 4, 2), 'y': (5, 5, 5, 5), 'z': (4,)}))
>>> func(ds).chunk().unify_chunks().chunks
Frozen(SortedKeysDict({'x': (10,), 'y': (20,), 'z': (4,)}))

Which means that it now fails the test_map_blocks() call in test_dask.py line 1077.

I've tried to follow through the code and see what is actually happening when this change is made, but I'm out of my depth here. My guess is that is_scalar(chunks) is giving the wrong behaviour when chunks=None ?

Edit: I think that's the problem!

>>> isinstance(None, numbers.Number)
False
>>> is_scalar(None)
True

I'll add in something to catch Nones and see if it fixes the error...

@dcherian
Copy link
Contributor

ah right. this is now rechunking chunked objects to a single chunk when chunks=None. This was not happening previously

@AndrewILWilliams
Copy link
Contributor Author

AndrewILWilliams commented May 15, 2020

@dcherian do you have any idea about this mypy Type error? I can't find much (accessible) documentation on how the Union[] is working in this context.

xarray/core/dataset.py:1737: error: Argument 2 to "fromkeys" of "dict" has incompatible type "Union[Number, Mapping[Hashable, Union[None, Number, Tuple[Number, ...]]]]"; expected "Union[None, Number, Tuple[Number, ...]]"
xarray/core/dataset.py:1740: error: Item "Number" of "Union[Number, Mapping[Hashable, Union[None, Number, Tuple[Number, ...]]]]" has no attribute "keys"

Edit: thanks to everyone for your help so far!

Copy link
Collaborator

@mathause mathause left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might also have to change

None, Number, Mapping[Hashable, Union[None, Number, Tuple[Number, ...]]]

to

None, Number, str, Mapping[Hashable, Union[None, Number, str, Tuple[Number, ...]]]

xarray/core/dataset.py Outdated Show resolved Hide resolved
@AndrewILWilliams AndrewILWilliams marked this pull request as ready for review May 16, 2020 12:30
@dcherian
Copy link
Contributor

Thanks @AndrewWilliams3142 . We should add a test for chunk("auto") . This could test that dataarray.chunk("auto").data is the same as dataarray.data.rechunk("auto") (or something like that).

@AndrewILWilliams
Copy link
Contributor Author

Cheers! I forgot about the tests, will add them this week or next hopefully

@AndrewILWilliams
Copy link
Contributor Author

This could test that dataarray.chunk("auto").data is the same as dataarray.data.rechunk("auto") (or something like that).

@dcherian Thanks for the tip:) Quick question: Is there a reason why you're specifying the .data here? Also I think I'm missing something because I don't get what the difference between .chunk() and .rechunk() would be in this case.

@keewis
Copy link
Collaborator

keewis commented May 21, 2020

chunk is a method of DataArray while rechunk is a method of dask.array.Array

dataarray.chunk("auto").data calls DataArray.chunk and then fetches the underlying data structure (a dask array) while dataarray.data.rechunk("auto") fetches the underlying data structure and then calls dask.array.Array.rechunk.

@AndrewILWilliams
Copy link
Contributor Author

@keewis thanks for this! I've added what I think is a suitable test for DataArrays, do you think it's also a good idea to have a DataSet test?

xarray/tests/test_dask.py Outdated Show resolved Hide resolved
@dcherian
Copy link
Contributor

Thanks @AndrewWilliams3142

@dcherian dcherian merged commit 1de38bc into pydata:master May 25, 2020
@AndrewILWilliams
Copy link
Contributor Author

No problem ! Thanks everyone for helping me get up to speed :)

@AndrewILWilliams AndrewILWilliams deleted the auto-chunk branch May 25, 2020 20:38
dcherian added a commit to dcherian/xarray that referenced this pull request May 25, 2020
* upstream/master:
  Improve interp performance (pydata#4069)
  Auto chunk (pydata#4064)
  xr.cov() and xr.corr() (pydata#4089)
  allow multiindex levels in plots (pydata#3938)
  Fix bool weights (pydata#4075)
  fix dangerous default arguments (pydata#4006)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Automatic chunking of arrays ?
7 participants