Auto chunk #4064

AndrewILWilliams · 2020-05-15T09:25:13Z

Adding chunks='auto' option to Dataset.chunk().

Closes Automatic chunking of arrays ? #4055
Tests added in test_dask.py
Passes isort -rc . && black . && mypy . && flake8
Updated whats-new.rst for changes

AndrewILWilliams · 2020-05-15T10:00:35Z

In my git clone, when I run the flake8 and black . tests, I get the following messages.

(xarray-tests) Andrews-MacBook-Pro-2:xarray andrewwilliams$ black .
All done! ✨ 🍰 ✨
143 files left unchanged.
(xarray-tests) Andrews-MacBook-Pro-2:xarray andrewwilliams$ flake8
./xarray/backends/memory.py:43:32: E741 ambiguous variable name 'l'
./xarray/backends/common.py:244:32: E741 ambiguous variable name 'l'
./xarray/backends/.ipynb_checkpoints/memory-checkpoint.py:43:32: E741 ambiguous variable name 'l'

I'm not sure why something has changed in these files (I haven't touched them), I also can't work out what the l variable is meant to be doing there.

Could this somehow be associated with loads of the checks failing below? Thanks! :)

kmuehlbauer · 2020-05-15T10:03:42Z

I'm not sure why something has changed in these files (I haven't touched them), I also can't work out what the l variable is meant to be doing there. Any tips appreciated!

@AndrewWilliams3142 This is due to flake8 applying new changes from pycodestyle. xarray-devs already dived into this, see #4057.

You might just need to rebase with current master, to make the error go away.

keewis · 2020-05-15T10:04:58Z

no need to rebase, simply merge master. Also, you'd then have to remove the checkpoint.

Edit: the failing map_block tests are probably related to your changes (I don't know the code related to dask too well)

AndrewILWilliams · 2020-05-15T10:15:50Z

Okay cheers both! I'll have a look at these now.

@keewis sorry I'm still getting used to using this side of Git at the moment, could you clarify what you mean by merge master ? Do you mean merge with my local master?

keewis · 2020-05-15T10:33:56Z

to merge master, what I usually do is:

$ git checkout master
$ git pull  # synchronize master with `origin/master`
$ git checkout <feature-branch>
$ git merge master

and if there are merge conflicts (i.e. the merge was interrupted), I follow the advice given by git status

AndrewILWilliams · 2020-05-15T10:49:43Z

Okay, that makes sense. Though, it seems that I forked the master branch before @kmuehlbauer's commit, which fixed this flake8 issue? So I think I need to make a new fork?

keewis · 2020-05-15T10:57:29Z

no, the git pull or git pull origin master (git pull <remote> <branch> in general) should do the trick. "Forking" just means setting up your own repository on github as a mirror of the repository you're forking. After that, git can handle the synchronization between both repositories.

If you need more explanations on git, how to use it and also how it works under the hood, check out the git book.

Edit: the book even has a section on Github

@keewis

* FIX: correct dask array handling in _calc_idxminmax * FIX: remove unneeded import, reformat via black * fix idxmax, idxmin with dask arrays * FIX: use array[dim].data in `_calc_idxminmax` as per @keewis suggestion, attach dim name to result * ADD: add dask tests to `idxmin`/`idxmax` dataarray tests * FIX: add back fixture line removed by accident * ADD: complete dask handling in `idxmin`/`idxmax` tests in test_dataarray, xfail dask tests for dtype dateime64 (M) * ADD: add "support dask handling for idxmin/idxmax" in whats-new.rst * MIN: reintroduce changes added by #3953 * MIN: change if-clause to use `and` instead of `&` as per review-comment * MIN: change if-clause to use `and` instead of `&` as per review-comment * WIP: remove dask handling entirely for debugging purposes * Test for dask computes * WIP: re-add dask handling (map_blocks-approach), add `with raise_if_dask_computes()` context to idxmin-tests * Use dask indexing instead of map_blocks. * Better chunk choice. * Return -1 for _nan_argminmax_object if all NaNs along dim * Revert "Return -1 for _nan_argminmax_object if all NaNs along dim" This reverts commit 58901b9. * Raise error for object arrays * No error for object arrays. Instead expect 1 compute in tests. Co-authored-by: dcherian <deepak@cherian.net>

* rename d and l to dim and length

Added changes to whats-new.rst

keewis · 2020-05-15T11:46:09Z

there's something wrong with the merge. Are you able to resolve that by yourself, or should I fix it for you?

AndrewILWilliams · 2020-05-15T11:49:23Z

Do you mean the master merge? If that's wrong would you be able to fix it for me? My bad, hopefully i'll be able to do it more cleanly in future

keewis · 2020-05-15T11:52:27Z

if you have any unpushed commits: could you push them now? While they would not be lost, it's way easier to handle them now than after the force-push

AndrewILWilliams · 2020-05-15T12:05:22Z

No unpushed commits

pep8speaks · 2020-05-15T12:06:03Z

Hello @AndrewWilliams3142! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-05-21 17:16:39 UTC

keewis · 2020-05-15T12:09:18Z

then you'll have to run git pull -f once to get the new state

For future reference, github does not work terribly well with rebases (or merges without a merge commit) so it would be good to avoid those.

Edit: ignore the pep8speaks message for now, you didn't change anything in Line 1.

…rray into auto-chunk

AndrewILWilliams · 2020-05-15T16:11:04Z

Okay so I've traced the error back to the map_blocks() function. I don't fully understand the code for this function in xarray/core/parallel.py, but here's a quick report on the different behaviours.

Normally, when using the make_ds() and make_da() functions in test_dask.py, without any changes to ds.chunk() we have:

>>> def func(obj):
...         result = obj + obj.x + 5 * obj.y
...         return result
...
>>> xr.map_blocks(func, ds).unify_chunks().chunks
Frozen(SortedKeysDict({'x': (4, 4, 2), 'y': (5, 5, 5, 5), 'z': (4,)}))
>>> func(ds).chunk().unify_chunks().chunks
Frozen(SortedKeysDict({'x': (4, 4, 2), 'y': (5, 5, 5, 5), 'z': (4,)}))

However, when I use the changes I've made to dataset.py (changing isinstance(chunks, Number) to is_scalar(chunks)), the behaviour becomes:

>>> xr.map_blocks(func, ds).unify_chunks().chunks
Frozen(SortedKeysDict({'x': (4, 4, 2), 'y': (5, 5, 5, 5), 'z': (4,)}))
>>> func(ds).chunk().unify_chunks().chunks
Frozen(SortedKeysDict({'x': (10,), 'y': (20,), 'z': (4,)}))

Which means that it now fails the test_map_blocks() call in test_dask.py line 1077.

I've tried to follow through the code and see what is actually happening when this change is made, but I'm out of my depth here. My guess is that is_scalar(chunks) is giving the wrong behaviour when chunks=None ?

Edit: I think that's the problem!

>>> isinstance(None, numbers.Number)
False
>>> is_scalar(None)
True

I'll add in something to catch Nones and see if it fixes the error...

dcherian · 2020-05-15T16:28:14Z

ah right. this is now rechunking chunked objects to a single chunk when chunks=None. This was not happening previously

AndrewILWilliams · 2020-05-15T17:40:39Z

@dcherian do you have any idea about this mypy Type error? I can't find much (accessible) documentation on how the Union[] is working in this context.

xarray/core/dataset.py:1737: error: Argument 2 to "fromkeys" of "dict" has incompatible type "Union[Number, Mapping[Hashable, Union[None, Number, Tuple[Number, ...]]]]"; expected "Union[None, Number, Tuple[Number, ...]]"
xarray/core/dataset.py:1740: error: Item "Number" of "Union[Number, Mapping[Hashable, Union[None, Number, Tuple[Number, ...]]]]" has no attribute "keys"

Edit: thanks to everyone for your help so far!

mathause

You might also have to change

xarray/xarray/core/dataset.py

Line 1701 in 2542a63

None, Number, Mapping[Hashable, Union[None, Number, Tuple[Number, ...]]]

to

None, Number, str, Mapping[Hashable, Union[None, Number, str, Tuple[Number, ...]]]

xarray/core/dataset.py

dcherian · 2020-05-20T17:05:41Z

Thanks @AndrewWilliams3142 . We should add a test for chunk("auto") . This could test that dataarray.chunk("auto").data is the same as dataarray.data.rechunk("auto") (or something like that).

AndrewILWilliams · 2020-05-21T11:30:01Z

Cheers! I forgot about the tests, will add them this week or next hopefully

AndrewILWilliams · 2020-05-21T13:36:41Z

This could test that dataarray.chunk("auto").data is the same as dataarray.data.rechunk("auto") (or something like that).

@dcherian Thanks for the tip:) Quick question: Is there a reason why you're specifying the .data here? Also I think I'm missing something because I don't get what the difference between .chunk() and .rechunk() would be in this case.

keewis · 2020-05-21T13:42:08Z

chunk is a method of DataArray while rechunk is a method of dask.array.Array

dataarray.chunk("auto").data calls DataArray.chunk and then fetches the underlying data structure (a dask array) while dataarray.data.rechunk("auto") fetches the underlying data structure and then calls dask.array.Array.rechunk.

AndrewILWilliams · 2020-05-21T14:49:37Z

@keewis thanks for this! I've added what I think is a suitable test for DataArrays, do you think it's also a good idea to have a DataSet test?

xarray/tests/test_dask.py

dcherian · 2020-05-25T19:20:52Z

Thanks @AndrewWilliams3142

AndrewILWilliams · 2020-05-25T20:38:49Z

No problem ! Thanks everyone for helping me get up to speed :)

* upstream/master: Improve interp performance (pydata#4069) Auto chunk (pydata#4064) xr.cov() and xr.corr() (pydata#4089) allow multiindex levels in plots (pydata#3938) Fix bool weights (pydata#4075) fix dangerous default arguments (pydata#4006)

Added chunks='auto' option in dataset.py

944c6a0

kmuehlbauer and others added 6 commits May 15, 2020 12:08

fix the failing flake8 CI (#4057)

8095e84

* rename d and l to dim and length

Fixed typo in rasterio docs (#4063)

f026064

Added chunks='auto' option in dataset.py

673c066

Added changes to whats-new.rst

Added chunks='auto' option in dataset.py

ba38b17

Added changes to whats-new.rst

Merge branch 'master' into auto-chunk

779a78c

Merge branch 'auto-chunk' of https://github.com/AndrewWilliams3142/xa…

fee40b4

…rray into auto-chunk

Error fix, catch chunks=None

6ca73c9

Minor reformatting + flake8 changes

28266d6

mathause reviewed May 15, 2020

View reviewed changes

xarray/core/dataset.py Outdated Show resolved Hide resolved

AndrewILWilliams added 2 commits May 16, 2020 11:43

Added isinstance(chunks, (Number, str)) in dataset.py, passing

c4388f5

format changes

50224ce

AndrewILWilliams marked this pull request as ready for review May 16, 2020 12:30

added auto-chunk test for dataarrays

927c706

keewis reviewed May 21, 2020

View reviewed changes

xarray/tests/test_dask.py Outdated Show resolved Hide resolved

dcherian reviewed May 21, 2020

View reviewed changes

xarray/tests/test_dask.py Show resolved Hide resolved

AndrewILWilliams and others added 2 commits May 21, 2020 18:16

Assert chunk sizes equal in auto-chunk test

ae14da9

Merge branch 'master' into auto-chunk

98734ef

dcherian merged commit 1de38bc into pydata:master May 25, 2020

AndrewILWilliams deleted the auto-chunk branch May 25, 2020 20:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto chunk #4064

Auto chunk #4064

AndrewILWilliams commented May 15, 2020 •

edited

Loading

AndrewILWilliams commented May 15, 2020 •

edited

Loading

kmuehlbauer commented May 15, 2020

keewis commented May 15, 2020 •

edited

Loading

AndrewILWilliams commented May 15, 2020 •

edited

Loading

keewis commented May 15, 2020

AndrewILWilliams commented May 15, 2020

keewis commented May 15, 2020 •

edited

Loading

keewis commented May 15, 2020

AndrewILWilliams commented May 15, 2020

keewis commented May 15, 2020

AndrewILWilliams commented May 15, 2020

pep8speaks commented May 15, 2020 •

edited

Loading

keewis commented May 15, 2020 •

edited

Loading

AndrewILWilliams commented May 15, 2020 •

edited

Loading

dcherian commented May 15, 2020

AndrewILWilliams commented May 15, 2020 •

edited

Loading

mathause left a comment

dcherian commented May 20, 2020

AndrewILWilliams commented May 21, 2020

AndrewILWilliams commented May 21, 2020

keewis commented May 21, 2020

AndrewILWilliams commented May 21, 2020

dcherian commented May 25, 2020

AndrewILWilliams commented May 25, 2020

Auto chunk #4064

Auto chunk #4064

Conversation

AndrewILWilliams commented May 15, 2020 • edited Loading

AndrewILWilliams commented May 15, 2020 • edited Loading

kmuehlbauer commented May 15, 2020

keewis commented May 15, 2020 • edited Loading

AndrewILWilliams commented May 15, 2020 • edited Loading

keewis commented May 15, 2020

AndrewILWilliams commented May 15, 2020

keewis commented May 15, 2020 • edited Loading

keewis commented May 15, 2020

AndrewILWilliams commented May 15, 2020

keewis commented May 15, 2020

AndrewILWilliams commented May 15, 2020

pep8speaks commented May 15, 2020 • edited Loading

Comment last updated at 2020-05-21 17:16:39 UTC

keewis commented May 15, 2020 • edited Loading

AndrewILWilliams commented May 15, 2020 • edited Loading

dcherian commented May 15, 2020

AndrewILWilliams commented May 15, 2020 • edited Loading

mathause left a comment

Choose a reason for hiding this comment

dcherian commented May 20, 2020

AndrewILWilliams commented May 21, 2020

AndrewILWilliams commented May 21, 2020

keewis commented May 21, 2020

AndrewILWilliams commented May 21, 2020

dcherian commented May 25, 2020

AndrewILWilliams commented May 25, 2020

AndrewILWilliams commented May 15, 2020 •

edited

Loading

AndrewILWilliams commented May 15, 2020 •

edited

Loading

keewis commented May 15, 2020 •

edited

Loading

AndrewILWilliams commented May 15, 2020 •

edited

Loading

keewis commented May 15, 2020 •

edited

Loading

pep8speaks commented May 15, 2020 •

edited

Loading

keewis commented May 15, 2020 •

edited

Loading

AndrewILWilliams commented May 15, 2020 •

edited

Loading

AndrewILWilliams commented May 15, 2020 •

edited

Loading