Feature: N-dimensional auto_combine #2553

TomNicholas · 2018-11-10T11:40:48Z

What I did

Generalised the auto_combine() function to be able to concatenate and merge datasets along any number of dimensions, instead of just one.

Provides one solution to #2159, and relevant for discussion in #2039.

Currently it cannot deduce the order in which datasets should be concatenated along any one dimension from the coordinates, so it just concatenates them in order they are supplied. This means for an N-D
concatenation the datasets have to be supplied as a list of lists, which is nested as many times as there are dimensions to be concatenated along.

How it works

In _infer_concat_order_from_nested_list() the nested list of datasets is recursively traversed in order to create a dictionary of datasets, where the keys are the corresponding "tile IDs". These tile IDs are
tuples serving as multidimensional indexes for the position of the dataset within the hypercube of all datasets which are to be combined. For example four datasets which are to be combined along two dimensions would be supplied as

datasets = [[ds0, ds1], [ds2, ds3]]

and given tile_IDs to be stored as

combined_ids = {(0, 0): ds0, (0, 1): ds1, (1, 0): ds2, (1, 1): ds3}

Using this unambiguous intermediate structure means that another method could be used to organise the datasets for concatenation (i.e. reading the values of their coordinates), and a new keyword argument
infer_order_from_coords used to choose the method. The _combine_nd() function concatenates along one dimension at a time, reducing the length of the tile_ID tuple by one each time _combine_along_first_dim() is called. After each concatenation the different variables are merged, so the new auto_combine() is essentially like calling the old one once for each dimension in concat_dims.

Still to do

I would like people's opinions on the method I've chosen to do this, and any feedback on the code quality would be appreciated. Assuming we're happy with the method I used here, then the remaining tasks include:

This PR was intended to solve the common use case of collecting output from a simulation which was parallelized in multiple dimensions. I would like to write a tutorial about how to use xarray to do this, including examples of how to preprocess the data and discard processor ghost cells.

…tentation algorithm

pep8speaks · 2018-11-10T11:40:54Z

Hello @TomNicholas! Thanks for updating the PR.

In the file xarray/backends/api.py, following are the PEP8 issues :

Line 647:80: E501 line too long (81 > 79 characters)

In the file xarray/core/combine.py, following are the PEP8 issues :

Line 381:36: E226 missing whitespace around arithmetic operator
Line 536:75: E211 whitespace before '('
Line 537:57: E126 continuation line over-indented for hanging indent

In the file xarray/testing.py, following are the PEP8 issues :

Line 148:1: W391 blank line at end of file

There are no PEP8 issues in the file xarray/tests/__init__.py !
In the file xarray/tests/test_backends.py, following are the PEP8 issues :

Line 2165:36: E127 continuation line over-indented for visual indent
Line 2172:36: E126 continuation line over-indented for hanging indent

In the file xarray/tests/test_combine.py, following are the PEP8 issues :

Line 495:65: E211 whitespace before '('
Line 523:52: E231 missing whitespace after ','
Line 523:54: E231 missing whitespace after ','
Line 523:61: E231 missing whitespace after ','
Line 524:57: E241 multiple spaces after ','
Line 525:55: E241 multiple spaces after ','
Line 526:55: E241 multiple spaces after ','
Line 527:57: E241 multiple spaces after ','
Line 597:52: E203 whitespace before ','
Line 646:80: E501 line too long (85 > 79 characters)
Line 647:80: E501 line too long (85 > 79 characters)
Line 658:80: E501 line too long (83 > 79 characters)
Line 659:80: E501 line too long (87 > 79 characters)

Comment last updated on December 12, 2018 at 00:54 Hours UTC

…oNetCDF

shoyer

a few high level comments

xarray/core/combine.py

shoyer · 2018-11-10T19:47:22Z

I should have suggested this before, but the internal implementation of numpy.block() is a nice example of this sort of logic: https://github.com/numpy/numpy/blob/e726c045c72833d5c826e8a13f889746ee0bfdf4/numpy/core/shape_base.py#L661

…ed list traverser as an iterator

…tion

…mension

TomNicholas · 2018-11-26T13:31:11Z

This is basically done now - I've implemented everything I wanted to, and included unit tests for the new functionality.
I'm also successfully using it in my personal code now.

@shoyer I haven't changed the way the combine function is applied repeatedly to match the implementation in numpy.block(), I think the way I've done it is relatively clear, but let me know if you think it should be refactored.

~~I don't understand why some of the CI tests are failing - all test which run on my machine pass, and the errors in the log seem to come from dask arrays not being loaded, i.e:~~

AssertionError: 
    <xarray.Dataset>
    Dimensions:  (x: 10, y: 8)
    Dimensions without coordinates: x, y
    Data variables:
        foo      (x, y) float64 -1.19 0.4476 0.6283 ... -0.09699 -0.2311 -0.6482
    <xarray.Dataset>
    Dimensions:  (x: 10, y: 8)
    Dimensions without coordinates: x, y
    Data variables:
        foo      (x, y) float64 dask.array<shape=(10, 8), chunksize=(5, 4)>

… was ordered

TomNicholas · 2018-12-06T01:20:05Z

Also if you specify dim='something' when that dimension doesn't already exist, you are explicitly telling it to add a new dimension.

shoyer · 2018-12-06T01:20:18Z

Specifying dim=None will specify (1) explicitly, but otherwise yes, that's exactly what it does.

Yes, this was a typo on my part :).

TomNicholas · 2018-12-06T01:21:20Z

I appreciate that this is pretty complicated, perhaps it should have its own section in the docs (I can do another PR?)

shoyer · 2018-12-06T01:22:08Z

Do you think it would make sense to try to get rid of "both merging and concatenating" in favor of requiring another dimension in the grid? I guess we do sort of need this for this for open_mfdataset but it feels more complex than desired.

TomNicholas · 2018-12-06T01:34:18Z

I think that would mean there might be some situations that 1D auto_combine() could deal with but nested_concat_and_merge() could not. Situations where you can only merge the variables once they have been individually concatenated. (actually I'm not sure - I would need to think about that)

I think this is a question of whether you think that:

a) auto_combine() is a well-defined operation that is still well-defined in N-D

or

b) auto_combine()'s purpose is to do it's best to combine whatever is thrown at it, and more surgical N-D operations should be the job of another function

I personally think a), but I expect users who haven't read the source code for auto_combine() might disagree.

shoyer · 2018-12-08T01:49:28Z

I think that would mean there might be some situations that 1D auto_combine() could deal with but nested_concat_and_merge() could not. Situations where you can only merge the variables once they have been individually concatenated. (actually I'm not sure - I would need to think about that)

I think multiple levels nested_concat_and_merge() could always replace auto_combine() as long as merging happens last, at the outer-most level. That's basically what auto_combine() already does.

There are probably some edge cases where the existing auto_combine would you be a little more sloppy about how you put together your lists of variables. But I don't think that's really a good idea for us to support in xarray, since it removes most of the freedom from the implementation.

I personally think a), but I expect users who haven't read the source code for auto_combine() might disagree.

Yes, this is my concern. The way this API handles nested lists makes sense from an implementation perspective, but not really from a user perspective. For users, I think it makes sense to have:

The precisely defined nested_concat_and_merge() for cases where you already data sorted. This is pretty common (e.g., with hierarchical directory structures).
A magical auto_combine() that inspects coordinates to figure out how to put everything together. We might even rename this to combine() (if we think it's comprehensive enough).

The original version of auto_combine() that I wrote was aiming towards this second use-case, but was obviously fragile and incomplete.

If you like, we could also merge this work (which is excellent progress towards user-facing APIs) but keep the changes internal to xarray for now until we figure out the public APIs.

TomNicholas · 2018-12-10T14:59:30Z

There are probably some edge cases where the existing auto_combine would you be a little more slopping about how you put together your lists of variables.

I think that's probably the case, but I also think that those edge cases will be so specific that maybe we don't have to explicitly support them. We could just say that anyone who has a combination of datasets that is that funky can just concatenate them themselves?

For users, I think it makes sense to have...

I agree, two separate functions is a lot more intuitive than having auto_combine() behave either auto-magically or totally manually depending on the value of an optional keyword argument.

The only problem with that idea is that both of these functions should be options for open_mfdataset(), so you still have multiple possible argument structures for that function. I think that's okay though - it would be a lot easier to explain if you say "if the keyword argument combine='auto' then the function auto_combine() is used, if combine='manual' then nested_concat_and_merge() is used, and the filepaths must be structured accordingly."

If you like, we could also merge this work (which is excellent progress towards user-facing APIs) but keep the changes internal to xarray for now until we figure out the public APIs.

That would be great. Then I could start using the master branch of xarray again in my code, while we redo the public API. If I set concat_dims back to concat_dim then it should be completely backwards-compatible.

shoyer · 2018-12-10T17:06:59Z

The only problem with that idea is that both of these functions should be options for open_mfdataset(), so you still have multiple possible argument structures for that function. I think that's okay though - it would be a lot easier to explain if you say "if the keyword argument combine='auto' then the function auto_combine() is used, if combine='manual' then nested_concat_and_merge() is used, and the filepaths must be structured accordingly."

This sounds pretty good to me.

If you like, we could also merge this work (which is excellent progress towards user-facing APIs) but keep the changes internal to xarray for now until we figure out the public APIs.

That would be great. Then I could start using the master branch of xarray again in my code, while we redo the public API. If I set concat_dims back to concat_dim then it should be completely backwards-compatible.

OK, do you want to go ahead and revert the documentation/public API for now?

I would even be OK supporting nested lists temporarily in xarray via APIs like open_mfdataset and auto_concat without official documentation/support -- using this feature would be "at your own risk". As long as we don't break any existing code, that is totally fine. We just shouldn't advertise an interface until we've happy with it.

TomNicholas · 2018-12-10T17:37:25Z

This sounds pretty good to me.

Okay good. So with this API then auto_combine() (and therefore the default behaviour of open_mfdatset()) would be to throw a warning if there isn't enough info in the dataset coords to order them?

OK, do you want to go ahead and revert the documentation/public API for now?

Yes, I'll do that.

I would even be OK supporting nested lists temporarily ... "at your own risk"

If I revert the documentation and you merge this PR then that's exactly what we will have, which would be useful for me until we do the public API.

(Also it seems that whatever was wrong with cftime has been fixed now, as the CI tests are passing.)

shoyer · 2018-12-10T17:48:21Z

Okay good. So with this API then auto_combine() (and therefore the default behaviour of open_mfdatset()) would be to throw a warning if there isn't enough info in the dataset coords to order them?

Yes, but throwing a warning should probably only be a temporary solution. Long term we should pick a default value (e.g., combine='auto') and require explicitly opting in to different behavior (e.g., combine='manual').

TomNicholas · 2018-12-11T14:35:12Z

Okay I've reverted the API, so it basically converts a concat_dim input to it's multidimensional version before continuing.

If we merge this then should I start a separate Pull Request for further discussion about the API?

One of the Travis CI builds failed but again I don't think that was me.

shoyer

you can indeed ignore the test failure on dask-dev -- that looks like a dask issue (dask/dask#4291)

xarray/backends/api.py

xarray/core/combine.py

TomNicholas · 2018-12-12T00:57:56Z

Removed the unnecessary argument.

shoyer · 2018-12-13T17:16:16Z

thanks @TomNicholas! I'm looking forward to the follow-ups :)

* upstream/master: Feature: N-dimensional auto_combine (pydata#2553) Support HighLevelGraphs (pydata#2603) Bump cftime version in doc environment (pydata#2604) use keep_attrs in binary operations II (pydata#2590) Temporarily mark dask-dev build as an allowed failure (pydata#2602) Fix wrong error message in interp() (pydata#2598) Add dayofyear and dayofweek accessors (pydata#2599) Fix h5netcdf saving scalars with filters or chunks (pydata#2591) Minor update to PR template (pydata#2596) Zarr consolidated (pydata#2559) fix examples (pydata#2581) Fix typo (pydata#2578) Concat docstring typo (pydata#2577) DOC: remove example using Dataset.T (pydata#2572) python setup.py test now works by default (pydata#2573) Return slices when possible from CFTimeIndex.get_loc() (pydata#2569) DOC: fix computation.rst (pydata#2567)

* concatenates along a single dimension * Wrote function to find correct tile_IDs from nested list of datasets * Wrote function to check that combined_tile_ids structure is valid * Added test of 2d-concatenation * Tests now check that dataset ordering is correct * Test concatentation along a new dimension * Started generalising auto_combine to N-D by integrating the N-D concatentation algorithm * All unit tests now passing * Fixed a failing test which I didn't notice because I don't have pseudoNetCDF * Began updating open_mfdataset to handle N-D input * Refactored to remove duplicate logic in open_mfdataset & auto_combine * Implemented Shoyers suggestion in #2553 to rewrite the recursive nested list traverser as an iterator * --amend * Now raises ValueError if input not ordered correctly before concatenation * Added some more prototype tests defining desired behaviour more clearly * Now raises informative errors on invalid forms of input * Refactoring to alos merge along each dimension * Refactored to literally just apply the old auto_combine along each dimension * Added unit tests for open_mfdatset * Removed TODOs * Removed format strings * test_get_new_tile_ids now doesn't assume dicts are ordered * Fixed failing tests on python3.5 caused by accidentally assuming dict was ordered * Test for getting new tile id * Fixed itertoolz import so that it's compatible with older versions * Increased test coverage * Added toolz as an explicit dependency to pass tests on python2.7 * Updated 'what's new' * No longer attempts to shortcut all concatenation at once if concat_dims=None * Rewrote using itertools.groupby instead of toolz.itertoolz.groupby to remove hidden dependency on toolz * Fixed erroneous removal of utils import * Updated docstrings to include an example of multidimensional concatenation * Clarified auto_combine docstring for N-D behaviour * Added unit test for nested list of Datasets with different variables * Minor spelling and pep8 fixes * Started working on a new api with both auto_combine and manual_combine * Wrote basic function to infer concatenation order from coords. Needs better error handling though. * Attempt at finalised version of public-facing API. All the internals still need to be redone to match though. * No longer uses entire old auto_combine internally, only concat or merge * Updated what's new * Removed uneeded addition to what's new for old release * Fixed incomplete merge in docstring for open_mfdataset * Tests for manual combine passing * Tests for auto_combine now passing * xfailed weird behaviour with manual_combine trying to determine concat_dim * Add auto_combine and manual_combine to API page of docs * Tests now passing for open_mfdataset * Completed merge so that #2648 is respected, and added tests. Also moved concat to it's own file to avoid a circular dependency * Separated the tests for concat and both combines * Some PEP8 fixes * Pre-empting a test which will fail with opening uamiv format * Satisfy pep8speaks bot * Python 3.5 compatibile after changing some error string formatting * Order coords using pandas.Index objects * Fixed performance bug from GH #2662 * Removed ToDos about natural sorting of string coords * Generalized auto_combine to handle monotonically-decreasing coords too * Added more examples to docstring for manual_combine * Added note about globbing aspect of open_mfdataset * Removed auto-inferring of concatenation dimension in manual_combine * Added example to docstring for auto_combine * Minor correction to docstring * Another very minor docstring correction * Added test to guard against issue #2777 * Started deprecation cycle for auto_combine * Fully reverted open_mfdataset tests * Updated what's new to match deprecation cycle * Reverted uamiv test * Removed dependency on itertools * Deprecation tests fixed * Satisfy pycodestyle * Started deprecation cycle of auto_combine * Added specific error for edge case combine_manual can't handle * Check that global coordinates are monotonic * Highlighted weird behaviour when concatenating with no data variables * Added test for impossible-to-auto-combine coordinates * Removed uneeded test * Satisfy linter * Added airspeedvelocity benchmark for combining functions * Benchmark will take longer now * Updated version numbers in deprecation warnings to fit with recent release of 0.12 * Updated api docs for new function names * Fixed docs build failure * Revert "Fixed docs build failure" This reverts commit ddfc6dd. * Updated documentation with section explaining new functions * Suppressed deprecation warnings in test suite * Resolved ToDo by pointing to issue with concat, see #2975 * Various docs fixes * Slightly renamed tests to match new name of tested function * Included minor suggestions from shoyer * Removed trailing whitespace * Simplified error message for case combine_manual can't handle * Removed filter for deprecation warnings, and added test for if user doesn't supply concat_dim * Simple fixes suggested by shoyer * Change deprecation warning behaviour * linting

TomNicholas added 9 commits November 5, 2018 20:09

concatenates along a single dimension

88ee12a

Wrote function to find correct tile_IDs from nested list of datasets

1aaa075

Wrote function to check that combined_tile_ids structure is valid

dbb371d

Added test of 2d-concatenation

cc4d743

Tests now check that dataset ordering is correct

d2fc7e7

Test concatentation along a new dimension

e3f3699

Started generalising auto_combine to N-D by integrating the N-D conca…

55bf685

…tentation algorithm

All unit tests now passing

845206c

Merge branch 'real_master' into feature/nd_combine

fb66626

TomNicholas mentioned this pull request Nov 10, 2018

Concatenate across multiple dimensions with open_mfdataset #2159

Closed

Fixed a failing test which I didn't notice because I don't have pseud…

f4e9aad

…oNetCDF

shoyer reviewed Nov 10, 2018

View reviewed changes

xarray/core/combine.py Outdated Show resolved Hide resolved

xarray/core/combine.py Outdated Show resolved Hide resolved

xarray/core/combine.py Outdated Show resolved Hide resolved

TomNicholas added 10 commits November 14, 2018 12:44

Began updating open_mfdataset to handle N-D input

00004a1

Refactored to remove duplicate logic in open_mfdataset & auto_combine

b41e374

Implemented Shoyers suggestion in #2553 to rewrite the recursive nest…

8672a79

…ed list traverser as an iterator

--amend

4f56b24

Now raises ValueError if input not ordered correctly before concatena…

4cfaf2e

…tion

Added some more prototype tests defining desired behaviour more clearly

9fd1413

Now raises informative errors on invalid forms of input

8ad0121

Refactoring to alos merge along each dimension

4b2c544

Refactored to literally just apply the old auto_combine along each di…

3d0061e

…mension

Added unit tests for open_mfdatset

60c93ba

Removed TODOs

1824538

TomNicholas changed the title ~~[WIP] Feature: N-dimensional auto_combine~~ Feature: N-dimensional auto_combine Nov 26, 2018

TomNicholas added 3 commits November 30, 2018 18:58

Removed format strings

d380815

test_get_new_tile_ids now doesn't assume dicts are ordered

c4bb8d0

Fixed failing tests on python3.5 caused by accidentally assuming dict…

6b7f889

… was ordered

TomNicholas mentioned this pull request Dec 10, 2018

Concatenate using global indexes boutproject/xBOUT#3

Open

TomNicholas added 2 commits December 10, 2018 15:14

Added unit test for nested list of Datasets with different variables

ce59da1

Minor spelling and pep8 fixes

9fb34cf

Reverted API so that N-D generalisation is hidden

f47486f

shoyer reviewed Dec 11, 2018

View reviewed changes

xarray/backends/api.py Outdated Show resolved Hide resolved

xarray/core/combine.py Outdated Show resolved Hide resolved

xarray/core/combine.py Outdated Show resolved Hide resolved

Removed infer_order_from_coords argument

ebbe47f

shoyer merged commit 9e8707d into pydata:master Dec 13, 2018

This was referenced Dec 13, 2018

Reduce simplify required dependencies boutproject/xBOUT#22

Closed

Arguments for open_mfdataset() now match signature in xarray master branch boutproject/xBOUT#28

Merged

TomNicholas mentioned this pull request Dec 17, 2018

API for N-dimensional combine #2616

Merged

15 tasks

max-sixty mentioned this pull request Dec 22, 2018

Is pep8speaks working well? #2627

Closed

shoyer mentioned this pull request Jan 4, 2019

Change an == to an is. Fix tests so that this won't happen again. #2648

Merged

2 tasks

TomNicholas mentioned this pull request Jan 6, 2019

getting a "truth value of an array" error when supplying my own concat_dim. #2647

Closed

This was referenced Mar 14, 2019

Incompatibility of SD1D output data format with xBOUT boutproject/SD1D#3

Closed

Compatibility of BOUT output data format with xarray/xBOUT boutproject/BOUT-dev#1641

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: N-dimensional auto_combine #2553

Feature: N-dimensional auto_combine #2553

TomNicholas commented Nov 10, 2018 •

edited

Loading

pep8speaks commented Nov 10, 2018 •

edited

Loading

shoyer left a comment

shoyer commented Nov 10, 2018

TomNicholas commented Nov 26, 2018 •

edited

Loading

TomNicholas commented Dec 6, 2018 •

edited

Loading

shoyer commented Dec 6, 2018

TomNicholas commented Dec 6, 2018

shoyer commented Dec 6, 2018 •

edited

Loading

TomNicholas commented Dec 6, 2018 •

edited

Loading

shoyer commented Dec 8, 2018 •

edited

Loading

TomNicholas commented Dec 10, 2018 •

edited

Loading

shoyer commented Dec 10, 2018

TomNicholas commented Dec 10, 2018

shoyer commented Dec 10, 2018

TomNicholas commented Dec 11, 2018

shoyer left a comment

TomNicholas commented Dec 12, 2018

shoyer commented Dec 13, 2018

Feature: N-dimensional auto_combine #2553

Feature: N-dimensional auto_combine #2553

Conversation

TomNicholas commented Nov 10, 2018 • edited Loading

What I did

How it works

Still to do

pep8speaks commented Nov 10, 2018 • edited Loading

Comment last updated on December 12, 2018 at 00:54 Hours UTC

shoyer left a comment

Choose a reason for hiding this comment

shoyer commented Nov 10, 2018

TomNicholas commented Nov 26, 2018 • edited Loading

TomNicholas commented Dec 6, 2018 • edited Loading

shoyer commented Dec 6, 2018

TomNicholas commented Dec 6, 2018

shoyer commented Dec 6, 2018 • edited Loading

TomNicholas commented Dec 6, 2018 • edited Loading

shoyer commented Dec 8, 2018 • edited Loading

TomNicholas commented Dec 10, 2018 • edited Loading

shoyer commented Dec 10, 2018

TomNicholas commented Dec 10, 2018

shoyer commented Dec 10, 2018

TomNicholas commented Dec 11, 2018

shoyer left a comment

Choose a reason for hiding this comment

TomNicholas commented Dec 12, 2018

shoyer commented Dec 13, 2018

TomNicholas commented Nov 10, 2018 •

edited

Loading

pep8speaks commented Nov 10, 2018 •

edited

Loading

TomNicholas commented Nov 26, 2018 •

edited

Loading

TomNicholas commented Dec 6, 2018 •

edited

Loading

shoyer commented Dec 6, 2018 •

edited

Loading

TomNicholas commented Dec 6, 2018 •

edited

Loading

shoyer commented Dec 8, 2018 •

edited

Loading

TomNicholas commented Dec 10, 2018 •

edited

Loading