Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: N-dimensional auto_combine #2553

Merged
merged 40 commits into from
Dec 13, 2018
Merged

Feature: N-dimensional auto_combine #2553

merged 40 commits into from
Dec 13, 2018

Conversation

TomNicholas
Copy link
Contributor

@TomNicholas TomNicholas commented Nov 10, 2018

What I did

Generalised the auto_combine() function to be able to concatenate and merge datasets along any number of dimensions, instead of just one.

Provides one solution to #2159, and relevant for discussion in #2039.

Currently it cannot deduce the order in which datasets should be concatenated along any one dimension from the coordinates, so it just concatenates them in order they are supplied. This means for an N-D
concatenation the datasets have to be supplied as a list of lists, which is nested as many times as there are dimensions to be concatenated along.

How it works

In _infer_concat_order_from_nested_list() the nested list of datasets is recursively traversed in order to create a dictionary of datasets, where the keys are the corresponding "tile IDs". These tile IDs are
tuples serving as multidimensional indexes for the position of the dataset within the hypercube of all datasets which are to be combined. For example four datasets which are to be combined along two dimensions would be supplied as

datasets = [[ds0, ds1], [ds2, ds3]]

and given tile_IDs to be stored as

combined_ids = {(0, 0): ds0, (0, 1): ds1, (1, 0): ds2, (1, 1): ds3}

Using this unambiguous intermediate structure means that another method could be used to organise the datasets for concatenation (i.e. reading the values of their coordinates), and a new keyword argument
infer_order_from_coords used to choose the method. The _combine_nd() function concatenates along one dimension at a time, reducing the length of the tile_ID tuple by one each time _combine_along_first_dim() is called. After each concatenation the different variables are merged, so the new auto_combine() is essentially like calling the old one once for each dimension in concat_dims.

Still to do

I would like people's opinions on the method I've chosen to do this, and any feedback on the code quality would be appreciated. Assuming we're happy with the method I used here, then the remaining tasks include:

  • More tests of the final auto_combine() function

  • Add option to deduce concatenation order from coords (or this could be a separate PR)

  • Integrate this all the way up to open_mfdataset().

  • Unit tests for open_mfdataset()

  • More tests that the user has inputted a valid structure of datasets

  • Possibly parallelize the concatenation step?

  • A few other small TODOs which are in combine.py

  • Proper documentation showing how the input should be structured.

  • Fix failing unit tests on python 2.7 (though support for 2.7 is being dropped at the end of 2018?)

  • Fix failing unit tests on python 3.5

  • Update what's new

This PR was intended to solve the common use case of collecting output from a simulation which was parallelized in multiple dimensions. I would like to write a tutorial about how to use xarray to do this, including examples of how to preprocess the data and discard processor ghost cells.

@pep8speaks
Copy link

pep8speaks commented Nov 10, 2018

Hello @TomNicholas! Thanks for updating the PR.

Line 647:80: E501 line too long (81 > 79 characters)

Line 381:36: E226 missing whitespace around arithmetic operator
Line 536:75: E211 whitespace before '('
Line 537:57: E126 continuation line over-indented for hanging indent

Line 148:1: W391 blank line at end of file

Line 2165:36: E127 continuation line over-indented for visual indent
Line 2172:36: E126 continuation line over-indented for hanging indent

Line 495:65: E211 whitespace before '('
Line 523:52: E231 missing whitespace after ','
Line 523:54: E231 missing whitespace after ','
Line 523:61: E231 missing whitespace after ','
Line 524:57: E241 multiple spaces after ','
Line 525:55: E241 multiple spaces after ','
Line 526:55: E241 multiple spaces after ','
Line 527:57: E241 multiple spaces after ','
Line 597:52: E203 whitespace before ','
Line 646:80: E501 line too long (85 > 79 characters)
Line 647:80: E501 line too long (85 > 79 characters)
Line 658:80: E501 line too long (83 > 79 characters)
Line 659:80: E501 line too long (87 > 79 characters)

Comment last updated on December 12, 2018 at 00:54 Hours UTC

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few high level comments

xarray/core/combine.py Outdated Show resolved Hide resolved
xarray/core/combine.py Outdated Show resolved Hide resolved
xarray/core/combine.py Outdated Show resolved Hide resolved
@shoyer
Copy link
Member

shoyer commented Nov 10, 2018

I should have suggested this before, but the internal implementation of numpy.block() is a nice example of this sort of logic: https://github.com/numpy/numpy/blob/e726c045c72833d5c826e8a13f889746ee0bfdf4/numpy/core/shape_base.py#L661

@TomNicholas
Copy link
Contributor Author

TomNicholas commented Nov 26, 2018

This is basically done now - I've implemented everything I wanted to, and included unit tests for the new functionality.
I'm also successfully using it in my personal code now.

@shoyer I haven't changed the way the combine function is applied repeatedly to match the implementation in numpy.block(), I think the way I've done it is relatively clear, but let me know if you think it should be refactored.

I don't understand why some of the CI tests are failing - all test which run on my machine pass, and the errors in the log seem to come from dask arrays not being loaded, i.e:

AssertionError: 
    <xarray.Dataset>
    Dimensions:  (x: 10, y: 8)
    Dimensions without coordinates: x, y
    Data variables:
        foo      (x, y) float64 -1.19 0.4476 0.6283 ... -0.09699 -0.2311 -0.6482
    <xarray.Dataset>
    Dimensions:  (x: 10, y: 8)
    Dimensions without coordinates: x, y
    Data variables:
        foo      (x, y) float64 dask.array<shape=(10, 8), chunksize=(5, 4)>

@TomNicholas TomNicholas changed the title [WIP] Feature: N-dimensional auto_combine Feature: N-dimensional auto_combine Nov 26, 2018
@TomNicholas
Copy link
Contributor Author

TomNicholas commented Dec 6, 2018

Also if you specify dim='something' when that dimension doesn't already exist, you are explicitly telling it to add a new dimension.

@shoyer
Copy link
Member

shoyer commented Dec 6, 2018

Specifying dim=None will specify (1) explicitly, but otherwise yes, that's exactly what it does.

Yes, this was a typo on my part :).

@TomNicholas
Copy link
Contributor Author

I appreciate that this is pretty complicated, perhaps it should have its own section in the docs (I can do another PR?)

@shoyer
Copy link
Member

shoyer commented Dec 6, 2018

Do you think it would make sense to try to get rid of "both merging and concatenating" in favor of requiring another dimension in the grid? I guess we do sort of need this for this for open_mfdataset but it feels more complex than desired.

@TomNicholas
Copy link
Contributor Author

TomNicholas commented Dec 6, 2018

I think that would mean there might be some situations that 1D auto_combine() could deal with but nested_concat_and_merge() could not. Situations where you can only merge the variables once they have been individually concatenated. (actually I'm not sure - I would need to think about that)

I think this is a question of whether you think that:

a) auto_combine() is a well-defined operation that is still well-defined in N-D

or

b) auto_combine()'s purpose is to do it's best to combine whatever is thrown at it, and more surgical N-D operations should be the job of another function

I personally think a), but I expect users who haven't read the source code for auto_combine() might disagree.

@shoyer
Copy link
Member

shoyer commented Dec 8, 2018

I think that would mean there might be some situations that 1D auto_combine() could deal with but nested_concat_and_merge() could not. Situations where you can only merge the variables once they have been individually concatenated. (actually I'm not sure - I would need to think about that)

I think multiple levels nested_concat_and_merge() could always replace auto_combine() as long as merging happens last, at the outer-most level. That's basically what auto_combine() already does.

There are probably some edge cases where the existing auto_combine would you be a little more sloppy about how you put together your lists of variables. But I don't think that's really a good idea for us to support in xarray, since it removes most of the freedom from the implementation.

I personally think a), but I expect users who haven't read the source code for auto_combine() might disagree.

Yes, this is my concern. The way this API handles nested lists makes sense from an implementation perspective, but not really from a user perspective. For users, I think it makes sense to have:

  1. The precisely defined nested_concat_and_merge() for cases where you already data sorted. This is pretty common (e.g., with hierarchical directory structures).
  2. A magical auto_combine() that inspects coordinates to figure out how to put everything together. We might even rename this to combine() (if we think it's comprehensive enough).

The original version of auto_combine() that I wrote was aiming towards this second use-case, but was obviously fragile and incomplete.

If you like, we could also merge this work (which is excellent progress towards user-facing APIs) but keep the changes internal to xarray for now until we figure out the public APIs.

@TomNicholas
Copy link
Contributor Author

TomNicholas commented Dec 10, 2018

There are probably some edge cases where the existing auto_combine would you be a little more slopping about how you put together your lists of variables.

I think that's probably the case, but I also think that those edge cases will be so specific that maybe we don't have to explicitly support them. We could just say that anyone who has a combination of datasets that is that funky can just concatenate them themselves?

For users, I think it makes sense to have...

I agree, two separate functions is a lot more intuitive than having auto_combine() behave either auto-magically or totally manually depending on the value of an optional keyword argument.

The only problem with that idea is that both of these functions should be options for open_mfdataset(), so you still have multiple possible argument structures for that function. I think that's okay though - it would be a lot easier to explain if you say "if the keyword argument combine='auto' then the function auto_combine() is used, if combine='manual' then nested_concat_and_merge() is used, and the filepaths must be structured accordingly."

If you like, we could also merge this work (which is excellent progress towards user-facing APIs) but keep the changes internal to xarray for now until we figure out the public APIs.

That would be great. Then I could start using the master branch of xarray again in my code, while we redo the public API. If I set concat_dims back to concat_dim then it should be completely backwards-compatible.

@shoyer
Copy link
Member

shoyer commented Dec 10, 2018

The only problem with that idea is that both of these functions should be options for open_mfdataset(), so you still have multiple possible argument structures for that function. I think that's okay though - it would be a lot easier to explain if you say "if the keyword argument combine='auto' then the function auto_combine() is used, if combine='manual' then nested_concat_and_merge() is used, and the filepaths must be structured accordingly."

This sounds pretty good to me.

If you like, we could also merge this work (which is excellent progress towards user-facing APIs) but keep the changes internal to xarray for now until we figure out the public APIs.

That would be great. Then I could start using the master branch of xarray again in my code, while we redo the public API. If I set concat_dims back to concat_dim then it should be completely backwards-compatible.

OK, do you want to go ahead and revert the documentation/public API for now?

I would even be OK supporting nested lists temporarily in xarray via APIs like open_mfdataset and auto_concat without official documentation/support -- using this feature would be "at your own risk". As long as we don't break any existing code, that is totally fine. We just shouldn't advertise an interface until we've happy with it.

@TomNicholas
Copy link
Contributor Author

This sounds pretty good to me.

Okay good. So with this API then auto_combine() (and therefore the default behaviour of open_mfdatset()) would be to throw a warning if there isn't enough info in the dataset coords to order them?

OK, do you want to go ahead and revert the documentation/public API for now?

Yes, I'll do that.

I would even be OK supporting nested lists temporarily ... "at your own risk"

If I revert the documentation and you merge this PR then that's exactly what we will have, which would be useful for me until we do the public API.

(Also it seems that whatever was wrong with cftime has been fixed now, as the CI tests are passing.)

@shoyer
Copy link
Member

shoyer commented Dec 10, 2018

Okay good. So with this API then auto_combine() (and therefore the default behaviour of open_mfdatset()) would be to throw a warning if there isn't enough info in the dataset coords to order them?

Yes, but throwing a warning should probably only be a temporary solution. Long term we should pick a default value (e.g., combine='auto') and require explicitly opting in to different behavior (e.g., combine='manual').

@TomNicholas
Copy link
Contributor Author

Okay I've reverted the API, so it basically converts a concat_dim input to it's multidimensional version before continuing.

If we merge this then should I start a separate Pull Request for further discussion about the API?

One of the Travis CI builds failed but again I don't think that was me.

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can indeed ignore the test failure on dask-dev -- that looks like a dask issue (dask/dask#4291)

xarray/backends/api.py Outdated Show resolved Hide resolved
xarray/core/combine.py Outdated Show resolved Hide resolved
xarray/core/combine.py Outdated Show resolved Hide resolved
@TomNicholas
Copy link
Contributor Author

Removed the unnecessary argument.

@shoyer shoyer merged commit 9e8707d into pydata:master Dec 13, 2018
@shoyer
Copy link
Member

shoyer commented Dec 13, 2018

thanks @TomNicholas! I'm looking forward to the follow-ups :)

dcherian pushed a commit to yohai/xarray that referenced this pull request Dec 16, 2018
* upstream/master:
  Feature: N-dimensional auto_combine (pydata#2553)
  Support HighLevelGraphs (pydata#2603)
  Bump cftime version in doc environment (pydata#2604)
  use keep_attrs in binary operations II (pydata#2590)
  Temporarily mark dask-dev build as an allowed failure (pydata#2602)
  Fix wrong error message in interp() (pydata#2598)
  Add dayofyear and dayofweek accessors (pydata#2599)
  Fix h5netcdf saving scalars with filters or chunks (pydata#2591)
  Minor update to PR template (pydata#2596)
  Zarr consolidated (pydata#2559)
  fix examples (pydata#2581)
  Fix typo (pydata#2578)
  Concat docstring typo (pydata#2577)
  DOC: remove example using Dataset.T (pydata#2572)
  python setup.py test now works by default (pydata#2573)
  Return slices when possible from CFTimeIndex.get_loc() (pydata#2569)
  DOC: fix computation.rst (pydata#2567)
@TomNicholas TomNicholas mentioned this pull request Dec 17, 2018
15 tasks
shoyer pushed a commit that referenced this pull request Jun 25, 2019
* concatenates along a single dimension

* Wrote function to find correct tile_IDs from nested list of datasets

* Wrote function to check that combined_tile_ids structure is valid

* Added test of 2d-concatenation

* Tests now check that dataset ordering is correct

* Test concatentation along a new dimension

* Started generalising auto_combine to N-D by integrating the N-D concatentation algorithm

* All unit tests now passing

* Fixed a failing test which I didn't notice because I don't have pseudoNetCDF

* Began updating open_mfdataset to handle N-D input

* Refactored to remove duplicate logic in open_mfdataset & auto_combine

* Implemented Shoyers suggestion in #2553 to rewrite the recursive nested list traverser as an iterator

* --amend

* Now raises ValueError if input not ordered correctly before concatenation

* Added some more prototype tests defining desired behaviour more clearly

* Now raises informative errors on invalid forms of input

* Refactoring to alos merge along each dimension

* Refactored to literally just apply the old auto_combine along each dimension

* Added unit tests for open_mfdatset

* Removed TODOs

* Removed format strings

* test_get_new_tile_ids now doesn't assume dicts are ordered

* Fixed failing tests on python3.5 caused by accidentally assuming dict was ordered

* Test for getting new tile id

* Fixed itertoolz import so that it's compatible with older versions

* Increased test coverage

* Added toolz as an explicit dependency to pass tests on python2.7

* Updated 'what's new'

* No longer attempts to shortcut all concatenation at once if concat_dims=None

* Rewrote using itertools.groupby instead of toolz.itertoolz.groupby to remove hidden dependency on toolz

* Fixed erroneous removal of utils import

* Updated docstrings to include an example of multidimensional concatenation

* Clarified auto_combine docstring for N-D behaviour

* Added unit test for nested list of Datasets with different variables

* Minor spelling and pep8 fixes

* Started working on a new api with both auto_combine and manual_combine

* Wrote basic function to infer concatenation order from coords.

Needs better error handling though.

* Attempt at finalised version of public-facing API.

All the internals still need to be redone to match though.

* No longer uses entire old auto_combine internally, only concat or merge

* Updated what's new

* Removed uneeded addition to what's new for old release

* Fixed incomplete merge in docstring for open_mfdataset

* Tests for manual combine passing

* Tests for auto_combine now passing

* xfailed weird behaviour with manual_combine trying to determine concat_dim

* Add auto_combine and manual_combine to API page of docs

* Tests now passing for open_mfdataset

* Completed merge so that #2648 is respected, and added tests.

Also moved concat to it's own file to avoid a circular dependency

* Separated the tests for concat and both combines

* Some PEP8 fixes

* Pre-empting a test which will fail with opening uamiv format

* Satisfy pep8speaks bot

* Python 3.5 compatibile after changing some error string formatting

* Order coords using pandas.Index objects

* Fixed performance bug from GH #2662

* Removed ToDos about natural sorting of string coords

* Generalized auto_combine to handle monotonically-decreasing coords too

* Added more examples to docstring for manual_combine

* Added note about globbing aspect of open_mfdataset

* Removed auto-inferring of concatenation dimension in manual_combine

* Added example to docstring for auto_combine

* Minor correction to docstring

* Another very minor docstring correction

* Added test to guard against issue #2777

* Started deprecation cycle for auto_combine

* Fully reverted open_mfdataset tests

* Updated what's new to match deprecation cycle

* Reverted uamiv test

* Removed dependency on itertools

* Deprecation tests fixed

* Satisfy pycodestyle

* Started deprecation cycle of auto_combine

* Added specific error for edge case combine_manual can't handle

* Check that global coordinates are monotonic

* Highlighted weird behaviour when concatenating with no data variables

* Added test for impossible-to-auto-combine coordinates

* Removed uneeded test

* Satisfy linter

* Added airspeedvelocity benchmark for combining functions

* Benchmark will take longer now

* Updated version numbers in deprecation warnings to fit with recent release of 0.12

* Updated api docs for new function names

* Fixed docs build failure

* Revert "Fixed docs build failure"

This reverts commit ddfc6dd.

* Updated documentation with section explaining new functions

* Suppressed deprecation warnings in test suite

* Resolved ToDo by pointing to issue with concat, see #2975

* Various docs fixes

* Slightly renamed tests to match new name of tested function

* Included minor suggestions from shoyer

* Removed trailing whitespace

* Simplified error message for case combine_manual can't handle

* Removed filter for deprecation warnings, and added test for if user doesn't supply concat_dim

* Simple fixes suggested by shoyer

* Change deprecation warning behaviour

* linting
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants