Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

interpolate_na: Add max_gap support. #3302

Merged
merged 35 commits into from Nov 15, 2019
Merged

Conversation

dcherian
Copy link
Contributor

@dnowacki-usgs : can you look this over and test it out if you have time? feel free to push any changes to this branch :)

Copy link
Collaborator

@max-sixty max-sixty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, looks good!

I had a cursory check through the algo; a couple more scenarios tested would be good, in particular gaps in the middle of the values

Any thoughts on maxgap? limit_gap? Neither clicks that well!

xarray/tests/test_missing.py Outdated Show resolved Hide resolved
xarray/tests/test_missing.py Outdated Show resolved Hide resolved
xarray/core/dataset.py Outdated Show resolved Hide resolved
xarray/core/dataarray.py Outdated Show resolved Hide resolved
@dcherian
Copy link
Contributor Author

Thanks @max-sixty I've updated the tests.

Any thoughts on maxgap? limit_gap? Neither clicks that well!

I pulled maxgap from this stalled Pandas PR: pandas-dev/pandas#25141
OK it looks like that PR is alive again, so maybe it's good to keep the same kwarg?

@max-sixty
Copy link
Collaborator

OK it looks like that PR is alive again, so maybe it's good to keep the same kwarg?

Ah, agree we should align. I'm really not keen on that name but yes on balance; unless they're open to changing

@max-sixty
Copy link
Collaborator

As per the pandas issue, sounds like max_gap is consensus

@dcherian
Copy link
Contributor Author

👍

@dcherian dcherian changed the title interpolate_na: Add maxgap support. interpolate_na: Add max_gap support. Sep 13, 2019
@stefraynaud
Copy link
Contributor

Nice feature.
How about adding the support max gaps expressed in physical units, since coordinates may be irregular?

@max-sixty max-sixty self-requested a review September 13, 2019 15:28
@dcherian
Copy link
Contributor Author

Thanks @stefraynaud . I'm having trouble figuring out defining the length of a gap in the irregular coordinate case.

e.g.

da4 = xr.DataArray([np.nan, np.nan, np.nan, 1, np.nan, np.nan, 4, np.nan, np.nan], 
                   dims=["y"], coords={"y": [0, 2, 5, 6, 7, 8, 10, 12, 14]})
<xarray.DataArray (y: 9)>
array([nan, nan, nan,  1., nan, nan,  4., nan, nan])
Coordinates:
  * y        (y) int64 0 2 5 6 7 8 10 12 14

What is the length of these three gaps given that xarray doesn't have any understanding of grids?

@dcherian dcherian changed the title interpolate_na: Add max_gap support. [WIP] interpolate_na: Add max_gap support. Sep 13, 2019
@max-sixty
Copy link
Collaborator

I think using locations rather than counts would be great, but harder and doesn't have to be part of this PR.

In the example above, it looks like 1 is aligned with 6 and 4 with 10, so the gap in locations along the y dimension would be 4?

@dcherian dcherian changed the title [WIP] interpolate_na: Add max_gap support. interpolate_na: Add max_gap support. Sep 13, 2019
@dcherian
Copy link
Contributor Author

OK added test and now raises error for irregularly spaced coordinates. I agree that this should be good for now.

[0, 2, 5, 6, 7, 8, 10, 12, 14],
[[6, 6, 6, 0, 2, 2, 0, 3, 3]],
marks=pytest.mark.xfail(
reason="max_gap with irregularly spaced coordinate."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgive me if I'm being slow—is the max_gap measuring the locations (i.e. on the index), or the number of values? The example below seems to be counting two values, rather than measuring the space between the locations.

If that's right, why does it matter than the coordinates are irregular (or even non-monotonic)?

@stefraynaud
Copy link
Contributor

Thanks @stefraynaud . I'm having trouble figuring out defining the length of a gap in the irregular coordinate case.

e.g.

da4 = xr.DataArray([np.nan, np.nan, np.nan, 1, np.nan, np.nan, 4, np.nan, np.nan], 
                   dims=["y"], coords={"y": [0, 2, 5, 6, 7, 8, 10, 12, 14]})
<xarray.DataArray (y: 9)>
array([nan, nan, nan,  1., nan, nan,  4., nan, nan])
Coordinates:
  * y        (y) int64 0 2 5 6 7 8 10 12 14

What is the length of these three gaps given that xarray doesn't have any understanding of grids?

@dcherian In your example, as said @max-sixty, the middle gap has a length of 10-6=4. The length gaps at the edges cannot be computed but it doesn't matter, and the algo should work as when simply counting the nans.

I'll have a look the code, maybe for a new PR after this one.

xarray/core/missing.py Outdated Show resolved Hide resolved
xarray/core/missing.py Outdated Show resolved Hide resolved
@dcherian
Copy link
Contributor Author

The thing I find weird is that for

<xarray.DataArray (y: 9)>
array([nan, nan, nan,  1., nan, nan,  4., nan, nan])
Coordinates:
  * y        (y) int64 0 1 2 3 4 5 6 7 8

the center gap's length = 7-4 = 3 which is the number of NaNs + 1. But maybe this is OK.

  1. We should check what that pandas PR does and align with that
  2. interp calls scipy.interpolate.interp which does do extrapolation, so we should figure out a sensible solution for the edges (extrapolating coordinates using the first and last spacing seems reasonable to me).

@stefraynaud I don't have time to work on this now. Please feel free to modify this and open a new PR. You could try to push to this branch but I'm not sure it will work.

@dcherian dcherian changed the title interpolate_na: Add max_gap support. [WIP] interpolate_na: Add max_gap support. Sep 16, 2019
@max-sixty
Copy link
Collaborator

IIUC, and please correct me if I'm wrong, the pandas version counts points rather than the distance between locations. Ideally we'd be able to do both, but even if we can only have one working correctly, that would be v good

@dcherian
Copy link
Contributor Author

yes, I think you are right. I was thinking that it would be nice to have the number-of-nan-points and gap-length metrics converge for uniformly spaced coordinates but I don't think that's possible in any sensible way.

@dcherian
Copy link
Contributor Author

dcherian commented Oct 22, 2019

Thanks @dnowacki-usgs that's a nice test. I think the right fix is to make index a Variable so that we get automatic broadcasting.

Things left to do (or at least add xfail tests + errors):

  • support for cftime indexes and offsets
  • what to do when use_coordinate=False
  • what to do with unlabeled dimensions
  • document convention for gap length
  • add examples to docs

* upstream/master:
  minor lint tweaks (pydata#3429)
  Hack around pydata#3440 (pydata#3442)
  Update Terminology page to account for multidimensional coordinates (pydata#3410)
  Use cftime master for upstream-dev build (pydata#3439)
  MAGA (Make Azure Green Again) (pydata#3436)
  Test that Dataset and DataArray resampling are identical (pydata#3412)
  Avoid multiplication DeprecationWarning in rasterio backend (pydata#3428)
  Sync with latest version of cftime (v1.0.4) (pydata#3430)
  Add cftime git tip to upstream-dev + temporarily pin cftime (pydata#3431)
@dcherian
Copy link
Contributor Author

Done for now. Ready for final review / testing.

@dcherian dcherian changed the title [WIP] interpolate_na: Add max_gap support. interpolate_na: Add max_gap support. Oct 25, 2019
dcherian and others added 4 commits October 25, 2019 09:20
* upstream/master:
  __dask_tokenize__ (pydata#3446)
  Type check sentinel values (pydata#3472)
  Fix typo in docstring (pydata#3474)
  fix test suite warnings re `drop` (pydata#3460)
  Fix integrate docs (pydata#3469)
  Fix leap year condition in monthly means example (pydata#3464)
  Hypothesis tests for roundtrip to & from pandas (pydata#3285)
  unpin cftime (pydata#3463)
  Cleanup whatsnew (pydata#3462)
  enable xr.ALL_DIMS in xr.dot (pydata#3424)
  Merge stable into master (pydata#3457)
  upgrade black verison to 19.10b0 (pydata#3456)
  Remove outdated code related to compatibility with netcdftime (pydata#3450)
  Remove deprecated behavior from dataset.drop docstring (pydata#3451)
  jupyterlab dark theme (pydata#3443)
  Drop groups associated with nans in group variable (pydata#3406)
  Allow ellipsis (...) in transpose (pydata#3421)
  Another groupby.reduce bugfix. (pydata#3403)
  add icomoon license (pydata#3448)
@dcherian
Copy link
Contributor Author

dcherian commented Nov 4, 2019

This could use another round of testing / review

(cc @dnowacki-usgs @stefraynaud @max-sixty )

@dnowacki-usgs
Copy link
Contributor

Thanks for all your work @dcherian! Did a quick test with some real-world timeseries data I've been wanting to use with max_gap and it looks good to me. I will definitely be using this in the future! 👍

@dcherian
Copy link
Contributor Author

I'm going to merge this. Happy to make any other changes.

@dcherian dcherian merged commit ee9da17 into pydata:master Nov 15, 2019
@dcherian dcherian deleted the interp-na-maxgap branch November 15, 2019 14:53
@max-sixty
Copy link
Collaborator

Great, thanks @dcherian !

dcherian added a commit to dcherian/xarray that referenced this pull request Nov 17, 2019
* upstream/master:
  Added fill_value for unstack (pydata#3541)
  Add DatasetGroupBy.quantile (pydata#3527)
  ensure rename does not change index type (pydata#3532)
  Leave empty slot when not using accessors
  interpolate_na: Add max_gap support. (pydata#3302)
  units & deprecation merge (pydata#3530)
  Fix set_index when an existing dimension becomes a level (pydata#3520)
  add Variable._replace (pydata#3528)
  Tests for module-level functions with units (pydata#3493)
  Harmonize `FillValue` and `missing_value` during encoding and decoding steps (pydata#3502)
  FUNDING.yml (pydata#3523)
  Allow appending datetime & boolean variables to zarr stores (pydata#3504)
  warn if dim is passed to rolling operations. (pydata#3513)
  Deprecate allow_lazy (pydata#3435)
  Recursive tokenization (pydata#3515)
dcherian added a commit to dcherian/xarray that referenced this pull request Nov 17, 2019
* upstream/master: (22 commits)
  Added fill_value for unstack (pydata#3541)
  Add DatasetGroupBy.quantile (pydata#3527)
  ensure rename does not change index type (pydata#3532)
  Leave empty slot when not using accessors
  interpolate_na: Add max_gap support. (pydata#3302)
  units & deprecation merge (pydata#3530)
  Fix set_index when an existing dimension becomes a level (pydata#3520)
  add Variable._replace (pydata#3528)
  Tests for module-level functions with units (pydata#3493)
  Harmonize `FillValue` and `missing_value` during encoding and decoding steps (pydata#3502)
  FUNDING.yml (pydata#3523)
  Allow appending datetime & boolean variables to zarr stores (pydata#3504)
  warn if dim is passed to rolling operations. (pydata#3513)
  Deprecate allow_lazy (pydata#3435)
  Recursive tokenization (pydata#3515)
  format indexing.rst code with black (pydata#3511)
  add missing pint integration tests (pydata#3508)
  DOC: update bottleneck repo url (pydata#3507)
  add drop_sel, drop_vars, map to api.rst (pydata#3506)
  remove syntax warning (pydata#3505)
  ...
dcherian added a commit to dcherian/xarray that referenced this pull request Nov 22, 2019
* master: (24 commits)
  Tweaks to release instructions (pydata#3555)
  Clarify conda environments for new contributors (pydata#3551)
  Revert to dev version
  0.14.1 whatsnew (pydata#3547)
  sparse option to reindex and unstack (pydata#3542)
  Silence sphinx warnings (pydata#3516)
  Numpy 1.18 support (pydata#3537)
  tweak whats-new. (pydata#3540)
  small simplification of rename from pydata#3532 (pydata#3539)
  Added fill_value for unstack (pydata#3541)
  Add DatasetGroupBy.quantile (pydata#3527)
  ensure rename does not change index type (pydata#3532)
  Leave empty slot when not using accessors
  interpolate_na: Add max_gap support. (pydata#3302)
  units & deprecation merge (pydata#3530)
  Fix set_index when an existing dimension becomes a level (pydata#3520)
  add Variable._replace (pydata#3528)
  Tests for module-level functions with units (pydata#3493)
  Harmonize `FillValue` and `missing_value` during encoding and decoding steps (pydata#3502)
  FUNDING.yml (pydata#3523)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improving interpolate_na()'s limit argument
4 participants