interpolate_na: Add max_gap support. #3302

dcherian · 2019-09-12T15:07:20Z

Closes Improving interpolate_na()'s limit argument #2392
Tests added
Passes black . && mypy . && flake8
Fully documented, including whats-new.rst for all changes and api.rst for new API

@dnowacki-usgs : can you look this over and test it out if you have time? feel free to push any changes to this branch :)

max-sixty

Awesome, looks good!

I had a cursory check through the algo; a couple more scenarios tested would be good, in particular gaps in the middle of the values

Any thoughts on maxgap? limit_gap? Neither clicks that well!

xarray/tests/test_missing.py

xarray/core/dataset.py

xarray/core/dataarray.py

dcherian · 2019-09-12T15:55:49Z

Thanks @max-sixty I've updated the tests.

Any thoughts on maxgap? limit_gap? Neither clicks that well!

I pulled maxgap from this ~~stalled~~ Pandas PR: pandas-dev/pandas#25141
OK it looks like that PR is alive again, so maybe it's good to keep the same kwarg?

max-sixty · 2019-09-12T16:17:46Z

OK it looks like that PR is alive again, so maybe it's good to keep the same kwarg?

Ah, agree we should align. I'm really not keen on that name but yes on balance; unless they're open to changing

max-sixty · 2019-09-12T22:52:24Z

As per the pandas issue, sounds like max_gap is consensus

dcherian · 2019-09-13T03:00:42Z

👍

stefraynaud · 2019-09-13T14:00:30Z

Nice feature.
How about adding the support max gaps expressed in physical units, since coordinates may be irregular?

dcherian · 2019-09-13T15:29:34Z

Thanks @stefraynaud . I'm having trouble figuring out defining the length of a gap in the irregular coordinate case.

e.g.

da4 = xr.DataArray([np.nan, np.nan, np.nan, 1, np.nan, np.nan, 4, np.nan, np.nan], 
                   dims=["y"], coords={"y": [0, 2, 5, 6, 7, 8, 10, 12, 14]})

<xarray.DataArray (y: 9)>
array([nan, nan, nan,  1., nan, nan,  4., nan, nan])
Coordinates:
  * y        (y) int64 0 2 5 6 7 8 10 12 14

What is the length of these three gaps given that xarray doesn't have any understanding of grids?

max-sixty · 2019-09-13T15:36:08Z

I think using locations rather than counts would be great, but harder and doesn't have to be part of this PR.

In the example above, it looks like 1 is aligned with 6 and 4 with 10, so the gap in locations along the y dimension would be 4?

dcherian · 2019-09-13T16:00:33Z

OK added test and now raises error for irregularly spaced coordinates. I agree that this should be good for now.

max-sixty · 2019-09-13T22:15:13Z

xarray/tests/test_missing.py

+            [0, 2, 5, 6, 7, 8, 10, 12, 14],
+            [[6, 6, 6, 0, 2, 2, 0, 3, 3]],
+            marks=pytest.mark.xfail(
+                reason="max_gap with irregularly spaced coordinate."


Forgive me if I'm being slow—is the max_gap measuring the locations (i.e. on the index), or the number of values? The example below seems to be counting two values, rather than measuring the space between the locations.

If that's right, why does it matter than the coordinates are irregular (or even non-monotonic)?

stefraynaud · 2019-09-16T07:50:26Z

Thanks @stefraynaud . I'm having trouble figuring out defining the length of a gap in the irregular coordinate case.

e.g.
da4 = xr.DataArray([np.nan, np.nan, np.nan, 1, np.nan, np.nan, 4, np.nan, np.nan], 
                   dims=["y"], coords={"y": [0, 2, 5, 6, 7, 8, 10, 12, 14]})
<xarray.DataArray (y: 9)>
array([nan, nan, nan,  1., nan, nan,  4., nan, nan])
Coordinates:
  * y        (y) int64 0 2 5 6 7 8 10 12 14
What is the length of these three gaps given that xarray doesn't have any understanding of grids?

@dcherian In your example, as said @max-sixty, the middle gap has a length of 10-6=4. The length gaps at the edges cannot be computed but it doesn't matter, and the algo should work as when simply counting the nans.

I'll have a look the code, maybe for a new PR after this one.

xarray/core/missing.py

dcherian · 2019-09-16T14:39:41Z

The thing I find weird is that for

<xarray.DataArray (y: 9)>
array([nan, nan, nan,  1., nan, nan,  4., nan, nan])
Coordinates:
  * y        (y) int64 0 1 2 3 4 5 6 7 8

the center gap's length = 7-4 = 3 which is the number of NaNs + 1. But maybe this is OK.

We should check what that pandas PR does and align with that
interp calls scipy.interpolate.interp which does do extrapolation, so we should figure out a sensible solution for the edges (extrapolating coordinates using the first and last spacing seems reasonable to me).

@stefraynaud I don't have time to work on this now. Please feel free to modify this and open a new PR. You could try to push to this branch but I'm not sure it will work.

max-sixty · 2019-09-16T16:02:57Z

IIUC, and please correct me if I'm wrong, the pandas version counts points rather than the distance between locations. Ideally we'd be able to do both, but even if we can only have one working correctly, that would be v good

dcherian · 2019-09-17T14:53:52Z

yes, I think you are right. I was thinking that it would be nice to have the number-of-nan-points and gap-length metrics converge for uniformly spaced coordinates but I don't think that's possible in any sensible way.

Co-Authored-By: Maximilian Roos <5635139+max-sixty@users.noreply.github.com>

dcherian · 2019-10-22T15:10:19Z

Thanks @dnowacki-usgs that's a nice test. I think the right fix is to make index a Variable so that we get automatic broadcasting.

Things left to do (or at least add xfail tests + errors):

support for cftime indexes and offsets
what to do when use_coordinate=False
what to do with unlabeled dimensions
document convention for gap length
add examples to docs

* upstream/master: minor lint tweaks (pydata#3429) Hack around pydata#3440 (pydata#3442) Update Terminology page to account for multidimensional coordinates (pydata#3410) Use cftime master for upstream-dev build (pydata#3439) MAGA (Make Azure Green Again) (pydata#3436) Test that Dataset and DataArray resampling are identical (pydata#3412) Avoid multiplication DeprecationWarning in rasterio backend (pydata#3428) Sync with latest version of cftime (v1.0.4) (pydata#3430) Add cftime git tip to upstream-dev + temporarily pin cftime (pydata#3431)

* upstream/master: Escaping dtypes (pydata#3444) Html repr (pydata#3425)

dcherian · 2019-10-25T15:14:51Z

Done for now. Ready for final review / testing.

* upstream/master: __dask_tokenize__ (pydata#3446) Type check sentinel values (pydata#3472) Fix typo in docstring (pydata#3474) fix test suite warnings re `drop` (pydata#3460) Fix integrate docs (pydata#3469) Fix leap year condition in monthly means example (pydata#3464) Hypothesis tests for roundtrip to & from pandas (pydata#3285) unpin cftime (pydata#3463) Cleanup whatsnew (pydata#3462) enable xr.ALL_DIMS in xr.dot (pydata#3424) Merge stable into master (pydata#3457) upgrade black verison to 19.10b0 (pydata#3456) Remove outdated code related to compatibility with netcdftime (pydata#3450) Remove deprecated behavior from dataset.drop docstring (pydata#3451) jupyterlab dark theme (pydata#3443) Drop groups associated with nans in group variable (pydata#3406) Allow ellipsis (...) in transpose (pydata#3421) Another groupby.reduce bugfix. (pydata#3403) add icomoon license (pydata#3448)

dcherian · 2019-11-04T21:44:49Z

This could use another round of testing / review

(cc @dnowacki-usgs @stefraynaud @max-sixty )

dnowacki-usgs · 2019-11-04T23:34:12Z

Thanks for all your work @dcherian! Did a quick test with some real-world timeseries data I've been wanting to use with max_gap and it looks good to me. I will definitely be using this in the future! 👍

dcherian · 2019-11-15T14:53:12Z

I'm going to merge this. Happy to make any other changes.

max-sixty · 2019-11-15T19:49:12Z

Great, thanks @dcherian !

* upstream/master: Added fill_value for unstack (pydata#3541) Add DatasetGroupBy.quantile (pydata#3527) ensure rename does not change index type (pydata#3532) Leave empty slot when not using accessors interpolate_na: Add max_gap support. (pydata#3302) units & deprecation merge (pydata#3530) Fix set_index when an existing dimension becomes a level (pydata#3520) add Variable._replace (pydata#3528) Tests for module-level functions with units (pydata#3493) Harmonize `FillValue` and `missing_value` during encoding and decoding steps (pydata#3502) FUNDING.yml (pydata#3523) Allow appending datetime & boolean variables to zarr stores (pydata#3504) warn if dim is passed to rolling operations. (pydata#3513) Deprecate allow_lazy (pydata#3435) Recursive tokenization (pydata#3515)

* upstream/master: (22 commits) Added fill_value for unstack (pydata#3541) Add DatasetGroupBy.quantile (pydata#3527) ensure rename does not change index type (pydata#3532) Leave empty slot when not using accessors interpolate_na: Add max_gap support. (pydata#3302) units & deprecation merge (pydata#3530) Fix set_index when an existing dimension becomes a level (pydata#3520) add Variable._replace (pydata#3528) Tests for module-level functions with units (pydata#3493) Harmonize `FillValue` and `missing_value` during encoding and decoding steps (pydata#3502) FUNDING.yml (pydata#3523) Allow appending datetime & boolean variables to zarr stores (pydata#3504) warn if dim is passed to rolling operations. (pydata#3513) Deprecate allow_lazy (pydata#3435) Recursive tokenization (pydata#3515) format indexing.rst code with black (pydata#3511) add missing pint integration tests (pydata#3508) DOC: update bottleneck repo url (pydata#3507) add drop_sel, drop_vars, map to api.rst (pydata#3506) remove syntax warning (pydata#3505) ...

* master: (24 commits) Tweaks to release instructions (pydata#3555) Clarify conda environments for new contributors (pydata#3551) Revert to dev version 0.14.1 whatsnew (pydata#3547) sparse option to reindex and unstack (pydata#3542) Silence sphinx warnings (pydata#3516) Numpy 1.18 support (pydata#3537) tweak whats-new. (pydata#3540) small simplification of rename from pydata#3532 (pydata#3539) Added fill_value for unstack (pydata#3541) Add DatasetGroupBy.quantile (pydata#3527) ensure rename does not change index type (pydata#3532) Leave empty slot when not using accessors interpolate_na: Add max_gap support. (pydata#3302) units & deprecation merge (pydata#3530) Fix set_index when an existing dimension becomes a level (pydata#3520) add Variable._replace (pydata#3528) Tests for module-level functions with units (pydata#3493) Harmonize `FillValue` and `missing_value` during encoding and decoding steps (pydata#3502) FUNDING.yml (pydata#3523) ...

max-sixty reviewed Sep 12, 2019

View reviewed changes

xarray/tests/test_missing.py Outdated Show resolved Hide resolved

xarray/tests/test_missing.py Outdated Show resolved Hide resolved

xarray/core/dataset.py Outdated Show resolved Hide resolved

xarray/core/dataarray.py Outdated Show resolved Hide resolved

max-sixty mentioned this pull request Sep 12, 2019

ENH: Added max_gap keyword for series.interpolate pandas-dev/pandas#25141

Closed

4 tasks

dcherian changed the title ~~interpolate_na: Add maxgap support.~~ interpolate_na: Add max_gap support. Sep 13, 2019

max-sixty self-requested a review September 13, 2019 15:28

max-sixty approved these changes Sep 13, 2019

View reviewed changes

dcherian changed the title ~~interpolate_na: Add max_gap support.~~ [WIP] interpolate_na: Add max_gap support. Sep 13, 2019

dcherian changed the title ~~[WIP] interpolate_na: Add max_gap support.~~ interpolate_na: Add max_gap support. Sep 13, 2019

max-sixty reviewed Sep 13, 2019

View reviewed changes

stefraynaud reviewed Sep 16, 2019

View reviewed changes

xarray/core/missing.py Outdated Show resolved Hide resolved

stefraynaud reviewed Sep 16, 2019

View reviewed changes

xarray/core/missing.py Outdated Show resolved Hide resolved

stefraynaud reviewed Sep 16, 2019

View reviewed changes

xarray/core/missing.py Show resolved Hide resolved

dcherian changed the title ~~interpolate_na: Add max_gap support.~~ [WIP] interpolate_na: Add max_gap support. Sep 16, 2019

dcherian and others added 6 commits October 15, 2019 22:18

interpolate_na: Add maxgap support.

ad6f35b

Add docs.

9275d89

Add requires_bottleneck to test.

47a7cf5

Review comments.

711b2a9

maxgap → max_gap

4cad630

Update xarray/core/dataarray.py

02b93c9

Co-Authored-By: Maximilian Roos <5635139+max-sixty@users.noreply.github.com>

dcherian added 4 commits October 21, 2019 17:04

fix whats-new

6e857f0

small fixes.

6f54616

fix dan's test.

db0c5f3

remove redundant test.

1127c61

dcherian added 8 commits October 23, 2019 09:16

nicer error message.

4e27c94

Add xfailed cftime tests

179eff1

better error checking and tests.

9de946f

typing.

a411cc2

Merge remote-tracking branch 'upstream/master' into interp-na-maxgap

fde2c14

* upstream/master: Escaping dtypes (pydata#3444) Html repr (pydata#3425)

update docstrings

4bda699

scipy intersphinx

4acdd3b

dcherian changed the title ~~[WIP] interpolate_na: Add max_gap support.~~ interpolate_na: Add max_gap support. Oct 25, 2019

dcherian and others added 4 commits October 25, 2019 09:20

Merge branch 'master' into interp-na-maxgap

cb3a3f1

fix tests

d9410b1

add bottleneck testing decorator.

d844ba7

dcherian requested a review from max-sixty November 9, 2019 22:37

Merge branch 'master' into interp-na-maxgap

2381a80

dcherian merged commit ee9da17 into pydata:master Nov 15, 2019

dcherian deleted the interp-na-maxgap branch November 15, 2019 14:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

interpolate_na: Add max_gap support. #3302

interpolate_na: Add max_gap support. #3302

dcherian commented Sep 12, 2019

max-sixty left a comment

dcherian commented Sep 12, 2019

max-sixty commented Sep 12, 2019

max-sixty commented Sep 12, 2019

dcherian commented Sep 13, 2019

stefraynaud commented Sep 13, 2019

dcherian commented Sep 13, 2019

max-sixty commented Sep 13, 2019

dcherian commented Sep 13, 2019

max-sixty Sep 13, 2019

stefraynaud commented Sep 16, 2019

dcherian commented Sep 16, 2019

max-sixty commented Sep 16, 2019

dcherian commented Sep 17, 2019

dcherian commented Oct 22, 2019 •

edited

dcherian commented Oct 25, 2019

dcherian commented Nov 4, 2019

dnowacki-usgs commented Nov 4, 2019

dcherian commented Nov 15, 2019

max-sixty commented Nov 15, 2019

interpolate_na: Add max_gap support. #3302

interpolate_na: Add max_gap support. #3302

Conversation

dcherian commented Sep 12, 2019

max-sixty left a comment

Choose a reason for hiding this comment

dcherian commented Sep 12, 2019

max-sixty commented Sep 12, 2019

max-sixty commented Sep 12, 2019

dcherian commented Sep 13, 2019

stefraynaud commented Sep 13, 2019

dcherian commented Sep 13, 2019

max-sixty commented Sep 13, 2019

dcherian commented Sep 13, 2019

max-sixty Sep 13, 2019

Choose a reason for hiding this comment

stefraynaud commented Sep 16, 2019

dcherian commented Sep 16, 2019

max-sixty commented Sep 16, 2019

dcherian commented Sep 17, 2019

dcherian commented Oct 22, 2019 • edited

dcherian commented Oct 25, 2019

dcherian commented Nov 4, 2019

dnowacki-usgs commented Nov 4, 2019

dcherian commented Nov 15, 2019

max-sixty commented Nov 15, 2019

dcherian commented Oct 22, 2019 •

edited