ENH: add max dist to NearestNDInterpolator #19483

harshilkamdar · 2023-11-07T02:44:39Z

Reference issue

This does not close a particular issue as far as I know, but does speed up nearest neighbor interpolation by a siginificant amount in cases where a max distance arg is passed.

What does this implement/fix?

This allows for far quicker queries and nearest neighbor interpolation for large datasets where prior information about when/where to interpolate is known. In some 2D geospatial examples with tens of millions of points, this leads to a speedup of anywhere from 2-30x.

Additional information

n/a. This is my first PR here -- please let me know if I did anything wrong.

This allows for far quicker queries for large datasets where prior information about when/where to interpolate is known.

scipy/interpolate/_ndgriddata.py

ev-br

Looks very reasonable!

My main comment is about naming: I think it would be better to

keep the name of the parameter consistent with KDTree, and
maybe think about exposing other KDTree.query parameters where it makes sense.

The second part here is optional of course.

scipy/interpolate/_ndgriddata.py

Co-authored-by: Jake Bowhay <60778417+j-bowhay@users.noreply.github.com>

This allows for far quicker queries for large datasets where prior information about when/where to interpolate is known.

ev-br · 2023-11-08T08:46:06Z

One other thing (sorry for the piecemeal review, only noticed this being next to **tree_options): you are adding these new arguments to the constructor and only use them at the call site. Is there a need to store workers and distance_upper_bound on the instance? Maybe better mirror KDTree and only supply them in __call__.

harshilkamdar · 2023-11-08T15:02:57Z

No worries at all. My rationale was looking to interpolate.griddata in the future and making it more flexible.

Right now, griddata does not support passing any options to the underlying interpolator (nearest, linear, or cubic). I'm hoping to change that in a new PR and figured it would be cleaner to only expose and use class-level options and keeping __call__ clean for now. Linear doesn't have interesting options, but CloughTocher2DInterpolator does.

This is not a super strong reason, so I'm happy to go with whatever you prefer.

ev-br · 2023-11-09T19:00:59Z

Right. So delineating between __init__ and __call__ optional arguments would require two sets of dicts as inputs to griddata (unpacked or not), and those would be method-dependent. Not nice.
OTOH, you current approach is that everything is passed in a constructor and __call__ signature only has data points and nothing else. Which is a reasonable invariant, so let's roll with it I'd say. WDYT @j-bowhay ?

I'd then ask for a small tweak:

please add a leading underscore to things you store on the instance (as in self._workers is private, while self.workers is a part of public API).
[EDITED IN] could you please stress that tree_options are passed to the underlying KDTree constructor.

In fact, it's not immediately clear from the docstring that this whole interpolator is for unstructured data and is KDTree-based. Would be great if you could update that in a follow-up PR.

j-bowhay · 2023-11-09T20:38:33Z

I appreciate the argument around future work on griddata but it seems cumbersome that I would have to rebuild my interpolator just because I want to change the workers / max dist. Could we not add an options dict argument to griddata (as we do for minimise and friends) then each method uses this where needed (whether that be __init__ or __call__)

harshilkamdar · 2023-11-11T02:25:54Z

@j-bowhay, @ev-br: thank you both for your careful feedback. I have updated the PR based on both your comments in the latest commit - ff1d52f. It now takes in a query_options dict and exposes some more cKDTree functionality. I've also added a couple more tests.

In fact, it's not immediately clear from the docstring that this whole interpolator is for unstructured data and is KDTree-based. Would be great if you could update that in a follow-up PR.

I will do this in a separate documentation PR.

j-bowhay · 2023-11-11T08:23:11Z

Sorry I wasn't quite clear, I was proposing the griddata take some kind of option dict to pass to the underlying interpolator and a signature like NearestNDInterpolator.__call__(*args, distance_upper_bound=np.inf, etc) but perhaps lets see what @ev-br thinks so we don't have to make lots of back and forth changes

ev-br · 2023-11-11T10:54:45Z

To be clear: I'm fine with either approach. I'm sold on that the inconvenience of lumping together constructor and call arguments is rather minor and a cleaner griddata signature outweights it; I can also live with a optimize-like approach. As long as we are not adding public attributes to interpolators, I'm OK and defer to @j-bowhay for the final API specification.

harshilkamdar · 2023-11-12T01:38:51Z

Sounds good - sorry about the added confusion. @j-bowhay let me know which option you prefer and I can implement. Either is fine on my end and should likely be minimal effort

j-bowhay

How about an interface like this?

scipy/interpolate/_ndgriddata.py

Co-authored-by: Jake Bowhay <60778417+j-bowhay@users.noreply.github.com>

harshilkamdar · 2023-11-15T02:30:08Z

Thanks both for your patience. @j-bowhay - I like it. I have committed and updated tests + docs and have re-requested review.

ev-br · 2023-11-15T15:04:44Z

CI failures look related, could you please take a look

…te' into add-max-dist-nearestnd-interpolate

harshilkamdar · 2023-11-16T15:14:54Z

Sorry, dumb bug. Fixed now.

ev-br · 2023-11-16T16:41:56Z

CI is still unhappy :-(.
Might be faster to iterate locally: the incantation is $ python dev.py test -s interpolate or -t path/to/test/file.

harshilkamdar · 2023-11-22T11:41:11Z

@ev-br - good call.

Through the griddata test failures, I've found that both NearestNDInterpolator and griddata are silently actually supporting two things that I wasn't originally accounting for - the case where __call__ gets an n-D xi argument and the case where the y-values for the interpolation are also multidimensional.

I've come up with hopefully an acceptable solution to account for these two cases and updated docs & comments for NearestNDInterpolator to reflect this. I couldn't get scipy to build locally for some annoying reasons but have tried to make sure that all tests in test_ndgriddata.py pass by checking manually. If there end up being failures, I can fix them quickly.

np.full with int dtype and nan-setting leads to weirdness.

ev-br

OK, this is getting close! I like the flatten-compute-restore logic, it makes a lot of sense.

I've left several comments about 1) minor simplifications of implementation details and 2) clarity of comments/documentation.

The story about stacks of arrays is always confusing, so let's try to clarify it here. So, there are three arrays: x, y and xi.

Let's see if I got it right:

x is always 2D
y is 1D or a stack of 1D, where the first dimension is the same as the second (or first?) dimension of x, and trailing dimensions represent the stack of 1D arrays.
xi is normally a 1D array of dimension ndim or a stack of 1D arrays, where the stacking is along the leading dimension (this is what broadcastable word hints at inhttps://docs.scipy.org/doc/scipy/reference/generated/scipy.interpolate.NearestNDInterpolator.call.html#scipy.interpolate.NearestNDInterpolator.call ).

I think it would be best to explain this (or a correct version of it) to clearly separate what is the number of dimensions, what is the number of data points and how stacked dimensions come into play.
Two general considerations to keep in mind:

the docs for NearestNDInterpolator, LinearNDInterpolator and CloughtTocher2DInterpolator should look relatively consistent. If one of them supports trailing/leading dimensions and the other does not, OK, so be it. But going from nearest to linear interpolator in a simple case should be seamless, and so should be the docs.
Maybe it'd be useful to take a look at how RegularGridInterpolator documents essentially the same issue. Ideally, regular and scattered interpolants look similar.

Finally, since the documentation story is not strictly speaking related to your original changes, it would be perfectly fine to undo the docstring changes and keep the documentation update for a follow-up PR (which would be very welcome indeed).

scipy/interpolate/_ndgriddata.py

ev-br · 2023-11-22T14:51:35Z

scipy/interpolate/_ndgriddata.py

+        # (1) the case where xi is of some dimension (n, m, ..., D), where D is the coordinate dimension, and
+        # (2) the case where y is multidimensional (npoints, k, l, ...).
+        # We will first flatten xi to deal with case (1) and build an intermediate return array with shape
+        # (n*m*.., k, l, ...) and then reshape back to (n, m, ..., k, l, ...).


n*m* is confusing: what does the trailing asterisk stand for?

Yeah, I'm not being careful here in my words. Have given the description another go.

ev-br · 2023-11-22T14:59:07Z

scipy/interpolate/_ndgriddata.py

-        dist, i = self.tree.query(xi)
-        return self.values[i]
+
+        # We need to handle two important cases for compatibility with a flexible griddata:


let's not bring griddata here. The rest of the comment is also confusing (at least to me, and this is not the first time I'm seeing the story about leading/trailing dimensions).

I'm not sure what your m, n, k and l refer to TBH. Maybe take a look at how RegularGridInterpolator documents essentially the same story, possible trailing dimensions in y and xi:

https://docs.scipy.org/doc/scipy/reference/generated/scipy.interpolate.RegularGridInterpolator.__call__.html#scipy.interpolate.RegularGridInterpolator.__call__

I'm not saying RGI does it perfectly (it is not), all I'm saying you maybe can come up with a better formulation for both (likely in a follow-up PR).

Yeah, this is confusing and griddata is not relevant - updated.

scipy/interpolate/_ndgriddata.py

harshilkamdar · 2023-11-22T18:47:00Z

@ev-br - thanks for the thorough comments. I think you're right that I'm confusing nD/2D and trailing dimensions. The pointer to RegularGridInterpolator was very helpful.

I have considered most of your comments and hopefully addressed bulk of it in the new commits. Hopefully this is a more accurate version of what's going on. Any remaining doc fix, I will leave to a separate PR for now.

Co-authored-by: Jake Bowhay <60778417+j-bowhay@users.noreply.github.com>

…te' into add-max-dist-nearestnd-interpolate

scipy/interpolate/_ndgriddata.py

ev-br

All my comments have been addressed, I think it is a great addition, so +1 from me.

Unless there are further comments, am going to hit the green button later this week.

ev-br · 2023-12-03T10:54:45Z

Am going to merge now, based on two core dev approvals. Thanks Jake for the review, thank you @harshilkamdar for the enhancement and congrats with what I believe is your first SciPy commit. Keep them coming!

harshilkamdar added 2 commits November 6, 2023 21:35

ENH: query_max_dist option to speedup interpolate.NearestNDInterpolator

84fde3f

This allows for far quicker queries for large datasets where prior information about when/where to interpolate is known.

ENH: query_max_dist option to speedup interpolate.NearestNDInterpolator

ef4e6c5

This allows for far quicker queries for large datasets where prior information about when/where to interpolate is known.

harshilkamdar requested a review from ev-br as a code owner November 7, 2023 02:44

dschmitz89 added enhancement A new feature or improvement scipy.interpolate labels Nov 7, 2023

j-bowhay reviewed Nov 7, 2023

View reviewed changes

scipy/interpolate/_ndgriddata.py Outdated Show resolved Hide resolved

ev-br requested changes Nov 7, 2023

View reviewed changes

scipy/interpolate/_ndgriddata.py Outdated Show resolved Hide resolved

scipy/interpolate/_ndgriddata.py Outdated Show resolved Hide resolved

scipy/interpolate/_ndgriddata.py Outdated Show resolved Hide resolved

harshilkamdar and others added 3 commits November 7, 2023 20:01

MAINT: apply suggestions from code review

cf2e6b7

Co-authored-by: Jake Bowhay <60778417+j-bowhay@users.noreply.github.com>

ENH: distance_upper_bound & workers for NearestNDInterpolator

2fa01c9

This allows for far quicker queries for large datasets where prior information about when/where to interpolate is known.

BUG: add missing self for workers

b625a28

ENH: supply query_options to NearestNDInterpolator

ff1d52f

j-bowhay reviewed Nov 14, 2023

View reviewed changes

scipy/interpolate/_ndgriddata.py Outdated Show resolved Hide resolved

scipy/interpolate/_ndgriddata.py Outdated Show resolved Hide resolved

scipy/interpolate/_ndgriddata.py Outdated Show resolved Hide resolved

harshilkamdar and others added 2 commits November 14, 2023 21:20

MAINT: apply suggestions from code review

b867112

Co-authored-by: Jake Bowhay <60778417+j-bowhay@users.noreply.github.com>

MAINT: update tests and doc with new API

0041cac

harshilkamdar requested review from ev-br and j-bowhay November 15, 2023 02:30

BUG: check all dists instead of if.

2f9f2f2

harshilkamdar added 3 commits November 15, 2023 22:30

Merge branch 'scipy:main' into add-max-dist-nearestnd-interpolate

839f533

BUG: update old test.

43ea043

Merge remote-tracking branch 'origin/add-max-dist-nearestnd-interpola…

8f7f8b2

…te' into add-max-dist-nearestnd-interpolate

harshilkamdar added 2 commits November 22, 2023 01:45

BUG: fix nD case

3d750bf

BUG: fix tests and multidimensional query & y points

2e497d3

BUG: fix test case for complex dtypes.

4d01d49

np.full with int dtype and nan-setting leads to weirdness.

ev-br reviewed Nov 22, 2023

View reviewed changes

j-bowhay reviewed Nov 22, 2023

View reviewed changes

scipy/interpolate/_ndgriddata.py Outdated Show resolved Hide resolved

DOC: fix documentation for dimensionality to be more accurate

7c3367a

harshilkamdar and others added 4 commits November 22, 2023 13:48

Update scipy/interpolate/_ndgriddata.py

35c9577

Co-authored-by: Jake Bowhay <60778417+j-bowhay@users.noreply.github.com>

DOC: fix documentation to be clearer for y values

b142521

Merge remote-tracking branch 'origin/add-max-dist-nearestnd-interpola…

4189d20

…te' into add-max-dist-nearestnd-interpolate

DOC: forgot dim

ca3f52f

harshilkamdar requested a review from ev-br November 23, 2023 11:14

ev-br reviewed Nov 30, 2023

View reviewed changes

scipy/interpolate/_ndgriddata.py Outdated Show resolved Hide resolved

DOC: a trivial doc tweak

fd06795

ev-br approved these changes Nov 30, 2023

View reviewed changes

ev-br added this to the 1.12.0 milestone Dec 3, 2023

j-bowhay approved these changes Dec 3, 2023

View reviewed changes

ev-br merged commit 1e7726d into scipy:main Dec 3, 2023
26 of 28 checks passed

ev-br mentioned this pull request Apr 15, 2024

Add NearestNDInterpolator to cupyx.scipy.interpolate cupy/cupy#8220

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: add max dist to NearestNDInterpolator #19483

ENH: add max dist to NearestNDInterpolator #19483

harshilkamdar commented Nov 7, 2023 •

edited

ev-br left a comment

ev-br commented Nov 8, 2023

harshilkamdar commented Nov 8, 2023 •

edited

ev-br commented Nov 9, 2023 •

edited

j-bowhay commented Nov 9, 2023

harshilkamdar commented Nov 11, 2023

j-bowhay commented Nov 11, 2023

ev-br commented Nov 11, 2023

harshilkamdar commented Nov 12, 2023

j-bowhay left a comment

harshilkamdar commented Nov 15, 2023

ev-br commented Nov 15, 2023

harshilkamdar commented Nov 16, 2023

ev-br commented Nov 16, 2023

harshilkamdar commented Nov 22, 2023

ev-br left a comment

ev-br Nov 22, 2023

harshilkamdar Nov 22, 2023

ev-br Nov 22, 2023 •

edited

harshilkamdar Nov 22, 2023

harshilkamdar commented Nov 22, 2023 •

edited

ev-br left a comment

ev-br commented Dec 3, 2023

ENH: add max dist to NearestNDInterpolator #19483

ENH: add max dist to NearestNDInterpolator #19483

Conversation

harshilkamdar commented Nov 7, 2023 • edited

Reference issue

What does this implement/fix?

Additional information

ev-br left a comment

Choose a reason for hiding this comment

ev-br commented Nov 8, 2023

harshilkamdar commented Nov 8, 2023 • edited

ev-br commented Nov 9, 2023 • edited

j-bowhay commented Nov 9, 2023

harshilkamdar commented Nov 11, 2023

j-bowhay commented Nov 11, 2023

ev-br commented Nov 11, 2023

harshilkamdar commented Nov 12, 2023

j-bowhay left a comment

Choose a reason for hiding this comment

harshilkamdar commented Nov 15, 2023

ev-br commented Nov 15, 2023

harshilkamdar commented Nov 16, 2023

ev-br commented Nov 16, 2023

harshilkamdar commented Nov 22, 2023

ev-br left a comment

Choose a reason for hiding this comment

ev-br Nov 22, 2023

Choose a reason for hiding this comment

harshilkamdar Nov 22, 2023

Choose a reason for hiding this comment

ev-br Nov 22, 2023 • edited

Choose a reason for hiding this comment

harshilkamdar Nov 22, 2023

Choose a reason for hiding this comment

harshilkamdar commented Nov 22, 2023 • edited

ev-br left a comment

Choose a reason for hiding this comment

ev-br commented Dec 3, 2023

harshilkamdar commented Nov 7, 2023 •

edited

harshilkamdar commented Nov 8, 2023 •

edited

ev-br commented Nov 9, 2023 •

edited

ev-br Nov 22, 2023 •

edited

harshilkamdar commented Nov 22, 2023 •

edited