Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add max dist to NearestNDInterpolator #19483

Merged
merged 21 commits into from Dec 3, 2023

Conversation

harshilkamdar
Copy link
Contributor

@harshilkamdar harshilkamdar commented Nov 7, 2023

Reference issue

This does not close a particular issue as far as I know, but does speed up nearest neighbor interpolation by a siginificant amount in cases where a max distance arg is passed.

What does this implement/fix?

This allows for far quicker queries and nearest neighbor interpolation for large datasets where prior information about when/where to interpolate is known. In some 2D geospatial examples with tens of millions of points, this leads to a speedup of anywhere from 2-30x.

Additional information

n/a. This is my first PR here -- please let me know if I did anything wrong.

This allows for far quicker queries for large datasets where prior information about when/where to interpolate is known.
This allows for far quicker queries for large datasets where prior information about when/where to interpolate is known.
@dschmitz89 dschmitz89 added enhancement A new feature or improvement scipy.interpolate labels Nov 7, 2023
Copy link
Member

@ev-br ev-br left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very reasonable!

My main comment is about naming: I think it would be better to

  1. keep the name of the parameter consistent with KDTree, and
  2. maybe think about exposing other KDTree.query parameters where it makes sense.

The second part here is optional of course.

scipy/interpolate/_ndgriddata.py Outdated Show resolved Hide resolved
scipy/interpolate/_ndgriddata.py Outdated Show resolved Hide resolved
scipy/interpolate/_ndgriddata.py Outdated Show resolved Hide resolved
harshilkamdar and others added 3 commits November 7, 2023 20:01
Co-authored-by: Jake Bowhay <60778417+j-bowhay@users.noreply.github.com>
This allows for far quicker queries for large datasets where prior information about when/where to interpolate is known.
@ev-br
Copy link
Member

ev-br commented Nov 8, 2023

One other thing (sorry for the piecemeal review, only noticed this being next to **tree_options): you are adding these new arguments to the constructor and only use them at the call site. Is there a need to store workers and distance_upper_bound on the instance? Maybe better mirror KDTree and only supply them in __call__.

@harshilkamdar
Copy link
Contributor Author

harshilkamdar commented Nov 8, 2023

No worries at all. My rationale was looking to interpolate.griddata in the future and making it more flexible.

Right now, griddata does not support passing any options to the underlying interpolator (nearest, linear, or cubic). I'm hoping to change that in a new PR and figured it would be cleaner to only expose and use class-level options and keeping __call__ clean for now. Linear doesn't have interesting options, but CloughTocher2DInterpolator does.

This is not a super strong reason, so I'm happy to go with whatever you prefer.

@ev-br
Copy link
Member

ev-br commented Nov 9, 2023

Right. So delineating between __init__ and __call__ optional arguments would require two sets of dicts as inputs to griddata (unpacked or not), and those would be method-dependent. Not nice.
OTOH, you current approach is that everything is passed in a constructor and __call__ signature only has data points and nothing else. Which is a reasonable invariant, so let's roll with it I'd say. WDYT @j-bowhay ?

I'd then ask for a small tweak:

  • please add a leading underscore to things you store on the instance (as in self._workers is private, while self.workers is a part of public API).
  • [EDITED IN] could you please stress that tree_options are passed to the underlying KDTree constructor.

In fact, it's not immediately clear from the docstring that this whole interpolator is for unstructured data and is KDTree-based. Would be great if you could update that in a follow-up PR.

@j-bowhay
Copy link
Member

j-bowhay commented Nov 9, 2023

I appreciate the argument around future work on griddata but it seems cumbersome that I would have to rebuild my interpolator just because I want to change the workers / max dist. Could we not add an options dict argument to griddata (as we do for minimise and friends) then each method uses this where needed (whether that be __init__ or __call__)

@harshilkamdar
Copy link
Contributor Author

@j-bowhay, @ev-br: thank you both for your careful feedback. I have updated the PR based on both your comments in the latest commit - ff1d52f. It now takes in a query_options dict and exposes some more cKDTree functionality. I've also added a couple more tests.

In fact, it's not immediately clear from the docstring that this whole interpolator is for unstructured data and is KDTree-based. Would be great if you could update that in a follow-up PR.

I will do this in a separate documentation PR.

@j-bowhay
Copy link
Member

Sorry I wasn't quite clear, I was proposing the griddata take some kind of option dict to pass to the underlying interpolator and a signature like NearestNDInterpolator.__call__(*args, distance_upper_bound=np.inf, etc) but perhaps lets see what @ev-br thinks so we don't have to make lots of back and forth changes

@ev-br
Copy link
Member

ev-br commented Nov 11, 2023

To be clear: I'm fine with either approach. I'm sold on that the inconvenience of lumping together constructor and call arguments is rather minor and a cleaner griddata signature outweights it; I can also live with a optimize-like approach. As long as we are not adding public attributes to interpolators, I'm OK and defer to @j-bowhay for the final API specification.

@harshilkamdar
Copy link
Contributor Author

Sounds good - sorry about the added confusion. @j-bowhay let me know which option you prefer and I can implement. Either is fine on my end and should likely be minimal effort

Copy link
Member

@j-bowhay j-bowhay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about an interface like this?

scipy/interpolate/_ndgriddata.py Outdated Show resolved Hide resolved
scipy/interpolate/_ndgriddata.py Outdated Show resolved Hide resolved
scipy/interpolate/_ndgriddata.py Outdated Show resolved Hide resolved
harshilkamdar and others added 2 commits November 14, 2023 21:20
Co-authored-by: Jake Bowhay <60778417+j-bowhay@users.noreply.github.com>
@harshilkamdar
Copy link
Contributor Author

Thanks both for your patience. @j-bowhay - I like it. I have committed and updated tests + docs and have re-requested review.

@ev-br
Copy link
Member

ev-br commented Nov 15, 2023

CI failures look related, could you please take a look

@harshilkamdar
Copy link
Contributor Author

Sorry, dumb bug. Fixed now.

@ev-br
Copy link
Member

ev-br commented Nov 16, 2023

CI is still unhappy :-(.
Might be faster to iterate locally: the incantation is $ python dev.py test -s interpolate or -t path/to/test/file.

@harshilkamdar
Copy link
Contributor Author

@ev-br - good call.

Through the griddata test failures, I've found that both NearestNDInterpolator and griddata are silently actually supporting two things that I wasn't originally accounting for - the case where __call__ gets an n-D xi argument and the case where the y-values for the interpolation are also multidimensional.

I've come up with hopefully an acceptable solution to account for these two cases and updated docs & comments for NearestNDInterpolator to reflect this. I couldn't get scipy to build locally for some annoying reasons but have tried to make sure that all tests in test_ndgriddata.py pass by checking manually. If there end up being failures, I can fix them quickly.

np.full with int dtype and nan-setting leads to weirdness.
Copy link
Member

@ev-br ev-br left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, this is getting close! I like the flatten-compute-restore logic, it makes a lot of sense.

I've left several comments about 1) minor simplifications of implementation details and 2) clarity of comments/documentation.

The story about stacks of arrays is always confusing, so let's try to clarify it here. So, there are three arrays: x, y and xi.

Let's see if I got it right:

  • x is always 2D
  • y is 1D or a stack of 1D, where the first dimension is the same as the second (or first?) dimension of x, and trailing dimensions represent the stack of 1D arrays.
  • xi is normally a 1D array of dimension ndim or a stack of 1D arrays, where the stacking is along the leading dimension (this is what broadcastable word hints at inhttps://docs.scipy.org/doc/scipy/reference/generated/scipy.interpolate.NearestNDInterpolator.call.html#scipy.interpolate.NearestNDInterpolator.call ).

I think it would be best to explain this (or a correct version of it) to clearly separate what is the number of dimensions, what is the number of data points and how stacked dimensions come into play.
Two general considerations to keep in mind:

  • the docs for NearestNDInterpolator, LinearNDInterpolator and CloughtTocher2DInterpolator should look relatively consistent. If one of them supports trailing/leading dimensions and the other does not, OK, so be it. But going from nearest to linear interpolator in a simple case should be seamless, and so should be the docs.
  • Maybe it'd be useful to take a look at how RegularGridInterpolator documents essentially the same issue. Ideally, regular and scattered interpolants look similar.

Finally, since the documentation story is not strictly speaking related to your original changes, it would be perfectly fine to undo the docstring changes and keep the documentation update for a follow-up PR (which would be very welcome indeed).

scipy/interpolate/_ndgriddata.py Outdated Show resolved Hide resolved
scipy/interpolate/_ndgriddata.py Show resolved Hide resolved
scipy/interpolate/_ndgriddata.py Outdated Show resolved Hide resolved
# (1) the case where xi is of some dimension (n, m, ..., D), where D is the coordinate dimension, and
# (2) the case where y is multidimensional (npoints, k, l, ...).
# We will first flatten xi to deal with case (1) and build an intermediate return array with shape
# (n*m*.., k, l, ...) and then reshape back to (n, m, ..., k, l, ...).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n*m* is confusing: what does the trailing asterisk stand for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm not being careful here in my words. Have given the description another go.

dist, i = self.tree.query(xi)
return self.values[i]

# We need to handle two important cases for compatibility with a flexible griddata:
Copy link
Member

@ev-br ev-br Nov 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's not bring griddata here. The rest of the comment is also confusing (at least to me, and this is not the first time I'm seeing the story about leading/trailing dimensions).

I'm not sure what your m, n, k and l refer to TBH. Maybe take a look at how RegularGridInterpolator documents essentially the same story, possible trailing dimensions in y and xi:

https://docs.scipy.org/doc/scipy/reference/generated/scipy.interpolate.RegularGridInterpolator.__call__.html#scipy.interpolate.RegularGridInterpolator.__call__

I'm not saying RGI does it perfectly (it is not), all I'm saying you maybe can come up with a better formulation for both (likely in a follow-up PR).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is confusing and griddata is not relevant - updated.

scipy/interpolate/_ndgriddata.py Outdated Show resolved Hide resolved
scipy/interpolate/_ndgriddata.py Outdated Show resolved Hide resolved
@harshilkamdar
Copy link
Contributor Author

harshilkamdar commented Nov 22, 2023

@ev-br - thanks for the thorough comments. I think you're right that I'm confusing nD/2D and trailing dimensions. The pointer to RegularGridInterpolator was very helpful.

I have considered most of your comments and hopefully addressed bulk of it in the new commits. Hopefully this is a more accurate version of what's going on. Any remaining doc fix, I will leave to a separate PR for now.

Copy link
Member

@ev-br ev-br left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All my comments have been addressed, I think it is a great addition, so +1 from me.

Unless there are further comments, am going to hit the green button later this week.

@ev-br ev-br added this to the 1.12.0 milestone Dec 3, 2023
@ev-br
Copy link
Member

ev-br commented Dec 3, 2023

Am going to merge now, based on two core dev approvals. Thanks Jake for the review, thank you @harshilkamdar for the enhancement and congrats with what I believe is your first SciPy commit. Keep them coming!

@ev-br ev-br merged commit 1e7726d into scipy:main Dec 3, 2023
26 of 28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement scipy.interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants