New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: add max dist to NearestNDInterpolator #19483
ENH: add max dist to NearestNDInterpolator #19483
Conversation
This allows for far quicker queries for large datasets where prior information about when/where to interpolate is known.
This allows for far quicker queries for large datasets where prior information about when/where to interpolate is known.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very reasonable!
My main comment is about naming: I think it would be better to
- keep the name of the parameter consistent with KDTree, and
- maybe think about exposing other
KDTree.query
parameters where it makes sense.
The second part here is optional of course.
Co-authored-by: Jake Bowhay <60778417+j-bowhay@users.noreply.github.com>
This allows for far quicker queries for large datasets where prior information about when/where to interpolate is known.
One other thing (sorry for the piecemeal review, only noticed this being next to |
No worries at all. My rationale was looking to Right now, griddata does not support passing any options to the underlying interpolator (nearest, linear, or cubic). I'm hoping to change that in a new PR and figured it would be cleaner to only expose and use class-level options and keeping This is not a super strong reason, so I'm happy to go with whatever you prefer. |
Right. So delineating between I'd then ask for a small tweak:
In fact, it's not immediately clear from the docstring that this whole interpolator is for unstructured data and is KDTree-based. Would be great if you could update that in a follow-up PR. |
I appreciate the argument around future work on |
@j-bowhay, @ev-br: thank you both for your careful feedback. I have updated the PR based on both your comments in the latest commit - ff1d52f. It now takes in a
I will do this in a separate documentation PR. |
Sorry I wasn't quite clear, I was proposing the |
To be clear: I'm fine with either approach. I'm sold on that the inconvenience of lumping together constructor and call arguments is rather minor and a cleaner |
Sounds good - sorry about the added confusion. @j-bowhay let me know which option you prefer and I can implement. Either is fine on my end and should likely be minimal effort |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about an interface like this?
Co-authored-by: Jake Bowhay <60778417+j-bowhay@users.noreply.github.com>
Thanks both for your patience. @j-bowhay - I like it. I have committed and updated tests + docs and have re-requested review. |
CI failures look related, could you please take a look |
…te' into add-max-dist-nearestnd-interpolate
Sorry, dumb bug. Fixed now. |
CI is still unhappy :-(. |
@ev-br - good call. Through the I've come up with hopefully an acceptable solution to account for these two cases and updated docs & comments for NearestNDInterpolator to reflect this. I couldn't get scipy to build locally for some annoying reasons but have tried to make sure that all tests in |
np.full with int dtype and nan-setting leads to weirdness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, this is getting close! I like the flatten-compute-restore logic, it makes a lot of sense.
I've left several comments about 1) minor simplifications of implementation details and 2) clarity of comments/documentation.
The story about stacks of arrays is always confusing, so let's try to clarify it here. So, there are three arrays: x
, y
and xi
.
Let's see if I got it right:
x
is always 2Dy
is 1D or a stack of 1D, where the first dimension is the same as the second (or first?) dimension ofx
, and trailing dimensions represent the stack of 1D arrays.xi
is normally a 1D array of dimensionndim
or a stack of 1D arrays, where the stacking is along the leading dimension (this is what broadcastable word hints at inhttps://docs.scipy.org/doc/scipy/reference/generated/scipy.interpolate.NearestNDInterpolator.call.html#scipy.interpolate.NearestNDInterpolator.call ).
I think it would be best to explain this (or a correct version of it) to clearly separate what is the number of dimensions, what is the number of data points and how stacked dimensions come into play.
Two general considerations to keep in mind:
- the docs for NearestNDInterpolator, LinearNDInterpolator and CloughtTocher2DInterpolator should look relatively consistent. If one of them supports trailing/leading dimensions and the other does not, OK, so be it. But going from nearest to linear interpolator in a simple case should be seamless, and so should be the docs.
- Maybe it'd be useful to take a look at how RegularGridInterpolator documents essentially the same issue. Ideally, regular and scattered interpolants look similar.
Finally, since the documentation story is not strictly speaking related to your original changes, it would be perfectly fine to undo the docstring changes and keep the documentation update for a follow-up PR (which would be very welcome indeed).
scipy/interpolate/_ndgriddata.py
Outdated
# (1) the case where xi is of some dimension (n, m, ..., D), where D is the coordinate dimension, and | ||
# (2) the case where y is multidimensional (npoints, k, l, ...). | ||
# We will first flatten xi to deal with case (1) and build an intermediate return array with shape | ||
# (n*m*.., k, l, ...) and then reshape back to (n, m, ..., k, l, ...). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n*m*
is confusing: what does the trailing asterisk stand for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm not being careful here in my words. Have given the description another go.
scipy/interpolate/_ndgriddata.py
Outdated
dist, i = self.tree.query(xi) | ||
return self.values[i] | ||
|
||
# We need to handle two important cases for compatibility with a flexible griddata: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's not bring griddata
here. The rest of the comment is also confusing (at least to me, and this is not the first time I'm seeing the story about leading/trailing dimensions).
I'm not sure what your m
, n
, k
and l
refer to TBH. Maybe take a look at how RegularGridInterpolator documents essentially the same story, possible trailing dimensions in y
and xi
:
I'm not saying RGI does it perfectly (it is not), all I'm saying you maybe can come up with a better formulation for both (likely in a follow-up PR).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is confusing and griddata is not relevant - updated.
@ev-br - thanks for the thorough comments. I think you're right that I'm confusing nD/2D and trailing dimensions. The pointer to RegularGridInterpolator was very helpful. I have considered most of your comments and hopefully addressed bulk of it in the new commits. Hopefully this is a more accurate version of what's going on. Any remaining doc fix, I will leave to a separate PR for now. |
Co-authored-by: Jake Bowhay <60778417+j-bowhay@users.noreply.github.com>
…te' into add-max-dist-nearestnd-interpolate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All my comments have been addressed, I think it is a great addition, so +1 from me.
Unless there are further comments, am going to hit the green button later this week.
Am going to merge now, based on two core dev approvals. Thanks Jake for the review, thank you @harshilkamdar for the enhancement and congrats with what I believe is your first SciPy commit. Keep them coming! |
Reference issue
This does not close a particular issue as far as I know, but does speed up nearest neighbor interpolation by a siginificant amount in cases where a max distance arg is passed.
What does this implement/fix?
This allows for far quicker queries and nearest neighbor interpolation for large datasets where prior information about when/where to interpolate is known. In some 2D geospatial examples with tens of millions of points, this leads to a speedup of anywhere from 2-30x.
Additional information
n/a. This is my first PR here -- please let me know if I did anything wrong.