-
Notifications
You must be signed in to change notification settings - Fork 560
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: faster contains_xy/intersects_xy predicates special casing for point coordinates #1548
ENH: faster contains_xy/intersects_xy predicates special casing for point coordinates #1548
Conversation
0cf45de
to
2668688
Compare
src/ufuncs.c
Outdated
} | ||
static void* contains_xy_data[1] = {GEOSContainsXY}; | ||
|
||
static char GEOSContainsProperlyXY(void* context, void* g1, void* pg1, double x, double y) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those functions (for contains and intersects) are a bit repetitive. Another option could also be to move this into the actual Ydd_b_p_func
, and inject GEOSPreparedContainsXY_r
and GEOSPreparedContains_r
as data
(similarly how YY_b_p_func
gets the non-prepared and prepared version of the predicate).
(however, in case we would drop the on the fly preparation, and assume this is already done in advance, that would make those functions a lot shorter)
Pull Request Test Coverage Report for Build 3521081413
💛 - Coveralls |
Looks good @jorisvandenbossche ! A high-level comment: can we accept a tuple of coordinates in the existing
These could be vectorized as well as a |
Trying to reuse the existing funtions is certainly an option as well (if we can do that cleanly, that would limit a proliferation of The main reason I was thinking it doesn't really fit nicely is because I wanted to have separate x and y arguments, and so that would make the signature not very clean (something like Passing xy as a single (n, 2) array (or a tuple as the scalar (1D array) version of that) would make it indeed fit better in the existing function. We would then need to dispatch under the hood to the actual ufunc depending on the data type of the second argument, I suppose? (since the shape itself can't be relied upon, given that also for geometries you could have such a shape with the broadcasting abilities of numpy) |
There is indeed a difference when working with arrays. But that goes both ways; if you happen to start with an |
If you have a (n, 2) array, you can still easily slice and pass x and y as separate arguments (they have just a stride of x2), and performance wise that doesn't change much compared to separate x and y arrays. Using the variables from above:
While the other way around you need to concatenate (and thus copy) the data. Although, given the above timing of that step, this copying is also not that significant compared to the actual operation. So this is probably not a strong argument either way :) |
To follow-up on this: for the actual ufunc, we will certainly keep two separate implementations (since they have different data types and signature), although I might change the new ufunc to a generalized ufunc with reducing dimension for xy (as Casper mentioned above: #1548 (comment)). Option 1: expose those as separate functions (current state of the PR), meaning we have both a Option 2: expose the new functionality to pass points as xy coordinates in the existing def contains(geom1, geom2):
# ... coerce to arrays
if np.issubdtype(geom2.dtype, np.number):
# geom2 are float coordinates
return lib.contains_xy(geom1, geom2)
else:
return lib.contains(geom1, geom2) Some considerations:
cc @caspervdw @brendan-ward @sgillies @mwtoews preferences / thoughts? |
We discussed this a little bit during the GeoPandas dev meeting. While there is some benefit of clarity to having separate methods to keep the documentation around the _xy predicates very clear, there is already a precedent of functions in Shapely 2.0 that take Points or coordinates as input: the Thus it seems like the single combined option with interleaved xy array seems like the most straightforward way to add this. |
This is only for the |
OK, I pushed a change to see how it looks like to have a single function that can accept both (I didn't yet update the docs for it, and also didn't yet change the actual ufunc implementation, so I still need to split the xy argument in separate x and y arrys) Another thought on the option 1 vs 2 choice. The fact that we only do this for Another minor advantage of the separate functions is that we can give the user the flexibility to pass the points both as interleaved xy array or as separate x and y arrays (just like with |
Additional note: even if there is not yet consensus on a public API, I think we can certainly already merge just the ufunc addition, i.e. |
@jorisvandenbossche I'm in favor of separate functions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm OK with this being 2 new functions, and would suggest also adding contains_properly_xy
.
It seems reasonable to prepare the geometry in python before calling the ufunc, since a common use case would be a single input geometry compared with many x,y values. That said, the overhead of checking that the input geometry is prepared in the ufunc is pretty low, and this could always be refactored later.
See my comment at the GEOS PR: libgeos/geos#677 (comment). For points, there is no difference between "contains" and "contains_properly" since in both cases the single point needs to be in the interior of the polygon to return True (and so there is no other point that can and cannot intersect with the boundary of the polygon) |
Would it be useful to note in the docstring for |
src/ufuncs.c
Outdated
if (pg1 == NULL) { | ||
prepared_geom_tmp = GEOSPrepare_r(context, g1); | ||
if (prepared_geom_tmp == NULL) { | ||
return 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll need to destroy geom
(if !GEOS_SINCE_3_12_0
) here
Maybe easiest to firstdo the prepare logic? Then you can also merge the conditionals if GEOS_SINCE_3_12_0
@caspervdw your sentence is not very explicit ;), but I am reading it as you assume the outcome is to have it combined into the existing functions? (option 2 from #1548 (comment)) |
@jorisvandenbossche I am a bit torn between the two options but rereading this thread I am slightly inclined to go for the second option (2 extra separate functions) without any added ones (like the Basically because it is the simplest one, just wrapping the GEOS function as-is. I think that is closest to the pygeos/shapely philosophy. The other option ( integrating) only really adds value if you do it for every binary predicate. But I am mildly worried that adding extra spatial logic in shapely might introduce edge cases that we will need to fix later. So, concluding, just 2 extra methods if you ask me. Call signature similar to |
@caspervdw Thanks, that also summarizes my thoughts at the moment, and so then I think we have agreement on the separate functions. Will update the PR to go back to that, and address the rest of the feedback |
This reverts commit 47032ce.
OK, I think this now implements the agreed API, and I have addressed the comments. So going to merge (in case I missed something, or you still see an issue, I am happy to do follow-up PRs) |
Love the more performant polygon.contains(point) But can't do polygon.contains_xy(point) Because you get:
So you have to do: contains_xy(polygon, point) Is this intended, and if so what's the motivation behind it? |
See libgeos/geos#677: GEOS added new XY versions of some predicates that avoid the overhead of point construction, and started experimenting a bit with those.
This would also close #1521
Some points:
_xy
top-level function (instead of for example integrating it in the existing predicate functions). Are we OK with such an addition to the API?shapely.vectorized
(after deprecating them). Also here I think functions with a different name are better than the currentshapely.vectorized.intersects(geom, x, y)
(same name as top-level one, but with a different signature)_xy
can give a speedup.If we want to add this, still need to add tests, expand docstrings, etc.
Some small benchmarks:
Scalar (I included the change from #1547 while running those, since that speeds up
Point(..)
):Scalar with polygon prepared in advance:
Vectorized:
(the big difference here above is that I made
contains_xy
to always prepare on the fly, whilecontains
does not do this, so this is not a fully fair comparison)Vectorized with polygon prepared in advance:
So it takes more time to create the points compared to just the contains check. To make this explicit: