Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: faster contains_xy/intersects_xy predicates special casing for point coordinates #1548

Merged
merged 18 commits into from
Nov 24, 2022

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Sep 24, 2022

See libgeos/geos#677: GEOS added new XY versions of some predicates that avoid the overhead of point construction, and started experimenting a bit with those.
This would also close #1521

Some points:

  • For now I went with an additional _xy top-level function (instead of for example integrating it in the existing predicate functions). Are we OK with such an addition to the API?
    • This can also replace the equivalent functions in shapely.vectorized (after deprecating them). Also here I think functions with a different name are better than the current shapely.vectorized.intersects(geom, x, y) (same name as top-level one, but with a different signature)
  • If the geometry is not yet prepared, I followed how we do it elsewhere and prepare/destroy_prepared on the fly. However, given the difference in speed in case you are repeatedly testing against a single geometry (as in the benchmarks below, around 600ms vs 60ms), I am wondering if we shouldn't just document that those functions prepare the geometries (or alternatively, we could also call prepare/destroy_prepare in the python wrapper, avoiding that we prepare the geometry repeatedly in the C function).
  • Instead of only making this function available for GEOS 3.12+, I just implemented the "create point + prepared predicate" for older versions of GEOS. This still creates a GEOSGeometry Point, but already avoids creating the Python geometry objects, and based on the timings, this is most of the overhead. So also for older GEOS versions, this specialized _xy can give a speedup.

If we want to add this, still need to add tests, expand docstrings, etc.

Some small benchmarks:

import shapely
rng = np.random.default_rng(42)
# use more complex polygon than `polygon = shapely.box(0, 0, 1, 1)`
polygon = shapely.make_valid(shapely.unary_union(shapely.polygons(rng.random((100, 3, 2)))))
x = rng.standard_normal(1_000_000) 
y = rng.standard_normal(1_000_000) 

Scalar (I included the change from #1547 while running those, since that speeds up Point(..)):

In [2]: %%timeit
   ...: shapely.contains_xy(polygon, 0.5, 0.5)
8.87 µs ± 69.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [3]: %%timeit
   ...: p = shapely.Point(0.5, 0.5)
   ...: shapely.contains(polygon, p)
78.7 µs ± 2.12 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Scalar with polygon prepared in advance:

In [4]: shapely.prepare(polygon)

In [5]: %%timeit
   ...: shapely.contains_xy(polygon, 0.5, 0.5)
3.88 µs ± 66.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [6]: %%timeit
   ...: p = shapely.Point(0.5, 0.5)
   ...: shapely.contains(polygon, p)
9.34 µs ± 157 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Vectorized:

In [6]: %%timeit 
   ...: shapely.contains_xy(polygon, x, y) 
650 ms ± 27.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)   # <- GEOS 3.12

In [7]: %%timeit 
   ...: points = shapely.points(x, y) 
   ...: shapely.contains(polygon, points)
8.19 s ± 752 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

(the big difference here above is that I made contains_xy to always prepare on the fly, while contains does not do this, so this is not a fully fair comparison)

Vectorized with polygon prepared in advance:

In [8]: shapely.prepare(polygon)                                                                                                                                                                                   

In [9]: %%timeit 
   ...: shapely.contains_xy(polygon, x, y)
68.6 ms ± 953 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)  # <-- GEOS 3.11
47.8 ms ± 1.44 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)  # <-- GEOS 3.12 dev

In [10]: %%timeit 
    ...: points = shapely.points(x, y) 
    ...: shapely.contains(polygon, points)
356 ms ± 37 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

So it takes more time to create the points compared to just the contains check. To make this explicit:

In [7]: %timeit points = shapely.points(x, y)
299 ms ± 12 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

points = shapely.points(x, y)
In [10]: %timeit shapely.contains(polygon, points)
49.1 ms ± 2.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

src/ufuncs.c Outdated
}
static void* contains_xy_data[1] = {GEOSContainsXY};

static char GEOSContainsProperlyXY(void* context, void* g1, void* pg1, double x, double y) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those functions (for contains and intersects) are a bit repetitive. Another option could also be to move this into the actual Ydd_b_p_func, and inject GEOSPreparedContainsXY_r and GEOSPreparedContains_r as data (similarly how YY_b_p_func gets the non-prepared and prepared version of the predicate).

(however, in case we would drop the on the fly preparation, and assume this is already done in advance, that would make those functions a lot shorter)

@coveralls
Copy link

coveralls commented Oct 3, 2022

Pull Request Test Coverage Report for Build 3521081413

  • 17 of 17 (100.0%) changed or added relevant lines in 2 files are covered.
  • 12 unchanged lines in 2 files lost coverage.
  • Overall coverage decreased (-0.2%) to 84.754%

Files with Coverage Reduction New Missed Lines %
shapely/geometry/base.py 4 83.54%
shapely/vectorized/init.py 8 56.52%
Totals Coverage Status
Change from base Build 3477517468: -0.2%
Covered Lines: 2218
Relevant Lines: 2617

💛 - Coveralls

@jorisvandenbossche jorisvandenbossche marked this pull request as ready for review October 3, 2022 08:25
@jorisvandenbossche jorisvandenbossche requested review from brendan-ward and caspervdw and removed request for brendan-ward October 3, 2022 08:30
@caspervdw
Copy link
Collaborator

Looks good @jorisvandenbossche !

A high-level comment: can we accept a tuple of coordinates in the existing contains function instead of exposing another function? Like:

shapely.contains(geometry, (0.1, 0.1)) ?

These could be vectorized as well as a (i),(i,2)->(i) ufunc.

@jorisvandenbossche
Copy link
Member Author

Trying to reuse the existing funtions is certainly an option as well (if we can do that cleanly, that would limit a proliferation of _xy versions if GEOS adds more of those in the future).

The main reason I was thinking it doesn't really fit nicely is because I wanted to have separate x and y arguments, and so that would make the signature not very clean (something like contains(a, b, y=None) with b getting interpreted as x if y is also specified ...).

Passing xy as a single (n, 2) array (or a tuple as the scalar (1D array) version of that) would make it indeed fit better in the existing function. We would then need to dispatch under the hood to the actual ufunc depending on the data type of the second argument, I suppose? (since the shape itself can't be relied upon, given that also for geometries you could have such a shape with the broadcasting abilities of numpy)
Although I would certainly be OK with such an API, there is a significant difference of passing separate arrays or a single array. If you start with separate arrays for x and y (for example, if you represent points as x and y columns in a dataframe), then you first need to interleave those (horizontally stack) for being able to pass them as a (n, 2) array.

@caspervdw
Copy link
Collaborator

Although I would certainly be OK with such an API, there is a significant difference of passing separate arrays or a single array. If you start with separate arrays for x and y (for example, if you represent points as x and y columns in a dataframe), then you first need to interleave those (horizontally stack) for being able to pass them as a (n, 2) array.

There is indeed a difference when working with arrays. But that goes both ways; if you happen to start with an (n, 2) you would have to transpose and unpack them. Do you have a feeling of what users tend to start out with?
My intuitive preference would be (n, 2) because the two numbers (x and y) that form 1 conceptual thing (a Point) are together.

@jorisvandenbossche
Copy link
Member Author

If you have a (n, 2) array, you can still easily slice and pass x and y as separate arguments (they have just a stride of x2), and performance wise that doesn't change much compared to separate x and y arrays. Using the variables from above:

In [9]: xy = np.stack((x, y), axis=-1)

In [10]: %timeit np.stack((x, y), axis=-1)
3.81 ms ± 521 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [11]: shapely.prepare(polygon)

In [12]: %timeit shapely.contains_xy(polygon, x, y)
69.1 ms ± 388 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [13]: %timeit shapely.contains_xy(polygon, xy[:, 0], xy[:, 1])
69.5 ms ± 356 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

While the other way around you need to concatenate (and thus copy) the data. Although, given the above timing of that step, this copying is also not that significant compared to the actual operation. So this is probably not a strong argument either way :)

@jorisvandenbossche
Copy link
Member Author

To follow-up on this: for the actual ufunc, we will certainly keep two separate implementations (since they have different data types and signature), although I might change the new ufunc to a generalized ufunc with reducing dimension for xy (as Casper mentioned above: #1548 (comment)).
But so the question is what to do with the public API.

Option 1: expose those as separate functions (current state of the PR), meaning we have both a shapely.contains(geom1, geom2) and shapely.contains_xy(geom1, xy) (or shapely.contains_xy(geom1, x, y), see below). And same for intersects.

Option 2: expose the new functionality to pass points as xy coordinates in the existing contains function. We would then dispatch to the underlying ufunc based on the data type, roughly like:

def contains(geom1, geom2):
    # ... coerce to arrays
    if np.issubdtype(geom2.dtype, np.number):
        # geom2 are float coordinates
        return lib.contains_xy(geom1, geom2)
    else:
        return lib.contains(geom1, geom2)

Some considerations:

  • Pros/cons of both options:
    • Re-using the existing functions avoids creating new functions for essentially the same operation (only a different way to express the second argument). This could also be a more scalable pattern if we want to add this ability to more functions.
      I think this is the biggest plus for combining them in a single function.
    • For discoverability, I think one could argue both ways. Having an explicit contains_xy in the namespace (which you for example see when tab-completing) might make this more visible. On the other hand adding this to the existing contains and describing this option in that docstring, might make it easier to see for the many users of that function.
    • A slight disadvantage of combining both in a single function is that it makes the input / output shape logic depending on the data type ((.., n), (.., n) -> (.., n) for objects vs (.., n), (.., n, 2) -> (.., n) for floats)
  • In case of doing it as a separate function (option 1), we need to decide whether we want to pass the coordinates as a single xy interleaved array (contains_xy(geom1, xy)) or as two separate x, y arrays (contains_xy(geom1, x, y).
    • The current PR does the second (separate x, y) arrays, but that can easily be changed.
    • The existing shapely.vectorized.contains uses separate x, y arrays. On the other hand, our functions that work with coordinates (eg get_coordinates) typically work with interleaved xy arrays, so that would be more consistent with that.
    • If you have a single interleaved array, it is easy to slice and pass as separate arrays, while if you have two separate arrays, you need to stack (i.e. copy) the arrays to be able to pass it as a single array. Although the time to make this copy is also quite small compared to the actual operation, so this is not too important (xref #1548 (comment) above).
    • If we go for a single combined function (option 2), I think we want a single interleaved xy array anyway, since we want to keep the number of arguments the same in both cases (so the second argument is always either an object dtype array of geometries or a float array of xy coordinates).

cc @caspervdw @brendan-ward @sgillies @mwtoews preferences / thoughts?

@brendan-ward
Copy link
Collaborator

We discussed this a little bit during the GeoPandas dev meeting. While there is some benefit of clarity to having separate methods to keep the documentation around the _xy predicates very clear, there is already a precedent of functions in Shapely 2.0 that take Points or coordinates as input: the LineString constructor takes both.

Thus it seems like the single combined option with interleaved xy array seems like the most straightforward way to add this.

@jorisvandenbossche
Copy link
Member Author

there is already a precedent of functions in Shapely 2.0 that take Points or coordinates as input: the LineString constructor takes both.

This is only for the LineString() class constructor though (shapely already had this behaviour for a long time), and not for the shapely.linestrings(..) bulk construction function.

@jorisvandenbossche
Copy link
Member Author

OK, I pushed a change to see how it looks like to have a single function that can accept both (I didn't yet update the docs for it, and also didn't yet change the actual ufunc implementation, so I still need to split the xy argument in separate x and y arrys)

Another thought on the option 1 vs 2 choice. The fact that we only do this for contains and intersects might steer me to have it as separate functions. If it is included in a single function, then users might wonder "why does this work in contains, but not in crosses or .. ?". Of course you can check this in the docstring, but with separate functions it is more explicit in that only contains_xy and intersects_xy exist.

Another minor advantage of the separate functions is that we can give the user the flexibility to pass the points both as interleaved xy array or as separate x and y arrays (just like with shapely.points, where you can also do either shapely.points(np.random.randn(10, 2)) or shapely.points(np.random.randn(10), np.random.randn(10))).

@jorisvandenbossche
Copy link
Member Author

@mwtoews @sgillies any thoughts here?

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Nov 15, 2022

Additional note: even if there is not yet consensus on a public API, I think we can certainly already merge just the ufunc addition, i.e. shapely.lib.contains_xy/shapely.lib.intersects_xy, since on that level they will be separate from shapely.lib.contains/intersects anyway (limiting the PR here, or splitting that part to a separate PR). That way we can at least use this under the hood to keep the performance of the shapely.vectorized functions for 2.0.

@jorisvandenbossche jorisvandenbossche added this to the 2.0 milestone Nov 15, 2022
@sgillies
Copy link
Contributor

@jorisvandenbossche I'm in favor of separate functions.

Copy link
Collaborator

@brendan-ward brendan-ward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with this being 2 new functions, and would suggest also adding contains_properly_xy.

It seems reasonable to prepare the geometry in python before calling the ufunc, since a common use case would be a single input geometry compared with many x,y values. That said, the overhead of checking that the input geometry is prepared in the ufunc is pretty low, and this could always be refactored later.

shapely/tests/test_predicates.py Outdated Show resolved Hide resolved
shapely/tests/test_predicates.py Outdated Show resolved Hide resolved
@jorisvandenbossche
Copy link
Member Author

would suggest also adding contains_properly_xy.

See my comment at the GEOS PR: libgeos/geos#677 (comment). For points, there is no difference between "contains" and "contains_properly" since in both cases the single point needs to be in the interior of the polygon to return True (and so there is no other point that can and cannot intersect with the boundary of the polygon)

@brendan-ward
Copy link
Collaborator

Would it be useful to note in the docstring for contains_xy that it is equivalent to the contains properly predicate for points, just to help avoid others looking for it when comparing against the GEOS C API? (like I did, without thinking through the implications of contains vs contains properly for points)

src/ufuncs.c Outdated
if (pg1 == NULL) {
prepared_geom_tmp = GEOSPrepare_r(context, g1);
if (prepared_geom_tmp == NULL) {
return 2;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll need to destroy geom (if !GEOS_SINCE_3_12_0) here

Maybe easiest to firstdo the prepare logic? Then you can also merge the conditionals if GEOS_SINCE_3_12_0

shapely/predicates.py Show resolved Hide resolved
shapely/predicates.py Outdated Show resolved Hide resolved
@jorisvandenbossche
Copy link
Member Author

Which is (I think) indeed the outcome of the decision.

@caspervdw your sentence is not very explicit ;), but I am reading it as you assume the outcome is to have it combined into the existing functions? (option 2 from #1548 (comment))
However, I started to doubt a bit about this myself (with a slight (but undecided) preference for the separate functions, see also #1548 (comment)), and @sgillies mentioned he has a preference for separate functions.

@caspervdw
Copy link
Collaborator

@jorisvandenbossche I am a bit torn between the two options but rereading this thread I am slightly inclined to go for the second option (2 extra separate functions) without any added ones (like the contains_properly_xy proposed by @brendan-ward )

Basically because it is the simplest one, just wrapping the GEOS function as-is. I think that is closest to the pygeos/shapely philosophy.

The other option ( integrating) only really adds value if you do it for every binary predicate. But I am mildly worried that adding extra spatial logic in shapely might introduce edge cases that we will need to fix later.

So, concluding, just 2 extra methods if you ask me. Call signature similar to points is a good idea!

@jorisvandenbossche
Copy link
Member Author

@caspervdw Thanks, that also summarizes my thoughts at the moment, and so then I think we have agreement on the separate functions. Will update the PR to go back to that, and address the rest of the feedback

@jorisvandenbossche
Copy link
Member Author

OK, I think this now implements the agreed API, and I have addressed the comments. So going to merge (in case I missed something, or you still see an issue, I am happy to do follow-up PRs)

@EwoutH
Copy link
Contributor

EwoutH commented Nov 18, 2023

Love the more performant contains_xy function! However, I find it a bit weird that I can do:

polygon.contains(point)

But can't do

polygon.contains_xy(point)

Because you get:

AttributeError: 'Polygon' object has no attribute 'contains_xy'. Did you mean: 'contains'?

So you have to do:

contains_xy(polygon, point)

Is this intended, and if so what's the motivation behind it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

performance regression and future of shapely.vectorized?
6 participants