-
Notifications
You must be signed in to change notification settings - Fork 559
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integration of PyGEOS and Shapely #782
Comments
Same thing as above, just organized by potential directions that could be taken in the integration. Probably missed some benefits/drawbacks for the different options, but I tried to get the general idea. Option 1: Have
|
@snowman2 thanks for that write-up! On Option 1: Have pygeos use shapely geometry object as its geometry object, some questions on how that would work in practice. For pygeos, the base geometry type needs to be a Python extension type, so written in C (or cython). So that's something that Shapely will still need to do (and thus have a hard dependency on a compiler for development / packaging). |
If this road were to be taken, I would probably lean towards replacing ctypes with Though off of that tangent: If |
So either you write it in cython for scalars, and then: no, you can't achieve the same speed-up (it's the fact that it works on the full array at once that is important). Or either you write it in cython to work on arrays, but then most practically you need numpy, and then: yes, you can achieve the exact same speed-up (this is basically what we were doing in the geopandas cython branch). I suppose it would be possible to write array-wise cython functionality without depending on numpy (using python's memoryviews etc), but I think that would be a lot less practical. Numpy is giving you a lot of convenience to work with arrays. Now, if a numpy dependency is not necessarily the problem, but rather the direct use of C is seen as a bottleneck: I think it is possible to do almost exactly the same in cython what pygeos currently does in C (you can write ufuncs in cython). |
A short response from my side. Thanks a lot for the writeup @jorisvandenbossche @snowman2 For me, one of the main issues is not wanting to split the Python GEOS community. For that it is necessary to either merge or make them dependent (options 3 and 4). Also, we shouldn’t put too much emphasis on the amount of work any transition would take. We should just decide on the “best way” to go forward. It will pay off. In my experience, that approach works best. More technically, I think that not only speed is an argument for using pygeos’ approach of wrapping GEOS. There is actually not much difference when looking only at scalar geometries. If I try to look objectively to the differences in shapely and pygeos (disclaimer: I wrote pygeos) I come to the following advantages of either using pygeos as dependency of shapely or merging the projects:
I did open the door to extending the pygeos.Geometry type with a pure Python class. That could be a way to actually make that object act as a shapely object. See pygeos/pygeos#72 . |
We've had a pair of video calls to discuss these issues. Thank you @jorisvandenbossche and @caspervdw for organizing and thank you @snowman2 @snorfalorpagus and @mwtoews for joining us. Our consensus is to go ahead with the 4th option: replacing the existing vectorized module in shapely with code from pygeos and expanding the scope for vectorized operations. It is yet to be decided if shapely's existing scalar operations will be a special case of more general vectorized operations. Also yet to be determined is a roadmap for developing and releasing. Next steps: let's spread the word about this and see if anyone else sees a flaw in the plan. Then let's work on the roadmap and try to make some progress on this after the holidays. |
Hi all, I started working (mainly thinking, some experimentation) on this again over the last weeks, and tried to put some thoughts together in a discussion document, in an effort to start getting this in motion. I think one of the important discussion points that we didn't flesh out on the call is how to integrate pygeos' code/concepts in Shapely: what API do we want to provide, and how much can this deviate from the current API? We obviously want to add the vectorized functions (somewhere in a module), but what to do with the Geometry classes? Since those geometry classes are at the core of Shapely, I think it's good to start discussing the options in that area. How do we want to discuss this? Is the text on hackmd fine (there is basic commenting functionality), or would it be better to do a PR with it, so we can do inline commenting on github? Or do we discuss it here, keeping the separate document as reference that can get updated? On another note: I think we will want to change some behaviour (eg I mentioned mutability in the linked document, but also just the
|
A gentle ping for my post above. Also, if there are any suggestions on how or where to best discuss this (here, a PR with a textual proposal, ...), that's very welcome. |
Hi @jorisvandenbossche, I'm currently drowning in shapely issues in the 1.7 (master) branch, we've got a burst of them, and haven't been able to get to this yet. I'll respond to that comment above later today or tomorrow. |
@jorisvandenbossche Thanks for keeping at this! Listing all the stuff that needs to be done makes the project a lot less daunting. One way to discuss this is is taking your document and place it it a shapely-rfc repo. We could separate issues and iteratively work on a plan. One thing we should consider is if we are going to work on the existing shapely code or start with a clean sheet, copying stuff from pygeos and shapely into a completely new repo (or branch). It seems to me that starting with a clean sheet might be less work altogether. One specific thing I’d like to add: I think we will hit a hard wall when trying to subclass pygeos.Geometry into Python. This because we should be able to type check the geometry inside the ufunc inner loop, and as Python is dynamic, you can’t without taking a detour through the interpreter. Cython in the core is something I’d like to avoid (as oppossed to custom algorithms). I’ll try to reply faster next time! |
Okay, I've left some comments in the HackMD page. Thanks for starting that @jorisvandenbossche ! Tomorrow I'll make a maint-1.7 branch and then the master will become 1.8dev. I think we could make a branch for version 2.0dev almost immediately and periodically merge in the changes from maint-1.7 and the master branch. @caspervdw let's find out whether that's going to be a blocker as soon as we can. |
@sgillies thanks for the comments! Since the comments in HackMD are not that visible, putting the most important one here:
On this last sentence: I didn't propose to eliminate Point's x and y attributes, so sorry if that wasn't clear (I think we should preserve all existing attributes and methods, with a few exceptions that we can deprecate first, like With the constructors, isinstance hooks, and all familiar methods/attributes in place, I don't think the option of a single class would break "every existing application". The main problem of multiple classes is that it will quite complicate the implementation on the C side (as far as I can imagine now), which is a clear downside of that option. We could have the subclasses in C and have them basically almost the same as the base Geometry class (just ensuring they have the correct geometry type). That might not be too much work (but IMO also doesn't add that much value compared to the single class). A difficulty would in this case also be to match the behaviour of the current constructors. To have the subclasses in Python, as @caspervdw points out, that complicates the ufuncs (and would probably impact performance as well) as those that create new geometries would need to call from C into python to construct the correct class. |
@jorisvandenbossche I thought removal of attributes like x, y from point objects was implied in the proposal for a single class, because why would we want x, y attributes on a polygon object? Would Capsules help with the ufuncs? https://docs.python.org/3/c-api/capsule.html#capsules. Specifically, could our C base class define a small C API that the ufuncs could use to get the GEOS geometry pointers? |
I touched on the topic of methods that are unique to one of the geometry types, but there are only a few, so mentioned that "those can also raise an error if they are not applicable for the geometry type in question", when providing all attributes on the single Geometry class. |
What problem do you have in mind for which this could help? |
I did look at capsules, but our implementation doesn't need it. It can't solve this issue for us. However, I do like the idea of pure python subclasses of
I had another look at the Python docs (notably: https://docs.python.org/3/c-api/typeobj.html#c.PyTypeObject.tp_base) and it appears that there is a static attribute that keeps track of a type's baseclass. We can use this in the ufuncs. Then the first issue is resolved (note: I did not benchmark yet, but I think we're absolutely fine with this). See this PR: The second issue might be not that bad either: we are using the python interpreter anyway as we are creating the objects. We can't release the GIL. Selecting the right subtype might not be that much extra overhead. If necessary, we could populate a struct on the module initialization with pointers to the (python-defined) classes. |
Consuming the subclasses in the C ufuncs is no problem I think (I commented on the PR as well), it's mainly producing them that is more complex. It's true that we need python interaction anyway to create the python objects, so it might indeed not be that much slower. It will still make the C code more complex though .. |
Some small updates / other questions on this:
|
Feedback on the STRTree / prepared geometry questions from the comment above is still very welcome. I also wanted to raise another topic for discussion: how to expose those vectorized/ufunc functions? Copying part of the text in the hackmd and Sean's response to it:
Some reasons that I am not really "fond" of
That last point touches on how we want to integrate this with the other shapely modules ( As an example, the |
The question of whether/how to automatically prepare geometry is an interesting one. It depends on the cost of preparing a geometry versus how much is saved by using the prepared geometry multiple times. There's so many factors involved it would be very difficult to cost this exactly. If you know the number of evaluations a priori, then a simple heuristic is to prepare if # evaluations > N, where N is some small number (perhaps even 1). |
@dr-jts Thanks for the input! I will try to respond in a bit more detail at pygeos/pygeos#92 one of the coming days @sgillies would it help if I do a PR with an RFC-like document that summarizes the changes we would like to do for Shapely 2.0 and the workflow to get there? (similarly how you did one for Fiona 2.0). That might be easier to discuss compared to the external hackmd I made now (which I also need to update with the latest changes/discussions above) |
@jorisvandenbossche yes please with an RFC 🙏 |
Hi all, a more detailed proposal is now available at shapely/shapely-rfc#1. Feedback very welcome! |
I read through the RFC. A big paint point for me is that geometries of different sizes cannot be created in a vectorized way in PyGEOS. This is fine if you only have points - but for linestrings, I think it is an unreasonable assumption that all linestrings will contain the same number of coordinates. In my particular case I want something that creates linestrings if the number of coordinates is >= 2, and points if the number of coordinates is == 1. (And I want the looping-part of this not to happen in Python for performance reasons) I'd love to see some kind of solution for this considered in Shapely 2.0! See: pygeos/pygeos#138 |
In the meantime, PyGEOS has been integrated into Shapely, and this has been released as Shapely 2.0.0 (#962). So we can close this issue now. |
For context, see also #501 on vectorizing all Shapely functions. More specific context: PyGEOS (https://github.com/pygeos/pygeos/, https://caspervdw.github.io/Introducing-Pygeos/) is a new python package that exposes geospatial operations from GEOS, similarly to Shapely, but through fast, vectorized (numpy-based) functions.
It gives a significant speed-up (5x to 100x speed-up (depending on the kind of operation) for operations on arrays), opens up the possibility of parallelizing calls to pygeos functions, and will provide a needed performance boost to GeoPandas and others who work with arrays of geometries.
It has a different initial focus (performance) but clearly also has a lot of overlap with Shapely. We have been discussing offline with a few people on how a possible integration between both packages could look like, but thought to make an issue here as well to have a public recollection of it.
To quote Sean: The unanswered question is: how do we organize GeoPandas, Shapely, and PyGEOS to maximize the benefits for the projects, their developers, and their users while also distributing the costs in an equitable way?
Can we integrate the PyGEOS functionality fully into Shapely, avoiding the need of a separate library? Or would Shapely want to use PyGEOS as a dependency for its interaction with the GEOS library instead of its ctypes approach? Or if keeping separate libraries, how do ensure a best possible interoperability?
Below, I am trying to summarize some of the discussion points, partly from my (biased) viewpoint as GeoPandas developer.
Stumbling blocks / disadvantages for moving pygeos into Shapely:
Advantages of integrating pygeos into Shapely:
And a few aspects that are less clear advantage/disadvantage:
Or, if not directly integrating the full functionality of PyGEOS into Shapely, are there ways to integrate the Geometry type better and ensure good interoperability / exchangeability ?
It is also not fully clear to me if it is safe to share pointers between both packages if they use a GEOS that is potentially built with a different toolchain.
__geo_interface__
. However, not all packages will support this to recognize geometries (I suppose many packages expect Shapely objects, and actually Shapely itself also does not support it as input in the Geometry methods), and this also has a lot of overhead. I would argue that this is mainly useful for packages that explicitly do no want to depend on GEOS (through Shapely or PyGEOS).For me, my main concern is to have a clear story for the community of both end user and dependent packages.
Should they use / support shapely or pygeos geometries? What if I am using pygeos geometries but then another package I want to use only supports shapely geometries? For that reason, I am personally hoping we can find some way to integrate both projects.
And a similar viewpoint from the GeoPandas side: we are planning to use PyGEOS under the hood to store the geometries in the geometry column and have faster operations on it. However, we are not planning to "get rid of" Shapely. For now, Shapely will still be the primary user-facing access to elements of a GeoDataFrame: when accessing a single element, users get a Shapely object (the stored pygeos object gets converted) to benefit from the rich functionality of Shapely scalar objects.
This means that whathever inconsistency there is between PyGEOS and Shapely will be confusing for GeoPandas users. It means that there will be two ways of getting the geometries out of a dataframe: as an array of shapely objects or an array of pygeos objects. All that is not an ideal situation from a GeoPandas viewpoint.
cc @sgillies @snorfalorpagus @snowman2 @caspervdw
The text was updated successfully, but these errors were encountered: