Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port the spatial join algorithm from geopandas #52

Closed
jorisvandenbossche opened this issue Oct 6, 2019 · 5 comments
Closed

Port the spatial join algorithm from geopandas #52

jorisvandenbossche opened this issue Oct 6, 2019 · 5 comments

Comments

@jorisvandenbossche
Copy link
Member

In the cython branch in GeoPandas, we had a C implementation of a spatial join algorithm (using the STRTree from GEOS). This was added in geopandas/geopandas#475

Since it's written in C directly against the GEOS C API, it would be nice to include it in this package (primarily for use in geopandas, but it might also be useful in general).

One question is how it would fit in the current API of pygeos.
The version in GeoPandas is currently written as a C function that gets 2 C arrays of geometry objects:

size_vector sjoin(GEOSContextHandle_t handle,
                  GEOSPreparedPredicate predicate,
                  GEOSGeometry **left, size_t nleft,
                  GEOSGeometry **right, size_t nright);

which is then wrapped in cython to provide a python interface with a signature of

def sjoin(left: np.ndarray[1D], right: np.ndarray[1D], predicate_name: str) -> (left_out, right_out)

where left_out and right_out are arrays of indices into left and out, respectively, for all the matches according to the predicate.

I think it is difficult to generalize this to multiple dimensions, as currently all other functions in pygeos do.

@jorisvandenbossche
Copy link
Member Author

@brendan-ward what were your ideas regarding spatial joins? Since you mentioned "This should make it easier to implement spatial joins on top of pygeos." in #87

I think we should still have a spatial join algorithm in pygeos itself (as having the outer loop in c as well will still be beneficial).

It could reuse the STRtree_query function, and call that multiple times for all geometries However, the problem is that that is a "python function", so not ideal to reuse from C. So we might need to factor out the actual "query" part that doesn't need python interaction (eg keyword parsing).

I think the API we have had in the GeoPandas cython branch (see links above) returning two arrays of indices (indices into the first and second set of geometries, respectively) is probably the most general one.

@brendan-ward
Copy link
Contributor

@jorisvandenbossche agreed on all points.

The general thought was just what you said, we need a vectorized way to loop over queries to the tree. Spatial join is basically a bulk query, with a predicate applied after, and possibly some smarts about the internal direction of the join to best leverage prepared geometries.

A couple ways we could achieve this (without thinking it through all the way):

  1. create a function in C that implements the core of the query operation, and use this inside C functions exposed to Python for both singular and bulk queries to the tree (the latter within a loop in C).
  2. change the signature of STRtree_query to take an array of input geometries instead of a single one, and implement the loop within this. Then, in the calling python function, intercept any case where the user passed in a singular geometry, and wrap that in an array before calling STRtree_query and unwrap the response.

Above and beyond how the tree is queried, there is additional work that goes on within geopandas' sjoin: picking the internal direction of the join based on existing indices (likely not applicable here) or length of geometries, then inverting that if not in the direction requested by the user. I think our first step here should be to figure out how best to bulk query the tree.

@caspervdw
Copy link
Member

Is this solved with the bulk_query method that was implemented in #108 ?

@caspervdw
Copy link
Member

Duplicate of #135

@caspervdw caspervdw marked this as a duplicate of #135 Dec 4, 2021
@jorisvandenbossche
Copy link
Member Author

This is indeed essentially query_bulk

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants