Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

STRtree with a reverse mapping and query_items/query_geoms methods #1112

Merged
merged 9 commits into from Jul 12, 2021

Conversation

sgillies
Copy link
Contributor

@sgillies sgillies commented Mar 30, 2021

This is very much like #1064. It stores only integer items. Exposes no callbacks. It's backwards compatible with the STRtree in shapely 1.7.0.

The differences from #1064:

  • This tree can be initialized from a sequence of geometries (as Store indices instead of geometries in STRtree + option to keep returning geometries #1064 and previous versions) but can also take a sequence of integer items to support applications that have their own integer identifiers. These sequences are separate, not zipped together.
  • This tree maintains a reverse mapping of integer items to initializing geometries, instead of using a list or array, to support the user-provided item scenario.
  • This tree has query_items/nearest_item and query_geoms/nearest_geom (aliased to geom and nearest) methods. The former return integers and the latter return geometries. I believe that this makes usage more clear and helps users who want to statically analyze their code using Python typing.
  • Type hints have been added for the methods listed above.

Also tests and docstrings are much improved over the code in 1.8a1.

@sgillies sgillies self-assigned this Mar 30, 2021
@coveralls
Copy link

coveralls commented Mar 30, 2021

Coverage Status

Coverage increased (+0.03%) to 85.474% when pulling 5c31b1f on int-strtree into bd64faa on master.

"""
if self._n_geoms == 0:
if self._tree is None or not self._rev:
return None
Copy link
Contributor Author

@sgillies sgillies Mar 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raising an exception here could be less ambiguous than returning None, which is easy to confuse with an empty sequence (a perfectly normal query result). But that's an API change to save for 2.0.

shapely/strtree.py Outdated Show resolved Hide resolved
@sgillies sgillies marked this pull request as ready for review March 30, 2021 22:30
@sgillies
Copy link
Contributor Author

@jorisvandenbossche @brendan-ward what do you think of this modified version of #1064? Does it have any big problems from the C implementation perspective? It took me a while to remember how important it is to stick to storing C types when it comes to releasing the GIL. Thanks for being patient with me about that.

I'd still like to work on the feature requested in #1106. I wonder if a list of integer items to exclude or skip would be the way to go. After all, these are what we're storing in the tree and they are immediately accessible to the callback without returning to Python land. Meaning we can exclude/skip while the GIL is released.

@sgillies sgillies mentioned this pull request Mar 30, 2021
@jorisvandenbossche
Copy link
Member

Thanks @sgillies for further looking at this!

Some quick remarks:

  • The way you documented now is to require the custom ids (items) to be integers (although not sure the actual implementation also requires it). But if we allow custom ids, why limit it to only integers?
  • You mention explicitly to use a mapping (to keep track of item/geom combo's) instead of a list or array. Is there a specific reason you prefer a mapping? For the current shapely, it doesn't matter much whether we use a mapping (like this PR) or a list (like Store indices instead of geometries in STRtree + option to keep returning geometries #1064). But for Shapely 2.0 I think we rather want to use an array when we can use numpy (we can't yet use an array for 1.8, but a list is then closer in implementation).
  • Implementation nitpick: currently it would be possible (I think) to have a mixture of (geom, id) tuples and plain geoms. I think we definitly want to disallow that.

Does it have any big problems from the C implementation perspective? It took me a while to remember how important it is to stick to storing C types when it comes to releasing the GIL.

Ah, is this the reason you limited it to only integer items?
This PR certainly does not impose any problems for a future version using the C implementation. Now, in practice, we would still not store the user-provided ids, but basically default 0,1,..n ids (using pointer arithmetic, this is how it's currently implemented in pygeos). But, it is easy to keep the user-provided ids on the Python tree object, and return those. So it would be easy to refactor the implementation in this PR on top of the pygeos C STRtree (although this basically would look like #1064).
The side-effect of not storing the user-provided ids in the C tree, is that those don't need to be limited to integers, but can be any kind of identifier / python object.

@sgillies
Copy link
Contributor Author

sgillies commented Jul 5, 2021

@jorisvandenbossche we came to the conclusion that since items of a uniform C type are good for performance and since all Python objects can be mapped to ints via id and hash, the least common denominator is storing ints in the tree and asking the application to handle the mapping of these integer items to other objects.

The only reason STRTree._rev is a mapping and not a list or array in this PR is to efficiently handle "sparse" trees where the integer items aren't necessarily in a compact serial sequence.

@sgillies sgillies changed the title STRtree with a reverse mapping and query_item/query_geom methods STRtree with a reverse mapping and query_items/query_geoms methods Jul 5, 2021
@sgillies
Copy link
Contributor Author

sgillies commented Jul 6, 2021

Yay, builds are passing again. I think we had some cache related problems a few commits back.

"""
return self.query_geoms(geom)

def nearest_item(self, geom: BaseGeometry) -> Union[int, None]:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the return type of this function (and nearest_geom) are the only things that we'll change before 2.0. We may raise an error instead of returning None.

@jorisvandenbossche
Copy link
Member

we came to the conclusion that since items of a uniform C type are good for performance and since all Python objects can be mapped to ints via id and hash, the least common denominator is storing ints in the tree and asking the application to handle the mapping of these integer items to other objects.

OK, just to be clear: the only additional "feature" this PR gives (compared to default indices 0...n mapping to the position in the sequence of geometries) is that you can use custom integer values instead of 0...n ? (and no longer any custom python object)

I don't necessarily want to argue for allowing any python object as id, but note that when storing indices in the tree (as we will do in shapely 2.0, and as #1064 does), there is not really a performance difference in using custom integers vs custom python objects.

@jorisvandenbossche
Copy link
Member

@sgillies and thanks for the updates!

For a better comparison, I also updated #1064 to follow the same API as you did in this PR (but didn't yet copy over all other changes, like better docstrings etc, just updated the actual implemenation). I think both achieve the same, but the version in #1064 is closer to what we would do in the shapely-2.0 branch on top of the C-based STRtree.

@sgillies
Copy link
Contributor Author

sgillies commented Jul 7, 2021

@songololo @jorisvandenbossche I've relaxed the integer constraint on the constructor's items argument. In the future implementation of this class we will be converting items to a numpy array and numpy handles this very well. A sequence of strings is converted to an array of dtype something like "U" and a sequence of mixed objects is converted to an array of dtype "object". In the future if you pass in a numpy array (like geopandas will do, yes?) it will be used as is, without any copying.

@sgillies sgillies merged commit 1fc769f into master Jul 12, 2021
def test_query_enumeration_idx(geoms, query_geom, expected):
"""Store enumeration idx"""
with pytest.warns(ShapelyDeprecationWarning):
tree = STRtree((g, i) for i, g in enumerate(geoms))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sgillies based on your last comments (and the updated top post which mentions "These sequences are separate, not zipped together" and your comment at #1064 (comment)) that the idea was the only support geometries and items as separate sequences (where the items is optional), and not in the combined (zipped) form.
But this test above is testing that constructing the tree from a sequence of tuples actually works?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened #1172 to potentially address this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh my, you're right that this test shouldn't pass! 🤦

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants