Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seeking geofence st_contains query alternative #91

Closed
harryprince opened this issue Jul 10, 2018 · 4 comments
Closed

Seeking geofence st_contains query alternative #91

harryprince opened this issue Jul 10, 2018 · 4 comments

Comments

@harryprince
Copy link

harryprince commented Jul 10, 2018

using geofence polygon to judge one event is too complicated and slow in Hive by directly ESRI st_contains function, which means no index.

In that, I am seeking a better solution, for example, using H3 maxPolyfillSize function.

However, I can not find any example about using maxPolyfillSize function to be a st_contains alternative solution, need help.

By default, maxPolyfillSize will return multiple h3index resolutions and that means SQL need to join many times with resolution levels. Does any better solution exist?

@nrabinowitz
Copy link
Collaborator

(As this is a question, not a feature request or bug report, please direct to StackOverflow in the future.)

  • I think there's some confusion here - you can use the polyfill function to return the hexagons within a polygon, but they are all the same resolution (supplied by the caller). You would have to call compact on the resulting hexagons in order to get hexagons in multiple resolutions.

I am not a Hive expert, but there are a few options for using H3 to spatially index data points:

  • The simplest is to use polyfill to fill a given polygon at a resolution that fits your desired precision, and then create a table using the hexagons as a reverse index with rows like h3index, polygon_id. Data points with lat/lon can then be mapped to a H3 index using geoToH3 (ideally at index time), and you can join this field with the polygon table to find the polygon id for a given data point. This is generally very fast, but the reverse index can get very large depending on the size and precision of the polygons you need to index.

  • A slower but more space-efficient option uses compact to index the polygon at multiple resolutions. Assigning a data point to a polygon would then require performing n joins or queries, where n is the number of different resolutions in the compacted set. This might be a better solution if you needed to cover a significant geographic area with high precision (making a standard reverse index very large) but generally only had to handle a single data point at a time (e.g. in a geocoding API).

Obviously if your polygons are stable, you'd be better off doing this at index time, storing the polygon id with the data point, rather than at query time.

@dfellis
Copy link
Collaborator

dfellis commented Jul 10, 2018

@nrabinowitz I don't think the second one would be that much slower. If you take your given lat, lng coordinates and compute the H3 index at all of the possible resolutions (say 6, 7, 8, 9 for your compacted set) and simply query for any of those 4 matching, it would just be four integer compares until it either succeeds or fails, and that would probably actually be faster than a single comparison across the entire uncompacted set.

If it was a normal non-Hive database, you could index the H3 integers with a hash index and it would literally be just 4 hash lookups and you'd get the answer in O(1) time instead of O(n), but that's the trade-off you make with the Hadoop ecosystem versus classic DBs (more space, but slower queries).

@nrabinowitz
Copy link
Collaborator

As noted, I don't actually know much about Hive :). This tradeoff was something we were considering with Cassandra queries at one point.

@harryprince
Copy link
Author

harryprince commented Jul 18, 2018

@dfellis join multiple times seems too stupid on the hive SQL grammar side, however, so far, it seems the best solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants