Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ReQL proposal: Geospatial support #2571

Closed
danielmewes opened this issue Jun 19, 2014 · 145 comments
Closed

ReQL proposal: Geospatial support #2571

danielmewes opened this issue Jun 19, 2014 · 145 comments

Comments

@danielmewes
Copy link
Member

Edit: Note the updated proposal below #2571 (comment)

In contrast to #1158, I would like to use this issue solely to track the ReQL API side of things.

This proposal is limited to two-dimensional geodesic geometry (geometry on the earth's surface). We can add support for Euclidean geometry later if necessary.

Geospatial data representation

  • Support at least Points, LineStrings and Polygons compatible to GeoJSON (Wikipedia, full specs ). I think MultiPoint, MultiLineString and MultiPolygon are less important. Once we add them we could support operations such as addition (union) and subtraction on geometry, but I'm going to skip them for now.
  • Define a new ReQL geometry pseudo-type
  • Provide two functions to convert from/to GeoJSON:
    • r.geoJSON(object) : object -> geometry converts from the GeoJSON object object to the geometry pseudo type
    • r.geoJSON(string) : string -> geometry (optional): equivalent to r.geoJSON(r.json(string))
    • geometry.toGeoJSON() : geometry -> object does the opposite of r.geoJSON(object)
  • Provide constructors:
    • r.geoPoint(x, y) : float, float -> geometry
    • r.geoLine(p1, p2, ...) : geometry, geometry, ... -> geometry where the input arguments must be points
    • r.geoLine([x1, y1], [x2, y2], ...) : [float, float], [float, float], ... -> geometry
    • r.geoPolygon(p1, p2, ...) : geometry, geometry, ... -> geometry where the input arguments must be points
    • r.geoPolygon([x1, y1], [x2, y2], ...) : [float, float], [float, float], ... -> geometry
    • polygon1.sub(polygon2) : geometry, geometry -> geometry subtracts polygon2 from polygon1. For now, we should make the following requirement: polygon2 must be completely inside of polygon1. This allows to construct polygons with holes in them.
    • For convenience: r.geoCircle(center, radius) : geometry, float -> geometry, r.geoRectangle(bottomLeft, upperRight) : geometry, geometry -> geometry create a line describing the corresponding shape. Can be combined with fill() (see Misc) to get spheres / filled rectangles.

Creating a geospatial index

  • Any secondary index can store geospatial data. There is no distinction between geospatial and other indexes. If we index field a of a table, some documents in the table can have geometry in a and others can have strings, numbers or whatever.
  • We can later add special opt args to indexCreate() when we want to support different types of geometry.
  • (Internally geometry objects will probably be indexed similar to arrays in a multi index, where e.g. a polygon is "expanded" to a set of grid cells that it intersects with. We will probably want to limit the number of how many geometry objects can be stored in a composite index. Otherwise if for example we expand each polygon to ~10 grid cells, insert a document into a composite index that has three polygons as the composite index keys, we would end up with 10^3 index entries. Will have to determine later how we fail in such cases.)

Misc

  • p1.distance(p2) : geometry, geometry -> float computes the minimal geodesic distance between points p1 and p2. Let's ignore distances to/between polygons and lines for now. Other geometries can be supported through opt args later.
  • l.fill() : geometry -> geometry Takes a line, makes it the outline of a polygon. The line has to be closed (and possibly must not intersect with itself, not sure about that yet).
  • geometry.isContained(polygon) : geometry, geometry -> bool tests whether geometry is completely contained in polygon
  • geometry1.intersects(geometry2) : geometry, geometry -> bool tests whether geometry1 and geometry2 intersect
  • Bonus: set1.isContained(set2) : array, array -> bool isContained() for two arrays. Tests whether all elements of set1 are found in set2.
  • Bonus: set1.intersects(set2) : array, array -> bool intersects() for two arrays.

Querying

  • Ideally, we would provide a predicate function to table.getAll() such as table.getAll(function (x) { return x('position').isContained(polygon); } ) or table.getAll(function (x) { return x('position').distance(center).le(5.0); } ) and have an optimizer automatically make use of an index. However to avoid having to analyze the function, I propose introducing simplified predicates for getAll that can use secondary indexes (comparable to r.desc() and r.asc() for orderBy()):
    • table.getAll(geometry, {index: ...})
    • table.getAll(r.intersects(polygon), {index: ...})
    • table.getAll(r.isContained(polygon), {index: ...})
    • table.getAll(r.withinDistance(center, radius), {index: ...}) (sugar for r.intersects(r.geoCircle(center, radius).fill()), except that we might want to restrict it to points)
    • Bonus: table.getAll(r.isBetween(left, right)) as an alternative to r.between()
  • The predicates can also be combined for querying composite indexes: table.getAll([pred1, pred2, ...]). Note that such a query wouldn't always be efficient, and might have to rely heavily on post-filtering (or alternatively trigger a lot of smaller index lookups). Consider the example of table.getAll([r.isBetween(-inf, +inf), "foo"]) to see why this is the case. Not sure if we want to support this for the first version.
  • I would prefer restricting the previous composite query to having a non-equality predicate only as the final entry in the array. E.g. table.getAll(["foo", r.isBetween(-inf, +inf)]) would be legal, while the previously mentioned table.getAll([r.isBetween(-inf, +inf), "foo"]) would not be allowed. Such queries would always be efficient, and much easier to implement.
  • r.orderBy(r.distance(p), {index: ...}) (note that this is a single-argument p.distance() variant)

Open questions:

  • Can we support querying by whether an entry in the index contains a given query geometry, rather than the other way around? Something like table.getAll(r.contains(geometry))?
  • Should we prefix all geo operations by geo? E.g. should we call intersects() geoIntersects() instead? This has the advantage that we can provide a reversed variant of isContained() called geoContains() without clashing with the existing object.contains(fieldNames) term.
  • For the first implementation, we might want to skip support for composite geospatial indexes altogether. I think it would be immensely useful though, and we should support it soon.
@danielmewes danielmewes added this to the 1.14 milestone Jun 19, 2014
@AtnNn
Copy link
Member

AtnNn commented Jun 19, 2014

The meaning of a line between two points, of a circle and of the shortest distance is ambiguous. What projection is used? Should we pretend that the earth is a sphere? Can lines and polygons cross the 180° longitude line? How do other API solve these problems?

@danielmewes
Copy link
Member Author

@AtnNn Everything here assumes geometry on the surface of a sphere (of maybe ellipsoid, I think S2 supports correcting for the shape of the earth based on the WGS84 system). We would implement this using S2 https://code.google.com/p/s2-geometry-library/ (MongoDB does the same, but also supports Euclidean geometry which we can add later and switch on via opt args).

The shortest distance is not ambiguous (apparently the definition of distances we would use is actually called great-circle distance http://en.wikipedia.org/wiki/Great-circle_distance). Lines between two points are ambiguous if the points are on opposite sides of the sphere. We could avoid the problem though by defining one line as the canonical one (add an implicit third point that the line must go through).

Crossing the 180 degree line should be possible.

@mlucy
Copy link
Member

mlucy commented Jun 19, 2014

  • table.getAll(r.intersects(polygon), {index: ...})
  • table.getAll(r.isContained(polygon), {index: ...})

I really dislike this syntax. It seems very un-ReQLish -- you're introducing two new terms which can only be used inside another term. We do this in some places (r.literal, r.asc/r.desc), but it makes me cringe every time.

I would prefer one of:

  • New table-level functions
    • table.intersects(polygon)
    • table.contained_in(polygon)
  • Overloading get_all on geoJSON, use an optarg to distinguish intersection and contains.
    • table.getAll(polygon)
    • table.getAll(polygon, include_intersecting: false)

@mlucy
Copy link
Member

mlucy commented Jun 19, 2014

I think we should do the simplest thing possible for the first release, so leaving off compound geo-indexing sounds fine to me.

@danielmewes
Copy link
Member Author

r.geoCircle(center, radius) should also accept an opt arg "num_vertices", with some reasonable default value.

@mlucy
Copy link
Member

mlucy commented Jun 19, 2014

It's sort of unfortunate that this proposal is only for geo data. Since we're doing all the work anyway, how hard would it be to have generic 2-dimensional indexes? They're useful for a lot of other things, too. (e.g. "Get me the experimental subjects in this age range and this weight range.")

@danielmewes
Copy link
Member Author

I would prefer one of:
New table-level functions
table.intersects(polygon)
table.contained_in(polygon)
Overloading get_all on geoJSON, use an optarg to distinguish intersection and contains.
table.getAll(polygon)
table.getAll(polygon, include_intersecting: false)

The problem with these is that they cannot be combined for composite indexes.

If we ignore that, I would strongly prefer the first variant with new functions defined on tables. The meaning of table.getAll(polygon) is not clear at all. It should probably mean testing for equality, but I think you actually meant it as testing for contained_in?

@AtnNn
Copy link
Member

AtnNn commented Jun 19, 2014

I am wary of S2, it seems unmaintained. The last update is from 2011.

@danielmewes
Copy link
Member Author

@mlucy
Implementing geodesic geometry is made easier because S2 already provides a very comprehensive library for performing the various tests (intersection, distance, containment) and has functions that help with indexing polygons.

We would have to look for a different library or implement a lot of things ourselves if we want to support other geometry. I'm not sure how much overlap there would actually be in the implementation, though we could certainly use the same interface for such data.

@mlucy
Copy link
Member

mlucy commented Jun 19, 2014

The problem with these is that they cannot be combined for composite indexes.

The second one can be: table.get_all(["foo", polgon])

(I'm having some trouble reasoning about this because I'm not 100% clear how composite indexes where one element is a polygon will work -- how will that be implemented? Will it always be efficient, even for an index like [x, polygon, y]?)

If we ignore that, I would strongly prefer the first variant with new functions defined on tables. The meaning of table.getAll(polygon) is not clear at all.

It may be because I thought of the syntax, but this reads pretty clearly to me:

# Get everything in the circle
table.get_all(r.geoCircle([x, y], radius), index:'location')
# Get everything in the circle, ignoring polygons that only intersect it
table.get_all(r.geoCircle([x, y], radius), index:'location', include_intersecting:'false')

@danielmewes
Copy link
Member Author

@AtnNn It's very solid though. It's used by a lot of projects including MongoDB, and it provides extremely convenient features for building indexes over geospatial data. I have also used it before for processing geodesic coordinates. We should still look for some alternatives before deciding on it.

@mlucy
Copy link
Member

mlucy commented Jun 19, 2014

(Also, can you have a compound index with two polygons? How will that work internally? What does it even mean, and what sort of queries can you do on it?)

@danielmewes
Copy link
Member Author

@mlucy: [x, polygon, y] wouldn't typically be efficient, which is why I think we should limit geometry to the last entry of such a query (i.e. [x, y, polygon]).

How would you do "distance less than ... from ..." with that get_all() syntax?

It also wouldn't work if you had multiple geometric predicates in a composite index, though I guess we would disallow that anyway.

@mlucy
Copy link
Member

mlucy commented Jun 19, 2014

How is "distance less than ... from ..." different from "contained within this circle"?

@danielmewes
Copy link
Member Author

Generally querying by a compound index [p1, p2, p3] with predicates p1, p2, p3 has the semantics of filtering for documents that match all predicates at the corresponding entries of the compound index. We could do it, but I agree that it wouldn't be a good idea.

@mlucy
Copy link
Member

mlucy commented Jun 19, 2014

Also, what does a between on a compound index including a GEO coordinate look like? What does it actually mean?

@danielmewes
Copy link
Member Author

How is "distance less than ... from ..." different from "contained within this circle"?

Sorry, I wasn't thinking properly there.

Generally my assumption was that there are a couple of very different criteria that you can apply to geometric data which can be optimized by an index, and that they should be combinable for composite indexes. Introducing such predicates seemed like the reasonable way to represent this.
If we drop the second design goal (composite indexes with multiple geometry entries), we are left with the question of whether the possible criteria on geometry that can be optimized through an index are really that different.
So far we have intersection, inclusion and equality. You propose assuming intersection by default, and switching to inclusion through a flag. What I don't like about that is that it gives special treatment to intersection. Why should intersection be the default?
In fact as mentioned before, I would intuitively assume equality to be the default.
If I do getAll("abcde") I don't search for strings that contain the string "abcde". I search for strings equal to "abcde". I would assume that the same holds for geometry. Everything else would be very confusing in my opinion.

The "r.table().intersects()" etc. syntax doesn't have that problem and is nicely explicit, which is why I prefer it.

@danielmewes
Copy link
Member Author

Also, what does a between on a compound index including a GEO coordinate look like? What does it actually mean?

We shouldn't allow that. The r.between() in the proposal was meant for non-geometric data strictly as an alternative to the current table.between().

@mlucy
Copy link
Member

mlucy commented Jun 19, 2014

No, I meant, if you have a compound index compound where one of the elements is GEO data, what does r.table('test').between(X, Y, index:'compound') mean/do? Or would we not support between on compound indexes that include GEO data?

@mlucy
Copy link
Member

mlucy commented Jun 19, 2014

If I do getAll("abcde") I don't search for strings that contain the string "abcde". I search for strings equal to "abcde". I would assume that the same holds for geometry. Everything else would be very confusing in my opinion.

I can see that. I guess in my mind "abcde" is a single key in a 1-D index, while a geometric shape is a set of keys in a 2-D index. If we had a data type representing a set of keys in a 1-D index (like r.range(0, 5), which we keep talk about adding), it would make sense to me for get_all to treat that as a set of keys as well.

@danielmewes
Copy link
Member Author

@mlucy: We would not support X or Y to contain geometry objects in r.table('test').between(X, Y, index:'compound'). A geometry would never lie between two non-geometry values.

@mlucy
Copy link
Member

mlucy commented Jun 19, 2014

So far we have intersection, inclusion and equality. You propose assuming intersection by default, and switching to inclusion through a flag. What I don't like about that is that it gives special treatment to intersection. Why should intersection be the default?

If you think of the shape as a set of keys, "intersection" is just "any entry which matches one of the keys in the set", which seems intuitive to me. But I could see how it would be non-intuitive too.

@danielmewes
Copy link
Member Author

Yeah, r.range() is what I meant with r.between() in combination with getAll() above.

Interpreting polygons as ranges all the time works well as long as you can only store points in a table. But you can also store polygons in a table, and then you might want to filter by equality.

@mlucy
Copy link
Member

mlucy commented Jun 19, 2014

I'm sort of getting the sense that compound indexes + geo data don't work well together. Compound indexes are very general and can be used for either point gets or range scans, whereas it sounds like compound indexes with geo data inside of them would basically only be used to let you do a geo-indexed query and only retain elements matching a particular set of tags. It might be better to support that some other way, rather than adding a bunch of very specific rules to compound indexes to produce that behavior.

@danielmewes
Copy link
Member Author

Well, a typical query someone might want to do is "find all gas stations within a certain distance". So you would build a compound index [POI_category, position] and query on that.
Requiring geospatial criteria to be the last criteria when filtering on a compound index doesn't sound unreasonable. That would be the only rule we would need I think.

@danielmewes
Copy link
Member Author

We could also represent such compount queries through the table.intersects(polygon, {index: ...}) / table.is_contained(polygon, {index: ...}) syntax as follows, though it doesn't seem super elegant:
table.intersects(X, Y, polygon, {index: 'compound'}) would assume that all but the last parameters are equality-constraints on the value of the compound index.

Edit: Removed the word non-geospatial. X and Y would be interpreted as equality constraint, no matter what their type is.

@mlucy
Copy link
Member

mlucy commented Jun 19, 2014

What would table.between(...) do on that index? What about table.order_by(...)?

It's still not too clear to me what it even means to have a compound index where one of the elements is 1-dimensional and another is 2-dimensional. Would we use a projection of the geoJSON as a 1-dimensional stand-in for things like between and order_by?

@mlucy
Copy link
Member

mlucy commented Jun 19, 2014

(Sorry to be dense on this, but I think I'm still not getting it.)

@danielmewes
Copy link
Member Author

@coffeemug What about the alternative of having intersects(g) work on streams as a rewrite rule for filter(r.row.intersects(g))? I'd prefer that over non-indexed getIntersecting.

The latter is also a bit confusing because if you specify a geo index, the intersection test is typically performed over a certain field (the field the index was built over). If you don't specify the index opt arg, it would apply the intersection test to the whole document.

@danielmewes
Copy link
Member Author

Regarding naming: If we call getIntersecting() something like geoIntersecting() (geo_intersecting in non-JS), I think we should prefix most of the geo operations with geo_. Except probably for the constructors.

@coffeemug
Copy link
Contributor

I agree that the command should be polymorphic on streams, however we don't have precedent for commands that run on objects, streams, and accept an index optarg only when run on streams (that would be pretty weird). I'm not quite sure what to do about that, other than having two functions getIntersecting as specified above, and a second intersection function that works on geometry objects and streams. (That's also a bit weird, so I guess we have to decide on the lesser of the two evils)

@danielmewes
Copy link
Member Author

@coffeemug Wait, unless I'm confused about which proposal we agreed on, we already have g1.intersects(g2) -> bool as a term that applies to geometry objects, and table.getIntersecting(g, {index: ...}) -> stream for index access.
I would definitely want to keep the table.getIntersecting(g, {index: ...}). The index opt arg would be mandatory and it would be exactly for filtering through geo indexes, just like getAll(v, {index: ...}) is like filter(r.row(...).eq(v)) but using an index.

My preferred solution would be to extend the predicate term intersects() to be applicable to streams in addition to single objects, and act like a filter. I believe we already do that for a couple of other predicates in ReQL (such as hasFields() I think?), don't we?

I would also like to hear @mlucy 's opinion on this.

@coffeemug
Copy link
Contributor

Ah, sorry, I misunderstood. Yes, I think it's a good idea to extend intersects to work on streams of geometry objects, and we already do that in a bunch of other places for predicate commands, so this wouldn't be new.

@danielmewes
Copy link
Member Author

Another thing I'd like to revisit is the r.rectangle() term.

The way it works now is that it creates a rectangle on the Mercator projection (http://en.wikipedia.org/wiki/Mercator_projection) of the geometry.

It can be useful because many maps are drawn that way, and you can use it for example to conveniently query for items visible in a given rectangular piece of the map (like when visualizing a map for a website).
However its function might be more specific than its name suggests, since it assumes a certain projection.
The term is already implemented and super simple, so that's not a problem.

I'm undecided whether we should keep it in as it is, remove it, or rename it (to mercator_rectangle()? Though that's an ugly name.). Opinions?

@AtnNn
Copy link
Member

AtnNn commented Jul 21, 2014

@danielmewes Do lines and polygons also follow the mercator projection?

@danielmewes
Copy link
Member Author

@AtnNn I'm not sure. If you connect two points that differ only either in their longitude or latitude but not both you will get a straight line in the Mercator projection. That's what r.rectangle() currently does: it constructs a rectangle like the ones in this picture http://en.wikipedia.org/wiki/File:Mercator_projection_SW.jpg given two opposing corners of the rectangle.
If you consider the shortest path between two points that differ in both latitude and longitude, I'm not sure if the Mercator projection still projects that to a straight lines.

The semantics of all polygons and lines (including those constructed by r.rectangle()) with respect to intersection/inclusion in ReQL is that we connect vertices by the shortest path along the earth's surface. A perfectly spherical earth is assumed for this. In contrast we use an ellipsoid model for distance calculations, to improve accuracy.

One problem with r.rectangle() is that it stops working when you reach the poles or try to cross them. Also the "rectangles" it constructs are only rectangles in the Mercator projection, as their angles are not actually 90 degrees otherwise.

@danielmewes
Copy link
Member Author

Also to clarify: r.rectangle(r.point(lat1, lon1), r.point(lat2, lon2)) is just a shortcut for
r.polygon([lat1, lon1], [lat2, lon1], [lat2, lon2], [lat1, lon2]).

@AtnNn
Copy link
Member

AtnNn commented Jul 21, 2014

@danielmewes You say that "we connect vertices by the shortest path along the earth's surface" and that, for r.rectangle, it is a "straight line in the Mercator projection". Which one is it?

@danielmewes
Copy link
Member Author

@AtnNn Both. Those are the same lines when you connect two points that differ in only either latitude or longitude (and don't wrap over the poles).

@AtnNn
Copy link
Member

AtnNn commented Jul 21, 2014

@danielmewes The only horizontal lines in the Mercator projection that represent the shortest distance between their end points on a sphere are those along the equator.

@danielmewes
Copy link
Member Author

Oh wait you are right. What I said is only true for latitudal lines. Yeah, so r.rectangle() is not even exactly the Mercator projection. I guess we should take it out.

@danielmewes
Copy link
Member Author

I removed r.rectangle() for now. If there's any protest I can still revert the commit (it's largely orthogonal to other remaining tasks).

@mlucy
Copy link
Member

mlucy commented Jul 22, 2014

My preferred solution would be to extend the predicate term intersects() to be applicable to streams in addition to single objects, and act like a filter. I believe we already do that for a couple of other predicates in ReQL (such as hasFields() I think?), don't we?

I would also like to hear @mlucy 's opinion on this.

That sounds good to me.

@danielmewes
Copy link
Member Author

The first part of the implementation is up in CR 1812 by @mlucy.
Missing features in that CR:

  • get_intersecting() and get_nearest() are eager and return arrays
  • No support for compound indices
  • No variant of intersects() for streams

@danielmewes
Copy link
Member Author

Geospatial support with the limitations mentioned in the previous post has been merged into next and will ship with 1.15.

@danielmewes
Copy link
Member Author

Opened #2854, #2855 and #2866 to track the missing parts. Closing this.

@barkerja
Copy link

Just a heads up.. since this has been resolved the secondary indexes documentation should be updated:

http://cl.ly/YcKV
http://rethinkdb.com/docs/secondary-indexes/ruby/

@danielmewes
Copy link
Member Author

Thanks for the heads up @barkerja . We had fixed this in rethinkdb/docs#568 . Trying to find out why it's not online yet...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants