-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feature] Ignore missing attributes #183
Comments
We had many, many discussions on this and the prevailing idea was to offer as much convenience as possible without sacrificing safety. We thought that treating missing attributes as nulls could get the user into very inconvenient situations, so we decided to be explicit instead. The current proposed improvement is in #27, which will add more operators that will make dealing with missing attributes more convenient. However, I do think the current behavior is very annoying and the improvements in #27 may not sufficiently offset that annoyance. We've essentially built a very flexible document model, but designed the query language in a way that makes taking advantage of this model inconvenient. This part of the product seems rather unpleasant as a result. I'm moving this to post-protobuf-improvements so we can revisit how to make this better. |
Dup of #27 |
Reopening for the time being because I think we might need a deeper discussion here than just adding convenience commands. |
When using a filter, for any attribute that is read, I would like an implicit |
That's in the eye of the beholder :) We'll consider various options to make this more comfortable as part of this issue. |
Would certainly be nice to have a shortcut for this in ReQL! |
Could we please consolidate this in to one issue with #27 by adding whatever this has that #27 doesn't to #27 and closing this one. Having 2 copies of basically the same issue is just adding mental overhead. I don't quite understand what this one has #27 doesn't @coffeemug could you clarify and I'll migrate it over? |
#27 proposes specific commands to make dealing with this behavior more comfortable. I think we should reconsider the behavior of the evaluator, consider evaluating missing keys to null, and leaving enforcing to constraints that we might implement later. That's distinctly different from #27, though it does make a |
So treating missing attributes as
would return documents in which "foo" is either a number less than So here's how I understand the desired behavior: Thus far I've only seen people want default attributes in This sounds a lot like an exception which Another option would be something akin to haskell's bottom. Basically bottom is a special type that indicates "this computation errors." If bottom is passed in to a function then it will return bottom so it gets propagated up and then |
Why not default to 'undefined' instead? In JS, null !== undefined (I'm not sure how that maps to other languages), and any expression where either side equals undefined will evaluate to false (except perhaps for undefined===undefined?). That means that x("foo").gt(5) will evaluate to false when the attribute 'foo' is missing. For numbers, undefined is similar (but not equal to!) NaN. |
Bottom is alot like undefined except that it propagates itself. I think undefined is basically the correct idea except that it still has a few confusing behaviors. Most notably under negation for example suppose someone wanted to check if an attribute were within a range they might do:
This returns false if "foo" is undefined which is what we want. However now suppose they want to filter to values outside of the range. They'll likely just invert the original predicate:
And now they're in a bit of trouble because this returns true if Undefined also does indeed compare equal to itself both under
would return documents for which both "foo" and "bar" are undefined which strikes me as unexpected. |
A simple solution would be to return false for any comparisons as soon as something is undefined. Drawbacks are
|
@neumino, your first point is not really an issue if the user needs to enable the 'silent errors' explicitly. As for the second point, I agree with @jdoliner; it's best if some error value (e.g. 'bottom') would propagate through the whole expression, such that x("foo").gt(1).not() returns 'bottom' if x("foo") evaluates to 'bottom'. |
I think special-casing filter isn't ideal. Special casing something like this has a bad smell, also there are other functions where we know we want this behavior (e.g. FYI, SQL essentially defines 'bottom' logic the way @jdoliner proposed. They treat null as "unknown value" and any operation on an unknown value returns another unknown value. I think some parts of SQL treat unknown values differently because in those cases it basically destroys the entire result set (which isn't always desired), so they got some criticism for having semantics with special cases. In our case, here are some issues with
I think one way to get around the second problem is to implement bottom semantics, but then have some commands (such as group map reduce) check for bottom in the reduction at every step and throw an error (or, for these commands, we can let user specify what to do in case of bottom -- e.g. keep going, throw, ignore the row). (I know this is a special case, but somehow it seems less bad to me -- perhaps because in this case we implement good shared underlying semantics, and have the command rely on it in an advantageous way -- e.g. if we had macros, users could define commands this way too. Giving this option in case of group map reduce is also a good idea because people legitimately want different behavior in different cases with this command, where in case of filter they almost always want one behavior. |
I think the special casing basically comes from a desire to have different defaults for different operations if we want to avoid special casing then it comes at the cost of making people explicitly say when they want to drop error rows I'm down for this but as you say people basically always want one behavior with filter and I think it's basically either special casing or an undesired default for one of them. Bottom (haskell notation is to use the symbol ⊥ so I'm going to save myself some typing) just gets propagated as an error. Part of an instance of ⊥ is a textual description of why the computation will error if run. If functions are both lazy and pure. Which all of the functions we're talking about in these examples are then ⊥ is actually equivalent to an exception in that it short circuits operations, basically once you get ⊥ you stop reading of the stream so the rest of it doesn't get computed. However we should bear in mind that this short circuiting doesn't actually save us much computation due to sharding. We still compute the full thing for shards without errors and then throw that result out when we discover one has an error. |
I think there are two types of special casing. One is when we arbitrarily change internal semantics to get the behavior we want. An example of this is throwing an error everywhere except in pluck and filter. I think we're all in agreement that this type of special casing is really bad. The second type is when we introduce a primitive (such as bottom) that supports all possible use cases we might want to support, and then have library functions have different behavior built on top of this primitive. This seems perfectly reasonable to me because the behavior doesn't change arbitrarily -- the same behavior, for example, is available to the user (e.g. they could check for bottom in their expressions and do different things based on it, such as throw an error, while they can't catch errors and make decisions on top of them right now).
Ah, makes sense.
Ok, that makes sense too. So then, how do we represent bottom on the client? |
Per my comments in #197 rethinkdb shouldn't throw an error if an attribute is missing. Given that rethinkdb is schemaless and there are no restrictions on the structure of the data that is inserted then querying should assume that missing attributes are not done in error and skip them. With a schemaless db these type of integrity constraints are the responsibility of the application by definition otherwise the DB should require support of a schema to ensure integrity at the write not the read. |
I think in lieu of representing bottom on the client, we should just throw an error if we would otherwise have to return bottom to the client (with bottom's error message, of course). So if somebody wrote: tbl.filter{|row| row[:missing_attr]} they wouldn't get an error because filter would handle bottom. But if somebody wrote: tbl.map{|row| row[:missing_attr]} then they would get an error, because the map would return bottom and we can't send bottom to the user. But if they wrote: tbl.filter{|row| row[:list].map{|row2| row2[:missing_attr]}} then they wouldn't get an error, because the map would return bottom and then the filter would handle it. As an aside, since we want bottom to short-circuit evaluation for efficiency reasons, this really seems more like throwing and catching an exception in my head. Also, we should probably call it the error type or something instead of bottom, since we're only using it for errors and it's a clearer name for people who don't know Haskell. (Also, maybe it would make sense to change |
Throwing an error on the client was how I imagined representing bottom. Actually this is sort of what we're doing right now the only thing we don't have is a way to recover from bottom. Bottom really is the functional equivalent of an exception. I don't think we want to have true exceptions here's why: We have some imperative parts of our language where bottom won't work like exceptions. For example bulk inserts. Here if one the items to be inserted returns bottom it won't behave like an exception (or at least how one might reasonable expect an exception to behave) in that it won't prevent the later insertions from happening. Making it so it did would be a pain and cost us some performance. Basically I think that exceptions are a little bit too heavyweight of a feature to fit in to our language right now, and perhaps ever. Particularly because they're going to make us sacrifice performance in some places. |
It still seems like there are 2 points of view being expressed here that aren't really addressing each other directly. Here's what they are as far as I can see it: One point of view which I believe is held by @kareemk is:
I'd like a bit of clarification on what this means. In particular skipping as far as I understand it means skipping the row in question this has a concrete meaning with a filter. That if evaluating the predicate creates such an error this row is dropped. However what does it mean in other cases? In particular what does the following return?
Similarly what if a missing attribute is encountered over the course of a map reduce? Perhaps if it's encountered while mapping a single row it might make sense to drop that row (although I think that's dubious) but if it's encountered while reducing the only thing to do I think is error. Another point of view which I believe is held by @coffeemug is:
The question I have is what exactly this looks like in the API. First if I understand correctly under this scheme should a vanilla filter fail to find an attribute then it will throw an error (being a representation a clientside representation of bottom) is this correct? Second how might error recovery syntax look? Something like:
seems like the most obvious option to me although it's a bit clunky. We could also maybe add an on_error function. Then the filter sementics would be implemented as:
These are just my off the top of my head ideas for the primitives. Others probably have some better ideas. I apologize if I've snipped away too much context with these quotes and attached beliefs that aren't actually held. This was just my impression of the 2 sides. However I think we need to make both of these ideas more concrete and eventually pick on before we can actually proceed on this issue. |
|
Following up on my previous comment, if RethinkDB is a schemaless database then the responsibility lies with the application developer to ensure that a comparison against nil due to referencing a missing attribute should never happen and branch the code appropriately. |
I think we should pick the most desired behavior for every function and use it when it makes sense. For example, When I said I think we should introduce a primitive that supports all use cases, I was making a philosophical argument. I think it's inelegant from the PL perspective to have some functions (such as filter and pluck) behave one way, and other functions (such as map, etc.) behave in a different way, without having an underlying mechanism users can use to easily implement the same behavior. Bottom provides such a mechanism. Having full-blown exception handling provides it as well. (Personally, I like bottom much more for ReQL, but that's a separate discussion). On a different note, as far as throwing an error if bottom propagates to the client, what if bottom is nested inside a document (e.g.
I'm glossing over a few issues (groupBy doesn't support multiple reductions, and |
So this may be an unpopular view but I feel like argument against special casing has some usability value beyond its philosophical merits. Having different behaviors for different functions runs the risks that the "best" behavior might not seem like the best one to everyone so it can introduce an extra piece of information you need to know to effectively use the language. An alternative which I suspect people will find ugly is to have a version of each of the functions which will drop the offending rows. Specifically The sugar could also go the other way having the default of all of these functions (which are all basically In my mind the value of having each function have the same default outweighs the cost of having some of the functions (the minority I'd assume because we can pick which side gets the sugar) require an extra character to get the most commonly desired behavior. Having less to think about just seems more valuable to me. I consider
So dropping nonexistant attributes is the default behavior. I was thinking that nested bottoms would just turn the whole expression into bottom. Returning nested bottom I think brings up a whole host of issues because then you can't just throw you need to have a way to represent it in the clients which complicates things. Also it's going to make it a lot harder to short circuit because we need to know a lot about where the result winds up to know what to do. Basically it seems like it's going to make the implementation tougher and giving people good ways to drop bottom infected is less complicated way to let people see parts of results. |
I like the idea of using bottom, but not nesting it or returning it to clients, and having filter skip rows that return bottom. The documentation for filter can say "selects all rows where the predicate returns true, ignoring any errors". I think in general functions that do a bunch of things should be resilient against one of them failing -- batched updates and inserts already keep going when they encounter an error, for example. |
Ok, agreed that consistency in default behavior outweighs the confusion associated with different behaviors in different cases. Also agreed that dealing with passing nested bottom to the client is an unnecessary complication. However, I think that doubling the number of functions will unnecessarily complicate the API. I'd much rather put the database into a "strictness state" by passing a flag on the connection, and on Also, I agree with @kareemk, I think the default behavior should be to skip over bottom. What we have now is pulling the product in two different directions (i.e. a flexible schema that one can't comfortably take advantage of because of a rather inflexible default error handling policy). If the user actually wants this behavior in production, they can get it by creating connections this way (and they'll be able to get further safety once we get to implementing constraints). |
I may be misunderstanding this but I think having a strictness state for the database is a bad idea because we'd need to store it somewhere and distribute it to different machines which gets in to all sorts of corner cases. Plus it has the potential to seriously screw you over because someone using the database for an entirely different purpose could change this flag. Strictness makes a lot more sense to me on a connection basis (which might have been what you meant I'm not sure). We also should probably return a summary which mentions how many rows were dropped due to errors. Maybe even which steps they were dropped on. This would play in nicely with what we're doing in #194. |
Another thing worth talking about is what happens with
Which is messy, if get return
The elements that don't find a matching row get dropped automatically due to bottom semantics and it's all very nice. To do this we would need to have more granular strictness settings. Because if someone sets a connection to be strict they won't expect it to change the meaning of |
Right, I think for the first implementation we should pass it on the connection, and on
Agreed.
Could you explain this in more detail? I don't think the strictness setting should affect the behavior of I don't know how much of a burden a per-op setting on protocol-buffers is, but if it's a pain, I don't think it's worth bothering with it for now. |
Strictness setting wouldn't affect the behavior of
It's very simple. |
@coffeemug I really like your proposal of making this a connection level setting that makes perfect sense and your approach of keeping the API as simple as possible (from the users's perspective) until there is demand from users for other features/options. With respect to throwing an error (returning bottom) I actually think that if you attract a large swatch of the mongodb crowd, which I think you have a very good chance of doing, I have a strong hunch you won't see many requests for this. |
Ahh, I see. It would be pretty easy to desugar to a branch that tests for bottom, which IMO is "righter" in this case than introducing a per-op argument (it doesn't feel like the right abstraction in this case, but I don't have a strong argument to support this). |
Actually having thought about it more I think desugarring to branch is a better idea. A strong argument is that missing attrs can come from different places. In the case of eq_join you can get it if an element of the stream is missing the attribute. This is a case where with strictness turned on you'd expect it to error but if we desugar to an error dropping map then it won't. So it seems like it is a bad abstraction. |
I would prefer having defaults (#27) rather than having this new primitive. Also, I strongly suggest not calling it bottom. The real meaning of bottom in type theory is very different from what you are calling bottom. It is a type that has no value. It is not a value that can be returned, and in a strict language it can not be passed as an argument. Haskell itself, even if it is a non-strict language, does not have bottom. A nicer way to name this new primitive with concepts from functional programming is to call it Nothing. It could also just be called undefined and have different semantics than in javascript. |
There has been a similar discussion in this thread related to refutable destructuring assignment in EcmaScript 6 (thread starts here), and the name they've given a non-signaling empty value is Instead of thinking of what we're talking about as a "default value", I think it makes more sense to think of it as a hole or bubble that pops up in data sets as we perform operations on sets of incomplete, structured values, analogous to the role that NULL plays in SQL's OUTER JOIN semantics (not that SQL should be a model for anything Rethink does -- please don't introduce ternary logic into Rethink). I think it can be sane as long as a few caveats are heeded:
The point of all this rigamarole is to allow people to store objects with values that are |
This issue has been subsumed by #570. Closing. |
Consider a query with a filter such as:
If the users table has a document without the age attribute, the database yells:
I do not wish to use an additional filter
user.contains(:age)
everywhere.Can the database treat missing attribute as null ?
The text was updated successfully, but these errors were encountered: