Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NULL/default/contains/etc. proposal #570

Closed
mlucy opened this issue Apr 2, 2013 · 26 comments
Closed

NULL/default/contains/etc. proposal #570

mlucy opened this issue Apr 2, 2013 · 26 comments
Assignees
Milestone

Comments

@mlucy
Copy link
Member

mlucy commented Apr 2, 2013

Slava, Bill, Sam and I just talked about this. Here's what I think we should do about all of these:

  • Adopt default with exception semantics. You write either query.default(val) or query.default {|exc| func_of_exc}. I think we should do this because exception semantics are easy to explain and easy to implement (it's a small edit distance from what we have now to these semantics).
  • Rename contains to has_fields. Make it polymorhpic, where calling seq.has_fields(:a, :b) is equivalent to seq.filter {|x| x.has_fields(:a, :b)}. Our new consistent rule for polymorphism is that terms which return an arbitrary object (like pluck) are polymorphic with respect to map, and predicates which only return a bool are polymorphic with respect to filter. (I don't think most users will ever explicitly learn this rule, but I would bet it has the right shape for their brains.)
  • has_fields will return true if the object has all of the keys passed to it AND the value of all those keys is non-null. It turns out that for real queries this is what you want most of the time.
  • Introduce a new primitive (has_attributes, has_names, has_keys, something like that) which returns true if the object has all of the kys (without checking whether the value of those keys is NULL). We don't have to emphasize this in the documentation, but people need some way to do this.

This gives two ways to solve the problem people have where they want to write e.g.:

tbl.filter {|user| user[:age] > 10} # some users don't have `age`!
tbl.has_fields(:age).filter {|user| user[:age] > 10}
tbl.filter {|user| user[:age].default(0) > 10}
tbl.filter {|user| (user[:age] > 10).default(false)}
@ghost ghost assigned mlucy Apr 2, 2013
@mlucy
Copy link
Member Author

mlucy commented Apr 2, 2013

Update:

  • We should go back to the behavior for pluck where r({:a => 1, :b => 2}).pluck(:a, :c) yields {:a => 1}. This is good because the primitive seems more useful that way.
  • We should introduce a term with_fields where tbl.with_fields(:a, :b) is equivalent to tbl.has_fields(:a, :b).pluck(:a, :b). This is good because it's a very common operation. (What I said to Slava is "we may end up having lots of terms for operating on streams of objects for the same reason that common lisp has lots of functions for operating on lists of lists".)

@coffeemug
Copy link
Contributor

I really like this proposal, for all of the following reasons:

  • It's incremental. Instead of spending months debating various evaluation models and then discovering that they break in some edge cases, don't fit reality, etc. this proposal takes our existing system and converts it to the one that's actually desirable via a small, incremental change.
  • It maintains safety semantics for people that want them. For example, if we make filter drop rows on error, the users will get confused when they get nothing back because of a silly error somewhere. With this proposal, people can pick between both models extremely easily, and get a safe model by default (which seems very sensible to me).
  • If we discover additional issues somewhere, we can tune the specific commands to fix them (whereas if we discover additional issues in a new evaluation model, they will quite possibly mean redoing everything).
  • Last but very much not least, this is easy to do. We can likely have this whole problem go away in just two days of development time.

I'd really like to hear what @al3xandru and @jdoliner think about it.

@neumino
Copy link
Member

neumino commented Apr 2, 2013

I have a small question. When is default executed? Is it when an error is thrown? Or also when we return null?

@mlucy
Copy link
Member Author

mlucy commented Apr 2, 2013

Just when an error is thrown. So the second example wouldn't work if you're using the "NULL means missing attribute" paradigm.

@al3xandru
Copy link
Contributor

  1. pluck not throwing: a must

  2. the proposed semantic of default looks ok

  3. the proposed semantics of default (and with_fields) don't address the use case of extended variability in the row structure.

    For cases where rows show big variability in their structure the query will be littered with defaults. The only option would be to have the users write multiple queries and union them or every other function in the query to be a default.

  4. For addressing the above I suggest a change of with_fields semantics that would allow specifying default values for the fields:

    r.table('stats').with_fields({ :a => 20, :b => { :n => 32, :m => 70})...
  5. If NULL values do compare to all other types we support, then having separate functions for has_fields and has_something_else doesn't make sense. I'd suggest has_fields to just to the check of presence and has_fields(not_null=true) to check for presence and not NULL .

    Keeping the number of functions we expose low will make the query language feel powerful and simple. The learning curve is important. The more API functions added, the more complicated things are perceived.

@al3xandru
Copy link
Contributor

Our new consistent rule for polymorphism is that terms which return an arbitrary object (like pluck) are polymorphic with respect to map, and predicates which only return a bool are polymorphic with respect to filter.

I'm not sure about users, but I'm not sure I'm getting it either.

@jdoliner
Copy link
Contributor

jdoliner commented Apr 2, 2013

It feels like this really doesn't address the issue that got us here. The original issue as I recall was that people were calling filter with a predicate that failed on some rows due to missing fields. To do this they had to write:

tbl.filter.{|user| user.has_attr(:age) && user[:age] > 10}

And this was annoying to them.

If that was annoying to them are these better?

tbl.has_fields(:age).filter {|user| user[:age] > 10}
tbl.filter {|user| user[:age].default(0) > 10}
tbl.filter {|user| (user[:age] > 10).default(false)}

Maybe marginally but I think they're still going to be pretty annoying. With this solution a project which has very unstructured data is going to wind up writing default(false) a lot in their projects and that's going to add up for them quickly. I agree that it's bad for the default behavior of filter to drop these rows but what if we added a version of filter that does? Tentatively I'd like to call it filterU meaning "unstructured" basically the unstructured versions of functions (map concat_map and maybe reduce could probably also have unstructured versions as well) implicitly drop rows which fail due to missing fields. The point of adding such a function is that people who are seriously using ReQL in an unstructured way have a short hand canonical way to represent this type of operation that isn't going to bloat their code base with ugly code.

@jdoliner
Copy link
Contributor

jdoliner commented Apr 2, 2013

I like the idea of having has_fields be variadic. I think having it be polymorphic is potentially confusing. My first guess for what has_fields would do on a stream is check if there's an element of the stream that matches the one I gave it. It's a pretty big jump for me to know that has_fields on a stream means, "give me only the elements of the stream which have these fields". I think this makes sense if you talk about it for a while but is going to be confusing the people coming in cold turkey. This might be okay though. I think it would be a lot better if I could say:

table.filter(r.has_fields("foo", "bar"))

@jdoliner
Copy link
Contributor

jdoliner commented Apr 2, 2013

I detest the idea of having has_attributes, has_names, has_keys which is mostly the same as has_fields but with a subtly different meaning. I actually don't know what it is that leads programmers to think that using synonyms to express subtle differences in semantics is a good idea. It's terribly confusing.

@coffeemug
Copy link
Contributor

Ok, this is workable. We'll have to work out a few naming issues and edge cases around functions, etc. outlined above, but overall, this seems satisfactory. I'm working on scheduling 1.5-spring-2 now. Once I'm done, I'll get all the relevant people together and we'll hammer this out.

@coffeemug
Copy link
Contributor

Since we should decide whether we treat null the same as non-existence, we should probably be consistent about it. E.g. if default checks for both errors and null by default, we should probably also redefine comparisons to throw on encountering null.

@mlucy
Copy link
Member Author

mlucy commented Apr 3, 2013

I think we should throw on any comparison with null and introduce a new is_null term.

@jdoliner
Copy link
Contributor

jdoliner commented Apr 3, 2013

So comparisons throw. What about just on access. For example:

table.map(row["non_existant_key"]) # throws
table.map(row["null_key"]) # doesn't throw?

table.filter(row["non_existant_key"]) #throws
table.filter(row["null_key"]) #throws?

Also adding an is_null term to get around the fact that all comparisons with null throw is insane. People are going to expect to be able to say: table.filter(row["foo"] != null) and have it not throw. This also is really weird when you have objects with null in them. You actually now can't write a function that compares two objects with null in them for equality.

I think the only sane way to compare null is this:

null == null -> true
null != null -> false
null == non_null -> false
null != non_null -> true
All other comparisons throw.

@mlucy
Copy link
Member Author

mlucy commented Apr 3, 2013

That would be reasonable too.

@coffeemug
Copy link
Contributor

👍 for @jdoliner's suggestion. I'm cool with is_null as sugar for == null, but that's obviously superfluous.

@coffeemug
Copy link
Contributor

A quick status update: things are a little crazy for me with 1.5-sprint-1 being over, and sprint-2 planning, but I'd like to get together IRL with all relevant stakeholders this week (Friday at the latest), and hammer out the final proposal. We can then implement it next week.

@al3xandru
Copy link
Contributor

If I'm reading this correctly now we'll have:

  1. null responding to equality and non-equality comparisons
  2. but null throwing in all other comparisons

So, we end up with 1 special case (non existing attribute) and half of a special case null. Is there any particular reason why we want to tread null as a (semi) special case?

The above proposals puts us very close to SQL behavior for NULL, but with a major difference: instead of always returning false for any comparison of null, we throw.

Introducing is_null also reminds me of SQL IS NULL and HAVING.

I think in my ideal world:

  1. null would be just a normal value with a clear definition for comparisons (basically what @jdoliner defined above plus null.cmp(not_null) -> false; not_null.cmp(null) ->false`)
  2. the only special case would be a missing field which would deal with has_field and with_field(default_values).

@chrisabrams
Copy link

What's the status on this? I just recently ran into the same issue with properties not existing.

@mlucy
Copy link
Member Author

mlucy commented May 12, 2013

Most of this (default, has_fields, with_fields) is in internal code-review at the moment. Changing the behavior of NULLs so that they throw on comparison probably isn't going to happen in the near future because there are too many things that depend on all RQL values having a well-defined ordering.

The 1.5 release is coming up very soon, so these changes probably won't make it into that, but there's a solid chance they'll be in for 1.6.

@chrisabrams
Copy link

So for filtering for properties that do not exist, is there anything I can do at the moment to prevent an error?

@mlucy
Copy link
Member Author

mlucy commented May 12, 2013

I might be misunderstanding your question, but you can explicitly use contains to pre-filter the list:

table.filter {|row|
  row.contains(:a, :b, :c)
}.filter {|row|
  row[:a] + row[:b] < row[:c]
}

(contains will be renamed to something like has_fields when all of this gets merged in because so many people find the name confusing)

@chrisabrams
Copy link

Ahhh. Yes I am one of those people :)

@ricardobeat
Copy link

Since null needs to respond to comparisons so that orderBy can work, I don't see why it should throw in other operations. Won't the need for pre-filtering result in worse performance?

When dealing with unstructured data I want null/undefined errors to be silent 99% of the time. Mongo has special $type and $exists queries to cater for the special cases when you don't want that.

@jdoliner
Copy link
Contributor

How much is this issue helping us right now?

@coffeemug
Copy link
Contributor

Good bye, relic of history!

@coffeemug
Copy link
Contributor

P.S. it feels really good to close these, so I feel a little guilty closing the ones you put in the work to find. Feel free to reopen and close again yourself :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants