Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: change the behavior of secondary indexes wrt NULLs #1001

Closed
coffeemug opened this issue Jun 13, 2013 · 24 comments
Closed

Proposal: change the behavior of secondary indexes wrt NULLs #1001

coffeemug opened this issue Jun 13, 2013 · 24 comments
Assignees
Milestone

Comments

@coffeemug
Copy link
Contributor

I was writing up an explanation of how to use secondary indexes to match regexes and I followed my intuition and wrote this:

r.table('users').indexCreate('test', r.row('first_name').match('S.*'))
r.table('users').between(null, null, {index: 'test'})

I expected the between call to return only the documents where first_name starts with S, but instead it returned everything because we include null into the secondary index. I had to rewrite the query as follows to actually work:

r.table('users').indexCreate('test', r.row('first_name').match('S.*').ne(null))
r.table('users').getAll(true, {index: 'test'})

I think we picked behavior of secondary indexes wrt NULLs and should change it to exclude NULLs. That seems a lot more natural to me.

@coffeemug
Copy link
Contributor Author

Note that if I just used filter, I'd write this: r.table('users').filter(r.row('first_name').match('S.*')). I'd naturally expect secondary indexes to be congruent with this behavior.

@mlucy
Copy link
Member

mlucy commented Jun 13, 2013

This is more of a between thing than a secondary index thing. If we change this we'll have to pick a new way to denote an empty endpoint for between.

Also, as an aside, making a secondary index with match like that won't really do what you want because each row that matches will have its own entry in the index (rather than being true it will be some object). So even with this change, you could easily get all the rows that don't match a regex, but you couldn't get all the rows that do.

What you really want here is r.row('first_name').match('S.*').coerce_to("bool"), although that's an ugly thing to ask people to type.

@coffeemug
Copy link
Contributor Author

Ah, noted on not being able to get all the rows that do :( We should probably close this proposal then.

However, I don't see what it has to do with between and why we'd have to change the denotation for empty endpoints. Could you clarify?

@wmrowan
Copy link
Contributor

wmrowan commented Jun 13, 2013

I thought we had a rule that index (only primary?) keys could not be null. This is what allows us to use null for empty endpoint in between. If secondary index keys can be null but between is supposed to work on secondary indexes then we already can't use nulls to denote empty endpoints. This is an inconsistency that we have to fix one way or the other.

Given that we decided to go the route that null generally means nonexistence then @coffeemug's original proposal makes total sense. Returning null in the secondary index function means "don't include in the index".

@mlucy why does the query have to be (...).coerce_to('boo')? Why does @coffeemug's (...).ne(null) not work?

@coffeemug
Copy link
Contributor Author

(...).ne(null) does work, FYI.

On Thu, Jun 13, 2013 at 3:11 AM, wmrowan notifications@github.com wrote:

I thought we had a rule that index (only primary?) keys could not be null.
This is what allows us to use null for empty endpoint in between. If
secondary index keys can be null but between is supposed to work on
secondary indexes then we already can't use nulls to denote empty
endpoints. This is an inconsistency that we have to fix one way or the
other.

Given that we decided to go the route that null generally means
nonexistence then @coffeemug https://github.com/coffeemug's original
proposal makes total sense. Returning null in the secondary index function
means "don't include in the index".

@mlucy https://github.com/mlucy why does the query have to be
(...).coerce_to('boo')? Why does @coffeemug https://github.com/coffeemug's
(...).ne(null) not work?


Reply to this email directly or view it on GitHubhttps://github.com//issues/1001#issuecomment-19382996
.

@mlucy
Copy link
Member

mlucy commented Jun 13, 2013

@wmrowan -- that works fine. coerce_to("bool") is effectively equivalent to do{|x| x.ne(null).and(x.ne(false)).

@coffeemug -- we currently disallow null as the value for an index and use it as a special value in between. If you write .between(null, 1) that means "anything <= 1", not "anything between null and 1".

@jdoliner
Copy link
Contributor

Using null as a special value in between is just generally a bad idea, we
should stop doing it and use a special value to represent unboundedness. My
objection is that if you're getting the arguments to between from another
expression such as accessing a field of another document its very easy for
a null which meant nonexistence to become a null which means unboundedness
and yield confusing results.
On Jun 13, 2013 4:58 PM, "Michael Lucy" notifications@github.com wrote:

@wmrowan https://github.com/wmrowan -- that works fine.
coerce_to("bool") is effectively equivalent to do{|x| x.ne(null).and(x.ne
(false)).

@coffeemug https://github.com/coffeemug -- we currently disallow nullas the value for an index and use it as a special value in
between. If you write .between(null, 1) that means "anything <= 1", not
"anything between null and 1".


Reply to this email directly or view it on GitHubhttps://github.com//issues/1001#issuecomment-19424578
.

@mlucy
Copy link
Member

mlucy commented Jun 14, 2013

I dunno, we seem to be moving away from using null as a user-value. If it represents non-existence, it actually makes a lot of sense to use it to represent a non-existent endpoint.

@neumino
Copy link
Member

neumino commented Jun 14, 2013

I'm strongly in favor of having null as a valid value and change the behavior of between.

Using null to represent unboudedness is just bad design.
It's also a pain for users. null in javascript is an object, not an undefined value.

@mlucy
Copy link
Member

mlucy commented Jun 14, 2013

@neumino -- what signature would you like instead?

@jdoliner
Copy link
Contributor

between(r.open, r.open) would work for me.
On Jun 14, 2013 6:24 PM, "Michael Lucy" notifications@github.com wrote:

@neumino https://github.com/neumino -- what signature would you like
instead?


Reply to this email directly or view it on GitHubhttps://github.com//issues/1001#issuecomment-19484895
.

@coffeemug
Copy link
Contributor Author

One moment ya'll -- I was just talking about changing the behavior of indexCreate to drop rows where the key of the secondary index is null (where as right now the row inserted with a null key).

The issue of between is somewhat related, but regardless of what we do with between I think we should change the behavior of index creation.

@neumino
Copy link
Member

neumino commented Jun 14, 2013

We shouldn't drop rows when the secondary index is null.

I expect

r.db("foo").table("bar").filter( r.row("field").eq( value ) )

to return exactly the same results as

r.db("foo").table("bar").getAll( value, { index: "field" })

Whatever the value of value is.

The syntax of @jdoliner would work for me.

@mlucy
Copy link
Member

mlucy commented Jun 15, 2013

@jdoliner -- I really don't like our new trend toward introducing terms that are only valid inside of one specific other term. It feels really hacky and ugly (although null does too). Can you think of any way of getting rid of null that doesn't do that?

@neumino -- the advantage of Slava's proposal would be that creating an index on an attribute would have exactly the same behavior if that attribute is null or if that attribute doesn't exist in the object. I also think that if we could push people away from using null as a regular value rather than a way to represent non-existence, that that might be a good thing.

@neumino
Copy link
Member

neumino commented Jun 15, 2013

We could have .after(start_key) and .before(end_key).

For the rest, I still don't see a valid reason not to let people use null. We can of had a bad start with null with our API, but it doesn't mean that we have to keep pushing it away.

@jdoliner
Copy link
Contributor

It seems like there's no getting around some added complexity here I find
special purpose values tasteful but you done so here's some other options.

  • we could have 3 versions of the function between lower_bound and
    upper_bound I seriously doubt @mlucy will like that and I really don't
    either but its an option.

  • we could introduce r.min_value and r.max_value. These would be valid in
    many other places so it addresses @mlucy's concern I think we should
    disallow serializing them as values to disk and maybe returning them as
    data to users (since they don't really convert to very usable native things
    in the language) this means they're probably more complicated to implement
    but maybe it's worth it.


Since we seem robe discussing 2 issues in one thread here I'll now address
the nulls in sindexes issue. I'm all for dropping null values but if I
recall correctly @coffemug you were actually the main proponent of not
dropping them originally. This isn't to say that you've hit your opinion
quota on this issue but normally when we rethink a decision we have some
main proponent reminding us why we originally made it. I forget what it was
but I remember it was pretty convincing and I'd kind of like to have it as
context. Can you remember what it was?

On Friday, June 14, 2013, Michel wrote:

We could have .after(start_key) and .before(end_key).

For the rest, I still don't see a valid reason not to let people use null.
We can of had a bad start with null with our API, but it doesn't mean
that we have to keep pushing it away.


Reply to this email directly or view it on GitHubhttps://github.com//issues/1001#issuecomment-19490568
.

@coffeemug
Copy link
Contributor Author

As far as between and dropping nulls goes, the obvious solution is to use key-based arguments instead of positional ones:

table.between() # returns everything
table.between(start = 'bar') # returns everything after 'bar'
table.between(end = 'foo') # returns everything before 'foo'
table.between(start = 'bar', end = 'foo') # returns everything between 'bar' and 'foo'

This seems ok to me. However, if we end up not allowing nulls in secondary indexes, I don't see what's wrong with jut passing null to between to indicate an infinite bound like we do now.


As for nulls in sindexes, I don't think either solution had a strong proponent. As far as I recall, we wobbled between available options and picked the one currently implemented (which I agree with at the time). I now think we picked the wrong one in light of treating null as non-existence.

@wmrowan
Copy link
Contributor

wmrowan commented Jun 17, 2013

@coffeemug What you just described for between is what we originally implemented. We switched it for 1.5 because we thought the optional argument approach was clunky: table.between({leftBound: 'a', rightBound: 'b'}).

@srh
Copy link
Contributor

srh commented Jun 17, 2013

Can you think of any way of getting rid of null that doesn't do that?

@mlucy What alternative do you think is best?

@mlucy
Copy link
Member

mlucy commented Jun 17, 2013

Alright, here's what I think we should do:

  • Make null not be in the secondary index because it's confusing.
  • Leave the current null syntax for 1.7. It's ugly, but if we don't allow NULLs in secondary indexes then it's non-ambiguous, and I think there are more important RQL problems that we need to address first. (We still have dual-inclusive slice, for god's sake.)
  • Open a new issue to fix between, since it's outside the scope of this issue (we originally got onto it because I misunderstood what Slava was saying).

As for what I think we should do about between, I actually sort of like r.minval and r.maxval. They would be useful in other terms as well. For example, they would fix #955 at the same time; you could get a common prefix by writing table.between([prefix, r.minval], [prefix, r.maxval], :index => ...).

@neumino
Copy link
Member

neumino commented Jun 17, 2013

@mlucy I didn't get why you think having null in the secondary index is confusing. What's confusing?

@mlucy
Copy link
Member

mlucy commented Jun 18, 2013

Mostly because it would be yet another case where null and non-existence differ. If someone creates a secondary index on a particular field (which I think is by far the most common use case, especially for novice users), then it's confusing to have different behavior when that field doesn't exist vs. when it's null. I could be convinced otherwise, though.

Actually, it turns out that we already exclude null from secondary indices; between is just incorrect because it has a code path from before we had secondary indexes where if you don't specify either endpoint it's just an identity op. We should just fix between to not have that behavior.

@ghost ghost assigned mlucy Jun 18, 2013
@mlucy
Copy link
Member

mlucy commented Jun 18, 2013

This is in review 642 by @Tryneus .

mlucy added a commit that referenced this issue Jun 18, 2013
@mlucy
Copy link
Member

mlucy commented Jun 18, 2013

This is in next. Closing because the original issue is satisfied (or rather, it turns out we were doing that all along).

@neumino -- if you still think we should allow null as an sindex value, could you open a separate RQL proposal for it?

@mlucy mlucy closed this as completed Jun 18, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants