Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a fast version of distinct that uses an index. #1864

Closed
mlucy opened this issue Jan 14, 2014 · 12 comments
Closed

Add a fast version of distinct that uses an index. #1864

mlucy opened this issue Jan 14, 2014 · 12 comments

Comments

@mlucy
Copy link
Member

mlucy commented Jan 14, 2014

A lot of people want to use distinct on large tables. We should do two things, I think:

  • Make distinct less inefficient. Use an internal map that we update rather than forcing everything into an array. This will help the case where there's a large input stream but only a few distinct elements.
  • Add a version of distinct that works on indexes, like r.table('test').distinct(index:'user_id') (we can do this efficiently because we can efficiently access a sorted representation). This will give people an option when they want to do this and their input and output streams are too big.
@coffeemug
Copy link
Contributor

FYI, there is already #1354 and #1135. I'll close them.

@mlucy
Copy link
Member Author

mlucy commented Jun 18, 2014

Everyone I've talked to seems to be on the same page on this. For posterity's sake, though: are there any objections?

(If not, I'm planning to mark this settled in a day or so and start working on it next.)

@coffeemug
Copy link
Contributor

I think the only reason it isn't done yet is that we just didn't have time to get to it.

@coffeemug
Copy link
Contributor

Moving to 1.14-polish. This isn't as important as major features, and I'd like to agree on those first. We can tackle this later.

@coffeemug coffeemug modified the milestones: 1.14-polish, 1.14 Jun 24, 2014
@brettgriffin
Copy link

@coffeemug - Is there anything I can to do to convince you to move this back to 1.14? I think I started the conversation that opened this issue in January[0] and I have eagerly been awaiting its completion. I thought it made it into 1.13 but I was wrong. This is actually the ONE issue that is forcing me to use MongoDB's aggregation framework over RethinkDB.

The lack of this feature severely limits RethinkDB from doing any real analytics aggregations right now (so often the count of distinct of an event is as important of count of event - e.g. unique users/visitors to an application).

Please reconsider!

[0] - https://groups.google.com/forum/#!searchin/rethinkdb/distinct/rethinkdb/vGqLvYtWtkE/OJXALpPLhvYJ

@neumino
Copy link
Member

neumino commented Jun 29, 2014

@brettgriffin -- for a work around now, you can do:

r.table("users").map(function(user) {
    return r.object(user("name"), true) // return { <name> : true}
}).reduce(function(left, right) {
    return left.merge(right)
})

And it will return an object where the keys are all the distinct names. You can chain with keys() if you want an array (but make sure that you don't have more than 100.000 distinct keys.

@bgriffinbl
Copy link

That's a nice workaround. It still requires returning the object and counting the keys in node but at least it does the issue. I suppose I can live with this until 1.14-polish. Thanks.

@mlucy
Copy link
Member Author

mlucy commented Jun 30, 2014

For the record, distinct with an index will be better than that -- the object solution still requires space proportional to the number of distinct elements (rather than the total number of elements in the stream), while indexed distinct doesn't.

(Also, this would be a good second-tier project for Graham when he shows up, because it isn't too hard but doing it right touches the aggregator code, the unsharding code, and the datum stream code.)

@mlucy
Copy link
Member Author

mlucy commented Jun 30, 2014

(Also, I'm sorry we haven't gotten this done yet @bgriffinbl !)

@coffeemug
Copy link
Contributor

@bgriffinbl -- I'm convinced, moving this back to 1.14. We have some more development resources now and I think we can make this happen in the next release. Sorry you had to wait for so long, and thank you for advocating so patiently for this issue!

@coffeemug coffeemug modified the milestones: 1.14, 1.14-polish Jul 2, 2014
@mlucy mlucy self-assigned this Jul 3, 2014
@mlucy
Copy link
Member Author

mlucy commented Jul 9, 2014

This is in next, CR 1729.

@larkost
Copy link
Collaborator

larkost commented Aug 27, 2014

@bgriffinbl: This has shipped with version 1.14.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants