New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a fast version of distinct
that uses an index.
#1864
Comments
Everyone I've talked to seems to be on the same page on this. For posterity's sake, though: are there any objections? (If not, I'm planning to mark this settled in a day or so and start working on it next.) |
I think the only reason it isn't done yet is that we just didn't have time to get to it. |
Moving to 1.14-polish. This isn't as important as major features, and I'd like to agree on those first. We can tackle this later. |
@coffeemug - Is there anything I can to do to convince you to move this back to 1.14? I think I started the conversation that opened this issue in January[0] and I have eagerly been awaiting its completion. I thought it made it into 1.13 but I was wrong. This is actually the ONE issue that is forcing me to use MongoDB's aggregation framework over RethinkDB. The lack of this feature severely limits RethinkDB from doing any real analytics aggregations right now (so often the count of distinct of an event is as important of count of event - e.g. unique users/visitors to an application). Please reconsider! [0] - https://groups.google.com/forum/#!searchin/rethinkdb/distinct/rethinkdb/vGqLvYtWtkE/OJXALpPLhvYJ |
@brettgriffin -- for a work around now, you can do:
And it will return an object where the keys are all the distinct names. You can chain with |
That's a nice workaround. It still requires returning the object and counting the keys in node but at least it does the issue. I suppose I can live with this until 1.14-polish. Thanks. |
For the record, (Also, this would be a good second-tier project for Graham when he shows up, because it isn't too hard but doing it right touches the aggregator code, the unsharding code, and the datum stream code.) |
(Also, I'm sorry we haven't gotten this done yet @bgriffinbl !) |
@bgriffinbl -- I'm convinced, moving this back to 1.14. We have some more development resources now and I think we can make this happen in the next release. Sorry you had to wait for so long, and thank you for advocating so patiently for this issue! |
This is in next, CR 1729. |
@bgriffinbl: This has shipped with version 1.14.0 |
A lot of people want to use
distinct
on large tables. We should do two things, I think:distinct
less inefficient. Use an internal map that we update rather than forcing everything into an array. This will help the case where there's a large input stream but only a few distinct elements.distinct
that works on indexes, liker.table('test').distinct(index:'user_id')
(we can do this efficiently because we can efficiently access a sorted representation). This will give people an option when they want to do this and their input and output streams are too big.The text was updated successfully, but these errors were encountered: