Add a fast version of `distinct` that uses an index. #1864

mlucy · 2014-01-14T20:43:01Z

A lot of people want to use distinct on large tables. We should do two things, I think:

Make distinct less inefficient. Use an internal map that we update rather than forcing everything into an array. This will help the case where there's a large input stream but only a few distinct elements.
Add a version of distinct that works on indexes, like r.table('test').distinct(index:'user_id') (we can do this efficiently because we can efficiently access a sorted representation). This will give people an option when they want to do this and their input and output streams are too big.

The text was updated successfully, but these errors were encountered:

coffeemug · 2014-01-14T21:52:11Z

FYI, there is already #1354 and #1135. I'll close them.

mlucy · 2014-06-18T08:39:08Z

Everyone I've talked to seems to be on the same page on this. For posterity's sake, though: are there any objections?

(If not, I'm planning to mark this settled in a day or so and start working on it next.)

coffeemug · 2014-06-18T08:40:58Z

I think the only reason it isn't done yet is that we just didn't have time to get to it.

coffeemug · 2014-06-24T00:59:59Z

Moving to 1.14-polish. This isn't as important as major features, and I'd like to agree on those first. We can tackle this later.

brettgriffin · 2014-06-29T16:32:31Z

@coffeemug - Is there anything I can to do to convince you to move this back to 1.14? I think I started the conversation that opened this issue in January[0] and I have eagerly been awaiting its completion. I thought it made it into 1.13 but I was wrong. This is actually the ONE issue that is forcing me to use MongoDB's aggregation framework over RethinkDB.

The lack of this feature severely limits RethinkDB from doing any real analytics aggregations right now (so often the count of distinct of an event is as important of count of event - e.g. unique users/visitors to an application).

Please reconsider!

[0] - https://groups.google.com/forum/#!searchin/rethinkdb/distinct/rethinkdb/vGqLvYtWtkE/OJXALpPLhvYJ

neumino · 2014-06-29T19:38:50Z

@brettgriffin -- for a work around now, you can do:

r.table("users").map(function(user) {
    return r.object(user("name"), true) // return { <name> : true}
}).reduce(function(left, right) {
    return left.merge(right)
})

And it will return an object where the keys are all the distinct names. You can chain with keys() if you want an array (but make sure that you don't have more than 100.000 distinct keys.

bgriffinbl · 2014-06-30T18:23:10Z

That's a nice workaround. It still requires returning the object and counting the keys in node but at least it does the issue. I suppose I can live with this until 1.14-polish. Thanks.

mlucy · 2014-06-30T20:29:16Z

For the record, distinct with an index will be better than that -- the object solution still requires space proportional to the number of distinct elements (rather than the total number of elements in the stream), while indexed distinct doesn't.

(Also, this would be a good second-tier project for Graham when he shows up, because it isn't too hard but doing it right touches the aggregator code, the unsharding code, and the datum stream code.)

mlucy · 2014-06-30T20:30:56Z

(Also, I'm sorry we haven't gotten this done yet @bgriffinbl !)

coffeemug · 2014-07-02T18:36:31Z

@bgriffinbl -- I'm convinced, moving this back to 1.14. We have some more development resources now and I think we can make this happen in the next release. Sorry you had to wait for so long, and thank you for advocating so patiently for this issue!

mlucy · 2014-07-09T02:46:09Z

This is in next, CR 1729.

larkost · 2014-08-27T18:29:24Z

@bgriffinbl: This has shipped with version 1.14.0

This was referenced Jan 14, 2014

Support distinct using an index #1354

Closed

Make distinct work well when you have large input and small output. #1135

Closed

coffeemug modified the milestone: subsequent Mar 26, 2014

coffeemug removed the tp:lts label Mar 26, 2014

coffeemug modified the milestones: 1.14, subsequent Jun 12, 2014

coffeemug assigned mlucy Jun 12, 2014

mlucy added the tp:API_settled label Jun 20, 2014

neumino mentioned this issue Jun 23, 2014

getAll without any arguments: RqlRuntimeError: Expected 2 or more argument(s) but found 1 #2586

Closed

coffeemug modified the milestones: 1.14-polish, 1.14 Jun 24, 2014

coffeemug modified the milestones: 1.14, 1.14-polish Jul 2, 2014

coffeemug unassigned mlucy Jul 2, 2014

mlucy self-assigned this Jul 3, 2014

mlucy mentioned this issue Jul 9, 2014

We send timestamps across the network in batchspec_ts. #2671

Closed

mlucy closed this as completed Jul 9, 2014

neumino mentioned this issue Aug 19, 2014

Update docs for distinct rethinkdb/docs#468

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a fast version of `distinct` that uses an index. #1864

Add a fast version of `distinct` that uses an index. #1864

mlucy commented Jan 14, 2014

coffeemug commented Jan 14, 2014

mlucy commented Jun 18, 2014

coffeemug commented Jun 18, 2014

coffeemug commented Jun 24, 2014

brettgriffin commented Jun 29, 2014

neumino commented Jun 29, 2014

bgriffinbl commented Jun 30, 2014

mlucy commented Jun 30, 2014

mlucy commented Jun 30, 2014

coffeemug commented Jul 2, 2014

mlucy commented Jul 9, 2014

larkost commented Aug 27, 2014

Add a fast version of distinct that uses an index. #1864

Add a fast version of distinct that uses an index. #1864

Comments

mlucy commented Jan 14, 2014

coffeemug commented Jan 14, 2014

mlucy commented Jun 18, 2014

coffeemug commented Jun 18, 2014

coffeemug commented Jun 24, 2014

brettgriffin commented Jun 29, 2014

neumino commented Jun 29, 2014

bgriffinbl commented Jun 30, 2014

mlucy commented Jun 30, 2014

mlucy commented Jun 30, 2014

coffeemug commented Jul 2, 2014

mlucy commented Jul 9, 2014

larkost commented Aug 27, 2014

Add a fast version of `distinct` that uses an index. #1864

Add a fast version of `distinct` that uses an index. #1864