Proposal for random sampling #861

jdoliner · 2013-05-21T22:26:58Z

No description provided.

jdoliner · 2013-05-21T22:28:01Z

Whoops hit enter by accident. Actual proposal coming momentarily.

jdoliner · 2013-05-21T22:35:17Z

I propose the following function be added to reql to allow people to do a bit of random sampling:

sample will be a function present on all sequences which takes a single number (n) as its argument. It returns a sequence of length n containing elements chosen from the stream with uniform random distribution. If the sequence passed has fewer than n elements in it we throw an error.

Example:

>>> r.expr([1,2,3]).sample(2).run()
[2,3]

jdoliner · 2013-05-21T23:07:38Z

This is implemented in the branch random_term if you want to play around with it.

neumino · 2013-05-21T23:13:16Z

What's the complexity behind .sample()?

jdoliner · 2013-05-21T23:14:58Z

O(n) where n is the number of elements in the sequence.

mlucy · 2013-05-22T00:12:30Z

I like this API.

srh · 2013-05-22T00:28:26Z

You mean it samples n elements without replacement from the sequence.

coffeemug · 2013-05-22T12:04:25Z

I think the API is great. A couple of questions/notes:

What happens if I do r.expr([1, 2]).sample(3)?
We should test .sample(-1) and such if we don't already (I assume this returns an error?).
A O (k log N) complexity (where k is the number of shards) is immensely more useful than O(N). I know it requires adding a new ICL-level command and doing some B-Tree magic, but I think it could be done in a couple of days and given the huge increase in usefulness, the engineering time is justifiable. (Especially since a log-time version of sample is clearly advertisable, where as a linear-time version is a lot less clearly so). @pixelspark, @namtih58, @spiffytech -- how important do you think it is for this command to be logarithmic with respect to document count, as opposed to linear?

spiffytech · 2013-05-22T13:10:50Z

My own use cases wouldn't normally see a great difference in performance between the two, but given RethinkDB is deliberately attracting users with enormous datasets, I can easily believe it's worthwhile to make it more performant up-front.

spiffytech · 2013-05-22T13:16:56Z

Regarding the case of r.expr([1, 2]).sample(3), my suggestion is to either add a second parameter to sample(), bool "upTo" or some such, to return as many records as possible instead of throwing an error, or to simply make that the default behavior, and let the user check the number of records returned in their program if they care.

I'm unsure which is preferable, but my gut feeling is the latter, as I'm having trouble imagining cases where one needs exactly N random records, for N > 1, but I can easily imagine, e.g., "show five random recommendations for the user, or as many as we have"

neumino · 2013-05-22T15:11:47Z

If we have an efficient .sample() method, we could write awesome map reduce examples.

jdoliner · 2013-05-22T15:16:59Z

So @mlucy and I decided we like throwing if there weren't enough elements mostly because that's what python does. I'm not strongly attached to this our logic was that people might write code which assumes the result of sample(n) always has n elements.

sample(-1) returns an error.

I'm unclear on where the estimate of a couple of days comes from for O(k log N) I seriously doubt that's possible. For one thing there are many viable use cases where you just can't do it such as following a filter or a join. But more importantly I actually doubt it's possible to do at all with our current architecture. The @srh trick of randomly choosing an offset for each level of the BTree will have some really bad run times on our BTree and I don't know of an alternative algorithm.

coffeemug · 2013-05-22T17:16:51Z

After talking to @jdoliner about this in person it appears that a logarithmic solution is indeed very hard (because erase-range leaves empty leaf nodes, which makes a simple tree-walk algorithm impossible). Let's ship with O(n) and take it from there.

spiffytech · 2013-05-22T19:46:30Z

If .sample() will error out on fewer than N elements, is there a way to sample "as many below N as you can"? All I can think of is first running a query to count how many elements the query would return, then sampling that number if it's less than N, but that's a pretty bad solution.

jdoliner · 2013-05-22T19:52:54Z

So if we have default then you'd write:

table.sample(n).default(table)

since if you have fewer than n elements you just want to sample all of them. So it's not that bad. That being said we could also add a flag. I was pretty iffy on whether I preferred to error or return fewer elements but for some reason I'm switching to returning the fewer elements by default and having a flag for strictness. I.e

>>> r.expr([1,2,3]).sample(4).run()
[1,2,3]
>>> r.expr([1,2,3]).sample(4, strict=True).run()
Error not enough elements to sample.

mlucy · 2013-05-22T20:14:38Z

I think either strict or non-strict by default is fine.

srh · 2013-05-22T21:50:02Z

What about sampling with replacement? This is something people want to do sometimes. It certainly makes the math easier (doing stats) in a lot of situations.

jdoliner · 2013-05-22T22:16:47Z

I'm totally on board with doing sampling with replacement a reservoir sampling method with replacement is a bit tougher though from what I'm reading. Would you be opposed to shipping without it and then adding it later?

spiffytech · 2013-05-22T22:27:53Z

I presume if I called .sample() with no parameters it would return all elements in the result set in random order?

mlucy · 2013-05-22T22:31:53Z

I wouldn't be opposed to adding an option to sample with replacement.

coffeemug · 2013-05-23T06:39:34Z

A few thoughts:

We've got lots of other things to do for this sprint, so I'd leave replacement until later. Let's ship this as is, and deal with replacement when someone demands it.
I think throwing is fine for now, let's not overcomplicate things by adding a strict argument (if we did, man would that be a can of worms). EDIT: I'd be cool with non-throwing one too. Just not a flag.
.sample() with no parameters that just returns things in random order would be nice, if we can get that done easily (seems like it would be easy since it's basically table.sample(table.count()))

jdoliner · 2013-05-23T18:46:45Z

@coffeemug so unfortunately .sample() isn't very easy right now. We use reservoir sampling for random selections which means there's no guarantees that the order of the elements is random. In practice they approach a random ordering as the ratio of n / table.count() approaches 0. However in the case of .sample() n == table.count() and the order won't be randomized at all. table.sample().run() will just be the same as table.run() except that it will be prohibitively inefficient because it requires storing the contents in memory.

It actually might be worth mentioning in documentation that this ordering can't be counted on to be random. In particular it has the following property.

If a value of index i where i < n is returned by sample then the value will be located at index i in the result. Otherwise it will in a uniform random location.

mlucy · 2013-05-23T20:07:02Z

We can solve that pretty easily by just shuffling the result when we're done.

coffeemug · 2013-05-24T07:38:51Z

We can solve that pretty easily by just shuffling the result when we're done.

👍

spiffytech · 2013-05-24T13:06:58Z

That would be sweet, since I don't think another plan for ordering results randomly is on the roadmap.

mlucy · 2013-05-27T23:56:34Z

Here's the final API that I think we settled on:

> r.random
0.234890128
> r.random(5)
2 # in [0, 5)
> r.random(20, 30)
24 # in [20, 30)

# Sampling without replacement, operates on any sequence.  Returns an array if passed an array, a stream if passed a stream, and a selection if passed a selection.
> r([1,2,3]).sample(2)
[3, 1] # shuffled
> r([1, 2, 3]).sample(4)
[1, 3, 2] # shuffled
> r([1, 2, 3]).sample # this is just a shuffle
[2, 1, 3] # shuffled

jdoliner · 2013-05-28T00:27:36Z

So I'm a bit confused about this final api unless I'm mistaken there's no
mention anywhere in this issue of adding a random term and I was hoping
to ignore that to have less to implement here.

Also it seems like we were on the fence about throwing vs returning few
elements but the last thing @coffeemug said was let's just throw.

Finally I really think we should ignore sample() for now. As I said
previously its going to very easily run you out of memory because it
requires storing everything in memory. We can fix this by adding memory
caps but I think we can reasonably ignore this for now if we only allow
capped sampling which I'd like to do to keep things simple.

mlucy · 2013-05-28T00:39:33Z

I suppose I had associated this with #865 in my head. I think we need a random term at some point, but if we don't want it for 1.6 I'm cool with that.

Regarding throwing, here's what people said:

Slava:

I think throwing is fine for now, let's not overcomplicate things by adding a strict argument (if we did, man would that be a can of worms). EDIT: I'd be cool with non-throwing one too. Just not a flag.

@spiffytech seemed to want non-throwing.

After reading what @spiffytech said, I think that if we aren't providing a flag then non-throwing might be better. If we don't throw by default and you want to throw, it's very easy to branch on the size of the resulting array and throw. In contrast, if we throw by default and you don't want to throw, then you're in a lot of trouble if you're trying to sample a potentially-large stream. (You can't just count the stream and branch on that because you can only evaluate a stream once, so after you've counted it you can't decide to sample it.)

I would be cool with not implementing .sample with no arguments. In that case we probably don't need to do shuffling either.

mlucy · 2013-05-28T03:33:20Z

I just talked to Slava, and he thinks shuffling is still a good idea for sample even if we don't allow a 0-arity version. I can imagine people assuming this behavior (which they will basically get for large streams) and then getting funky results on small streams, so I think we should do that.

A revised summary:

# Sampling without replacement, operates on any sequence.  Returns an array if passed an array, a stream if passed a stream, and a selection if passed a selection.
> r([1,2,3]).sample(2)
[3, 1] # shuffled
> r([1, 2, 3]).sample(4)
[1, 3, 2] # shuffled

coffeemug · 2013-05-28T04:45:34Z

👍 for @mlucy's proposal above. Also, I agree that zero-arity version of sample isn't necessary for now, and random isn't necessary for now.

srh · 2013-05-28T19:38:43Z

Why do you care about shuffling the order of things being sampled? You're sampiling without replacement. There's no reason for the ordered to be shuffled.

coffeemug · 2013-05-28T19:42:47Z

If I want a list of N rows in a random order we have to shuffle (because as N approaches a high number, the current algorithm doesn't automatically offer good randomness).

srh · 2013-05-28T19:48:47Z

If you have an efficient count() then you don't have to shuffle. We are getting that someday, right? We shouldn't define the output to be in any particular (shuffled) order.

spiffytech · 2013-05-28T23:47:20Z

It doesn't have to come from .sample, but I think RethinkDB should have
some way to retrieve randomly-ordered results, and it doesn't look like
there's an alternative on the roadmap.

On Tue, May 28, 2013 at 3:48 PM, srh notifications@github.com wrote:

If you have an efficient count() then you don't have to shuffle. We are
getting that someday, right? We shouldn't define the output to be in any
particular (shuffled) order.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/861#issuecomment-18575216
.

jdoliner · 2013-05-29T00:04:55Z

@spiffytech it's definitely a feature we'd like to add. Unfortunately it's actually very hard for us to implement that without storing the data in memory. If you have to store all of the data in memory then it's really hardly useful because it will either get you killed by the OOM killer or will have a prohibitively low cap. We can get around this it's just a big undertaking and while we do like the feature unfortunately right now the cost benefit isn't such that we can really put it anywhere in the roadmap but "future"; at least I don't think we can. Sorry I can't give you a more satisfying answer here, we will implement this eventually I promise.

coffeemug · 2013-05-29T06:36:07Z

I still think .sample(n) should automagically shuffle the output. It essentially gives a way to get N random rows and is very easy to do.

mlucy · 2013-05-29T06:51:17Z

I concur on the shuffling.

namtih58 · 2013-05-29T15:25:27Z

Would working with id's until when the algorithm is done sampling/shuffling then getting the relevant data help with the memory consumption issue?

namtih58 · 2013-05-29T15:41:34Z

The way I got a quick impl working in mongo was to pick a random range twice the size of the sample i wanted, pulled only the ids, shuffled the ids, picked a random range of ids that was the size i wanted and used those ids to query the documents i needed. I dont know the internals of rethink, but that was a quick implementation of Document.sample(n)

jdoliner · 2013-06-03T17:43:36Z

CR 582 by @mlucy. In as of commit 8fd9113. Ships 1.6.

timmaxw · 2013-06-16T00:07:44Z

Random sampling with replacement from a stream shouldn't be too hard to implement. Here's some example code, based on this paper:

std::vector<Thing> take_sample_with_replacement(int n_samples, ThingStream *stream) {
    std::vector<Thing> results(n_samples);
    int counter = 0;
    while (!stream->is_done()) {
        Thing current_thing = stream->next();
        counter++;
        for (int i = 0; i < n_samples; i++) {
            if (rand() % counter == 0) {
                results[i] = current_thing;
            }
        }
    }
    return results;
}

Basically, for each thing in the stream, for each slot in the reservoir, you replace whatever's currently in that slot with the new thing with probability 1/N, where N is the number of things read off the stream by that point, including the current thing.

Edit: Oops, I didn't read the thread very carefully. This isn't actually relevant right now.

jdoliner · 2013-06-16T01:43:03Z

Yeah its actually easy enough to do. But still a tad of work. We can
schedule it if someone requests it.
On Jun 15, 2013 8:07 PM, "Timothy Maxwell" notifications@github.com wrote:

Random sampling with replacement from a stream shouldn't be that hard to
implement. Here's some example code, based on this paperhttps://www.siam.org/proceedings/datamining/2004/dm04_053parkb.pdf
:

std::vector take_sample_with_replacement(int n_samples, ThingStream *stream) {
std::vector results(n_samples);
int counter = 0;
while (!stream->is_done()) {
Thing current_thing = stream->next();
counter++;
for (int i = 0; i < n_samples; i++) {
if (rand() % counter == 0) {
results[i] = current_thing;
}
}
}
return results;}

Basically, for each thing in the stream, for each slot in the reservoir,
you replace whatever's currently in that slot with the new thing with
probability 1/N, where N is the number of things read off the stream by
that point, including the current thing.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/861#issuecomment-19505631
.

ghost assigned jdoliner May 22, 2013

coffeemug mentioned this issue May 22, 2013

Add a way to randomly sample objects from a table #182

Closed

jdoliner closed this as completed Jun 3, 2013

Proposal for random sampling #861

Proposal for random sampling #861

Comments

jdoliner commented May 21, 2013

jdoliner commented May 21, 2013

jdoliner commented May 21, 2013

jdoliner commented May 21, 2013

neumino commented May 21, 2013

jdoliner commented May 21, 2013

mlucy commented May 22, 2013

srh commented May 22, 2013

coffeemug commented May 22, 2013

spiffytech commented May 22, 2013

spiffytech commented May 22, 2013

neumino commented May 22, 2013

jdoliner commented May 22, 2013

coffeemug commented May 22, 2013

spiffytech commented May 22, 2013

jdoliner commented May 22, 2013

mlucy commented May 22, 2013

srh commented May 22, 2013

jdoliner commented May 22, 2013

spiffytech commented May 22, 2013

mlucy commented May 22, 2013

coffeemug commented May 23, 2013

jdoliner commented May 23, 2013

mlucy commented May 23, 2013

coffeemug commented May 24, 2013

spiffytech commented May 24, 2013

mlucy commented May 27, 2013

jdoliner commented May 28, 2013

mlucy commented May 28, 2013

mlucy commented May 28, 2013

coffeemug commented May 28, 2013

srh commented May 28, 2013

coffeemug commented May 28, 2013

srh commented May 28, 2013

spiffytech commented May 28, 2013

jdoliner commented May 29, 2013

coffeemug commented May 29, 2013

mlucy commented May 29, 2013

namtih58 commented May 29, 2013

namtih58 commented May 29, 2013

jdoliner commented Jun 3, 2013

timmaxw commented Jun 16, 2013

jdoliner commented Jun 16, 2013