Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for random sampling #861

Closed
jdoliner opened this issue May 21, 2013 · 42 comments
Closed

Proposal for random sampling #861

jdoliner opened this issue May 21, 2013 · 42 comments
Assignees
Milestone

Comments

@jdoliner
Copy link
Contributor

No description provided.

@jdoliner
Copy link
Contributor Author

Whoops hit enter by accident. Actual proposal coming momentarily.

@jdoliner
Copy link
Contributor Author

I propose the following function be added to reql to allow people to do a bit of random sampling:

sample will be a function present on all sequences which takes a single number (n) as its argument. It returns a sequence of length n containing elements chosen from the stream with uniform random distribution. If the sequence passed has fewer than n elements in it we throw an error.

Example:

>>> r.expr([1,2,3]).sample(2).run()
[2,3] 

@jdoliner
Copy link
Contributor Author

This is implemented in the branch random_term if you want to play around with it.

@neumino
Copy link
Member

neumino commented May 21, 2013

What's the complexity behind .sample()?

@jdoliner
Copy link
Contributor Author

O(n) where n is the number of elements in the sequence.

@mlucy
Copy link
Member

mlucy commented May 22, 2013

I like this API.

@srh
Copy link
Contributor

srh commented May 22, 2013

You mean it samples n elements without replacement from the sequence.

@coffeemug
Copy link
Contributor

I think the API is great. A couple of questions/notes:

  1. What happens if I do r.expr([1, 2]).sample(3)?
  2. We should test .sample(-1) and such if we don't already (I assume this returns an error?).
  3. A O (k log N) complexity (where k is the number of shards) is immensely more useful than O(N). I know it requires adding a new ICL-level command and doing some B-Tree magic, but I think it could be done in a couple of days and given the huge increase in usefulness, the engineering time is justifiable. (Especially since a log-time version of sample is clearly advertisable, where as a linear-time version is a lot less clearly so). @pixelspark, @namtih58, @spiffytech -- how important do you think it is for this command to be logarithmic with respect to document count, as opposed to linear?

@spiffytech
Copy link

My own use cases wouldn't normally see a great difference in performance between the two, but given RethinkDB is deliberately attracting users with enormous datasets, I can easily believe it's worthwhile to make it more performant up-front.

@spiffytech
Copy link

Regarding the case of r.expr([1, 2]).sample(3), my suggestion is to either add a second parameter to sample(), bool "upTo" or some such, to return as many records as possible instead of throwing an error, or to simply make that the default behavior, and let the user check the number of records returned in their program if they care.

I'm unsure which is preferable, but my gut feeling is the latter, as I'm having trouble imagining cases where one needs exactly N random records, for N > 1, but I can easily imagine, e.g., "show five random recommendations for the user, or as many as we have"

@neumino
Copy link
Member

neumino commented May 22, 2013

If we have an efficient .sample() method, we could write awesome map reduce examples.

@jdoliner
Copy link
Contributor Author

So @mlucy and I decided we like throwing if there weren't enough elements mostly because that's what python does. I'm not strongly attached to this our logic was that people might write code which assumes the result of sample(n) always has n elements.

sample(-1) returns an error.

I'm unclear on where the estimate of a couple of days comes from for O(k log N) I seriously doubt that's possible. For one thing there are many viable use cases where you just can't do it such as following a filter or a join. But more importantly I actually doubt it's possible to do at all with our current architecture. The @srh trick of randomly choosing an offset for each level of the BTree will have some really bad run times on our BTree and I don't know of an alternative algorithm.

@coffeemug
Copy link
Contributor

After talking to @jdoliner about this in person it appears that a logarithmic solution is indeed very hard (because erase-range leaves empty leaf nodes, which makes a simple tree-walk algorithm impossible). Let's ship with O(n) and take it from there.

@spiffytech
Copy link

If .sample() will error out on fewer than N elements, is there a way to sample "as many below N as you can"? All I can think of is first running a query to count how many elements the query would return, then sampling that number if it's less than N, but that's a pretty bad solution.

@jdoliner
Copy link
Contributor Author

So if we have default then you'd write:

table.sample(n).default(table)

since if you have fewer than n elements you just want to sample all of them. So it's not that bad. That being said we could also add a flag. I was pretty iffy on whether I preferred to error or return fewer elements but for some reason I'm switching to returning the fewer elements by default and having a flag for strictness. I.e

>>> r.expr([1,2,3]).sample(4).run()
[1,2,3]
>>> r.expr([1,2,3]).sample(4, strict=True).run()
Error not enough elements to sample.

@mlucy
Copy link
Member

mlucy commented May 22, 2013

I think either strict or non-strict by default is fine.

@srh
Copy link
Contributor

srh commented May 22, 2013

What about sampling with replacement? This is something people want to do sometimes. It certainly makes the math easier (doing stats) in a lot of situations.

@jdoliner
Copy link
Contributor Author

I'm totally on board with doing sampling with replacement a reservoir sampling method with replacement is a bit tougher though from what I'm reading. Would you be opposed to shipping without it and then adding it later?

@spiffytech
Copy link

I presume if I called .sample() with no parameters it would return all elements in the result set in random order?

@mlucy
Copy link
Member

mlucy commented May 22, 2013

I wouldn't be opposed to adding an option to sample with replacement.

@coffeemug
Copy link
Contributor

A few thoughts:

  1. We've got lots of other things to do for this sprint, so I'd leave replacement until later. Let's ship this as is, and deal with replacement when someone demands it.
  2. I think throwing is fine for now, let's not overcomplicate things by adding a strict argument (if we did, man would that be a can of worms). EDIT: I'd be cool with non-throwing one too. Just not a flag.
  3. .sample() with no parameters that just returns things in random order would be nice, if we can get that done easily (seems like it would be easy since it's basically table.sample(table.count()))

@jdoliner
Copy link
Contributor Author

@coffeemug so unfortunately .sample() isn't very easy right now. We use reservoir sampling for random selections which means there's no guarantees that the order of the elements is random. In practice they approach a random ordering as the ratio of n / table.count() approaches 0. However in the case of .sample() n == table.count() and the order won't be randomized at all. table.sample().run() will just be the same as table.run() except that it will be prohibitively inefficient because it requires storing the contents in memory.

It actually might be worth mentioning in documentation that this ordering can't be counted on to be random. In particular it has the following property.

If a value of index i where i < n is returned by sample then the value will be located at index i in the result. Otherwise it will in a uniform random location.

@mlucy
Copy link
Member

mlucy commented May 23, 2013

We can solve that pretty easily by just shuffling the result when we're done.

@coffeemug
Copy link
Contributor

We can solve that pretty easily by just shuffling the result when we're done.

👍

@spiffytech
Copy link

That would be sweet, since I don't think another plan for ordering results randomly is on the roadmap.

@mlucy
Copy link
Member

mlucy commented May 27, 2013

Here's the final API that I think we settled on:

> r.random
0.234890128
> r.random(5)
2 # in [0, 5)
> r.random(20, 30)
24 # in [20, 30)

# Sampling without replacement, operates on any sequence.  Returns an array if passed an array, a stream if passed a stream, and a selection if passed a selection.
> r([1,2,3]).sample(2)
[3, 1] # shuffled
> r([1, 2, 3]).sample(4)
[1, 3, 2] # shuffled
> r([1, 2, 3]).sample # this is just a shuffle
[2, 1, 3] # shuffled

@jdoliner
Copy link
Contributor Author

So I'm a bit confused about this final api unless I'm mistaken there's no
mention anywhere in this issue of adding a random term and I was hoping
to ignore that to have less to implement here.

Also it seems like we were on the fence about throwing vs returning few
elements but the last thing @coffeemug said was let's just throw.

Finally I really think we should ignore sample() for now. As I said
previously its going to very easily run you out of memory because it
requires storing everything in memory. We can fix this by adding memory
caps but I think we can reasonably ignore this for now if we only allow
capped sampling which I'd like to do to keep things simple.

@mlucy
Copy link
Member

mlucy commented May 28, 2013

I suppose I had associated this with #865 in my head. I think we need a random term at some point, but if we don't want it for 1.6 I'm cool with that.

Regarding throwing, here's what people said:

Slava:

I think throwing is fine for now, let's not overcomplicate things by adding a strict argument (if we did, man would that be a can of worms). EDIT: I'd be cool with non-throwing one too. Just not a flag.

@spiffytech seemed to want non-throwing.

After reading what @spiffytech said, I think that if we aren't providing a flag then non-throwing might be better. If we don't throw by default and you want to throw, it's very easy to branch on the size of the resulting array and throw. In contrast, if we throw by default and you don't want to throw, then you're in a lot of trouble if you're trying to sample a potentially-large stream. (You can't just count the stream and branch on that because you can only evaluate a stream once, so after you've counted it you can't decide to sample it.)

I would be cool with not implementing .sample with no arguments. In that case we probably don't need to do shuffling either.

@mlucy
Copy link
Member

mlucy commented May 28, 2013

I just talked to Slava, and he thinks shuffling is still a good idea for sample even if we don't allow a 0-arity version. I can imagine people assuming this behavior (which they will basically get for large streams) and then getting funky results on small streams, so I think we should do that.

A revised summary:

# Sampling without replacement, operates on any sequence.  Returns an array if passed an array, a stream if passed a stream, and a selection if passed a selection.
> r([1,2,3]).sample(2)
[3, 1] # shuffled
> r([1, 2, 3]).sample(4)
[1, 3, 2] # shuffled

@coffeemug
Copy link
Contributor

👍 for @mlucy's proposal above. Also, I agree that zero-arity version of sample isn't necessary for now, and random isn't necessary for now.

@srh
Copy link
Contributor

srh commented May 28, 2013

Why do you care about shuffling the order of things being sampled? You're sampiling without replacement. There's no reason for the ordered to be shuffled.

@coffeemug
Copy link
Contributor

If I want a list of N rows in a random order we have to shuffle (because as N approaches a high number, the current algorithm doesn't automatically offer good randomness).

@srh
Copy link
Contributor

srh commented May 28, 2013

If you have an efficient count() then you don't have to shuffle. We are getting that someday, right? We shouldn't define the output to be in any particular (shuffled) order.

@spiffytech
Copy link

It doesn't have to come from .sample, but I think RethinkDB should have
some way to retrieve randomly-ordered results, and it doesn't look like
there's an alternative on the roadmap.

On Tue, May 28, 2013 at 3:48 PM, srh notifications@github.com wrote:

If you have an efficient count() then you don't have to shuffle. We are
getting that someday, right? We shouldn't define the output to be in any
particular (shuffled) order.


Reply to this email directly or view it on GitHubhttps://github.com//issues/861#issuecomment-18575216
.

@jdoliner
Copy link
Contributor Author

@spiffytech it's definitely a feature we'd like to add. Unfortunately it's actually very hard for us to implement that without storing the data in memory. If you have to store all of the data in memory then it's really hardly useful because it will either get you killed by the OOM killer or will have a prohibitively low cap. We can get around this it's just a big undertaking and while we do like the feature unfortunately right now the cost benefit isn't such that we can really put it anywhere in the roadmap but "future"; at least I don't think we can. Sorry I can't give you a more satisfying answer here, we will implement this eventually I promise.

@coffeemug
Copy link
Contributor

I still think .sample(n) should automagically shuffle the output. It essentially gives a way to get N random rows and is very easy to do.

@mlucy
Copy link
Member

mlucy commented May 29, 2013

I concur on the shuffling.

@namtih58
Copy link

Would working with id's until when the algorithm is done sampling/shuffling then getting the relevant data help with the memory consumption issue?

@namtih58
Copy link

The way I got a quick impl working in mongo was to pick a random range twice the size of the sample i wanted, pulled only the ids, shuffled the ids, picked a random range of ids that was the size i wanted and used those ids to query the documents i needed. I dont know the internals of rethink, but that was a quick implementation of Document.sample(n)

@jdoliner
Copy link
Contributor Author

jdoliner commented Jun 3, 2013

CR 582 by @mlucy. In as of commit 8fd9113. Ships 1.6.

@jdoliner jdoliner closed this as completed Jun 3, 2013
@timmaxw
Copy link
Member

timmaxw commented Jun 16, 2013

Random sampling with replacement from a stream shouldn't be too hard to implement. Here's some example code, based on this paper:

std::vector<Thing> take_sample_with_replacement(int n_samples, ThingStream *stream) {
    std::vector<Thing> results(n_samples);
    int counter = 0;
    while (!stream->is_done()) {
        Thing current_thing = stream->next();
        counter++;
        for (int i = 0; i < n_samples; i++) {
            if (rand() % counter == 0) {
                results[i] = current_thing;
            }
        }
    }
    return results;
}

Basically, for each thing in the stream, for each slot in the reservoir, you replace whatever's currently in that slot with the new thing with probability 1/N, where N is the number of things read off the stream by that point, including the current thing.

Edit: Oops, I didn't read the thread very carefully. This isn't actually relevant right now.

@jdoliner
Copy link
Contributor Author

Yeah its actually easy enough to do. But still a tad of work. We can
schedule it if someone requests it.
On Jun 15, 2013 8:07 PM, "Timothy Maxwell" notifications@github.com wrote:

Random sampling with replacement from a stream shouldn't be that hard to
implement. Here's some example code, based on this paperhttps://www.siam.org/proceedings/datamining/2004/dm04_053parkb.pdf
:

std::vector take_sample_with_replacement(int n_samples, ThingStream *stream) {
std::vector results(n_samples);
int counter = 0;
while (!stream->is_done()) {
Thing current_thing = stream->next();
counter++;
for (int i = 0; i < n_samples; i++) {
if (rand() % counter == 0) {
results[i] = current_thing;
}
}
}
return results;}

Basically, for each thing in the stream, for each slot in the reservoir,
you replace whatever's currently in that slot with the new thing with
probability 1/N, where N is the number of things read off the stream by
that point, including the current thing.


Reply to this email directly or view it on GitHubhttps://github.com//issues/861#issuecomment-19505631
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants