New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal for random sampling #861
Comments
Whoops hit enter by accident. Actual proposal coming momentarily. |
I propose the following function be added to reql to allow people to do a bit of random sampling:
Example: >>> r.expr([1,2,3]).sample(2).run()
[2,3] |
This is implemented in the branch random_term if you want to play around with it. |
What's the complexity behind |
|
I like this API. |
You mean it samples n elements without replacement from the sequence. |
I think the API is great. A couple of questions/notes:
|
My own use cases wouldn't normally see a great difference in performance between the two, but given RethinkDB is deliberately attracting users with enormous datasets, I can easily believe it's worthwhile to make it more performant up-front. |
Regarding the case of I'm unsure which is preferable, but my gut feeling is the latter, as I'm having trouble imagining cases where one needs exactly N random records, for N > 1, but I can easily imagine, e.g., "show five random recommendations for the user, or as many as we have" |
If we have an efficient |
So @mlucy and I decided we like throwing if there weren't enough elements mostly because that's what python does. I'm not strongly attached to this our logic was that people might write code which assumes the result of
I'm unclear on where the estimate of a couple of days comes from for |
After talking to @jdoliner about this in person it appears that a logarithmic solution is indeed very hard (because erase-range leaves empty leaf nodes, which makes a simple tree-walk algorithm impossible). Let's ship with O(n) and take it from there. |
If .sample() will error out on fewer than N elements, is there a way to sample "as many below N as you can"? All I can think of is first running a query to count how many elements the query would return, then sampling that number if it's less than N, but that's a pretty bad solution. |
So if we have
since if you have fewer than >>> r.expr([1,2,3]).sample(4).run()
[1,2,3]
>>> r.expr([1,2,3]).sample(4, strict=True).run()
Error not enough elements to sample. |
I think either strict or non-strict by default is fine. |
What about sampling with replacement? This is something people want to do sometimes. It certainly makes the math easier (doing stats) in a lot of situations. |
I'm totally on board with doing sampling with replacement a reservoir sampling method with replacement is a bit tougher though from what I'm reading. Would you be opposed to shipping without it and then adding it later? |
I presume if I called .sample() with no parameters it would return all elements in the result set in random order? |
I wouldn't be opposed to adding an option to sample with replacement. |
A few thoughts:
|
@coffeemug so unfortunately It actually might be worth mentioning in documentation that this ordering can't be counted on to be random. In particular it has the following property. If a value of index |
We can solve that pretty easily by just shuffling the result when we're done. |
👍 |
That would be sweet, since I don't think another plan for ordering results randomly is on the roadmap. |
Here's the final API that I think we settled on: > r.random
0.234890128
> r.random(5)
2 # in [0, 5)
> r.random(20, 30)
24 # in [20, 30)
# Sampling without replacement, operates on any sequence. Returns an array if passed an array, a stream if passed a stream, and a selection if passed a selection.
> r([1,2,3]).sample(2)
[3, 1] # shuffled
> r([1, 2, 3]).sample(4)
[1, 3, 2] # shuffled
> r([1, 2, 3]).sample # this is just a shuffle
[2, 1, 3] # shuffled |
So I'm a bit confused about this final api unless I'm mistaken there's no Also it seems like we were on the fence about throwing vs returning few Finally I really think we should ignore |
I suppose I had associated this with #865 in my head. I think we need a random term at some point, but if we don't want it for 1.6 I'm cool with that. Regarding throwing, here's what people said: Slava:
@spiffytech seemed to want non-throwing. After reading what @spiffytech said, I think that if we aren't providing a flag then non-throwing might be better. If we don't throw by default and you want to throw, it's very easy to branch on the size of the resulting array and throw. In contrast, if we throw by default and you don't want to throw, then you're in a lot of trouble if you're trying to sample a potentially-large stream. (You can't just count the stream and branch on that because you can only evaluate a stream once, so after you've counted it you can't decide to sample it.) I would be cool with not implementing |
I just talked to Slava, and he thinks shuffling is still a good idea for A revised summary: # Sampling without replacement, operates on any sequence. Returns an array if passed an array, a stream if passed a stream, and a selection if passed a selection.
> r([1,2,3]).sample(2)
[3, 1] # shuffled
> r([1, 2, 3]).sample(4)
[1, 3, 2] # shuffled |
👍 for @mlucy's proposal above. Also, I agree that zero-arity version of |
Why do you care about shuffling the order of things being sampled? You're sampiling without replacement. There's no reason for the ordered to be shuffled. |
If I want a list of N rows in a random order we have to shuffle (because as N approaches a high number, the current algorithm doesn't automatically offer good randomness). |
If you have an efficient count() then you don't have to shuffle. We are getting that someday, right? We shouldn't define the output to be in any particular (shuffled) order. |
It doesn't have to come from .sample, but I think RethinkDB should have On Tue, May 28, 2013 at 3:48 PM, srh notifications@github.com wrote:
|
@spiffytech it's definitely a feature we'd like to add. Unfortunately it's actually very hard for us to implement that without storing the data in memory. If you have to store all of the data in memory then it's really hardly useful because it will either get you killed by the OOM killer or will have a prohibitively low cap. We can get around this it's just a big undertaking and while we do like the feature unfortunately right now the cost benefit isn't such that we can really put it anywhere in the roadmap but "future"; at least I don't think we can. Sorry I can't give you a more satisfying answer here, we will implement this eventually I promise. |
I still think |
I concur on the shuffling. |
Would working with id's until when the algorithm is done sampling/shuffling then getting the relevant data help with the memory consumption issue? |
The way I got a quick impl working in mongo was to pick a random range twice the size of the sample i wanted, pulled only the ids, shuffled the ids, picked a random range of ids that was the size i wanted and used those ids to query the documents i needed. I dont know the internals of rethink, but that was a quick implementation of Document.sample(n) |
Random sampling with replacement from a stream shouldn't be too hard to implement. Here's some example code, based on this paper: std::vector<Thing> take_sample_with_replacement(int n_samples, ThingStream *stream) {
std::vector<Thing> results(n_samples);
int counter = 0;
while (!stream->is_done()) {
Thing current_thing = stream->next();
counter++;
for (int i = 0; i < n_samples; i++) {
if (rand() % counter == 0) {
results[i] = current_thing;
}
}
}
return results;
} Basically, for each thing in the stream, for each slot in the reservoir, you replace whatever's currently in that slot with the new thing with probability 1/N, where N is the number of things read off the stream by that point, including the current thing. Edit: Oops, I didn't read the thread very carefully. This isn't actually relevant right now. |
Yeah its actually easy enough to do. But still a tad of work. We can
|
No description provided.
The text was updated successfully, but these errors were encountered: