-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why is .sample() slow? #1520
Comments
I don't remember the details of the Sorry you're running into this -- we're exclusively working on performance now, and most issues should be ironed out in the next few months. Thank you so much for reporting the bugs and being patient! |
There are a couple of possible levels of optimization for The first optimization we can make is rewriting it in terms of map reduce. This requires some anaphoric macro magic but other than that it's not too hard. The next step would be optimizing The last step would be to do a BTree aware optimization which could sample rows in logarithmic time. This one is quite challenging so I don't think it's worth doing any time too soon. |
It looks like Here are some times reported by @ha1331:
This was on a table with 4 shards. Here's a possible algorithm for pushing most of the work down to the shards: As an additional optimization we could avoid loading the data for rejected documents like @jdoliner mentioned, similar to what |
@danielmewes FYI, the results are certainly less pronounced when running against a single shard:
Also note that |
Thanks for testing @marshall007 . That would fit with the theory that the costs comes from collecting all data on one server first. |
@danielmewes -- that is in fact how it's implemented, and as far as I can tell there's no reason not to use the algorithm you described instead. (The reason we implemented it the way we did is that we originally assumed users would mostly be sampling arrays rather than whole tables, and at the time |
As a bit more information, I'm running a single shard locally and |
Thanks for the data @IanCal . That's really very slow. I doubt it's Python related, since as far as the client is concerned, that query will only return a single document. Does your data fit into the cache? |
@danielmewes Thanks, that would appear to be it, Setting the cache size to > the disk storage (so the interface now shows a cache usage of ~60%) brings the timings to: 0.6s for 1.8s for The query profiler only has extremely small timings for any sub-task, apart from "Sampling elements." which has a mean time of 0.008ms and so accounts for the majority of the time. If I get a chance I might see if I can compile rethinkdb myself as the time is all being spent in a very small section of code that's pretty clear. Edit - So on another machine (ubuntu) I've built and run 2.1.1 with a few changes. Sampling a single item takes ~1.3s which comes down to ~1.1s if I remove the profiling. Removing the core logic of swapping elements around only brings down the time by another 0.05s or so, which means the vast majority of the time is spent iterating over all the elements, and quite a substantial amount of time profiling each element. |
Yeah the profiler doesn't work too well for some queries. I wouldn't rely on the timings it gives you for this to be honest. The remaining difference between |
I'm trying to run .sample(10) on a 2M document db and it takes 4min 34.88s server time! Why does it seem to take a sample for every document in the db??? |
Currently |
Cool, sounds good. For now, I will try to write my own sampler using an index. This will be a great core feature when it's efficient enough to use. |
@danielmewes -- could we consider bumping this when possible since this is such a common thing people run into? |
Yeah we'll try to get this into 2.3. |
There's actually another aspect to this that I hadn't thought of earlier. While we can implement the short-term improvement described in this issue to reduce the runtime by some factor, it will still be a linear-time The short-term improvements are not too hard I think, but still require a fair bit of work which might be better spent on getting #3949 done more quickly, at which point we could implement a truly fast I'll think about how to prioritize these... |
Yet another option would be to introduce an approximate We could use the approximate distribution query to reduce the key space that we need to traverse to a small fraction in a way that's approximately uniform with the key distribution. While this will be somewhat off from an actual uniform sampling, I think it's going to be enough for 75% or so of all use cases. I think we should even consider making the approximate |
I always thought that our adherence to strict uniform distribution in I'm extremely strongly in favor of making |
👍 for simple random sampling by default. |
I think you're already in agreement but adding one more 👍. As a user, if I was concerned with obtaining precisely uniform sampling, I'd certainly check the documentation to understand the implementation before using it. An optarg seems like a good approach from my perspective. |
Alright, marking this a ReQL_proposal as follows:
|
I don't think we should do this part until we get subtree counts. For now it would be a vanity option, as it's basically unusable because of speed. I think people will try to use the option, discover that it's super-slow, and submit bug reports without ever ending up using it. It seems strictly better to avoid this option, until we can actually make it be fast. |
@coffeemug I don't think that's going to be a problem. We'll document it as a "more precise but slow" algorithm or something like that. Since it wouldn't be the default, most users will need to explicitly look it up before using it. I don't think there will be bug reports for an option that's not enabled by default and that's explicitly marked as being slow. Also note that we need to keep the slow implementation around anyway, since we need to support There is also the question about what to do on streams which are not table slices. Our current algorithm is a lot better than doing a One thing we could do is fail if |
Also I think "For now it would be a vanity option, as it's basically unusable because of speed" is an exaggeration. It's strictly more usable than a non-indexed The main problem with the current implementation is that all the work is done on a single thread, rather than working on the shards. So it has limited parallelism, but its runtime complexity is still only linear, and it only uses constant memory no matter the data size. |
+1 . Slight preference for |
Just noticed we never marked this settled. Implement as described here #1520 (comment) , however with the argument called I doubt we'll be able to ship this on time for 2.3 considering some of the other things we are now working on, but I'll leave it in |
I would love to see this feature included in an upcoming release. I have several large tables (>10M docs) that struggle with the current implementation. |
Hi rethinkers, any news on this? I have huge historical tables which I need to query for frontend charts. Is .sample() the suggested way to go with this? |
Unfortunately I'm not aware of anything having changed in this regard. It's quite common for people to run another database, better suited to aggregation or search, such as Elasticsearch, alongside Rethink. The Rethink database continues to serve as the single source of truth and is used for normal operations. But for those operations that Rethink is not great at, you would maintain a sort of shadow database, and query that instead. You would simply copy data over from Rethink to the other database, either on a regular interval, or in response to events in Rethink. |
Thanks @ChrisTalman! , but for now without a shadow db, using .sample() is the best way to go with this kind of operations in rethinkdb? |
I think so. It'll be slow with hundreds of thousands of documents, but it'll work.
That'd probably also be slow, as it'd have to iterate over a lot of documents. A more exotic solution might be to add one or more fields to every document which contain a random value which you insert into each document on a regular interval, such as every hour or every day. You could then use a secondary index to select some documents. But the results would be the same within each interval, and if you have a lot of documents, you'd probably hammer your database with write operations, degrading its performance and making it slow for anyone else trying to access it. And if you're going to invest that much time into a solution, you could just as equally spend that time setting up a shadow database like Elasticsearch instead, which would afford you more options and better performance in any case. |
Found very interesting thing... Executing Executed multiple times, restarted server, tested again, same result. What's the trick? Why does it happen?! |
Why does
$rethinkdb.table(:foobar).sample(1)[0]
run slow? Table:foobar
only has 10,000 records/documentsSlow, as in a few milliseconds. Close to a second actually. Normally, retrieving a single record from 10k should be instant. At least on other db solutions.
The text was updated successfully, but these errors were encountered: