Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why is .sample() slow? #1520

Open
krzkrzkrz opened this issue Oct 6, 2013 · 32 comments
Open

Why is .sample() slow? #1520

krzkrzkrz opened this issue Oct 6, 2013 · 32 comments

Comments

@krzkrzkrz
Copy link

Why does $rethinkdb.table(:foobar).sample(1)[0] run slow? Table :foobar only has 10,000 records/documents

Slow, as in a few milliseconds. Close to a second actually. Normally, retrieving a single record from 10k should be instant. At least on other db solutions.

@coffeemug
Copy link
Contributor

I don't remember the details of the sample implementation, so I might be wrong on this (@jdoliner should chime in because he knows it much more intimately). As far as I remember, sample was designed to return a uniform distribution, and doing this right is surprisingly tricky. We picked a simpler implementation that does a linear time shuffle, with the intention of doing a smarter implementation later.

Sorry you're running into this -- we're exclusively working on performance now, and most issues should be ironed out in the next few months. Thank you so much for reporting the bugs and being patient!

@jdoliner
Copy link
Contributor

jdoliner commented Oct 7, 2013

There are a couple of possible levels of optimization for sample. sample in the general case is a linear time operation and right now we only implement the general case.

The first optimization we can make is rewriting it in terms of map reduce. This requires some anaphoric macro magic but other than that it's not too hard.

The next step would be optimizing sample on tables so that it doesn't read data off disk unless it actually selects the row. This might prove to be a little tricky but shouldn't be impossible.

The last step would be to do a BTree aware optimization which could sample rows in logarithmic time. This one is quite challenging so I don't think it's worth doing any time too soon.

@danielmewes
Copy link
Member

It looks like sample is especially slow compared to other linear time operations on tables.

Here are some times reported by @ha1331:

r.table('t').sample(2) 2.51s server time
r.table('t').sum('someField') 82ms server time
r.table('t').count() (result is 70387 rows) 17ms server time

This was on a table with 4 shards.
I think sample currently pulls all data onto a single server/thread first? My suspicion is that that's the reason for the slowness.

Here's a possible algorithm for pushing most of the work down to the shards:
Assume a query table.sample(n). Most likely n is going to be small compared to the table size.
We can now first uniformly sample n elements on each shard. As a side effect we can also get the exact number of scanned documents on the shard for free. Finally on the parsing node, we receive the n elements from each shard, and then sample n out of them with a probability distribution adjusted to the total number of scanned documents from each shard.

As an additional optimization we could avoid loading the data for rejected documents like @jdoliner mentioned, similar to what count does.

@marshall007
Copy link
Contributor

@danielmewes FYI, the results are certainly less pronounced when running against a single shard:

table.sample(1) 320ms server time
table.sum('field') 200ms server time
table.avg('field') 202ms server time
table.count() (38717 rows) 23ms server time

Also note that table.sample(n) takes roughly the same amount of time regardless of the value n (including the special case where n >= total number of rows).

@danielmewes
Copy link
Member

Thanks for testing @marshall007 . That would fit with the theory that the costs comes from collecting all data on one server first.

@mlucy
Copy link
Member

mlucy commented Jul 9, 2015

@danielmewes -- that is in fact how it's implemented, and as far as I can tell there's no reason not to use the algorithm you described instead. (The reason we implemented it the way we did is that we originally assumed users would mostly be sampling arrays rather than whole tables, and at the time shards.cc didn't exist yet so implementing it efficiently for the sharded case would have been about twice as much work.)

@IanCal
Copy link

IanCal commented Aug 23, 2015

As a bit more information, I'm running a single shard locally and r.table('example').sample(1).run(conn) in the python library takes 60s to complete on a table of about 200k rows, so it might not be the cost of pulling data from one shard to another. The frontend shows this as ~6k/s reads, so I'll have to do a bit more investigation to see if this is rethinkdb or the python lib that's causing the issue.

@danielmewes
Copy link
Member

Thanks for the data @IanCal . That's really very slow. I doubt it's Python related, since as far as the client is concerned, that query will only return a single document.

Does your data fit into the cache?
For comparison: How long does it take to run r.table('example').map(r.row).count.run(conn)

@IanCal
Copy link

IanCal commented Aug 26, 2015

@danielmewes Thanks, that would appear to be it, r.table('example').map(r.row).count().run(conn) was similarly slow.

Setting the cache size to > the disk storage (so the interface now shows a cache usage of ~60%) brings the timings to:

0.6s for r.table('example').map(r.row).count().run(conn)

1.8s for r.table('example').sample(1).run(conn)

The query profiler only has extremely small timings for any sub-task, apart from "Sampling elements." which has a mean time of 0.008ms and so accounts for the majority of the time. If I get a chance I might see if I can compile rethinkdb myself as the time is all being spent in a very small section of code that's pretty clear.

Edit -

So on another machine (ubuntu) I've built and run 2.1.1 with a few changes. Sampling a single item takes ~1.3s which comes down to ~1.1s if I remove the profiling. Removing the core logic of swapping elements around only brings down the time by another 0.05s or so, which means the vast majority of the time is spent iterating over all the elements, and quite a substantial amount of time profiling each element.

@danielmewes
Copy link
Member

Yeah the profiler doesn't work too well for some queries. I wouldn't rely on the timings it gives you for this to be honest.

The remaining difference between map.count and sample is likely because of our inefficient sample implementation that was mentioned before, which concentrates all documents on a single thread. map.count on the other hand performs the count on multiple threads in parallel, and then just adds up the subresults.

@coffenbacher
Copy link

I'm trying to run .sample(10) on a 2M document db and it takes 4min 34.88s server time!

Why does it seem to take a sample for every document in the db???
{
"description": "Sampling elements." ,
"mean_duration(ms)": 0.130639 ,
"n_samples": 2074151
}

@coffeemug
Copy link
Contributor

Currently sample loads all the documents into memory, shuffles them, and then returns the sample. It's a really bad implementation, we just need to put in the time to make it fast.

@coffenbacher
Copy link

Cool, sounds good. For now, I will try to write my own sampler using an index. This will be a great core feature when it's efficient enough to use.

@coffeemug
Copy link
Contributor

@danielmewes -- could we consider bumping this when possible since this is such a common thing people run into?

@danielmewes danielmewes modified the milestones: 2.3, subsequent Oct 12, 2015
@danielmewes
Copy link
Member

Yeah we'll try to get this into 2.3.

@danielmewes
Copy link
Member

There's actually another aspect to this that I hadn't thought of earlier.
Once we implement subtree counts (compare #3949 ), we can actually do table.sample(n) in O(n log N) time, where N is the size of the table.

While we can implement the short-term improvement described in this issue to reduce the runtime by some factor, it will still be a linear-time O(N) operation.

The short-term improvements are not too hard I think, but still require a fair bit of work which might be better spent on getting #3949 done more quickly, at which point we could implement a truly fast sample operation.

I'll think about how to prioritize these...

@danielmewes
Copy link
Member

Yet another option would be to introduce an approximate sample operation.

We could use the approximate distribution query to reduce the key space that we need to traverse to a small fraction in a way that's approximately uniform with the key distribution.

While this will be somewhat off from an actual uniform sampling, I think it's going to be enough for 75% or so of all use cases.

I think we should even consider making the approximate sample operation the default, and requiring an optarg (e.g. distribution="uniform" or uniform=true) to get the current behavior for cases that rely on a proper uniform distribution.

@coffeemug
Copy link
Contributor

Yet another option would be to introduce an approximate sample operation.

I always thought that our adherence to strict uniform distribution in sample is an example of excessive purity. At the moment sample is unusable in 90% of use cases due to performance issues, and making it return a random sample without precisely specifying the distribution would be relatively easy to do and would suffice for 90% of cases where people want to use the command.

I'm extremely strongly in favor of making sample approximate by default. My guess is that if we do and document it clearly, the question of having a strictly uniform sample function will not come up with actual users for a long, long time.

@marshall007
Copy link
Contributor

👍 for simple random sampling by default.

@coffenbacher
Copy link

I think you're already in agreement but adding one more 👍. As a user, if I was concerned with obtaining precisely uniform sampling, I'd certainly check the documentation to understand the implementation before using it. An optarg seems like a good approach from my perspective.

@danielmewes
Copy link
Member

Alright, marking this a ReQL_proposal as follows:

  • Make r.sample only guarantee approximately uniform sampling
  • When used on a table or table_slice, implement it in terms of something akin to our distribution query
  • Add an optarg distribution="uniform" (or similar) to make r.sample switch over to its current implementation that guarantees a strictly uniform sampling

@coffeemug
Copy link
Contributor

Add an optarg distribution="uniform" (or similar) to make r.sample switch over to its current implementation that guarantees a strictly uniform sampling

I don't think we should do this part until we get subtree counts. For now it would be a vanity option, as it's basically unusable because of speed. I think people will try to use the option, discover that it's super-slow, and submit bug reports without ever ending up using it. It seems strictly better to avoid this option, until we can actually make it be fast.

@danielmewes
Copy link
Member

@coffeemug I don't think that's going to be a problem. We'll document it as a "more precise but slow" algorithm or something like that. Since it wouldn't be the default, most users will need to explicitly look it up before using it. I don't think there will be bug reports for an option that's not enabled by default and that's explicitly marked as being slow.

Also note that we need to keep the slow implementation around anyway, since we need to support sample on arrays. In that case I think it's nice to allow people to set the option so that they can get a guarantee that the sampling is uniform, rather than having sample be strictly uniform on arrays and only approximately uniform (with potentially significant deviations) on tables.

There is also the question about what to do on streams which are not table slices. Our current algorithm is a lot better than doing a coerceTo('array').sample(n), since it doesn't load all data into memory first. It actually properly streams the data and keeps no more than n results in memory at a time.

One thing we could do is fail if .sample is called with distribution="uniform" on an input that does support approximate sampling. But I don't think that would be very helpful, as it reduces the number of use cases (there is no equivalent non-hacky work-around that I know of for users who need uniform sampling in that case, other than implementing it in the client).

@danielmewes
Copy link
Member

Also I think "For now it would be a vanity option, as it's basically unusable because of speed" is an exaggeration. It's strictly more usable than a non-indexed orderBy for example.

The main problem with the current implementation is that all the work is done on a single thread, rather than working on the shards. So it has limited parallelism, but its runtime complexity is still only linear, and it only uses constant memory no matter the data size.

@danielmewes danielmewes modified the milestones: 2.3, 2.3-polish Nov 19, 2015
@mlucy
Copy link
Member

mlucy commented Dec 9, 2015

I think we should even consider making the approximate sample operation the default, and requiring an optarg (e.g. distribution="uniform" or uniform=true) to get the current behavior for cases that rely on a proper uniform distribution.

+1 . Slight preference for uniform=true.

@danielmewes
Copy link
Member

Just noticed we never marked this settled.
Doing so now, with the following plan:

Implement as described here #1520 (comment) , however with the argument called uniform=true, and it being false by default.

I doubt we'll be able to ship this on time for 2.3 considering some of the other things we are now working on, but I'll leave it in 2.3-polish for now.

@danielmewes danielmewes modified the milestones: 2.3-polish, subsequent Apr 7, 2016
@iquizzle
Copy link

iquizzle commented May 5, 2016

I would love to see this feature included in an upcoming release. I have several large tables (>10M docs) that struggle with the current implementation.

@zxpectre
Copy link

Hi rethinkers, any news on this?

I have huge historical tables which I need to query for frontend charts. Is .sample() the suggested way to go with this?

@ChrisTalman
Copy link

Unfortunately I'm not aware of anything having changed in this regard.

It's quite common for people to run another database, better suited to aggregation or search, such as Elasticsearch, alongside Rethink. The Rethink database continues to serve as the single source of truth and is used for normal operations. But for those operations that Rethink is not great at, you would maintain a sort of shadow database, and query that instead. You would simply copy data over from Rethink to the other database, either on a regular interval, or in response to events in Rethink.

@zxpectre
Copy link

Thanks @ChrisTalman! , but for now without a shadow db, using .sample() is the best way to go with this kind of operations in rethinkdb?
Could a .filter() by array position filtering do the work better than sample? I'm 1 week new to rethinkdb, I don't know best practices in this regard yet.

@ChrisTalman
Copy link

sample() is the best way to go with this kind of operations in rethinkdb?

I think so. It'll be slow with hundreds of thousands of documents, but it'll work.

Could a .filter() by array position filtering do the work better than sample?

That'd probably also be slow, as it'd have to iterate over a lot of documents.

A more exotic solution might be to add one or more fields to every document which contain a random value which you insert into each document on a regular interval, such as every hour or every day. You could then use a secondary index to select some documents. But the results would be the same within each interval, and if you have a lot of documents, you'd probably hammer your database with write operations, degrading its performance and making it slow for anyone else trying to access it. And if you're going to invest that much time into a solution, you could just as equally spend that time setting up a shadow database like Elasticsearch instead, which would afford you more options and better performance in any case.

@yaneony
Copy link

yaneony commented Jun 30, 2021

Found very interesting thing...
Executing
r.db('mydb').table('mytable').sample(10)
on 250k items table, gives me response time ~1.45 seconds.

Executing
r.db('mydb').table('mytable').sample(100000).sample(10000).sample(1000).sample(10)
on same 250k items table gives me ~800ms response time.

Executed multiple times, restarted server, tested again, same result. What's the trick? Why does it happen?!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests