Calling `.getAll()` with a large number of keys is not efficient #4218

marshall007 · 2015-05-12T18:08:51Z

Discussed with @AtnNn on IRC. I'm trying to delete ~8k documents using .getAll() on a secondary index. The table has ~19k total docs. I've tried to run the following query several times (for at most 5m), but I've yet to see it run to completion:

var terms = r.http('https://example.com/data.json')('term');

r.db(...).table('terms')
  .getAll(r.args(terms), { index: 'term' })
  .delete()

Replacing .delete() with .count() finishes in ~7.5s. Based on our discussion and the things I've tried, I don't think it has anything to do with r.http, but rather is simply .getAll() being less efficient with a large number of keys. Also, as suggested by @AtnNn, this works:

r.http('https://example.com/data.json')('term')
.forEach(function (row) {
  return r.table('terms')
    .getAll(row, { index: 'term' })
    .delete()
  })

The text was updated successfully, but these errors were encountered:

thelinuxlich · 2015-05-12T19:31:34Z

The latter is faster? This is sad :(

marshall007 · 2015-05-12T21:15:21Z

@thelinuxlich well, like I said, the first one never finishes executing (as far as I can tell). The forEach is slow but actually works. If I throw a .limit(1000) on terms then the getAll works and is much faster.

danielmewes · 2015-05-12T21:51:59Z

Thanks for opening an issue about this @marshall007 .
We know that getAll with many arguments is inefficient for two reasons:

it doesn't retrieve the keys in parallel, but one at a time (part of Concurrent query evaluation #2156)
the in-memory representation of the arguments has a lot of overhead. Some of that is fixed in our development branch (Make backtraces in the ReQL code not be large and slow #2900), but is still the case in 2.0.

The fact that your query example is so much slower with a .delete() than it is with a .count() suggests that there is something else wrong though.
We have to investigate that.

mlucy · 2015-05-12T21:55:58Z

@danielmewes -- it's possible that get_all with ~8k keys is slow because of the recent union changes (i.e. it's trying to do 8k operations in parallel and then union them together and the code is falling over for some reason). That's the only reason I can imagine that the forEach would be faster. This wouldn't surprise me, the new union infrastructure wasn't written with the 8k case in mind so it might be doing something obviously stupid.

If that is the case and you start on a fix, be careful to think about the case where you union 8k changefeeds that never receive a document with 1 changefeed that does (i.e. in the changefeed case we can't just block on a subset of the streams indefinitely).

marshall007 · 2015-05-12T22:08:53Z

Another thing I meant to mention is that while running this query most other operations fail (i.e. I can't create new connections and queries on existing connections time out). Also the .delete() isn't necessary to reproduce. I see the same issue with just the .getAll().

mlucy · 2015-05-12T22:12:29Z

That makes me pretty sure that the problem is we're spawning ~8k reads in parallel rather than throttling them. We should probably throttle calls to next_batch for everything except changefeed streams (although even for changefeed streams we should probably stagger the initial subscription reads).

danielmewes · 2015-05-12T22:18:32Z

Oh wait I forgot you changed union to process things in parallel. In that case ignore my first bullet point. Your theory makes sense. I'll try to reproduce it and see what I can find out.

marshall007 · 2015-05-12T23:45:46Z

It sounds like you guys have it figured out, but I just confirmed this is a regression. I was able to run the .delete() query in ~1min on one of our production machines that's still on v1.15.x. The .getAll() by itself was taking <10s.

danielmewes · 2015-05-12T23:46:55Z

Oh that's pretty bad. Thanks for checking @marshall007 .
Moving to 2.0.x.

danielmewes · 2015-05-13T02:18:04Z

A fix like the one suggested by @mlucy is implemented in branch daniel_4218 and in code review 2879. This leaves the behavior for changefeeds unchanged, but should fix the problem for non-changefeed getAlls.

In a quick test this reduced the amount of memory used by a table.getAll(r.args(r.range(0, 8000).coerceTo('ARRAY'))).delete() from ~3 GB to less than 100 MB and makes the query about twice as fast even on a server with sufficient memory available.

danielmewes · 2015-05-14T00:10:30Z

Fixed in v2.0.x as of commit 539a6b7
and in next by commit 263bc6c

Will ship with RethinkDB 2.0.2.

danielmewes added the tp:performance label May 12, 2015

danielmewes added this to the 2.1-polish milestone May 12, 2015

danielmewes self-assigned this May 12, 2015

danielmewes modified the milestones: 2.0.x, 2.1-polish May 12, 2015

danielmewes closed this as completed May 14, 2015

danielmewes modified the milestones: 2.0.x, 2.0.2 May 19, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calling `.getAll()` with a large number of keys is not efficient #4218

Calling `.getAll()` with a large number of keys is not efficient #4218

marshall007 commented May 12, 2015

thelinuxlich commented May 12, 2015

marshall007 commented May 12, 2015

danielmewes commented May 12, 2015

mlucy commented May 12, 2015

marshall007 commented May 12, 2015

mlucy commented May 12, 2015

danielmewes commented May 12, 2015

marshall007 commented May 12, 2015

danielmewes commented May 12, 2015

danielmewes commented May 13, 2015

danielmewes commented May 14, 2015

Calling .getAll() with a large number of keys is not efficient #4218

Calling .getAll() with a large number of keys is not efficient #4218

Comments

marshall007 commented May 12, 2015

thelinuxlich commented May 12, 2015

marshall007 commented May 12, 2015

danielmewes commented May 12, 2015

mlucy commented May 12, 2015

marshall007 commented May 12, 2015

mlucy commented May 12, 2015

danielmewes commented May 12, 2015

marshall007 commented May 12, 2015

danielmewes commented May 12, 2015

danielmewes commented May 13, 2015

danielmewes commented May 14, 2015

Calling `.getAll()` with a large number of keys is not efficient #4218

Calling `.getAll()` with a large number of keys is not efficient #4218