Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calling .getAll() with a large number of keys is not efficient #4218

Closed
marshall007 opened this issue May 12, 2015 · 11 comments
Closed

Calling .getAll() with a large number of keys is not efficient #4218

marshall007 opened this issue May 12, 2015 · 11 comments
Assignees
Milestone

Comments

@marshall007
Copy link
Contributor

Discussed with @AtnNn on IRC. I'm trying to delete ~8k documents using .getAll() on a secondary index. The table has ~19k total docs. I've tried to run the following query several times (for at most 5m), but I've yet to see it run to completion:

var terms = r.http('https://example.com/data.json')('term');

r.db(...).table('terms')
  .getAll(r.args(terms), { index: 'term' })
  .delete()

Replacing .delete() with .count() finishes in ~7.5s. Based on our discussion and the things I've tried, I don't think it has anything to do with r.http, but rather is simply .getAll() being less efficient with a large number of keys. Also, as suggested by @AtnNn, this works:

r.http('https://example.com/data.json')('term')
.forEach(function (row) {
  return r.table('terms')
    .getAll(row, { index: 'term' })
    .delete()
  })
@thelinuxlich
Copy link

The latter is faster? This is sad :(

@marshall007
Copy link
Contributor Author

@thelinuxlich well, like I said, the first one never finishes executing (as far as I can tell). The forEach is slow but actually works. If I throw a .limit(1000) on terms then the getAll works and is much faster.

@danielmewes
Copy link
Member

Thanks for opening an issue about this @marshall007 .
We know that getAll with many arguments is inefficient for two reasons:

The fact that your query example is so much slower with a .delete() than it is with a .count() suggests that there is something else wrong though.
We have to investigate that.

@danielmewes danielmewes added this to the 2.1-polish milestone May 12, 2015
@danielmewes danielmewes self-assigned this May 12, 2015
@mlucy
Copy link
Member

mlucy commented May 12, 2015

@danielmewes -- it's possible that get_all with ~8k keys is slow because of the recent union changes (i.e. it's trying to do 8k operations in parallel and then union them together and the code is falling over for some reason). That's the only reason I can imagine that the forEach would be faster. This wouldn't surprise me, the new union infrastructure wasn't written with the 8k case in mind so it might be doing something obviously stupid.

If that is the case and you start on a fix, be careful to think about the case where you union 8k changefeeds that never receive a document with 1 changefeed that does (i.e. in the changefeed case we can't just block on a subset of the streams indefinitely).

@marshall007
Copy link
Contributor Author

Another thing I meant to mention is that while running this query most other operations fail (i.e. I can't create new connections and queries on existing connections time out). Also the .delete() isn't necessary to reproduce. I see the same issue with just the .getAll().

@mlucy
Copy link
Member

mlucy commented May 12, 2015

That makes me pretty sure that the problem is we're spawning ~8k reads in parallel rather than throttling them. We should probably throttle calls to next_batch for everything except changefeed streams (although even for changefeed streams we should probably stagger the initial subscription reads).

@danielmewes
Copy link
Member

Oh wait I forgot you changed union to process things in parallel. In that case ignore my first bullet point. Your theory makes sense. I'll try to reproduce it and see what I can find out.

@marshall007
Copy link
Contributor Author

It sounds like you guys have it figured out, but I just confirmed this is a regression. I was able to run the .delete() query in ~1min on one of our production machines that's still on v1.15.x. The .getAll() by itself was taking <10s.

@danielmewes
Copy link
Member

Oh that's pretty bad. Thanks for checking @marshall007 .
Moving to 2.0.x.

@danielmewes danielmewes modified the milestones: 2.0.x, 2.1-polish May 12, 2015
@danielmewes
Copy link
Member

A fix like the one suggested by @mlucy is implemented in branch daniel_4218 and in code review 2879. This leaves the behavior for changefeeds unchanged, but should fix the problem for non-changefeed getAlls.

In a quick test this reduced the amount of memory used by a table.getAll(r.args(r.range(0, 8000).coerceTo('ARRAY'))).delete() from ~3 GB to less than 100 MB and makes the query about twice as fast even on a server with sufficient memory available.

@danielmewes
Copy link
Member

Fixed in v2.0.x as of commit 539a6b7
and in next by commit 263bc6c

Will ship with RethinkDB 2.0.2.

@danielmewes danielmewes modified the milestones: 2.0.x, 2.0.2 May 19, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants