Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

count using index for speed on large tables please #1733

Closed
underrun opened this issue Dec 4, 2013 · 16 comments
Closed

count using index for speed on large tables please #1733

underrun opened this issue Dec 4, 2013 · 16 comments
Assignees
Milestone

Comments

@underrun
Copy link

underrun commented Dec 4, 2013

doing something like:

r.db('test').table('data').getAll('20131130', {index: 'date'}).count()

is very slow for large tables. it would be nice if instead we could do something like:

r.db('test').table('data').count('20131130', {index: 'date'})

and get a count using the index. while i haven't looked at the code, it should be faster to skip getting all docs with a matching index (that is what getAll does right?) and counting the results if you could just look at the index itself and count whats there.

@underrun
Copy link
Author

underrun commented Dec 4, 2013

as an aside - while document dbs are most applicable in cases where data is not relational at all and indexes are primarily for grabbing a document by some attribute other than the primary key, being able to operate on the indexes themselves (or combinations of multiple indexes) to acquire metadata (especially counts) would really help rethinkdb "feel" faster to data analysts and scientists using it. inserting, updating, reading or otherwise touching actual data can take time, but making any metadata you can very easy and fast to generate or access would be huge.

@jdoliner
Copy link
Contributor

jdoliner commented Dec 4, 2013

The code actually already does what you're talking about. In your query no documents should be read off of disk and only the parts of the index which pertain to the given date should be processed. Indexes only get you so far in optimizing count though. To optimize this further we'd need to do #1118 which will allow for constant time count.

I'm going to close this issue though because as far as I can tell of the optimizations it suggests are already implemented. People should feel free to reopen if I'm wrong.

@jdoliner jdoliner closed this as completed Dec 4, 2013
@neumino
Copy link
Member

neumino commented Dec 4, 2013

With a table with about 130k documents, I get

r.db('test').table('test').count()

RethinkDB 1.7 -- 728ms
RethinkDB 1.11 (based on next) -- 3.59s.

r.db('test').table('test').getAll(1, {index:'value'}).count()

RethinkDB 1.7 -- 1.12s
RethinkDB 1.11 (based on next) -- 3.59s

Reopening to find/fix the regression.

@neumino neumino reopened this Dec 4, 2013
@jdoliner
Copy link
Contributor

jdoliner commented Dec 4, 2013

Okay something's clearly wrong here.

@mlucy
Copy link
Member

mlucy commented Dec 4, 2013

Scheduling for 1.11.x; this looks like a pretty serious performance regression.

@coffeemug
Copy link
Contributor

Could someone take this issue? It's pretty easy to hit and there are reports of it on Twitter.

@mlucy
Copy link
Member

mlucy commented Dec 12, 2013

I took a look at this. I can confirm that there was a major slowdown in the speed of count (~5-10x).

@srh -- Your changes to make btree reads happen in parallel (the concurrent_traversal stuff starting around Aug. 28) seem to be where the slowdown starts. You add a line in e69c7fe which seems to make even count queries fetch things off disk, with an RSI to fix it, but then that RSI is removed in 1278dbd without any noticeable fix being made. When you get a chance, do you think you could speed this code back up?

@mlucy
Copy link
Member

mlucy commented Dec 12, 2013

Wait, belay that; I think that may not be where the slowdown starts. (I'm still curious about that RSI, though.)

@mlucy
Copy link
Member

mlucy commented Dec 12, 2013

Yeah, it starts way after that. I just ran the wrong build when I was testing things.

@mlucy
Copy link
Member

mlucy commented Dec 12, 2013

Wait, no, I was right the first time.

Alright, here's the breakdown:

  • In release mode, e69c7fe seems to be the start of the slowdown (8eaf37a, one of its parents, is way faster). I think the concurrent traversal code broke efficient count. Count is about 3x slower after this merge.
  • In debug mode, the biggest hit to count is Daniel's change to make backtraces span coroutine boundaries. This change results in about a 5x slowdown, but it doesn't matter because it's in debug mode. (This is why I was confused.)

@srh -- could you look into making the concurrent traversal code faster for count, when you get a chance?

@srh
Copy link
Contributor

srh commented Jan 9, 2014

There is a hackish solution to this that I'm implementing for people interested in having this be a point release. It's implemented and currently being tested right now, it should be in code review soon. The hackish solution is to look at the terminal and transforms and sindex_function in handle_pair in rdb_protocol/btree.cc, and load the value specifically when we do a count query.

This is bad because it copies the logic in one place, but the hackishness will go away once we change the code to allow actually processing the datums concurrently. Right now we merely load them from disk concurrently.

@srh
Copy link
Contributor

srh commented Jan 9, 2014

This is currently implemented in branch sam_1733 (branched off v1.11.x) and in review 1114.

@srh
Copy link
Contributor

srh commented Jan 9, 2014

This is merged to v1.11.x.

@srh srh closed this as completed Jan 9, 2014
@AtnNn
Copy link
Member

AtnNn commented Jan 16, 2014

RethinkDB 1.11.3 has been released with a fix for this issue.

@lookingcloudy
Copy link

I'm running 1.14.1 and just tried a count() on a large table...very, very slow. 4.64s response time. Should be instantaneous.

[
{
"description": "Evaluating count.",
"duration(ms)": 4628.042196,
"sub_tasks": [
{
"description": "Evaluating table.",
"duration(ms)": 0.035773,
"sub_tasks": [
{
"description": "Evaluating db.",
"duration(ms)": 0.020335,
"sub_tasks": [
{
"description": "Evaluating datum.",
"duration(ms)": 0.0002,
"sub_tasks": []
}
]
},
{
"description": "Evaluating datum.",
"duration(ms)": 0.000118,
"sub_tasks": []
}
]
},
{
"description": "Perform read.",
"duration(ms)": 4627.982693,
"sub_tasks": [
{
"parallel_tasks": [
[
{
"description": "Perform read on shard.",
"duration(ms)": 3600.315737,
"sub_tasks": [
{
"description": "Do range scan on primary index.",
"duration(ms)": 3600.30516,
"sub_tasks": []
}
]
}
],
[
{
"description": "Perform read on shard.",
"duration(ms)": 3605.220024,
"sub_tasks": [
{
"description": "Do range scan on primary index.",
"duration(ms)": 3605.210496,
"sub_tasks": []
}
]
}
],
[
{
"description": "Perform read on shard.",
"duration(ms)": 3647.771427,
"sub_tasks": [
{
"description": "Do range scan on primary index.",
"duration(ms)": 3647.762547,
"sub_tasks": []
}
]
}
],
[
{
"description": "Perform read on shard.",
"duration(ms)": 4626.888251,
"sub_tasks": [
{
"description": "Do range scan on primary index.",
"duration(ms)": 4626.874681,
"sub_tasks": []
}
]
}
],
[
{
"description": "Perform read on shard.",
"duration(ms)": 4627.594332,
"sub_tasks": [
{
"description": "Do range scan on primary index.",
"duration(ms)": 4627.58817,
"sub_tasks": []
}
]
}
],
[
{
"description": "Perform read on shard.",
"duration(ms)": 3650.904346,
"sub_tasks": [
{
"description": "Do range scan on primary index.",
"duration(ms)": 3650.893839,
"sub_tasks": []
}
]
}
],
[
{
"description": "Perform read on shard.",
"duration(ms)": 3634.760805,
"sub_tasks": [
{
"description": "Do range scan on primary index.",
"duration(ms)": 3634.74648,
"sub_tasks": []
}
]
}
],
[
{
"description": "Perform read on shard.",
"duration(ms)": 3624.56552,
"sub_tasks": [
{
"description": "Do range scan on primary index.",
"duration(ms)": 3624.556311,
"sub_tasks": []
}
]
}
]
]
}
]
}
]
}
]

@neumino
Copy link
Member

neumino commented Sep 22, 2014

We will add a constant time count -- See #152 to track progress.
This should also be a work around -- #1118

For now, if things don't fit in cache, if rethinkdb runs on a hard drive, things can feel a bit slow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants