count using index for speed on large tables please #1733

underrun · 2013-12-04T18:40:23Z

doing something like:

r.db('test').table('data').getAll('20131130', {index: 'date'}).count()

is very slow for large tables. it would be nice if instead we could do something like:

r.db('test').table('data').count('20131130', {index: 'date'})

and get a count using the index. while i haven't looked at the code, it should be faster to skip getting all docs with a matching index (that is what getAll does right?) and counting the results if you could just look at the index itself and count whats there.

The text was updated successfully, but these errors were encountered:

underrun · 2013-12-04T18:51:49Z

as an aside - while document dbs are most applicable in cases where data is not relational at all and indexes are primarily for grabbing a document by some attribute other than the primary key, being able to operate on the indexes themselves (or combinations of multiple indexes) to acquire metadata (especially counts) would really help rethinkdb "feel" faster to data analysts and scientists using it. inserting, updating, reading or otherwise touching actual data can take time, but making any metadata you can very easy and fast to generate or access would be huge.

jdoliner · 2013-12-04T19:42:38Z

The code actually already does what you're talking about. In your query no documents should be read off of disk and only the parts of the index which pertain to the given date should be processed. Indexes only get you so far in optimizing count though. To optimize this further we'd need to do #1118 which will allow for constant time count.

I'm going to close this issue though because as far as I can tell of the optimizations it suggests are already implemented. People should feel free to reopen if I'm wrong.

neumino · 2013-12-04T20:03:03Z

With a table with about 130k documents, I get

r.db('test').table('test').count()

RethinkDB 1.7 -- 728ms
RethinkDB 1.11 (based on next) -- 3.59s.

r.db('test').table('test').getAll(1, {index:'value'}).count()

RethinkDB 1.7 -- 1.12s
RethinkDB 1.11 (based on next) -- 3.59s

Reopening to find/fix the regression.

jdoliner · 2013-12-04T20:18:33Z

Okay something's clearly wrong here.

mlucy · 2013-12-04T21:42:08Z

Scheduling for 1.11.x; this looks like a pretty serious performance regression.

coffeemug · 2013-12-10T10:31:58Z

Could someone take this issue? It's pretty easy to hit and there are reports of it on Twitter.

mlucy · 2013-12-12T04:59:06Z

I took a look at this. I can confirm that there was a major slowdown in the speed of count (~5-10x).

@srh -- Your changes to make btree reads happen in parallel (the concurrent_traversal stuff starting around Aug. 28) seem to be where the slowdown starts. You add a line in e69c7fe which seems to make even count queries fetch things off disk, with an RSI to fix it, but then that RSI is removed in 1278dbd without any noticeable fix being made. When you get a chance, do you think you could speed this code back up?

mlucy · 2013-12-12T05:03:05Z

Wait, belay that; I think that may not be where the slowdown starts. (I'm still curious about that RSI, though.)

mlucy · 2013-12-12T05:19:32Z

Yeah, it starts way after that. I just ran the wrong build when I was testing things.

mlucy · 2013-12-12T06:16:09Z

Wait, no, I was right the first time.

Alright, here's the breakdown:

In release mode, e69c7fe seems to be the start of the slowdown (8eaf37a, one of its parents, is way faster). I think the concurrent traversal code broke efficient count. Count is about 3x slower after this merge.
In debug mode, the biggest hit to count is Daniel's change to make backtraces span coroutine boundaries. This change results in about a 5x slowdown, but it doesn't matter because it's in debug mode. (This is why I was confused.)

@srh -- could you look into making the concurrent traversal code faster for count, when you get a chance?

srh · 2014-01-09T01:42:35Z

There is a hackish solution to this that I'm implementing for people interested in having this be a point release. It's implemented and currently being tested right now, it should be in code review soon. The hackish solution is to look at the terminal and transforms and sindex_function in handle_pair in rdb_protocol/btree.cc, and load the value specifically when we do a count query.

This is bad because it copies the logic in one place, but the hackishness will go away once we change the code to allow actually processing the datums concurrently. Right now we merely load them from disk concurrently.

srh · 2014-01-09T02:00:02Z

This is currently implemented in branch sam_1733 (branched off v1.11.x) and in review 1114.

srh · 2014-01-09T21:20:53Z

This is merged to v1.11.x.

AtnNn · 2014-01-16T23:25:42Z

RethinkDB 1.11.3 has been released with a fix for this issue.

lookingcloudy · 2014-09-22T03:33:25Z

I'm running 1.14.1 and just tried a count() on a large table...very, very slow. 4.64s response time. Should be instantaneous.

[
{
"description": "Evaluating count.",
"duration(ms)": 4628.042196,
"sub_tasks": [
{
"description": "Evaluating table.",
"duration(ms)": 0.035773,
"sub_tasks": [
{
"description": "Evaluating db.",
"duration(ms)": 0.020335,
"sub_tasks": [
{
"description": "Evaluating datum.",
"duration(ms)": 0.0002,
"sub_tasks": []
}
]
},
{
"description": "Evaluating datum.",
"duration(ms)": 0.000118,
"sub_tasks": []
}
]
},
{
"description": "Perform read.",
"duration(ms)": 4627.982693,
"sub_tasks": [
{
"parallel_tasks": [
[
{
"description": "Perform read on shard.",
"duration(ms)": 3600.315737,
"sub_tasks": [
{
"description": "Do range scan on primary index.",
"duration(ms)": 3600.30516,
"sub_tasks": []
}
]
}
],
[
{
"description": "Perform read on shard.",
"duration(ms)": 3605.220024,
"sub_tasks": [
{
"description": "Do range scan on primary index.",
"duration(ms)": 3605.210496,
"sub_tasks": []
}
]
}
],
[
{
"description": "Perform read on shard.",
"duration(ms)": 3647.771427,
"sub_tasks": [
{
"description": "Do range scan on primary index.",
"duration(ms)": 3647.762547,
"sub_tasks": []
}
]
}
],
[
{
"description": "Perform read on shard.",
"duration(ms)": 4626.888251,
"sub_tasks": [
{
"description": "Do range scan on primary index.",
"duration(ms)": 4626.874681,
"sub_tasks": []
}
]
}
],
[
{
"description": "Perform read on shard.",
"duration(ms)": 4627.594332,
"sub_tasks": [
{
"description": "Do range scan on primary index.",
"duration(ms)": 4627.58817,
"sub_tasks": []
}
]
}
],
[
{
"description": "Perform read on shard.",
"duration(ms)": 3650.904346,
"sub_tasks": [
{
"description": "Do range scan on primary index.",
"duration(ms)": 3650.893839,
"sub_tasks": []
}
]
}
],
[
{
"description": "Perform read on shard.",
"duration(ms)": 3634.760805,
"sub_tasks": [
{
"description": "Do range scan on primary index.",
"duration(ms)": 3634.74648,
"sub_tasks": []
}
]
}
],
[
{
"description": "Perform read on shard.",
"duration(ms)": 3624.56552,
"sub_tasks": [
{
"description": "Do range scan on primary index.",
"duration(ms)": 3624.556311,
"sub_tasks": []
}
]
}
]
]
}
]
}
]
}
]

neumino · 2014-09-22T18:10:29Z

We will add a constant time count -- See #152 to track progress.
This should also be a work around -- #1118

For now, if things don't fit in cache, if rethinkdb runs on a hard drive, things can feel a bit slow.

jdoliner closed this as completed Dec 4, 2013

neumino reopened this Dec 4, 2013

jdoliner mentioned this issue Dec 13, 2013

Investigate query latency more #1766

Open

ghost assigned srh Dec 13, 2013

coffeemug mentioned this issue Dec 31, 2013

Server crash with index manipulation #1789

Closed

danielmewes mentioned this issue Dec 31, 2013

Automated performance regression tests #1806

Closed

mlucy mentioned this issue Jan 8, 2014

Count Queries Painfully Slow #1271

Closed

srh closed this as completed Jan 9, 2014

This was referenced Mar 7, 2020

[Snyk] Security upgrade browserify from 13.3.0 to 15.0.0 DhavalW/rethinkdb#8

Open

[Snyk] Security upgrade browserify from 13.3.0 to 15.0.0 enterstudio/rethinkdb#10

Open

This was referenced Nov 18, 2020

[Snyk] Security upgrade yargs from 3.32.0 to 16.0.0 DhavalW/rethinkdb#13

Open

[Snyk] Security upgrade yargs from 3.32.0 to 16.0.0 enterstudio/rethinkdb#15

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

count using index for speed on large tables please #1733

count using index for speed on large tables please #1733

underrun commented Dec 4, 2013

underrun commented Dec 4, 2013

jdoliner commented Dec 4, 2013

neumino commented Dec 4, 2013

jdoliner commented Dec 4, 2013

mlucy commented Dec 4, 2013

coffeemug commented Dec 10, 2013

mlucy commented Dec 12, 2013

mlucy commented Dec 12, 2013

mlucy commented Dec 12, 2013

mlucy commented Dec 12, 2013

srh commented Jan 9, 2014

srh commented Jan 9, 2014

srh commented Jan 9, 2014

AtnNn commented Jan 16, 2014

lookingcloudy commented Sep 22, 2014

neumino commented Sep 22, 2014

count using index for speed on large tables please #1733

count using index for speed on large tables please #1733

Comments

underrun commented Dec 4, 2013

underrun commented Dec 4, 2013

jdoliner commented Dec 4, 2013

neumino commented Dec 4, 2013

jdoliner commented Dec 4, 2013

mlucy commented Dec 4, 2013

coffeemug commented Dec 10, 2013

mlucy commented Dec 12, 2013

mlucy commented Dec 12, 2013

mlucy commented Dec 12, 2013

mlucy commented Dec 12, 2013

srh commented Jan 9, 2014

srh commented Jan 9, 2014

srh commented Jan 9, 2014

AtnNn commented Jan 16, 2014

lookingcloudy commented Sep 22, 2014

neumino commented Sep 22, 2014