New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
count using index for speed on large tables please #1733
Comments
as an aside - while document dbs are most applicable in cases where data is not relational at all and indexes are primarily for grabbing a document by some attribute other than the primary key, being able to operate on the indexes themselves (or combinations of multiple indexes) to acquire metadata (especially counts) would really help rethinkdb "feel" faster to data analysts and scientists using it. inserting, updating, reading or otherwise touching actual data can take time, but making any metadata you can very easy and fast to generate or access would be huge. |
The code actually already does what you're talking about. In your query no documents should be read off of disk and only the parts of the index which pertain to the given date should be processed. Indexes only get you so far in optimizing count though. To optimize this further we'd need to do #1118 which will allow for constant time I'm going to close this issue though because as far as I can tell of the optimizations it suggests are already implemented. People should feel free to reopen if I'm wrong. |
With a table with about 130k documents, I get
RethinkDB 1.7 -- 728ms
RethinkDB 1.7 -- 1.12s Reopening to find/fix the regression. |
Okay something's clearly wrong here. |
Scheduling for 1.11.x; this looks like a pretty serious performance regression. |
Could someone take this issue? It's pretty easy to hit and there are reports of it on Twitter. |
I took a look at this. I can confirm that there was a major slowdown in the speed of count (~5-10x). @srh -- Your changes to make btree reads happen in parallel (the |
Wait, belay that; I think that may not be where the slowdown starts. (I'm still curious about that RSI, though.) |
Yeah, it starts way after that. I just ran the wrong build when I was testing things. |
Wait, no, I was right the first time. Alright, here's the breakdown:
@srh -- could you look into making the concurrent traversal code faster for |
There is a hackish solution to this that I'm implementing for people interested in having this be a point release. It's implemented and currently being tested right now, it should be in code review soon. The hackish solution is to look at the terminal and transforms and sindex_function in This is bad because it copies the logic in one place, but the hackishness will go away once we change the code to allow actually processing the datums concurrently. Right now we merely load them from disk concurrently. |
This is currently implemented in branch sam_1733 (branched off v1.11.x) and in review 1114. |
This is merged to v1.11.x. |
RethinkDB 1.11.3 has been released with a fix for this issue. |
I'm running 1.14.1 and just tried a count() on a large table...very, very slow. 4.64s response time. Should be instantaneous. [ |
doing something like:
is very slow for large tables. it would be nice if instead we could do something like:
and get a count using the index. while i haven't looked at the code, it should be faster to skip getting all docs with a matching index (that is what getAll does right?) and counting the results if you could just look at the index itself and count whats there.
The text was updated successfully, but these errors were encountered: