New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
indexList() extremely slow under load #2071
Comments
Ok wrong issue title. |
Very low CPU utilization, moderate i/o. |
This is an issue that was gone but is reintroduced after the merge of the new cache, correct? In that case it should be in 1.12, not in polish. Such a regression is really important to fix. |
Yes, I don't know if it was reintroduced or had been there for longer. I originally thought that this did just affect indexList(), but it affects other things as well. |
Oh and two nodes just crashed:
That might just be a case where we don't handle disconnects (due to heartbeat timeouts) properly, and could be unrelated to the performance issues. |
Opened a new issue for the crash: #2073 |
The reason for the slow queries appears to be transaction throttling in the cache. Not sure how that could result in a heartbeat timeout though (it might be that the node was actually crashing as in #2073). |
Also secondary index postconstruction does some things that could interact very badly with the cache's throttling. I will look into this issue once #2073 is closed, but I expect that it will be comparably easy to fix. |
Yeah ok. Some things just hang in the message queue for way too long. Not sure why yet. |
Did a few more tests and part of the problem (and possibly the cause for the heartbeat timeouts in this very write intensive scenario) is #265 . I could measure ftruncate calls taking multiple seconds, in one case more than 14 seconds. @srh: This means that we have to do CR 1214 for 1.12. With that part fixed, simple queries still take an extraordinary amount of time during backfilling+sindex postconstruction. It doesn't seem to be due to overly high message queue latencies though. Could be a throttling / locking issue now... |
With #265 fixed, I'm not seeing heartbeat timeouts anymore. |
A partial fix is in code review 1335 by @AtnNn. The impact on latency of simple operations under this workload is still highly unsatisfactory though (rather than more than 1 minute, it is ~4 seconds with those patches). More work is needed here. |
@danielmewes -- out of curiosity, why did you make @AtnNn instead of @srh the reviewer? (Not sure how @AtnNn can review the code he isn't too familiar with) |
@coffeemug @AtnNn mentioned he was interested in reading about secondary index code. These changes aren't in the cache. Apart from that my decision was based on the fact that @srh has other open issues to work on for 1.12. |
Ah thanks -- just wondering. |
...seems that @srh wants to do the review. Reassigned. |
The partial fix is in next 5359930 |
Just a quick status update: This is still pretty bad. It was already bad in 1.11 (simple queries there took in the range of 1-2 seconds), but it's worse in next (2-15 seconds). |
What if we increase the dirty block throttling limit, to, say, a larger value? |
It's not some thought-out value right now -- 200 is particularly low, considering how quickly they could be dumped to disk. |
@srh: I had previously experimented with that a little. Removing the throttling solves the issue (until the process runs out of memory), but increasing it to for example 2000 doesn't change all that much. I think it just takes it a little longer until hitting the limit, and from there on it's basically the same. I have a theory of why this is so slow. Let me test that first... |
@srh: Basically my hypothesis is that we have to do exactly this (from btree_store.cc):
Now replacing the mutex by a fifo_enforcer to test whether that improves things... |
Or replace it with a new_mutex_t? |
Oh right! I forgot that we had that. That's actually easier. :-) |
That alone wasn't enough. A couple of different optimizations and parameter adjustments were required to get rid of the latency issues. The short version is that backfills and sindex post constructions now get throttled if the cache can't flush their data fast enough. Implemented in branch daniel_2071_2. In code review 1350 by @srh. |
indexList() is still occasionally slow, but other non-sindex related queries are ok now (and much faster in this scenario than in 1.11). |
Closed by a4e7edc |
New title: Everything becomes extremely slow under backfilling + sindex creation
While re-testing #1389, I noticed that executing
r.table().indexList()
took about 30s when the cluster was under load (here: backfilling + creating indexes + some insert queries).indexStatus()
was similarly slow, taking up to 2 minutes!The text was updated successfully, but these errors were encountered: