New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extremely slow gets during insert workload #1820
Comments
underrun added the profiler output of a slow get: https://gist.github.com/underrun/68edbab57bcd4cfb4bf2 |
here are two more profiler outputs using the mas_profiling branch: slow outdated read: https://gist.github.com/underrun/ee270127ee06a3c3891a |
Quick update: There are probably different things at play here. One of them is related to #1385 . I've seen that problem (that is the flush lock blocking things) account for delays in the range of close to a second, sometimes 2 seconds in the scenario described here. That in itself is already too much, but it doesn't explain query times in the minutes. |
this issue is happening even when i'm not inserting anything. the cluster is completely idle except for a r.db.table.get running on it and it's still taking minutes. |
without inserting it is consistently fast reading now... i didn't look around but i suspect that it was slow during backfilling. it is definitely consistently fast now that it has been a while since the last insert or change to the layout of the cluster. |
@underrun: Thank you for the additional information. I also could observe slow gets only while the cluster was either backfilling or while running the inserts. |
I finally found out where the stalls on the message queue come from. There are actually three different places which take in the order of multiple seconds to process without yielding, all in the serializer. Two of them become increasingly severe with an increasing table size. They are:
These things are something that definitely have to be fixed. Whether they are the sole cause of this issue remains to be seen. |
I have fixed 2 and 3 in my branch There are still two problems left:
|
Found the problem with the throttling system. Now looking into the remaining slow message. |
It seems that the remaining slow message is part of the cache's writeback process. That will be gone when the new cache is merged in. |
The LBA and throttling changes have been merged into next 54bab7f and will be part of RethinkDB 1.12. With these changes, I didn't get any slow gets anymore. |
@danielmewes, what about the |
@srh: I think it is still a problem in theory. However looking at the alt cache, I see that |
As far as I know |
There exists a function |
It's used at least when a node used to be a replica for a shard and then loses that role.The delete also was just an example. A r.table().update() does it too. |
@underrun reports simple point gets taking up to 6 minutes. However this is not consistent, apparently the time for a get varies between 16ms and 6 minutes.
He has an insert workload running in the background.
I'll try to reproduce it. Only idea I have out of my head is that it is some kind of locking issue, or starvation issue.
Here is the IRC log for details:
The text was updated successfully, but these errors were encountered: