New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Heartbeat timeout when resharding #1708
Comments
One of the problems appears to be that we are sending out a lot of unnecessary directory updates. This does solve the timeout issue, but I would like to investiage a bit further to figure out |
The real problem here is that the message queue takes too long to be processed. Among others, timers don't get executed, which includes the heartbeat timer. I think I can fix this once and forever. Our message_hub already has a notion of processing granularity. However it does not process OS events before the whole queue has been processed. That can be changed though. In a second step, I'm going to adjust coroutine priorities a little, so heartbeat-related things receive preferred treatment. |
Also heartbeats are sent out only half as often as they probably should be. This might be by design, I'm not sure. The reason is that sending the heartbeat itself notifies the writes tracker. So on the next call to |
This is implemented. I also had to add a yield in a critical place during directory updates. We might still want to consider introducing the delay before sending out a directory update (speculating that we are going to send out an even newer version very soon anyway). That made things a lot more efficient in general, but is not really part of this issue. |
Merged into next 6dd8eee |
The problem of heartbeat timeouts seemed to be gone, but apparently not on large clusters.
If I reshard a table from 1 to 32 shards on a 64 node cluster (on EC2), heartbeat timeouts occur between some of the nodes.
This happened with branch
daniel_directory_efficiency
.The text was updated successfully, but these errors were encountered: