Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Heartbeat timeout when resharding #1708

Closed
danielmewes opened this issue Nov 26, 2013 · 5 comments
Closed

Heartbeat timeout when resharding #1708

danielmewes opened this issue Nov 26, 2013 · 5 comments
Assignees
Milestone

Comments

@danielmewes
Copy link
Member

The problem of heartbeat timeouts seemed to be gone, but apparently not on large clusters.

If I reshard a table from 1 to 32 shards on a 64 node cluster (on EC2), heartbeat timeouts occur between some of the nodes.

This happened with branch daniel_directory_efficiency.

@ghost ghost assigned danielmewes Nov 26, 2013
@danielmewes
Copy link
Member Author

One of the problems appears to be that we are sending out a lot of unnecessary directory updates.
If I add a short delay (e.g. 100ms) before actually sending an updated version of the directory to the other nodes and then send only the most recent version, that reduces the number of directory messages by a factor of roughly 20.

This does solve the timeout issue, but I would like to investiage a bit further to figure out
a) why the high amount of directory updates would lead to heartbeat timeouts and
b) where the unnecessary (or rather partial) updates originate from exactly

@danielmewes
Copy link
Member Author

The real problem here is that the message queue takes too long to be processed. Among others, timers don't get executed, which includes the heartbeat timer.

I think I can fix this once and forever. Our message_hub already has a notion of processing granularity. However it does not process OS events before the whole queue has been processed. That can be changed though.

In a second step, I'm going to adjust coroutine priorities a little, so heartbeat-related things receive preferred treatment.

@danielmewes
Copy link
Member Author

Also heartbeats are sent out only half as often as they probably should be. This might be by design, I'm not sure.

The reason is that sending the heartbeat itself notifies the writes tracker. So on the next call to heartbeat_manager_t::on_timer(), the heartbeat manager sees that some write (in this case the heartbeat message) has happened since the last time it checked, and does not initiate another write. Effectively this means that a heartbeat is only sent during every second call to on_timer().
As a consequence, heartbeats are sent out every 4 seconds instead of every 2 seconds.

@danielmewes
Copy link
Member Author

This is implemented. I also had to add a yield in a critical place during directory updates.
In code review 1081

We might still want to consider introducing the delay before sending out a directory update (speculating that we are going to send out an even newer version very soon anyway). That made things a lot more efficient in general, but is not really part of this issue.

@danielmewes
Copy link
Member Author

Merged into next 6dd8eee

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant