Heartbeat timeout when resharding #1708

danielmewes · 2013-11-26T01:36:41Z

The problem of heartbeat timeouts seemed to be gone, but apparently not on large clusters.

If I reshard a table from 1 to 32 shards on a 64 node cluster (on EC2), heartbeat timeouts occur between some of the nodes.

This happened with branch daniel_directory_efficiency.

The text was updated successfully, but these errors were encountered:

danielmewes · 2013-11-27T01:34:29Z

One of the problems appears to be that we are sending out a lot of unnecessary directory updates.
If I add a short delay (e.g. 100ms) before actually sending an updated version of the directory to the other nodes and then send only the most recent version, that reduces the number of directory messages by a factor of roughly 20.

This does solve the timeout issue, but I would like to investiage a bit further to figure out
a) why the high amount of directory updates would lead to heartbeat timeouts and
b) where the unnecessary (or rather partial) updates originate from exactly

danielmewes · 2013-11-27T23:49:58Z

The real problem here is that the message queue takes too long to be processed. Among others, timers don't get executed, which includes the heartbeat timer.

I think I can fix this once and forever. Our message_hub already has a notion of processing granularity. However it does not process OS events before the whole queue has been processed. That can be changed though.

In a second step, I'm going to adjust coroutine priorities a little, so heartbeat-related things receive preferred treatment.

danielmewes · 2013-11-28T00:46:13Z

Also heartbeats are sent out only half as often as they probably should be. This might be by design, I'm not sure.

The reason is that sending the heartbeat itself notifies the writes tracker. So on the next call to heartbeat_manager_t::on_timer(), the heartbeat manager sees that some write (in this case the heartbeat message) has happened since the last time it checked, and does not initiate another write. Effectively this means that a heartbeat is only sent during every second call to on_timer().
As a consequence, heartbeats are sent out every 4 seconds instead of every 2 seconds.

danielmewes · 2013-12-07T03:03:56Z

This is implemented. I also had to add a yield in a critical place during directory updates.
In code review 1081

We might still want to consider introducing the delay before sending out a directory update (speculating that we are going to send out an even newer version very soon anyway). That made things a lot more efficient in general, but is not really part of this issue.

danielmewes · 2013-12-11T19:43:28Z

Merged into next 6dd8eee

ghost assigned danielmewes Nov 26, 2013

danielmewes mentioned this issue Dec 7, 2013

Web interface and CLI when the server is under heavy load #1183

Closed

danielmewes closed this as completed Dec 11, 2013

This was referenced Nov 18, 2020

[Snyk] Security upgrade yargs from 3.32.0 to 16.0.0 DhavalW/rethinkdb#13

Open

[Snyk] Security upgrade yargs from 3.32.0 to 16.0.0 enterstudio/rethinkdb#15

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Heartbeat timeout when resharding #1708

Heartbeat timeout when resharding #1708

danielmewes commented Nov 26, 2013

danielmewes commented Nov 27, 2013

danielmewes commented Nov 27, 2013

danielmewes commented Nov 28, 2013

danielmewes commented Dec 7, 2013

danielmewes commented Dec 11, 2013

Heartbeat timeout when resharding #1708

Heartbeat timeout when resharding #1708

Comments

danielmewes commented Nov 26, 2013

danielmewes commented Nov 27, 2013

danielmewes commented Nov 27, 2013

danielmewes commented Nov 28, 2013

danielmewes commented Dec 7, 2013

danielmewes commented Dec 11, 2013