SWIM: improve dissemination speed to log(N) #4253

Gerold103 · 2019-05-27T19:29:32Z

There is a paper with math proves of some gossip basics: "Efficient Reconciliation and Flow Control for Anti-Entropy Protocols". Also a video exists, where another paper is explained: https://www.youtube.com/watch?v=Gxf5glthqrk. According to these researches, there is a method of spreading gossips with log(log(N)) speed against usual speed log(N). It requires pull-push dissemination.

SWIM works as push dissemination - a member never asks to send him anything, but sends packets to others. Pull dissemination is when a member asks to send him info about certain members. It is stated, that combined they give log(log(N)) dissemination speed.

We could turn SWIM into push-pull protocol, if piggyback dissemination component not only with round messages, but with ACK messages too. Then each round message, containing a ping (they do contain a ping), would be answered with an ACK and another dissemination.

A clear example of how does it speeds dissemination up. Assume, a node A learned, that node C is dead. It starts spreading a gossip about that. Also there is a node B, which hasn't learned about the gossip yet, but it sends a ping to A. A answers with an ACK. In the traditional SWIM that ACK does not make B aware of death of C. In the proposed implementation the ACK will carry death of C.

The estimation log(log(N)) should be proved, it is only an assumption based on some slides. Regardless of result of the investigation, the feature certainly speed dissemination up. The only unknown thing is how much.

The text was updated successfully, but these errors were encountered:

Gerold103 · 2019-05-27T20:07:03Z

In the same paper "Efficient Reconciliation and Flow Control for Anti-Entropy Protocols" a protocol Scuttlebutt is presented, and it has impressive ideas on lowering network load when there are lots of updates. Probably it could be somehow adopted not only in pull-push ideas.

Gerold103 · 2019-06-28T23:37:43Z

In addition to events we can add anti-entropy section too. To each message. According to Yaroslav's tests, a cluster of 100 nodes with such packing reaches fullmesh in 8 round steps. Even in Lua.

Before the patch old events were disseminated before new ones until their TTD becomes 0. It was considered fair, and not much important. But after some experiments with closed-source version of SWIM it was found, that I was wrong as never. Firstly, SWIM in the original paper explicitly says, that newer events should be sent first. Secondly, it really has a goal. For example, when a big cluster is just started, there is huge number of events like 'a new member is added'. They consume the whole UDP packet, and newer events like 'a member is dead' starve. In fact, dissemination just doesn't work at start of a big cluster. This patch make SWIM prefer newer events for dissemination. Part of #4253

Before the patch there was a problem of events and anti-entropy starvation, when a cluster generates so many events, that they consume the whole UDP packet. If during the event storm something important happens, that event is likely to be lost, and not disseminated until the storm is over. Sadly, there is no way to prevent a storm, but it can be made much shorter. For that the patch makes TTD of events logarithmic instead of linear of cluster size. According to the SWIM paper and to the experiments the logarithm is really enough. Linear TTD was a redundant overkill. When events live shorter, it does not solve a problem of the events starvation - still some of them can be lost in case of a storm. But it frees some space for anti-entropy, which can finish dissemination of lost events. Experiments in a simulation of a cluster with 100 nodes showed, that a failure dissemination happened in ~110 steps if there is a storm. Linear dissemination is the worst problem. After the patch it is ~20 steps. So it is logarithmic as it should be, although with a bigger constant than without a storm. Part of #4253

Before the patch there was a problem of events and anti-entropy starvation, when a cluster generates so many events, that they consume the whole UDP packet. A packet fits up to 26 events. If during the event storm something important happens, that event is likely to be lost, and not disseminated until the storm is over. Sadly, there is no way to prevent a storm, but it can be made much shorter. For that the patch makes TTD of events logarithmic instead of linear of cluster size. According to the SWIM paper and to experiments the logarithm is really enough. Linear TTD was a redundant overkill. When events live shorter, it does not solve a problem of the events starvation - still some of them can be lost in case of a storm. But it frees some space for anti-entropy, which can finish dissemination of lost events. Experiments in a simulation of a cluster with 100 nodes showed, that a failure dissemination happened in ~110 steps if there is a storm. Basically, no dissemination at all. After the patch it is ~20 steps. So it is logarithmic as it should be, although with a bigger constant than without a storm. However the patch can't be pushed as is. Several tests are broken now, because due to short TTD 'dead' members are dropped from member tables too early. It creates a possibility, that another node resurrects them back via anti-entropy. Part of #4253

The previous commit solves one important problem with too long event dissemination. Events could for too long time occupy the whole UDP packet. Now they live log() time, but 'dead' and 'left' members were bound to TTD. Such members were deleted after TTD is 0. Now they are deleted to early. Cluster nodes too early forget about dead ones, and nodes not aware of death of the latters, can accidentally resurrect them via anti-entropy. A way of keeping dead and left members until the whole cluster knows about their death should exist. This patch unbounds TTL and TTD concepts. When a member is learned to be dead or left, it is disseminated for log() times, but kept in the member table for C * log() times. Here the members wait until everyone knows about their death with a very high probability. C is chosen as 10, because according to experiments it is a minimal value covering all the cases, when event death is being disseminated too long. Unfortunately, even TTL, even big one, does not help, when a user sets heartbeat rate in a too small value. For example, if heartbeat_rate is 0.01, and the cluster is of size 100, then any member never can be deleted automatically. Dead and left members will live no longer than 1 second - not enough to disseminate their death to everyone. A possible solution - make TTL be in time instead of round steps. Then it won't depend on heartbeat rate. Part of #4253

Before the patch there was a problem of events and anti-entropy starvation, when a cluster generates so many events, that they consume the whole UDP packet. A packet fits up to 26 events. If during the event storm something important happens, that event is likely to be lost, and not disseminated until the storm is over. Sadly, there is no way to prevent a storm, but it can be made much shorter. For that the patch makes TTD of events logarithmic instead of linear of cluster size. According to the SWIM paper and to experiments the logarithm is really enough. Linear TTD was a redundant overkill. When events live shorter, it does not solve a problem of the events starvation - still some of them can be lost in case of a storm. But it frees some space for anti-entropy, which can finish dissemination of lost events. Experiments in a simulation of a cluster with 100 nodes showed, that a failure dissemination happened in ~110 steps if there is a storm. Basically, no dissemination at all. After the patch it is ~20 steps. So it is logarithmic as it should be, although with a bigger constant than without a storm. However the patch can't be pushed as is. Several tests are broken now, because due to short TTD 'dead' members are dropped from member tables too early. It creates a possibility, that another node resurrects them back via anti-entropy. Part of #4253

The previous commit solves one important problem with too long event dissemination. Events could for too long time occupy the whole UDP packet. Now they live log() time, but 'dead' and 'left' members were bound to TTD. Such members were deleted after TTD is 0. Now they are deleted to early. Cluster nodes too early forget about dead ones, and nodes not aware of death of the latters, can accidentally resurrect them via anti-entropy. Cluster nodes need to be suspicious when someone tells them to add a new not dead member. This patch makes SWIM add a new member in two cases only: manually and if an ACK was received from it. A new member can't be added indirectly via events and anti-entropy anymore. Instead, a ping is sent to the members who are said to be new and alive. If ACK is received directly from them, then they are added. The patch does not affect updates. They are still indirect, because if something has updated in an existing member, then it is definitely alive. Part of #4253

The previous commit solves one important problem with too long event dissemination. Events could for too long time occupy the whole UDP packet. Now they live log() time, but 'dead' and 'left' members were bound to TTD. Such members were deleted after TTD is 0. Now they are deleted to early. Cluster nodes too early forget about dead ones, and nodes not aware of death of the latters, can accidentally resurrect them via anti-entropy. A way of keeping dead and left members until the whole cluster knows about their death should exist. This patch unbounds TTL and TTD concepts. When a member is learned to be dead or left, it is disseminated for log() times, but kept in the member table for C * log() times. Here the members wait until everyone knows about their death with a very high probability. C is chosen as 10, because according to experiments it is a minimal value covering all the cases, when event death is being disseminated too long, and heartbeat_rate is 1 second. Unfortunately, even TTL, even big one, does not help, when a user sets heartbeat rate in a too small value. For example, if heartbeat_rate is 0.01, and the cluster is of size 100, then any member never can be deleted automatically. Dead and left members will live no longer than 1 second - not enough to disseminate their death to everyone. A possible solution - make TTL be in time instead of round steps. Then it won't depend on heartbeat rate. But still it does not eliminate resurrection probability. A third solution could be removal of automatic GC - it would simplify many things. Besides, seems like no one is going to use it. Part of #4253

Before the patch there was a problem of events and anti-entropy starvation, when a cluster generates so many events, that they consume the whole UDP packet. A packet fits up to 26 events. If during the event storm something important happens, that event is likely to be lost, and not disseminated until the storm is over. Sadly, there is no way to prevent a storm, but it can be made much shorter. For that the patch makes TTD of events logarithmic instead of linear of cluster size. According to the SWIM paper and to experiments the logarithm is really enough. Linear TTD was a redundant overkill. When events live shorter, it does not solve a problem of the events starvation - still some of them can be lost in case of a storm. But it frees some space for anti-entropy, which can finish dissemination of lost events. Experiments in a simulation of a cluster with 100 nodes showed, that a failure dissemination happened in ~110 steps if there is a storm. Basically, no dissemination at all. After the patch it is ~20 steps. So it is logarithmic as it should be, although with a bigger constant than without a storm. Part of #4253

The previous commit solves one important problem with too long event dissemination. Events could for too long time occupy the whole UDP packet. Now they live log() time, but 'dead' and 'left' members were bound to TTD. Such members were deleted after TTD is 0. Now they are deleted to early. Cluster nodes too early forget about dead ones, and nodes not aware of death of the latters, can accidentally resurrect them via anti-entropy. Cluster nodes need to be suspicious when someone tells them to add a new not dead member. This patch makes SWIM add a new member in two cases only: manually and if an ACK was received from it. A new member can't be added indirectly via events and anti-entropy anymore. Instead, a ping is sent to the members who are said to be new and alive. If ACK is received directly from them, then they are added. The patch does not affect updates. They are still indirect, because if something has updated in an existing member, then it is definitely alive. Part of #4253

SWIM sends basically the same message during a round. There was a microoptimization so as not to reassemble the message on each step. Now it is getting harder to support that island of perfectionism, because * Soon all the messages will carry all the sections, including indirect messages. Their body is smaller, so it is not possible to maintain one cached message without reducing its maximal size; * In big-clusters even without any changes a cached message would need to be rebuilt. This is because anti-entropy section won't help much unless it is being changed frequent enough; * In big clusters changes happen often enough to invalidate the cached message constantly, unless SWIM would had maintained what members are included into the cache, and which are not. Then change of a member, not included into the message, would not affect the cache. But it would complicate the code too much. Part of #4253

There were tests relying on certain content of SWIM messages. After next patches these conditions won't work without an explicit intervention with error injections. The patchset moves these tests to separate release-disabled files. Part of #4253

With following patches some of the tests will work much slower due to significantly increased size of the most of packets. This commit tries to smooth it by * Turning off verbose logs in unit tests; * Using much more light version of UUID comparator. According to the profiler these places increase speed in a couple of times, and at the same time they are simple. Needed for #4253

One another place consuming most of the tests start up time is useless dissemination of an empty payload, which can be skipped in fact. Consider a cluster of 300 nodes. Each one of them are interconnected manually, and now a test wants to wait for a stabilization, when there are no events. On such a cluster it happens for ~200 round steps till there are no any single event. This is not about big packets, or log() TTD. There may be a few events, may be more, but when a test wants the cluster to be clean, it needs to wait for all the events being done. This patch abuses the fact, that empty payloads can be compared for free, no any single memcmp. If both new and the old payload are empty, then nothing to disseminate. It could help in a real cluster too, if initially there are no payloads. Needed for #4253

Gerold103 · 2019-07-04T22:15:09Z

After all I've found the original source of that loglog idea: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.33.6750&rep=rep1&type=pdf. I've mistakenly thought that loglog is about time, but it appeared to be about number of messages. In short, the paper says, that there is a way how to send n*log(log(n)) messages in log(n) round steps to disseminate one event.

Anyway, the issue is worth implementing, because without it the SWIM works at least twice slower than log() when number of nodes <= 800 (more was not tested).

SWIM in the original paper says, that dissemination time of an event is O(log(N)), where N is size of the cluster. It is true, when both ping and ack messages carry dissemination and anti-entropy. Before this patch it wasn't so - only regular pings were carrying something. After this patch the SWIM module has true exponential dissemination speed. Closes #4253

Before the patch there was a problem of events and anti-entropy starvation, when a cluster generates so many events, that they consume the whole UDP packet. A packet fits up to 26 events. If during the event storm something important happens, that event is likely to be lost, and not disseminated until the storm is over. Sadly, there is no way to prevent a storm, but it can be made much shorter. For that the patch makes TTD of events logarithmic instead of linear of cluster size. According to the SWIM paper and to experiments the logarithm is really enough. Linear TTD was a redundant overkill. When events live shorter, it does not solve a problem of the events starvation - still some of them can be lost in case of a storm. But it frees some space for anti-entropy, which can finish dissemination of lost events. Experiments in a simulation of a cluster with 100 nodes showed, that a failure dissemination happened in ~110 steps if there is a storm. Basically, no dissemination at all. After the patch it is ~20 steps. So it is logarithmic as it should be, although with a bigger constant than without a storm. Part of #4253

The previous commit solves one important problem with too long event dissemination. Events could for too long time occupy the whole UDP packet. Now they live log() time, but 'dead' and 'left' members were bound to TTD. Such members were deleted after TTD is 0. Now they are deleted to early. Cluster nodes too early forget about dead ones, and nodes not aware of death of the latters, can accidentally resurrect them via anti-entropy. Cluster nodes need to be suspicious when someone tells them to add a new not dead member. This patch makes SWIM add a new member in two cases only: manually and if an ACK was received from it. A new member can't be added indirectly via events and anti-entropy anymore. Instead, a ping is sent to the members who are said to be new and alive. If ACK is received directly from them, then they are added. The patch does not affect updates. They are still indirect, because if something has updated in an existing member, then it is definitely alive. Part of #4253

SWIM sends basically the same message during a round. There was a microoptimization so as not to reassemble the message on each step. Now it is getting harder to support that island of perfectionism, because * Soon all the messages will carry all the sections, including indirect messages. Their body is smaller, so it is not possible to maintain one cached message without reducing its maximal size; * In big-clusters even without any changes a cached message would need to be rebuilt. This is because anti-entropy section won't help much unless it is being changed frequent enough; * In big clusters changes happen often enough to invalidate the cached message constantly, unless SWIM would had maintained what members are included into the cache, and which are not. Then change of a member, not included into the message, would not affect the cache. But it would complicate the code too much. Part of #4253

There were tests relying on certain content of SWIM messages. After next patches these conditions won't work without an explicit intervention with error injections. The patchset moves these tests to separate release-disabled files. Part of #4253

With following patches some of the tests will work much slower due to significantly increased size of the most of packets. This commit tries to smooth it by * Turning off verbose logs in unit tests; * Using much more light version of UUID comparator. According to the profiler these places increase speed in a couple of times, and at the same time they are simple. Needed for #4253

One another place consuming most of the tests start up time is useless dissemination of an empty payload, which can be skipped in fact. Consider a cluster of 300 nodes. Each one of them are interconnected manually, and now a test wants to wait for a stabilization, when there are no events. On such a cluster it happens for ~200 round steps till there are no any single event. This is not about big packets, or log() TTD. There may be a few events, may be more, but when a test wants the cluster to be clean, it needs to wait for all the events being done. This patch abuses the fact, that empty payloads can be compared for free, no any single memcmp. If both new and the old payload are empty, then nothing to disseminate. It could help in a real cluster too, if initially there are no payloads. Needed for #4253

Gerold103 added feature A new functionality app labels May 27, 2019

Gerold103 self-assigned this May 27, 2019

kyukhin added this to the wishlist milestone May 28, 2019

Gerold103 changed the title ~~SWIM: improve dissemination speed to log(log(N))~~ SWIM: improve dissemination speed to log(N) Jul 4, 2019

Gerold103 closed this as completed in 3fb2b87 Jul 5, 2019

kostja modified the milestones: wishlist, 2.2.2 Aug 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SWIM: improve dissemination speed to log(N) #4253

SWIM: improve dissemination speed to log(N) #4253

Gerold103 commented May 27, 2019

Gerold103 commented May 27, 2019

Gerold103 commented Jun 28, 2019

Gerold103 commented Jul 4, 2019

SWIM: improve dissemination speed to log(N) #4253

SWIM: improve dissemination speed to log(N) #4253

Comments

Gerold103 commented May 27, 2019

Gerold103 commented May 27, 2019

Gerold103 commented Jun 28, 2019

Gerold103 commented Jul 4, 2019