Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SWIM: improve dissemination speed to log(N) #4253

Closed
Gerold103 opened this issue May 27, 2019 · 3 comments
Closed

SWIM: improve dissemination speed to log(N) #4253

Gerold103 opened this issue May 27, 2019 · 3 comments
Assignees
Labels
app feature A new functionality
Milestone

Comments

@Gerold103
Copy link
Collaborator

There is a paper with math proves of some gossip basics: "Efficient Reconciliation and Flow Control for Anti-Entropy Protocols". Also a video exists, where another paper is explained: https://www.youtube.com/watch?v=Gxf5glthqrk. According to these researches, there is a method of spreading gossips with log(log(N)) speed against usual speed log(N). It requires pull-push dissemination.

SWIM works as push dissemination - a member never asks to send him anything, but sends packets to others. Pull dissemination is when a member asks to send him info about certain members. It is stated, that combined they give log(log(N)) dissemination speed.

We could turn SWIM into push-pull protocol, if piggyback dissemination component not only with round messages, but with ACK messages too. Then each round message, containing a ping (they do contain a ping), would be answered with an ACK and another dissemination.

A clear example of how does it speeds dissemination up. Assume, a node A learned, that node C is dead. It starts spreading a gossip about that. Also there is a node B, which hasn't learned about the gossip yet, but it sends a ping to A. A answers with an ACK. In the traditional SWIM that ACK does not make B aware of death of C. In the proposed implementation the ACK will carry death of C.

The estimation log(log(N)) should be proved, it is only an assumption based on some slides. Regardless of result of the investigation, the feature certainly speed dissemination up. The only unknown thing is how much.

@Gerold103 Gerold103 added feature A new functionality app labels May 27, 2019
@Gerold103 Gerold103 self-assigned this May 27, 2019
@Gerold103
Copy link
Collaborator Author

In the same paper "Efficient Reconciliation and Flow Control for Anti-Entropy Protocols" a protocol Scuttlebutt is presented, and it has impressive ideas on lowering network load when there are lots of updates. Probably it could be somehow adopted not only in pull-push ideas.

@kyukhin kyukhin added this to the wishlist milestone May 28, 2019
@Gerold103
Copy link
Collaborator Author

In addition to events we can add anti-entropy section too. To each message. According to Yaroslav's tests, a cluster of 100 nodes with such packing reaches fullmesh in 8 round steps. Even in Lua.

Gerold103 added a commit that referenced this issue Jun 29, 2019
Before the patch old events were disseminated before new ones
until their TTD becomes 0. It was considered fair, and not much
important. But after some experiments with closed-source version
of SWIM it was found, that I was wrong as never.

Firstly, SWIM in the original paper explicitly says, that newer
events should be sent first. Secondly, it really has a goal. For
example, when a big cluster is just started, there is huge number
of events like 'a new member is added'. They consume the whole
UDP packet, and newer events like 'a member is dead' starve. In
fact, dissemination just doesn't work at start of a big cluster.

This patch make SWIM prefer newer events for dissemination.

Part of #4253
Gerold103 added a commit that referenced this issue Jun 29, 2019
Before the patch there was a problem of events and anti-entropy
starvation, when a cluster generates so many events, that they
consume the whole UDP packet. If during the event storm something
important happens, that event is likely to be lost, and not
disseminated until the storm is over.

Sadly, there is no way to prevent a storm, but it can be made
much shorter. For that the patch makes TTD of events logarithmic
instead of linear of cluster size.

According to the SWIM paper and to the experiments the logarithm
is really enough. Linear TTD was a redundant overkill.

When events live shorter, it does not solve a problem of the
events starvation - still some of them can be lost in case of a
storm. But it frees some space for anti-entropy, which can finish
dissemination of lost events.

Experiments in a simulation of a cluster with 100 nodes showed,
that a failure dissemination happened in ~110 steps if there is
a storm. Linear dissemination is the worst problem.
After the patch it is ~20 steps. So it is logarithmic as it
should be, although with a bigger constant than without a storm.

Part of #4253
Gerold103 added a commit that referenced this issue Jun 30, 2019
Before the patch there was a problem of events and anti-entropy
starvation, when a cluster generates so many events, that they
consume the whole UDP packet. A packet fits up to 26 events. If
during the event storm something important happens, that event is
likely to be lost, and not disseminated until the storm is over.

Sadly, there is no way to prevent a storm, but it can be made
much shorter. For that the patch makes TTD of events logarithmic
instead of linear of cluster size.

According to the SWIM paper and to experiments the logarithm is
really enough. Linear TTD was a redundant overkill.

When events live shorter, it does not solve a problem of the
events starvation - still some of them can be lost in case of a
storm. But it frees some space for anti-entropy, which can finish
dissemination of lost events.

Experiments in a simulation of a cluster with 100 nodes showed,
that a failure dissemination happened in ~110 steps if there is
a storm. Basically, no dissemination at all.

After the patch it is ~20 steps. So it is logarithmic as it
should be, although with a bigger constant than without a storm.

However the patch can't be pushed as is. Several tests are broken
now, because due to short TTD 'dead' members are dropped from
member tables too early. It creates a possibility, that another
node resurrects them back via anti-entropy.

Part of #4253
Gerold103 added a commit that referenced this issue Jun 30, 2019
The previous commit solves one important problem with too long
event dissemination. Events could for too long time occupy the
whole UDP packet. Now they live log() time, but 'dead' and 'left'
members were bound to TTD. Such members were deleted after TTD
is 0.

Now they are deleted to early. Cluster nodes too early forget
about dead ones, and nodes not aware of death of the latters, can
accidentally resurrect them via anti-entropy. A way of keeping
dead and left members until the whole cluster knows about their
death should exist.

This patch unbounds TTL and TTD concepts. When a member is
learned to be dead or left, it is disseminated for log() times,
but kept in the member table for C * log() times. Here the
members wait until everyone knows about their death with a very
high probability.

C is chosen as 10, because according to experiments it is a
minimal value covering all the cases, when event death is being
disseminated too long.

Unfortunately, even TTL, even big one, does not help, when a user
sets heartbeat rate in a too small value. For example, if
heartbeat_rate is 0.01, and the cluster is of size 100, then
any member never can be deleted automatically. Dead and left
members will live no longer than 1 second - not enough to
disseminate their death to everyone. A possible solution - make
TTL be in time instead of round steps. Then it won't depend on
heartbeat rate.

Part of #4253
Gerold103 added a commit that referenced this issue Jun 30, 2019
The previous commit solves one important problem with too long
event dissemination. Events could for too long time occupy the
whole UDP packet. Now they live log() time, but 'dead' and 'left'
members were bound to TTD. Such members were deleted after TTD
is 0.

Now they are deleted to early. Cluster nodes too early forget
about dead ones, and nodes not aware of death of the latters, can
accidentally resurrect them via anti-entropy. A way of keeping
dead and left members until the whole cluster knows about their
death should exist.

This patch unbounds TTL and TTD concepts. When a member is
learned to be dead or left, it is disseminated for log() times,
but kept in the member table for C * log() times. Here the
members wait until everyone knows about their death with a very
high probability.

C is chosen as 10, because according to experiments it is a
minimal value covering all the cases, when event death is being
disseminated too long.

Unfortunately, even TTL, even big one, does not help, when a user
sets heartbeat rate in a too small value. For example, if
heartbeat_rate is 0.01, and the cluster is of size 100, then
any member never can be deleted automatically. Dead and left
members will live no longer than 1 second - not enough to
disseminate their death to everyone. A possible solution - make
TTL be in time instead of round steps. Then it won't depend on
heartbeat rate.

Part of #4253
Gerold103 added a commit that referenced this issue Jun 30, 2019
Before the patch there was a problem of events and anti-entropy
starvation, when a cluster generates so many events, that they
consume the whole UDP packet. A packet fits up to 26 events. If
during the event storm something important happens, that event is
likely to be lost, and not disseminated until the storm is over.

Sadly, there is no way to prevent a storm, but it can be made
much shorter. For that the patch makes TTD of events logarithmic
instead of linear of cluster size.

According to the SWIM paper and to experiments the logarithm is
really enough. Linear TTD was a redundant overkill.

When events live shorter, it does not solve a problem of the
events starvation - still some of them can be lost in case of a
storm. But it frees some space for anti-entropy, which can finish
dissemination of lost events.

Experiments in a simulation of a cluster with 100 nodes showed,
that a failure dissemination happened in ~110 steps if there is
a storm. Basically, no dissemination at all.

After the patch it is ~20 steps. So it is logarithmic as it
should be, although with a bigger constant than without a storm.

However the patch can't be pushed as is. Several tests are broken
now, because due to short TTD 'dead' members are dropped from
member tables too early. It creates a possibility, that another
node resurrects them back via anti-entropy.

Part of #4253
Gerold103 added a commit that referenced this issue Jun 30, 2019
The previous commit solves one important problem with too long
event dissemination. Events could for too long time occupy the
whole UDP packet. Now they live log() time, but 'dead' and 'left'
members were bound to TTD. Such members were deleted after TTD
is 0.

Now they are deleted to early. Cluster nodes too early forget
about dead ones, and nodes not aware of death of the latters, can
accidentally resurrect them via anti-entropy. Cluster nodes need
to be suspicious when someone tells them to add a new not dead
member.

This patch makes SWIM add a new member in two cases only: manually
and if an ACK was received from it. A new member can't be added
indirectly via events and anti-entropy anymore. Instead, a ping is
sent to the members who are said to be new and alive. If ACK is
received directly from them, then they are added.

The patch does not affect updates. They are still indirect,
because if something has updated in an existing member, then it
is definitely alive.

Part of #4253
Gerold103 added a commit that referenced this issue Jun 30, 2019
The previous commit solves one important problem with too long
event dissemination. Events could for too long time occupy the
whole UDP packet. Now they live log() time, but 'dead' and 'left'
members were bound to TTD. Such members were deleted after TTD
is 0.

Now they are deleted to early. Cluster nodes too early forget
about dead ones, and nodes not aware of death of the latters, can
accidentally resurrect them via anti-entropy. A way of keeping
dead and left members until the whole cluster knows about their
death should exist.

This patch unbounds TTL and TTD concepts. When a member is
learned to be dead or left, it is disseminated for log() times,
but kept in the member table for C * log() times. Here the
members wait until everyone knows about their death with a very
high probability.

C is chosen as 10, because according to experiments it is a
minimal value covering all the cases, when event death is being
disseminated too long, and heartbeat_rate is 1 second.

Unfortunately, even TTL, even big one, does not help, when a user
sets heartbeat rate in a too small value. For example, if
heartbeat_rate is 0.01, and the cluster is of size 100, then
any member never can be deleted automatically. Dead and left
members will live no longer than 1 second - not enough to
disseminate their death to everyone. A possible solution - make
TTL be in time instead of round steps. Then it won't depend on
heartbeat rate. But still it does not eliminate resurrection
probability.

A third solution could be removal of automatic GC - it would
simplify many things. Besides, seems like no one is going to use
it.

Part of #4253
Gerold103 added a commit that referenced this issue Jul 3, 2019
Before the patch there was a problem of events and anti-entropy
starvation, when a cluster generates so many events, that they
consume the whole UDP packet. A packet fits up to 26 events. If
during the event storm something important happens, that event is
likely to be lost, and not disseminated until the storm is over.

Sadly, there is no way to prevent a storm, but it can be made
much shorter. For that the patch makes TTD of events logarithmic
instead of linear of cluster size.

According to the SWIM paper and to experiments the logarithm is
really enough. Linear TTD was a redundant overkill.

When events live shorter, it does not solve a problem of the
events starvation - still some of them can be lost in case of a
storm. But it frees some space for anti-entropy, which can finish
dissemination of lost events.

Experiments in a simulation of a cluster with 100 nodes showed,
that a failure dissemination happened in ~110 steps if there is
a storm. Basically, no dissemination at all.

After the patch it is ~20 steps. So it is logarithmic as it
should be, although with a bigger constant than without a storm.

Part of #4253
Gerold103 added a commit that referenced this issue Jul 3, 2019
The previous commit solves one important problem with too long
event dissemination. Events could for too long time occupy the
whole UDP packet. Now they live log() time, but 'dead' and 'left'
members were bound to TTD. Such members were deleted after TTD
is 0.

Now they are deleted to early. Cluster nodes too early forget
about dead ones, and nodes not aware of death of the latters, can
accidentally resurrect them via anti-entropy. Cluster nodes need
to be suspicious when someone tells them to add a new not dead
member.

This patch makes SWIM add a new member in two cases only: manually
and if an ACK was received from it. A new member can't be added
indirectly via events and anti-entropy anymore. Instead, a ping is
sent to the members who are said to be new and alive. If ACK is
received directly from them, then they are added.

The patch does not affect updates. They are still indirect,
because if something has updated in an existing member, then it
is definitely alive.

Part of #4253
Gerold103 added a commit that referenced this issue Jul 3, 2019
SWIM sends basically the same message during a round. There was
a microoptimization so as not to reassemble the message on each
step. Now it is getting harder to support that island of
perfectionism, because

    * Soon all the messages will carry all the sections,
      including indirect messages. Their body is smaller, so it
      is not possible to maintain one cached message without
      reducing its maximal size;

    * In big-clusters even without any changes a cached message
      would need to be rebuilt. This is because anti-entropy
      section won't help much unless it is being changed
      frequent enough;

    * In big clusters changes happen often enough to invalidate
      the cached message constantly, unless SWIM would had
      maintained what members are included into the cache, and
      which are not. Then change of a member, not included into
      the message, would not affect the cache. But it would
      complicate the code too much.

Part of #4253
Gerold103 added a commit that referenced this issue Jul 3, 2019
There were tests relying on certain content of SWIM messages.
After next patches these conditions won't work without an
explicit intervention with error injections.

The patchset moves these tests to separate release-disabled
files.

Part of #4253
Gerold103 added a commit that referenced this issue Jul 3, 2019
With following patches some of the tests will work much slower
due to significantly increased size of the most of packets.

This commit tries to smooth it by

    * Turning off verbose logs in unit tests;
    * Using much more light version of UUID comparator.

According to the profiler these places increase speed in a
couple of times, and at the same time they are simple.

Needed for #4253
Gerold103 added a commit that referenced this issue Jul 3, 2019
One another place consuming most of the tests start up time is
useless dissemination of an empty payload, which can be skipped
in fact.

Consider a cluster of 300 nodes. Each one of them are
interconnected manually, and now a test wants to wait for a
stabilization, when there are no events. On such a cluster it
happens for ~200 round steps till there are no any single event.

This is not about big packets, or log() TTD. There may be a few
events, may be more, but when a test wants the cluster to be
clean, it needs to wait for all the events being done.

This patch abuses the fact, that empty payloads can be compared
for free, no any single memcmp. If both new and the old payload
are empty, then nothing to disseminate.

It could help in a real cluster too, if initially there are no
payloads.

Needed for #4253
@Gerold103 Gerold103 changed the title SWIM: improve dissemination speed to log(log(N)) SWIM: improve dissemination speed to log(N) Jul 4, 2019
@Gerold103
Copy link
Collaborator Author

After all I've found the original source of that loglog idea: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.33.6750&rep=rep1&type=pdf. I've mistakenly thought that loglog is about time, but it appeared to be about number of messages. In short, the paper says, that there is a way how to send n*log(log(n)) messages in log(n) round steps to disseminate one event.

Anyway, the issue is worth implementing, because without it the SWIM works at least twice slower than log() when number of nodes <= 800 (more was not tested).

Gerold103 added a commit that referenced this issue Jul 4, 2019
SWIM in the original paper says, that dissemination time of an
event is O(log(N)), where N is size of the cluster. It is true,
when both ping and ack messages carry dissemination and
anti-entropy. Before this patch it wasn't so - only regular
pings were carrying something.

After this patch the SWIM module has true exponential
dissemination speed.

Closes #4253
Gerold103 added a commit that referenced this issue Jul 5, 2019
Before the patch there was a problem of events and anti-entropy
starvation, when a cluster generates so many events, that they
consume the whole UDP packet. A packet fits up to 26 events. If
during the event storm something important happens, that event is
likely to be lost, and not disseminated until the storm is over.

Sadly, there is no way to prevent a storm, but it can be made
much shorter. For that the patch makes TTD of events logarithmic
instead of linear of cluster size.

According to the SWIM paper and to experiments the logarithm is
really enough. Linear TTD was a redundant overkill.

When events live shorter, it does not solve a problem of the
events starvation - still some of them can be lost in case of a
storm. But it frees some space for anti-entropy, which can finish
dissemination of lost events.

Experiments in a simulation of a cluster with 100 nodes showed,
that a failure dissemination happened in ~110 steps if there is
a storm. Basically, no dissemination at all.

After the patch it is ~20 steps. So it is logarithmic as it
should be, although with a bigger constant than without a storm.

Part of #4253
Gerold103 added a commit that referenced this issue Jul 5, 2019
The previous commit solves one important problem with too long
event dissemination. Events could for too long time occupy the
whole UDP packet. Now they live log() time, but 'dead' and 'left'
members were bound to TTD. Such members were deleted after TTD
is 0.

Now they are deleted to early. Cluster nodes too early forget
about dead ones, and nodes not aware of death of the latters, can
accidentally resurrect them via anti-entropy. Cluster nodes need
to be suspicious when someone tells them to add a new not dead
member.

This patch makes SWIM add a new member in two cases only: manually
and if an ACK was received from it. A new member can't be added
indirectly via events and anti-entropy anymore. Instead, a ping is
sent to the members who are said to be new and alive. If ACK is
received directly from them, then they are added.

The patch does not affect updates. They are still indirect,
because if something has updated in an existing member, then it
is definitely alive.

Part of #4253
Gerold103 added a commit that referenced this issue Jul 5, 2019
SWIM sends basically the same message during a round. There was
a microoptimization so as not to reassemble the message on each
step. Now it is getting harder to support that island of
perfectionism, because

    * Soon all the messages will carry all the sections,
      including indirect messages. Their body is smaller, so it
      is not possible to maintain one cached message without
      reducing its maximal size;

    * In big-clusters even without any changes a cached message
      would need to be rebuilt. This is because anti-entropy
      section won't help much unless it is being changed
      frequent enough;

    * In big clusters changes happen often enough to invalidate
      the cached message constantly, unless SWIM would had
      maintained what members are included into the cache, and
      which are not. Then change of a member, not included into
      the message, would not affect the cache. But it would
      complicate the code too much.

Part of #4253
Gerold103 added a commit that referenced this issue Jul 5, 2019
There were tests relying on certain content of SWIM messages.
After next patches these conditions won't work without an
explicit intervention with error injections.

The patchset moves these tests to separate release-disabled
files.

Part of #4253
Gerold103 added a commit that referenced this issue Jul 5, 2019
With following patches some of the tests will work much slower
due to significantly increased size of the most of packets.

This commit tries to smooth it by

    * Turning off verbose logs in unit tests;
    * Using much more light version of UUID comparator.

According to the profiler these places increase speed in a
couple of times, and at the same time they are simple.

Needed for #4253
Gerold103 added a commit that referenced this issue Jul 5, 2019
One another place consuming most of the tests start up time is
useless dissemination of an empty payload, which can be skipped
in fact.

Consider a cluster of 300 nodes. Each one of them are
interconnected manually, and now a test wants to wait for a
stabilization, when there are no events. On such a cluster it
happens for ~200 round steps till there are no any single event.

This is not about big packets, or log() TTD. There may be a few
events, may be more, but when a test wants the cluster to be
clean, it needs to wait for all the events being done.

This patch abuses the fact, that empty payloads can be compared
for free, no any single memcmp. If both new and the old payload
are empty, then nothing to disseminate.

It could help in a real cluster too, if initially there are no
payloads.

Needed for #4253
@kostja kostja modified the milestones: wishlist, 2.2.2 Aug 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
app feature A new functionality
Projects
None yet
Development

No branches or pull requests

3 participants