Possible write loss during cluster reconfiguration #5289

Closed
coffeemug opened this Issue Jan 14, 2016 · 8 comments

Comments

Projects
None yet
5 participants
@coffeemug
Contributor

coffeemug commented Jan 14, 2016

We're working with Kyle Kingsbury on putting RethinkDB through the Jepsen tests. The first chunk of the analysis was published on January 4th (https://aphyr.com/posts/329-jepsen-rethinkdb-2-1-5) and RethinkDB passed the tests with flying colors. We're now working with Kyle to do even more sophisticated analysis -- testing read/write guarantees during cluster reconfiguration.

Kyle found a potential guarantee failure -- acknowledged writes that happened during rapid cluster reconfiguration may be lost in the reconfiguration process. We'll be publishing more details as we get more information, but here's what we know so far: if you do rapid reconfigurations of the RethinkDB cluster, in about one out of a few thousand reconfigurations some of the writes that occurred during reconfiguration may be lost. This is unlikely to happen in production since the tests are creating thousands of random configurations that potentially overlap with each other, but there are some steps you can take to eliminate the impact of this bug.

Edit: Please follow the updated workaround from #5289 (comment) instead
Workaround: don't reconfigure a live cluster; if you need to make a change to the cluster configuration, shut down all RethinkDB clients first and don't reconfigure during live writes.

Sorry for the inconvenience everyone, we're working around the clock to get more information and get this fixed.

@danielmewes

This comment has been minimized.

Show comment
Hide comment
@danielmewes

danielmewes Jan 14, 2016

Member

It's worth no‎ting that this might only be an issue if network partitions happen in the middle of reconfiguration. 

I think in practice taking down the application for reconfiguration will often do more harm than the minor risk of running into this. 

However if you rely on linearizability and/or full durability guarantees, we advise following the mentioned work-around until we have more details on this. 

Member

danielmewes commented Jan 14, 2016

It's worth no‎ting that this might only be an issue if network partitions happen in the middle of reconfiguration. 

I think in practice taking down the application for reconfiguration will often do more harm than the minor risk of running into this. 

However if you rely on linearizability and/or full durability guarantees, we advise following the mentioned work-around until we have more details on this. 

@danielmewes

This comment has been minimized.

Show comment
Hide comment
@danielmewes

danielmewes Jan 25, 2016

Member

Because there has been some confusion about this: This issue still exists in the latest release 2.2.3, not just in 2.1.5.

Member

danielmewes commented Jan 25, 2016

Because there has been some confusion about this: This issue still exists in the latest release 2.2.3, not just in 2.1.5.

@danielmewes

This comment has been minimized.

Show comment
Hide comment
@danielmewes

danielmewes Jan 25, 2016

Member

To give a quick status update:
We performed a thorough audit of our clustering code and strengthened its correctness checks.
We so far found one race condition in our Raft implementation that could potentially lead to inconsistent configuration across the cluster, and possibly split-brain effects.
However there is still a remaining failure that we are still investigating

Member

danielmewes commented Jan 25, 2016

To give a quick status update:
We performed a thorough audit of our clustering code and strengthened its correctness checks.
We so far found one race condition in our Raft implementation that could potentially lead to inconsistent configuration across the cluster, and possibly split-brain effects.
However there is still a remaining failure that we are still investigating

@danielmewes

This comment has been minimized.

Show comment
Hide comment
@danielmewes

danielmewes Jan 27, 2016

Member

The short version

We have found the bug that's causing this! The fix is currently going through final review and testing.

We hope to ship RethinkDB 2.2.4 with the bug fix by the end of this week.

Workaround (updated): While the problem seems to be unlikely to occur in practice, we recommend not performing any cluster reconfiguration while either

  • any server of the cluster is down or
  • any number of servers are unreachable because of a network partition

until then.

Below you can find the long description of this bug, as we think that some of you might be interested in the technical details of this.

The long version / Introduction

As part of @aphyr's continuing work on testing RethinkDB using his Jepsen test framework, we were made aware of an issue that caused RethinkDB to return incorrect results, and to drop already acknowledged writes.

RethinkDB - to the best of our knowledge - does provide full linearizable consistency when using the documented configuration and not performing any manual reconfigurations. These guarantees are upheld under failure of individual servers as well as arbitrary network partitions. A recent analysis by @aphyr supported RethinkDB's correctness under the tested scenarios.

This bug affects scenarios where a user performs reconfiguration of the cluster in the presence of network partitions. Reconfiguration in this context refers to changes to the r.db('rethinkdb').table('table_config') system table, or the use of the reconfigure or rebalance commands.

Under the right circumstances, RethinkDB 2.2.3 and earlier can violate the documented consistency and write persistence guarantees.

We are not aware of a single user who has been affected by this bug, and the issue requires a particular combination of factors in order to generate incorrect behavior.

The following provides an in-depth analysis of the bug.

We would like to thank @aphyr for his help in reproducing the issue and in tracking down potential causes.

Background

Here we provide a simplified explanation of RethinkDB's cluster architecture.

We try to provide enough information to understand the bug, but will leave out a lot of detail and simplify certain processes for the sake of not letting this become even longer than it already is.

RethinkDB's cluster architecture

RethinkDB's cluster architecture has three major components.

  • A Raft-based system to manage the configuration state of a table
  • The "multi table manager" to make servers in the cluster aware of their duties, according to the current table configuration from the Raft cluster
  • A system to store and replicate the actual data and queries.

We'll take a closer look at the first two components here:

Raft

RethinkDB uses Raft to maintain a consistent configuration state for a given table. Typically, all the replicas you configure for a table will become members of a Raft cluster (sometimes called Raft ensemble) specific to that table. Most importantly, Raft is used in RethinkDB to ensure that the replicas of the table agree on which server should be the current primary. This makes sure that no two servers can successfully operate as a primary at the same time, and this is what allows RethinkDB to provide linearizable consistency.

Raft is structured around the concept of a quorum. If there are 5 replicas for a table for example, at least 3 of them have to agree on a configuration change before it can take effect. This property ensures that no two configuration changes can happen without at least one replica knowing about both of them, even under arbitrary network partitions. If the two configurations would lead to conflicting outcomes (e.g. each one designating a different replica as the primary for the table), that replica would "veto" the second one and thereby make sure that no illegal configuration can ever take effect. (In reality replicas don't actually veto a configuration, but instead vote to elect a Raft leader. The result is the same).

Another important component of Raft is the concept of a persistent log. At different points during the Raft protocol, the replicas need to persist state to disk, and guarantee that it will still be there at a later time. Similar to the quorum concept, this guarantee is crucial for Raft to function properly.

The multi table manager and Raft membership

As long as you don't create or delete tables, or manually reconfigure an existing table, all configuration management pretty much happens in the Raft cluster.

However what happens if you for example use reconfigure to add a new replica to a table? Let's say a table currently has the replicas {A, B} and we now want to add a third replica C. The reconfigure command will contact one of the existing replicas and ask them to change the table configuration. This configuration change happens through the Raft protocol, and {A, B} will communicate to agree on a new configuration that includes C to form a new replica set {A, B, C}. However C is not currently part of the Raft cluster for the table. Hence it has no way of learning about this change.

This is where the multi table manager comes into play. There is always one instance of the multi table manager running on each RethinkDB server. Once the multi table manager on A or B learns about the new replica C from the Raft cluster, it will reach out to the multi table manager on C to tell it about that change. C can now join the Raft cluster and start serving as a full replica for the table.

Similarly, if C is currently unreachable (e.g. because of a network failure), and you perform another reconfiguration to reduce the replica set back to {A, B}, the multi table manager on C will get notified by the ones running on A and B to stop being a replica for the table once C becomes reachable again.

Every member of the Raft cluster is identified by a unique "Raft ID". When a new replica is made to join the cluster, the current members of the Raft cluster will generate a random ID for the new replica. This member ID is communicated through the multi table manager to the new replica, which then uses it to join the Raft cluster.

If a server joins the Raft cluster with a member ID that has been seen in the cluster
before, the other replicas in the Raft cluster will assume that the server has participated in the cluster before and is now coming back, with all of its persistent log intact. In contrast a newly added replica will receive a fresh random ID, and hence the other members of the Raft cluster will know that it's a new node without any prior data in its persistent log.

Where things went wrong

To understand what went wrong, we need to take a closer look at some details of the multi table manager.

The multi table manager relies on timestamps to determine which configuration for a table is the most recent. When it receives some new information from another multi table manager (e.g. that the server is now a replica for a table like in our example above), it compares the timestamp of that new piece of information with the timestamp of the table state that it has currently stored. If the received information is older than the currently stored one, it is ignored. Only if it's newer, the locally stored information is replaced and the multi table manager takes additional actions to become a replica for
a table or cease being a replica for a table.

However there is one exception to this rule, as you can see in this part of its source code.

/* If we are inactive and are told to become active, we ignore the
timestamp. The `base_table_config_t` in our inactive state might be
a left-over from an emergency repair that none of the currently active
servers has seen. In that case we would have no chance to become active
again for this table until another emergency repair happened
(which might be impossible, if the table is otherwise still available). */
bool ignore_timestamp =
    table->status == table_t::status_t::INACTIVE
    && action_status == action_status_t::ACTIVE;

What that code is saying is that if the multi table manager currently believes that the server it's running on should not be a replica for a table (the INACTIVE status), and then it learns from another multi table manager that it should be a replica (the ACTIVE status), it accepts that new ACTIVE status even if the new status has an older timestamp than the INACTIVE status that it previously knew about.

This special case was added in RethinkDB 2.1.1 as part of issue #4668 to work around a scenario that caused tables to never finish any reconfiguration, if the table had previously been migrated to RethinkDB 2.1.0 from RethinkDB 2.0.x or earlier. The reason this became necessary was because the migration code sometimes generated INACTIVE entries with wrong timestamps that were far in the future, and hence any server with such an entry in its multi table manager could never become ACTIVE again.

This so far isn't an issue. Let's however take a closer look at what the multi table manager does if it processes an INACTIVE status. As one part of that process, the multi table manager writes the INACTIVE state to disk by calling the write_metadata_inactive function. You can find the full implementation of that function here, but note this line in particular:

table_raft_storage_interface_t::erase(&write_txn, table_id);

This line erases the Raft storage of the table, which includes the Raft persistent log among other data.

Putting things together

We now have all the ingredients to understand the basic mode of the bug.

Remember:

  • We rely on Raft to ensure consistency of table configurations and, indirectly, of the stored data
  • A Raft member is identified by its member ID. For a given member ID, the Raft member must ensure that no entry written to its persistent log gets lost.
  • If the multi table manager processes an ACTIVE status, it causes the server to join the Raft cluster for the table with the member ID provided in the status.
  • If the multi table manager processes an INACTIVE status, it stops the replica and erases the persistent Raft log.

After processing an INACTIVE status, the only way for a multi table manager to later process an ACTIVE status is if that ACTIVE status has a higher timestamp. The timestamps are generated by the current members of the Raft cluster for the table. The same code generates the Raft member ID that gets put into the ACTIVE status.

The code makes sure that it never generates a sequence of ACTIVE, INACTIVE, ACTIVE status where each one has a higher timestamp than the previous one, and both ACTIVE status have the same Raft member ID. If you reconfigure a table first to remove a replica, and then reconfigure it again to add the same replica back, the second ACTIVE status will have a different Raft member ID. So things should be safe.

... but wait a minute. We saw that there was one exception where the multi table manager does process an ACTIVE status even though its timestamp is not higher than that of a previously received INACTIVE status.

And this is indeed where the bug lies. If for whatever reason (network delays, network partitions, etc.) a multi table manager receives an ACTIVE status first, then receives an INACTIVE status with a higher timestamp, and then receives the initial ACTIVE status a second time, it will process the second copy of the ACTIVE status. Both ACTIVE status have the same Raft member ID, but the INACTIVE status in between has wiped out the persistent log. And we know that Raft cannot properly function if a member comes back with the same member ID, but a different (in this case empty) log.

Example sequence of events

A couple of things have to come together for this to actually matter and cause split-brain behavior (two primaries accepting queries at the same time) and/or data loss.

So far we've only come up with scenarios that involve a combination of table reconfigurations and network partitions, though that doesn't mean that no other scenarios exist.

Here is a rough sketch of one such scenario:
Consider a cluster of give nodes {A, B, C, D, E}, and a table in that cluster.
We will denote the configuration of the table in terms of its replica set as stored in the Raft cluster, as well as the current connectivity groups of the network where applicable.

  1. Initial configuration. All servers are a replica. The network is fully connected.
    Replicas: {A, B, C, D, E}
    Network: {A, B, C, D, E}
  2. The network gets partitioned. {A, B} can no longer see {C, D, E} and vice versa.
    Replicas: {A, B, C, D, E}
    Network: {A, B}, {C, D, E}
  3. Reconfigure the table to a new replica set {D, E} with a client connected to the {C, D, E} side of the network partition. Note that {A, B} are not aware of this configuration change at this point, but {C, D, E} can apply the change because they have a quorum (i.e. at least 3 out of 5 replicas).
    Replicas: {A, B, C, D, E} as seen from {A, B}, {D, E} as seen from {C, D, E}
    Network: {A, B}, {C, D, E}
  4. As a consequence of the reconfiguration, C receives an INACTIVE status and wipes out its persistent Raft log.
    Network: {A, B}, {C, D, E}
  5. Repartition the network into different connected sets {A, B, C}, {D, E}
    Network: {A, B, C}, {D, E}
  6. Since A and B still believe that the replica set is {A, B, C, D, E}, their multi table manager send an ACTIVE status to C.
    Network: {A, B, C}, {D, E}
  7. Because of the bug, C accepts the ACTIVE status and rejoins the Raft cluster with {A, B} (it also would like to join with {D, E}, but since the network is still partitioned, it can't reach those servers).
    Network: {A, B, C}, {D, E}
  8. In step 3, C had approved the membership change that excluded [A, B] from the replica set. However because its Raft log is now gone, there is no longer any record of that change in the connected set {A, B, C}.
    Network: {A, B, C}, {D, E}
  9. {A, B, C} (incorrectly) believe that they have a Raft quorum. {A, B} still assume the original replica set from step 1, i.e. {A, B, C, D, E} since they were cut off from the network when the change happened in step 3. Together with C, they can obtain a quorum since they have 3 out of 5 replicas.
  10. At the same time, {D, E} (legitimately) believe that they have a quorum as well. In their case their quorum is within the replica set {C, D, E}, i.e. they have 2 out of 3 replicas.
  11. Both {A, B, C} and {D, E} can now independently perform subsequent configuration changes. For example they could both elect a primary of their own. Both primaries would independently accept write and read operations on both sides of the netsplit.

The fix

In this case the fix is rather straight forward. We simply remove the special override for the timestamp comparison in the multi table manager. The multi table manager is only going to process an ACTIVE status if it has a higher timestamp than any previously received status. Together with the way these status are generated, this ensures that any processed ACTIVE status will have a new Raft member ID.

You can find the new code here.

This introduces a regression for users who migrated to RethinkDB 2.1.0 at some point and either are still running RethinkDB 2.1.0, or haven't reconfigured their tables to utilize all servers in their cluster since the initial migration. We expect that the number of users affected by this will be extremely small.

If you observe replicas that never become ready after a reconfiguration, and you find messages of the form Not adding a replica on this server because the active configuration conflicts with a more recent inactive configuration. in the server log, you can use the following command to allow the table to complete the reconfiguration:

r.db(<db>).table(<table>).reconfigure({emergencyRepair: "_debug_recommit"})

We highly advise to disconnect any clients before running this command. As with all emergencyRepair commands, the _debug_recommit command does not guarantee linearizable consistency during the reconfiguration.

Lessons learned

Distributed systems are highly complex and designing them to be safe under any sort of edge case is a difficult undertaking. Incidentally this is the reason for why we decided to base our cluster architecture around the proven (literally, in a mathematical sense) Raft protocol, rather than designing our own consensus protocols from scratch.

As we've seen, the bug occurred in an auxiliary component that interacted with the Raft subsystem in a way that we didn't anticipate when we made the change that introduced the bug.

Apart from an increased general caution whenever future changes to one of these systems are necessary, there are three things in particular that we learned while researching this bug. These measures will make a similar bug much less likely to occur again:

  1. More fuzz testing of the clustering architecture.
    Designing distributed systems is hard, but testing them isn't much easier. @aphyr's Jepsen series on the correctness of distributed databases has some great insights into what it means to perform sophisticated randomized testing of such systems.
    We have previously used an adapted version of the Jepsen test framework internally, in addition to our own fuzzing tests. Recently, @aphyr wrote about his own analysis of RethinkDB, and we are happy that RethinkDB turned out to be one of the very few systems tested so far that behaved correctly under the tested conditions.
    Kyle has now extended his test to include configuration changes during network partitions. This is a completely new aspect, and his preliminary results of this test were what made us aware of the existence of this bug. We're going to continue making use of the extended version of the Jepsen test for RethinkDB in the future.
  2. Making better use of runtime guarantees in the code. Our Raft implementation performs a number of consistency checks at runtime, and shuts down the server if it detects any issues. This is an effective safety mechanism to stop invalid state from leading to further consistency and data integrity issues, once it has been detected.
    As part of our work on this bug, we've added additional checks to detect issues like this even earlier in the future.
  3. Priorization of issues found through the previously mentioned methods.
    We got a hint that something unexpected was happening in our cluster implementation in
    October 2015, when one of our own fuzzing tests failed a runtime guarantee check (see issue #4979). We investigated the issue, and found a bug that seemed to be the cause of it. We fixed the bug a few days later. However we observed the guarantee check failing again in November, indicating that our original fix didn't provide a full resolution. Unfortunately for our debugging efforts, the guarantee failure was extremely rare. We spent another few weeks validating the involved code paths, but couldn't find a conclusive explanation for the issue at the time. It now appears like the guarantee failures were a consequence of the same bug we describe here.
    The lesson learned is to give even more attention to even the slightest indication of unexpected behavior in the part of the clustering logic that deals with consistency.
Member

danielmewes commented Jan 27, 2016

The short version

We have found the bug that's causing this! The fix is currently going through final review and testing.

We hope to ship RethinkDB 2.2.4 with the bug fix by the end of this week.

Workaround (updated): While the problem seems to be unlikely to occur in practice, we recommend not performing any cluster reconfiguration while either

  • any server of the cluster is down or
  • any number of servers are unreachable because of a network partition

until then.

Below you can find the long description of this bug, as we think that some of you might be interested in the technical details of this.

The long version / Introduction

As part of @aphyr's continuing work on testing RethinkDB using his Jepsen test framework, we were made aware of an issue that caused RethinkDB to return incorrect results, and to drop already acknowledged writes.

RethinkDB - to the best of our knowledge - does provide full linearizable consistency when using the documented configuration and not performing any manual reconfigurations. These guarantees are upheld under failure of individual servers as well as arbitrary network partitions. A recent analysis by @aphyr supported RethinkDB's correctness under the tested scenarios.

This bug affects scenarios where a user performs reconfiguration of the cluster in the presence of network partitions. Reconfiguration in this context refers to changes to the r.db('rethinkdb').table('table_config') system table, or the use of the reconfigure or rebalance commands.

Under the right circumstances, RethinkDB 2.2.3 and earlier can violate the documented consistency and write persistence guarantees.

We are not aware of a single user who has been affected by this bug, and the issue requires a particular combination of factors in order to generate incorrect behavior.

The following provides an in-depth analysis of the bug.

We would like to thank @aphyr for his help in reproducing the issue and in tracking down potential causes.

Background

Here we provide a simplified explanation of RethinkDB's cluster architecture.

We try to provide enough information to understand the bug, but will leave out a lot of detail and simplify certain processes for the sake of not letting this become even longer than it already is.

RethinkDB's cluster architecture

RethinkDB's cluster architecture has three major components.

  • A Raft-based system to manage the configuration state of a table
  • The "multi table manager" to make servers in the cluster aware of their duties, according to the current table configuration from the Raft cluster
  • A system to store and replicate the actual data and queries.

We'll take a closer look at the first two components here:

Raft

RethinkDB uses Raft to maintain a consistent configuration state for a given table. Typically, all the replicas you configure for a table will become members of a Raft cluster (sometimes called Raft ensemble) specific to that table. Most importantly, Raft is used in RethinkDB to ensure that the replicas of the table agree on which server should be the current primary. This makes sure that no two servers can successfully operate as a primary at the same time, and this is what allows RethinkDB to provide linearizable consistency.

Raft is structured around the concept of a quorum. If there are 5 replicas for a table for example, at least 3 of them have to agree on a configuration change before it can take effect. This property ensures that no two configuration changes can happen without at least one replica knowing about both of them, even under arbitrary network partitions. If the two configurations would lead to conflicting outcomes (e.g. each one designating a different replica as the primary for the table), that replica would "veto" the second one and thereby make sure that no illegal configuration can ever take effect. (In reality replicas don't actually veto a configuration, but instead vote to elect a Raft leader. The result is the same).

Another important component of Raft is the concept of a persistent log. At different points during the Raft protocol, the replicas need to persist state to disk, and guarantee that it will still be there at a later time. Similar to the quorum concept, this guarantee is crucial for Raft to function properly.

The multi table manager and Raft membership

As long as you don't create or delete tables, or manually reconfigure an existing table, all configuration management pretty much happens in the Raft cluster.

However what happens if you for example use reconfigure to add a new replica to a table? Let's say a table currently has the replicas {A, B} and we now want to add a third replica C. The reconfigure command will contact one of the existing replicas and ask them to change the table configuration. This configuration change happens through the Raft protocol, and {A, B} will communicate to agree on a new configuration that includes C to form a new replica set {A, B, C}. However C is not currently part of the Raft cluster for the table. Hence it has no way of learning about this change.

This is where the multi table manager comes into play. There is always one instance of the multi table manager running on each RethinkDB server. Once the multi table manager on A or B learns about the new replica C from the Raft cluster, it will reach out to the multi table manager on C to tell it about that change. C can now join the Raft cluster and start serving as a full replica for the table.

Similarly, if C is currently unreachable (e.g. because of a network failure), and you perform another reconfiguration to reduce the replica set back to {A, B}, the multi table manager on C will get notified by the ones running on A and B to stop being a replica for the table once C becomes reachable again.

Every member of the Raft cluster is identified by a unique "Raft ID". When a new replica is made to join the cluster, the current members of the Raft cluster will generate a random ID for the new replica. This member ID is communicated through the multi table manager to the new replica, which then uses it to join the Raft cluster.

If a server joins the Raft cluster with a member ID that has been seen in the cluster
before, the other replicas in the Raft cluster will assume that the server has participated in the cluster before and is now coming back, with all of its persistent log intact. In contrast a newly added replica will receive a fresh random ID, and hence the other members of the Raft cluster will know that it's a new node without any prior data in its persistent log.

Where things went wrong

To understand what went wrong, we need to take a closer look at some details of the multi table manager.

The multi table manager relies on timestamps to determine which configuration for a table is the most recent. When it receives some new information from another multi table manager (e.g. that the server is now a replica for a table like in our example above), it compares the timestamp of that new piece of information with the timestamp of the table state that it has currently stored. If the received information is older than the currently stored one, it is ignored. Only if it's newer, the locally stored information is replaced and the multi table manager takes additional actions to become a replica for
a table or cease being a replica for a table.

However there is one exception to this rule, as you can see in this part of its source code.

/* If we are inactive and are told to become active, we ignore the
timestamp. The `base_table_config_t` in our inactive state might be
a left-over from an emergency repair that none of the currently active
servers has seen. In that case we would have no chance to become active
again for this table until another emergency repair happened
(which might be impossible, if the table is otherwise still available). */
bool ignore_timestamp =
    table->status == table_t::status_t::INACTIVE
    && action_status == action_status_t::ACTIVE;

What that code is saying is that if the multi table manager currently believes that the server it's running on should not be a replica for a table (the INACTIVE status), and then it learns from another multi table manager that it should be a replica (the ACTIVE status), it accepts that new ACTIVE status even if the new status has an older timestamp than the INACTIVE status that it previously knew about.

This special case was added in RethinkDB 2.1.1 as part of issue #4668 to work around a scenario that caused tables to never finish any reconfiguration, if the table had previously been migrated to RethinkDB 2.1.0 from RethinkDB 2.0.x or earlier. The reason this became necessary was because the migration code sometimes generated INACTIVE entries with wrong timestamps that were far in the future, and hence any server with such an entry in its multi table manager could never become ACTIVE again.

This so far isn't an issue. Let's however take a closer look at what the multi table manager does if it processes an INACTIVE status. As one part of that process, the multi table manager writes the INACTIVE state to disk by calling the write_metadata_inactive function. You can find the full implementation of that function here, but note this line in particular:

table_raft_storage_interface_t::erase(&write_txn, table_id);

This line erases the Raft storage of the table, which includes the Raft persistent log among other data.

Putting things together

We now have all the ingredients to understand the basic mode of the bug.

Remember:

  • We rely on Raft to ensure consistency of table configurations and, indirectly, of the stored data
  • A Raft member is identified by its member ID. For a given member ID, the Raft member must ensure that no entry written to its persistent log gets lost.
  • If the multi table manager processes an ACTIVE status, it causes the server to join the Raft cluster for the table with the member ID provided in the status.
  • If the multi table manager processes an INACTIVE status, it stops the replica and erases the persistent Raft log.

After processing an INACTIVE status, the only way for a multi table manager to later process an ACTIVE status is if that ACTIVE status has a higher timestamp. The timestamps are generated by the current members of the Raft cluster for the table. The same code generates the Raft member ID that gets put into the ACTIVE status.

The code makes sure that it never generates a sequence of ACTIVE, INACTIVE, ACTIVE status where each one has a higher timestamp than the previous one, and both ACTIVE status have the same Raft member ID. If you reconfigure a table first to remove a replica, and then reconfigure it again to add the same replica back, the second ACTIVE status will have a different Raft member ID. So things should be safe.

... but wait a minute. We saw that there was one exception where the multi table manager does process an ACTIVE status even though its timestamp is not higher than that of a previously received INACTIVE status.

And this is indeed where the bug lies. If for whatever reason (network delays, network partitions, etc.) a multi table manager receives an ACTIVE status first, then receives an INACTIVE status with a higher timestamp, and then receives the initial ACTIVE status a second time, it will process the second copy of the ACTIVE status. Both ACTIVE status have the same Raft member ID, but the INACTIVE status in between has wiped out the persistent log. And we know that Raft cannot properly function if a member comes back with the same member ID, but a different (in this case empty) log.

Example sequence of events

A couple of things have to come together for this to actually matter and cause split-brain behavior (two primaries accepting queries at the same time) and/or data loss.

So far we've only come up with scenarios that involve a combination of table reconfigurations and network partitions, though that doesn't mean that no other scenarios exist.

Here is a rough sketch of one such scenario:
Consider a cluster of give nodes {A, B, C, D, E}, and a table in that cluster.
We will denote the configuration of the table in terms of its replica set as stored in the Raft cluster, as well as the current connectivity groups of the network where applicable.

  1. Initial configuration. All servers are a replica. The network is fully connected.
    Replicas: {A, B, C, D, E}
    Network: {A, B, C, D, E}
  2. The network gets partitioned. {A, B} can no longer see {C, D, E} and vice versa.
    Replicas: {A, B, C, D, E}
    Network: {A, B}, {C, D, E}
  3. Reconfigure the table to a new replica set {D, E} with a client connected to the {C, D, E} side of the network partition. Note that {A, B} are not aware of this configuration change at this point, but {C, D, E} can apply the change because they have a quorum (i.e. at least 3 out of 5 replicas).
    Replicas: {A, B, C, D, E} as seen from {A, B}, {D, E} as seen from {C, D, E}
    Network: {A, B}, {C, D, E}
  4. As a consequence of the reconfiguration, C receives an INACTIVE status and wipes out its persistent Raft log.
    Network: {A, B}, {C, D, E}
  5. Repartition the network into different connected sets {A, B, C}, {D, E}
    Network: {A, B, C}, {D, E}
  6. Since A and B still believe that the replica set is {A, B, C, D, E}, their multi table manager send an ACTIVE status to C.
    Network: {A, B, C}, {D, E}
  7. Because of the bug, C accepts the ACTIVE status and rejoins the Raft cluster with {A, B} (it also would like to join with {D, E}, but since the network is still partitioned, it can't reach those servers).
    Network: {A, B, C}, {D, E}
  8. In step 3, C had approved the membership change that excluded [A, B] from the replica set. However because its Raft log is now gone, there is no longer any record of that change in the connected set {A, B, C}.
    Network: {A, B, C}, {D, E}
  9. {A, B, C} (incorrectly) believe that they have a Raft quorum. {A, B} still assume the original replica set from step 1, i.e. {A, B, C, D, E} since they were cut off from the network when the change happened in step 3. Together with C, they can obtain a quorum since they have 3 out of 5 replicas.
  10. At the same time, {D, E} (legitimately) believe that they have a quorum as well. In their case their quorum is within the replica set {C, D, E}, i.e. they have 2 out of 3 replicas.
  11. Both {A, B, C} and {D, E} can now independently perform subsequent configuration changes. For example they could both elect a primary of their own. Both primaries would independently accept write and read operations on both sides of the netsplit.

The fix

In this case the fix is rather straight forward. We simply remove the special override for the timestamp comparison in the multi table manager. The multi table manager is only going to process an ACTIVE status if it has a higher timestamp than any previously received status. Together with the way these status are generated, this ensures that any processed ACTIVE status will have a new Raft member ID.

You can find the new code here.

This introduces a regression for users who migrated to RethinkDB 2.1.0 at some point and either are still running RethinkDB 2.1.0, or haven't reconfigured their tables to utilize all servers in their cluster since the initial migration. We expect that the number of users affected by this will be extremely small.

If you observe replicas that never become ready after a reconfiguration, and you find messages of the form Not adding a replica on this server because the active configuration conflicts with a more recent inactive configuration. in the server log, you can use the following command to allow the table to complete the reconfiguration:

r.db(<db>).table(<table>).reconfigure({emergencyRepair: "_debug_recommit"})

We highly advise to disconnect any clients before running this command. As with all emergencyRepair commands, the _debug_recommit command does not guarantee linearizable consistency during the reconfiguration.

Lessons learned

Distributed systems are highly complex and designing them to be safe under any sort of edge case is a difficult undertaking. Incidentally this is the reason for why we decided to base our cluster architecture around the proven (literally, in a mathematical sense) Raft protocol, rather than designing our own consensus protocols from scratch.

As we've seen, the bug occurred in an auxiliary component that interacted with the Raft subsystem in a way that we didn't anticipate when we made the change that introduced the bug.

Apart from an increased general caution whenever future changes to one of these systems are necessary, there are three things in particular that we learned while researching this bug. These measures will make a similar bug much less likely to occur again:

  1. More fuzz testing of the clustering architecture.
    Designing distributed systems is hard, but testing them isn't much easier. @aphyr's Jepsen series on the correctness of distributed databases has some great insights into what it means to perform sophisticated randomized testing of such systems.
    We have previously used an adapted version of the Jepsen test framework internally, in addition to our own fuzzing tests. Recently, @aphyr wrote about his own analysis of RethinkDB, and we are happy that RethinkDB turned out to be one of the very few systems tested so far that behaved correctly under the tested conditions.
    Kyle has now extended his test to include configuration changes during network partitions. This is a completely new aspect, and his preliminary results of this test were what made us aware of the existence of this bug. We're going to continue making use of the extended version of the Jepsen test for RethinkDB in the future.
  2. Making better use of runtime guarantees in the code. Our Raft implementation performs a number of consistency checks at runtime, and shuts down the server if it detects any issues. This is an effective safety mechanism to stop invalid state from leading to further consistency and data integrity issues, once it has been detected.
    As part of our work on this bug, we've added additional checks to detect issues like this even earlier in the future.
  3. Priorization of issues found through the previously mentioned methods.
    We got a hint that something unexpected was happening in our cluster implementation in
    October 2015, when one of our own fuzzing tests failed a runtime guarantee check (see issue #4979). We investigated the issue, and found a bug that seemed to be the cause of it. We fixed the bug a few days later. However we observed the guarantee check failing again in November, indicating that our original fix didn't provide a full resolution. Unfortunately for our debugging efforts, the guarantee failure was extremely rare. We spent another few weeks validating the involved code paths, but couldn't find a conclusive explanation for the issue at the time. It now appears like the guarantee failures were a consequence of the same bug we describe here.
    The lesson learned is to give even more attention to even the slightest indication of unexpected behavior in the part of the clustering logic that deals with consistency.
@deontologician

This comment has been minimized.

Show comment
Hide comment
@deontologician

deontologician Jan 27, 2016

Contributor

👏

Contributor

deontologician commented Jan 27, 2016

👏

danielmewes added a commit that referenced this issue Jan 28, 2016

Fixes bugs in the clustering layer.
This also adds some new comments and guarantees to the Raft code,
and makes a few thinks more obviously correct.

Fixes #5289.
Fixes #4979.
Fixes #4949.
Fixes #4866. (probably)

danielmewes added a commit that referenced this issue Jan 28, 2016

Fixes bugs in the clustering layer.
This also adds some new comments and guarantees to the Raft code,
and makes a few thinks more obviously correct.

Fixes #5289.
Fixes #4979.
Fixes #4949.
Fixes #4866. (probably)
@danielmewes

This comment has been minimized.

Show comment
Hide comment
@danielmewes

danielmewes Jan 28, 2016

Member

The fix was merged into v2.2.x, next and v2.1.x in the commits
33b1d56
1571493
568446e
respectively.

This will ship with RethinkDB 2.2.4 and RethinkDB 2.1.6 (for users who haven't migrated to the 2.2.x branch yet).

Member

danielmewes commented Jan 28, 2016

The fix was merged into v2.2.x, next and v2.1.x in the commits
33b1d56
1571493
568446e
respectively.

This will ship with RethinkDB 2.2.4 and RethinkDB 2.1.6 (for users who haven't migrated to the 2.2.x branch yet).

@bketelsen

This comment has been minimized.

Show comment
Hide comment
@bketelsen

bketelsen Feb 9, 2016

Amazing write up. Kudos to your team and Kyle for the persistence -- and wisdom to run it through Jepson in the first place. It's really made me consider the platform.

Amazing write up. Kudos to your team and Kyle for the persistence -- and wisdom to run it through Jepson in the first place. It's really made me consider the platform.

@danielmewes

This comment has been minimized.

Show comment
Hide comment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment