Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Possible write loss during cluster reconfiguration #5289
We're working with Kyle Kingsbury on putting RethinkDB through the Jepsen tests. The first chunk of the analysis was published on January 4th (https://aphyr.com/posts/329-jepsen-rethinkdb-2-1-5) and RethinkDB passed the tests with flying colors. We're now working with Kyle to do even more sophisticated analysis -- testing read/write guarantees during cluster reconfiguration.
Kyle found a potential guarantee failure -- acknowledged writes that happened during rapid cluster reconfiguration may be lost in the reconfiguration process. We'll be publishing more details as we get more information, but here's what we know so far: if you do rapid reconfigurations of the RethinkDB cluster, in about one out of a few thousand reconfigurations some of the writes that occurred during reconfiguration may be lost. This is unlikely to happen in production since the tests are creating thousands of random configurations that potentially overlap with each other, but there are some steps you can take to eliminate the impact of this bug.
Edit: Please follow the updated workaround from #5289 (comment) instead
Sorry for the inconvenience everyone, we're working around the clock to get more information and get this fixed.
It's worth noting that this might only be an issue if network partitions happen in the middle of reconfiguration.
I think in practice taking down the application for reconfiguration will often do more harm than the minor risk of running into this.
However if you rely on linearizability and/or full durability guarantees, we advise following the mentioned work-around until we have more details on this.
To give a quick status update:
The short version
We have found the bug that's causing this! The fix is currently going through final review and testing.
We hope to ship RethinkDB 2.2.4 with the bug fix by the end of this week.
Workaround (updated): While the problem seems to be unlikely to occur in practice, we recommend not performing any cluster reconfiguration while either
Below you can find the long description of this bug, as we think that some of you might be interested in the technical details of this.
The long version / Introduction
As part of @aphyr's continuing work on testing RethinkDB using his Jepsen test framework, we were made aware of an issue that caused RethinkDB to return incorrect results, and to drop already acknowledged writes.
RethinkDB - to the best of our knowledge - does provide full linearizable consistency when using the documented configuration and not performing any manual reconfigurations. These guarantees are upheld under failure of individual servers as well as arbitrary network partitions. A recent analysis by @aphyr supported RethinkDB's correctness under the tested scenarios.
This bug affects scenarios where a user performs reconfiguration of the cluster in the presence of network partitions. Reconfiguration in this context refers to changes to the
Under the right circumstances, RethinkDB 2.2.3 and earlier can violate the documented consistency and write persistence guarantees.
We are not aware of a single user who has been affected by this bug, and the issue requires a particular combination of factors in order to generate incorrect behavior.
The following provides an in-depth analysis of the bug.
We would like to thank @aphyr for his help in reproducing the issue and in tracking down potential causes.
Here we provide a simplified explanation of RethinkDB's cluster architecture.
We try to provide enough information to understand the bug, but will leave out a lot of detail and simplify certain processes for the sake of not letting this become even longer than it already is.
RethinkDB's cluster architecture
RethinkDB's cluster architecture has three major components.
We'll take a closer look at the first two components here:
RethinkDB uses Raft to maintain a consistent configuration state for a given table. Typically, all the replicas you configure for a table will become members of a Raft cluster (sometimes called Raft ensemble) specific to that table. Most importantly, Raft is used in RethinkDB to ensure that the replicas of the table agree on which server should be the current primary. This makes sure that no two servers can successfully operate as a primary at the same time, and this is what allows RethinkDB to provide linearizable consistency.
Raft is structured around the concept of a quorum. If there are 5 replicas for a table for example, at least 3 of them have to agree on a configuration change before it can take effect. This property ensures that no two configuration changes can happen without at least one replica knowing about both of them, even under arbitrary network partitions. If the two configurations would lead to conflicting outcomes (e.g. each one designating a different replica as the primary for the table), that replica would "veto" the second one and thereby make sure that no illegal configuration can ever take effect. (In reality replicas don't actually veto a configuration, but instead vote to elect a Raft leader. The result is the same).
Another important component of Raft is the concept of a persistent log. At different points during the Raft protocol, the replicas need to persist state to disk, and guarantee that it will still be there at a later time. Similar to the quorum concept, this guarantee is crucial for Raft to function properly.
The multi table manager and Raft membership
As long as you don't create or delete tables, or manually reconfigure an existing table, all configuration management pretty much happens in the Raft cluster.
However what happens if you for example use
This is where the multi table manager comes into play. There is always one instance of the multi table manager running on each RethinkDB server. Once the multi table manager on
Every member of the Raft cluster is identified by a unique "Raft ID". When a new replica is made to join the cluster, the current members of the Raft cluster will generate a random ID for the new replica. This member ID is communicated through the multi table manager to the new replica, which then uses it to join the Raft cluster.
If a server joins the Raft cluster with a member ID that has been seen in the cluster
Where things went wrong
To understand what went wrong, we need to take a closer look at some details of the multi table manager.
The multi table manager relies on timestamps to determine which configuration for a table is the most recent. When it receives some new information from another multi table manager (e.g. that the server is now a replica for a table like in our example above), it compares the timestamp of that new piece of information with the timestamp of the table state that it has currently stored. If the received information is older than the currently stored one, it is ignored. Only if it's newer, the locally stored information is replaced and the multi table manager takes additional actions to become a replica for
However there is one exception to this rule, as you can see in this part of its source code.
/* If we are inactive and are told to become active, we ignore the timestamp. The `base_table_config_t` in our inactive state might be a left-over from an emergency repair that none of the currently active servers has seen. In that case we would have no chance to become active again for this table until another emergency repair happened (which might be impossible, if the table is otherwise still available). */ bool ignore_timestamp = table->status == table_t::status_t::INACTIVE && action_status == action_status_t::ACTIVE;
What that code is saying is that if the multi table manager currently believes that the server it's running on should not be a replica for a table (the INACTIVE status), and then it learns from another multi table manager that it should be a replica (the ACTIVE status), it accepts that new ACTIVE status even if the new status has an older timestamp than the INACTIVE status that it previously knew about.
This special case was added in RethinkDB 2.1.1 as part of issue #4668 to work around a scenario that caused tables to never finish any reconfiguration, if the table had previously been migrated to RethinkDB 2.1.0 from RethinkDB 2.0.x or earlier. The reason this became necessary was because the migration code sometimes generated INACTIVE entries with wrong timestamps that were far in the future, and hence any server with such an entry in its multi table manager could never become ACTIVE again.
This so far isn't an issue. Let's however take a closer look at what the multi table manager does if it processes an INACTIVE status. As one part of that process, the multi table manager writes the INACTIVE state to disk by calling the
This line erases the Raft storage of the table, which includes the Raft persistent log among other data.
Putting things together
We now have all the ingredients to understand the basic mode of the bug.
After processing an INACTIVE status, the only way for a multi table manager to later process an ACTIVE status is if that ACTIVE status has a higher timestamp. The timestamps are generated by the current members of the Raft cluster for the table. The same code generates the Raft member ID that gets put into the ACTIVE status.
The code makes sure that it never generates a sequence of ACTIVE, INACTIVE, ACTIVE status where each one has a higher timestamp than the previous one, and both ACTIVE status have the same Raft member ID. If you reconfigure a table first to remove a replica, and then reconfigure it again to add the same replica back, the second ACTIVE status will have a different Raft member ID. So things should be safe.
... but wait a minute. We saw that there was one exception where the multi table manager does process an ACTIVE status even though its timestamp is not higher than that of a previously received INACTIVE status.
And this is indeed where the bug lies. If for whatever reason (network delays, network partitions, etc.) a multi table manager receives an ACTIVE status first, then receives an INACTIVE status with a higher timestamp, and then receives the initial ACTIVE status a second time, it will process the second copy of the ACTIVE status. Both ACTIVE status have the same Raft member ID, but the INACTIVE status in between has wiped out the persistent log. And we know that Raft cannot properly function if a member comes back with the same member ID, but a different (in this case empty) log.
Example sequence of events
A couple of things have to come together for this to actually matter and cause split-brain behavior (two primaries accepting queries at the same time) and/or data loss.
So far we've only come up with scenarios that involve a combination of table reconfigurations and network partitions, though that doesn't mean that no other scenarios exist.
Here is a rough sketch of one such scenario:
In this case the fix is rather straight forward. We simply remove the special override for the timestamp comparison in the multi table manager. The multi table manager is only going to process an ACTIVE status if it has a higher timestamp than any previously received status. Together with the way these status are generated, this ensures that any processed ACTIVE status will have a new Raft member ID.
You can find the new code here.
This introduces a regression for users who migrated to RethinkDB 2.1.0 at some point and either are still running RethinkDB 2.1.0, or haven't reconfigured their tables to utilize all servers in their cluster since the initial migration. We expect that the number of users affected by this will be extremely small.
If you observe replicas that never become ready after a reconfiguration, and you find messages of the form
We highly advise to disconnect any clients before running this command. As with all
Distributed systems are highly complex and designing them to be safe under any sort of edge case is a difficult undertaking. Incidentally this is the reason for why we decided to base our cluster architecture around the proven (literally, in a mathematical sense) Raft protocol, rather than designing our own consensus protocols from scratch.
As we've seen, the bug occurred in an auxiliary component that interacted with the Raft subsystem in a way that we didn't anticipate when we made the change that introduced the bug.
Apart from an increased general caution whenever future changes to one of these systems are necessary, there are three things in particular that we learned while researching this bug. These measures will make a similar bug much less likely to occur again:
added a commit
Jan 28, 2016
added a commit
Jan 28, 2016
This was referenced
Jan 28, 2016
The Jepsen post about this is out: https://aphyr.com/posts/330-jepsen-rethinkdb-2-2-3-reconfiguration