Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate document about consistency and persistence guarantees #770

Closed
danielmewes opened this issue May 19, 2015 · 14 comments
Closed

Separate document about consistency and persistence guarantees #770

danielmewes opened this issue May 19, 2015 · 14 comments
Assignees
Milestone

Comments

@danielmewes
Copy link
Member

For RethinkDB 2.1, we should write a separate document that explains all the consistency and persistence guarantees and the different settings and their effects.

For example: If I perform a write with settings X, is the write guaranteed to be persistent if...
a) one or multiple replicas are temporarily shut down (e.g. maintenance, connectivity issue)
b) one or multiple replicas fail temporarily without an orderly shutdown (e.g. power failure, software crash)
c) one or multiple replicas fail permanently (e.g. storage failure)

Relevant settings are: soft durability vs. hard durability, majority vs. single acks

We should also specify how many replicas are allowed to fail in a given configuration.

Also: If I perform a read with settings X, am I guaranteed to see all previously acknowledged writes? Can I read a value that might get lost later in case a replica fails (e.g. a write that hasn't been acknowledged yet, or that hasn't been written to disk yet)?

Relevant settings are: use_outdated, the upcoming "consistent read" parameter (rethinkdb/rethinkdb#3895). It also matters which settings the relevant previous writes have been performed with.

@danielmewes danielmewes added this to the 2.1 milestone May 19, 2015
@danielmewes
Copy link
Member Author

This page will supersede (or enhance) the FAQ entry written for #725

@danielmewes
Copy link
Member Author

It would also be good to have a brief paragraph on changefeed consistency guarantees (currently they always behave like up to date reads without the "consistent read" flag).

@timmaxw
Copy link
Member

timmaxw commented Jun 9, 2015

This document pertains to RethinkDB 2.1.

Consistency and durability settings

There are three settings that control consistency and durability: write_acks, durability, and read_mode.

Write acks

write_acks can be set per-table via rethinkdb.table_config.

The possible values are:

  • "majority": Writes will return once have been written to a majority of replicas (not counting non-voting replicas). This is the default.
  • "single": Writes will return once they have been written to at least one replica.
Durability

durability can be set per-table via rethinkdb.table_config or per-query via an optarg to .update(), .insert(), etc. The per-table setting is only used if no per-query value is specified.

The possible values are:

  • "hard": Writes are not considered "written to a replica" until they are safely on the replica's disk, secure against power failure. This is the default.
  • "soft": Writes are considered "written to a replica" once they have been recorded in memory on the replica.
Read mode

read_mode can be set per-query via an optarg to r.table(). This replaces the old use_outdated optarg.

The possible values are:

  • "majority": The read will only observe values that are safely committed on disk on a majority of replicas. This mode requires sending a message to every replica for each read.
  • "single": The read will observe values that are in memory on the primary replica. This is the default.
  • "outdated": The read will observe values that are in memory on an arbitrary replica.

Changefeeds ignore the read_mode flag; their behavior is always equivalent to "single" mode.

Guarantees

Linearizability and atomicity

If write_acks is set to "majority", durability is set to "hard", and read_mode is set to "majority", then RethinkDB guarantees linearizability of individual atomic operations on each individual document. This means that every read will see every previous successful write, and no read will ever see a definitively failed write. (See note about definitively failed vs. indeterminate writes below.)

Warning: The above linearizability guarantee is for atomic operations, not for queries. A single RethinkDB query will not necessarily execute as a single atomic operation. For example, r.table("foo").get("bar").eq(r.table("foo").get("bar")) might return false! Each individual r.table("foo").get("bar") operation is atomic, but the query as a whole is not atomic.

If you need to read and then modify a document as a single atomic operation, use the .update() or .replace() commands. For example, to atomically increment the hits field of a document, you could write:

table.get(counter_id).update({hits: r.row("hits") + 1})

This can also be used to implement a check-and-set register. For example, the following query will atomically check whether to foo field is equal to old_value and change it to new_value if so:

table.get(register_id).update({
    foo: r.branch(r.row("foo").eq(old_value), new_value, r.row("foo"))
    })

RethinkDB operations are never atomic across multiple keys. For this reason, RethinkDB is not considered an ACID database.

Currently, .filter(), .get_all() and other such operations will execute as separate atomic operations from .update() and other mutation operations. For example, the following is not a correct implementation of a check-and-set register:

table.filter({id: register_id, foo: old_val}).update({foo: new_val})

However, there has been some discussion of changing this behavior. See rethinkdb/rethinkdb#3992.

Availability

Except for brief periods, a table will remain fully available as long as more than half of the replicas for each shard are available, not counting non-voting replicas. If half or more of the voting replicas for a shard are lost, then read or write operations on that shard will fail.

If the primary replica is lost, but more than half of the voting replicas are still available, an arbitrary voting replica will be elected as primary. The new primary will show up in table_status, but the primary_replica field of table_config will not change. If the old primary ever becomes available again, the system will switch back. When the primary changes there will be a brief period of unavailability.

Reconfiguring a table (changing the number of shards, shard boundaries, etc.) will cause brief losses of availability at various points during the reconfiguration.

If half or more of the voting replicas of a shard are lost, the only way to recover availability is to run the emergency repair command. Running the emergency repair command invalidates the linearizability guarantees in this document.

But see rethinkdb/rethinkdb#4357 for an exception / known bug in these availability guarantees.

Reads run in "single" mode may succeed even if the table is not available, but this is not guaranteed. Reads run in "outdated" mode will succeed as long as at least one replica for each of the relevant shards is available.

Trading off safety for performance

RethinkDB offers a sliding scale of safety versus performance guarantees.

The default settings always choose safety over performance except in one case: read_mode defaults to "single" rather than "majority". This is because "majority" read mode requires sending a query to all of the replicas and waiting for a majority of them to reply, which significantly degrades performance.

In normal operation, "single" read mode produces the same results as "majority" read mode during normal operation, but in the event of a network failure or crash, it might return outdated results. It's also possible that a read run in "single" mode can return results from an incomplete write that is later rolled back.

The same is true for "single" write mode and "soft" durability mode: in normal operation these produce the same ressults as "majority" and "hard", but in the event of a network or server failure, recent write operations that were run using these modes might be lost.

Note that write_acks and durability don't actually affect how the write is performed; they only affect when the acknowledgement is sent back to the client.

Reads run in "outdated" mode will return outdated data even during normal operation, but the data will typically be less than a second out of date. In the event of a network or server failure, the data may be much more out of date. The advantage of running reads in "outdated" mode is that the latency and throughput are often better than in "single" mode, in addition to the availability differences described in the previous section.

Other notes

If you run the emergency repair command on a table, these guarantees will be invalidated.

There are two ways a write operation can fail. Sometimes it will fail definitively; other times it will fail indeterminately. You can examine the error message to see which type of failure happened. (This is a work in progress; ask @mlucy for details.) If a write fails definitively, no read will ever see it, even in the weaker read modes. If it fails indeterminately, it's in a sort of a limbo state; reads run in "single" or "outdated" modes might see it, but when the network failure or crash that caused the problem is resolved, the write might or might not be rolled back. In general, writes will fail indeterminately if they were running at the exact moment when the network or server issue first happened.

@VeXocide @danielmewes -- please read over this and let me know if you spot any errors or things that are incompletely documented.

@danielmewes
Copy link
Member Author

Currently, .filter(), .get_all() and other such operations will execute as separate atomic operations from .update() and other mutation operations.

Minor correction: The operation in filter isn't necessarily atomic itself.

@danielmewes
Copy link
Member Author

However, there has been some discussion of changing this behavior. See rethinkdb/rethinkdb#3992.

Note that this will either not happen for 2.1, or it will happen in a way which we wouldn't want to document. So as far as the 2.1 documentation is concerned, those operations are definitely not guaranteed to be atomic as a whole (and it's unclear if that will change).

@danielmewes
Copy link
Member Author

If it fails indeterminately, it's in a sort of a limbo state; reads run in "single" or "outdated" modes might see it, but when the network failure or crash that caused the problem is resolved, the write might or might not be rolled back.

Note that even "majority" reads can see the write (at least) if the network failure is non-transitive.

@danielmewes
Copy link
Member Author

Also: Very nice writeup 👍

@timmaxw
Copy link
Member

timmaxw commented Jun 9, 2015

Note that even "majority" reads can see the write (at least) if the network failure is non-transitive.

If a "majority" read ever sees a write, then the write will never be rolled back. I don't see how network failure is related to this.

@timmaxw
Copy link
Member

timmaxw commented Jun 9, 2015

*Non-transitive network failure

@danielmewes
Copy link
Member Author

If a "majority" read ever sees a write, then the write will never be rolled back. I don't see how network failure is related to this.

That's true, but such a write might still have returned an indeterminate failure result. Hence a write that fails indeterminately might not only be seen by "single" and "outdated", but also by "majority" reads.
Non-transitive network failure was maybe not the best example. Just consider the case where the parsing node that handles a majority write loses connectivity to the rest of the cluster before it can receive sufficient acks. Then the write will fail indeterminately, but a majority read over another node might still very well reach a majority and see the write.

@timmaxw
Copy link
Member

timmaxw commented Jun 9, 2015

OK, I think we're on the same page here.

@chipotle chipotle self-assigned this Jun 9, 2015
chipotle pushed a commit that referenced this issue Jun 11, 2015
@danielmewes
Copy link
Member Author

One more thing that confused me while playing with our Raft version, and that I think is worth mentioning:
When describing "majority" and "single" write acks, we should note that using "single" write acks doesn't improve availability of tables. We will still reject "single" ack writes if we cannot reach a majority of replicas eventually. The difference to "majority" is that this isn't guaranteed to happen immediately (i.e. a handful of writes can still pass before all servers realize that something is wrong), and as already mentioned the latency of writes can be reduced because we don't have to wait for as many replicas to acknowledge the writes.
However if 2 out of 3 replicas go down you cannot write to the table regardless of the "write_acks" setting. This is different from RethinkDB 2.0.

@chipotle
Copy link
Contributor

Closed in 3d59aa6

@danielmewes
Copy link
Member Author

Nice document. Just a few small things I'd like to change:

(These operations are usually atomic, although not all filter operations are, depending on the predicate.)

I think this is confusing, because it's not clear what being "atomic" means with respect to these operations. I think we should remove that remark.

Except for brief periods, a table will remain fully available as long as more than half of the voting replicas for each shard are available.

This is not entirely true. In addition to more than half of the voting replicas for each shard, we also need to have more than half of the voting replicas of the table overall available.

(changing the number of shards, shard boundaries, etc.)

Since we don't expose shard boundaries directly and don't allow adjusting them explicitly, I think this should say: "(changing the number of shards, rebalancing, etc.)"

@danielmewes danielmewes reopened this Jun 16, 2015
@chipotle chipotle closed this as completed Jul 6, 2015
chipotle pushed a commit that referenced this issue Jul 14, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants