diff --git a/_posts/2015-08-22-paxos-register.markdown b/_posts/2015-08-22-paxos-register.markdown index d0fc04ac..c355a74e 100644 --- a/_posts/2015-08-22-paxos-register.markdown +++ b/_posts/2015-08-22-paxos-register.markdown @@ -1,25 +1,25 @@ --- layout: post -title: rystsov::The Paxos Register -name: The Paxos Register -tags: ["misc"] +title: rystsov::Write-once distributed register +name: Write-once distributed register +tags: ["distr"] desc: "Design of a write-once distributed fault-tolerance register" has_comments: true --- -

The Paxos Register

+

Write-once distributed register

-Imagine a service that provides an API to read and to write a write-once variable with the following properties: +Imagine a service that provides an API to read and to write a write-once register (explain what it is) with the following properties: 1. If a write request succeeds then any consequent write should fail. 2. If a value is written then any consequent read must return the value or fail. 3. Any read request must return empty value until a value is written -The implementation of the service is straightforward when we plan to run it on a single node. But any single node solution has a reliability/availability problem: when the node goes off-line (due to crash or maintenance procedures) our service is unavailable. The obvious solution is to run a service on several nodes to achieve high availability. At that point we enter the realm of distributed systems and everything (including our service) gets harder. Try to think how to satisfy the desired properties in a distributed environment (where nodes may temporary go offline and where network may lose messages) and you will see that consistency is a tough problem. +The implementation of the service is straightforward when we plan to run it on a single node. But any single node solution has a reliability/availability problem: when the node goes off-line (due to crash or maintenance procedures) our service is unavailable. The obvious solution is to run a service on several nodes to achieve high availability. At that point we enter the realm of distributed systems and everything (including our service) gets more complicated. Try to think how to satisfy the desired properties listed above in a distributed environment (where nodes may temporary go offline and where network may lose messages) and you will see that consistency is a tough problem. -Hopefully this task is already solved by the paxos consensus algorithm. People usually use paxos to build way more complex systems than the described service (register) but the algorithm is also applicable here. Moreover the fact that the register is one of the simplest distributed system that can be built based on paxos makes it is a good entry point for the engineers who what to understand paxos on an example. +Hopefully this task is already solved by the Paxos consensus algorithm. People usually use Paxos to build way more complex systems than the described service but the algorithm is also applicable here. Moreover the fact that the register is one of the simplest distributed systems that can be built based on Paxos makes it a good entry point for the engineers who want to understand Paxos on an example. -The description of the register is written in a python inspired pseudocode. I assume that every write to an instance variable is intercepted, redirected to a persistent storage and fsync-ed. Of course every read from an instance variable reads from the storage. The pseudocode follows the algorithm described in the [Paxos Made Simple](http://research.microsoft.com/en-us/um/people/lamport/pubs/paxos-simple.pdf) paper by Leslie Lamport, please read it to understand the idea behind the code and see the proof. +The description of the register is written in a Python inspired pseudocode. I assume that every write to an instance variable is intercepted, redirected to a persistent storage and fsync-ed. Of course every read from an instance variable reads from the storage. The pseudocode follows the algorithm described in the [Paxos Made Simple](http://research.microsoft.com/en-us/um/people/lamport/pubs/paxos-simple.pdf) paper by Leslie Lamport, please read it to understand the idea behind the code and see the proof. Acceptors: diff --git a/_posts/2015-08-23-paxos-variable.markdown b/_posts/2015-08-23-paxos-variable.markdown index 01ad40ae..dd11fec3 100644 --- a/_posts/2015-08-23-paxos-variable.markdown +++ b/_posts/2015-08-23-paxos-variable.markdown @@ -1,25 +1,25 @@ --- layout: post -title: rystsov::The Paxos Variable -name: The Paxos Variable -tags: ["misc"] -desc: "How to update the registor into a distributed fault-tolerance variable with compare-and-set (CAS) concurrency control mechanism" +title: rystsov::Distributed register +name: Distributed register +tags: ["distr"] +desc: "How to update the write-once registor into a distributed fault-tolerance mutable register with the compare-and-set (CAS) concurrency control mechanism" has_comments: true --- -

The Paxos Variable

+

Distributed register

-In the previous [article]({% post_url 2015-08-22-paxos-register %}) I showed how to create a write-once distributed register. But in real life we usually want to mutate state. So Let's fix it and design a distributed paxos based variable. +In the previous [article]({% post_url 2015-08-22-paxos-register %}) I showed how to create a write-once distributed register. But in real life we usually want to mutate state. So Let's fix it and design a mutable register. -Distributed variable is a little bit more complex system than the register but it uses the same idea so if you understand the register then it should be easy to you to understand the variable's design. +Distributed mutable register is a little bit more complex system than the write-once register but it uses the same idea so if you understand the register then it should be easy to you to understand the current design. -The basic idea is to use a sequence of ordered registers to emulate a variable. For example to write a value we write it to the register with the lowest id among empty registers. To read a value we read it from the register with the highest id among filled registers. +The basic idea is to use a sequence of ordered registers to emulate a variable. For example to write a value we write it to the register with the lowest ID among empty registers. To read a value we read it from the register with the highest ID among filled registers. -Distributed system is by definition is a concurrent system too so its very important to provide concurrency control mechanisms like a compare-and-set (CAS) primitive. To support CAS primitive we must prevent modifications of the registers if a register with a higher is has already chosen a value. We can achieve it if maintain serializability between any successful register's modifications. Hopefully we can do it if make the acceptors to use the same promise. Let's take a look on the new acceptor's methods. +Distributed system is by definition is a concurrent system too so its very important to provide concurrency control mechanisms like a compare-and-set (CAS) primitive. To support CAS primitive we must prevent modification of the past, e.g. edit of registers if a register with a higher ID has already chosen a value. We can achieve it by maintaining serializability between any successful register's modifications. Hopefully we can do it if we make the acceptors use the same promise. Let's take a look on the new acceptor's methods. {% gist rystsov/a45793daa1662a921479 %} -You may be surprised since you don't see a sequence of the registers. Well, we always are interested only in the filled register with the largest id so we don't store the whole sequence. +You may be surprised since you don't see a sequence of the registers. Well, we always are interested only in the filled register with the largest ID so we don't store the whole sequence. The API consist of two functions diff --git a/_posts/2015-08-24-dpp-variable.markdown b/_posts/2015-08-24-dpp-variable.markdown index 44e28526..b63b838c 100644 --- a/_posts/2015-08-24-dpp-variable.markdown +++ b/_posts/2015-08-24-dpp-variable.markdown @@ -1,21 +1,21 @@ --- layout: post -title: rystsov::dynamic distributed register -name: Dynamic Distributed Register -tags: ["misc"] +title: rystsov::distributed register for the dynamic environment +name: Distributed register for the dynamic environment +tags: ["distr"] desc: "Updates distributed CAS register to work in the dynamic environment" has_comments: true --- -

Dynamic Distributed Register

+

Distributed register for the dynamic environment

-Distributed register that we designed in the [previous post]({% post_url 2015-08-23-paxos-variable %}) has an obvious weakness: it works only with the static set of nodes. We know that hardware malfunction are unavoidable and eventually all the nodes will be broken. We need a mechanism to add (remove) nodes to the cluster to let the register to outlive the nodes it runs on. The [Dynamic Plain Paxos]({% post_url 2015-07-19-dynamic-plain-paxos %}) article explains how we can do it with plain paxos, we need to adapt it to use with the register. +Distributed register that we designed in the [previous post]({% post_url 2015-08-23-paxos-variable %}) has an obvious weakness: it works only with the static set of nodes. We know that hardware malfunctions are unavoidable and eventually all the nodes will be broken. So we need a mechanism to add (remove) nodes to the cluster to let the register outlive the nodes it runs on. The [Dynamic Plain Paxos]({% post_url 2015-07-19-dynamic-plain-paxos %}) article explains how we can do it with plain Paxos. We need to adapt it to use with the register. -Lets take a look on the coordinator which manages the process of changing the membership. Compared to the the article we don't use filters. Filters were added to make the proof understandable but if you think about them then you will notice that a system with filters behaves the same way as a system without them so we can omit them during the implementation. +Lets take a look on the coordinator which manages the process of changing the membership. Compared to the article mentioned above we don't use filters. Filters were added to make the proof understandable, but if you think about them you will notice that a system with filters behaves the same way as a system without them, so we can omit filters during the implementation. {% gist rystsov/0644f6cf7b45f5a72e81 %} -As you can see to change membership we execute several simple steps like increasing the quorum size or altering the set of nodes. But we can't think that those simple changes happen instantly since the old value may be used in the running _read_write_read operation. It makes sense to wait until all running operations with old value are finished. This is implemented via the era variable and the _epic_change method. +As you can see, to change membership we execute several simple steps like increasing the quorum size or altering the set of nodes. But we can't think that those simple changes happen instantly since the old value may be used in the running '_read_write_read' operation. It makes sense to wait until all running operations with old value are finished. This is implemented via the 'era' variable and 'the _epic_change' method. The acceptors methods are left intact. diff --git a/_posts/2015-08-26-sharded-paxos.markdown b/_posts/2015-08-26-sharded-paxos.markdown index 1b24f343..78755459 100644 --- a/_posts/2015-08-26-sharded-paxos.markdown +++ b/_posts/2015-08-26-sharded-paxos.markdown @@ -2,19 +2,43 @@ layout: post title: rystsov::Paxos-based sharded ordered key value store with CAS name: Paxos-based sharded ordered key value store with CAS -tags: ["misc"] -desc: "How to run an instance of paxos-register per per key to get a key value storage and shard it on the fly without loosing consistency" +tags: ["distr"] +desc: "How to run an instance of paxos register per per key to build a key value storage and how to shard it on the fly without loosing consistency" has_comments: true ---

Paxos-based sharded ordered key value store with CAS

- +To build a key value storage each node runs a new instance of Paxos register each time when the node receives a get or put request for an unseen key and redirects all further requests related to the key to that register. Of course different instances on the same node should share the quorum size and the set of nodes and use unified membership change mechanism which generalises the Paxos variable's version to work with a dynamic set of Paxos registers. For example its 2n+1 to 2n+2 transition is: +1. Increase the quorum size. +2. Generate an event on each node. +3. Fetch all keys from the nodes up to the event, union them and sync all of them. Of course this operation may be optimized by batching and parallel processing. +4. Add a new node to the set of nodes. - +Up to this point we designed a replicated key/value with strong consistency and per key test-and-set concurrency control primitive. Since the whole dataset fits one node nothing prevents us from maintaining the order on keys and support range queries. - +The next step is to support sharding. - +

Sharding

- +Imagine a key value storage that lives on three nodes. + + + +Some day because of the storage usage or high load we will decide to split the storage. So we peek a key from the key space and split the key/value storage into two logical groups. First group (A) contains key less-or-equal to the key and the second group contains the rest (B). + + + +Then we add a node to the B group. + + + +And remove a node from it. + + + +And repeat the process until we get A and B clusters working on the different sets of nodes. + + + +As a result we split the storage without stopping the cluster or loosing consistency. diff --git a/css/common.css b/css/common.css index 4e2b6d50..d929e3dd 100644 --- a/css/common.css +++ b/css/common.css @@ -9,6 +9,10 @@ a { text-decoration: none; } +.anchor { + background: none; +} + h1,h2,h3,p { margin: 0; } @@ -24,6 +28,11 @@ h1,h2,h3,p { font-size: 90%; } +.key_value_design { + margin-top: 1em; + margin-bottom: 1em; +} + .abstract-center { max-width: 600px; margin-right: auto; diff --git a/css/post.css b/css/post.css index 37eda1cb..3a4e7cb1 100644 --- a/css/post.css +++ b/css/post.css @@ -30,3 +30,9 @@ h3 { #low_internet { margin-top: 2em; } + +.sharded-paxos-pic { + margin-left: auto; + margin-right: auto; + display: block; +} diff --git a/images/sharded-paxos-1.png b/images/sharded-paxos-1.png new file mode 100644 index 00000000..5434b7ed Binary files /dev/null and b/images/sharded-paxos-1.png differ diff --git a/images/sharded-paxos-2.png b/images/sharded-paxos-2.png new file mode 100644 index 00000000..fe218281 Binary files /dev/null and b/images/sharded-paxos-2.png differ diff --git a/images/sharded-paxos-3.png b/images/sharded-paxos-3.png new file mode 100644 index 00000000..5606e128 Binary files /dev/null and b/images/sharded-paxos-3.png differ diff --git a/images/sharded-paxos-4.png b/images/sharded-paxos-4.png new file mode 100644 index 00000000..a440010c Binary files /dev/null and b/images/sharded-paxos-4.png differ diff --git a/images/sharded-paxos-5.png b/images/sharded-paxos-5.png new file mode 100644 index 00000000..b9c29720 Binary files /dev/null and b/images/sharded-paxos-5.png differ diff --git a/index.html b/index.html index ea816dbb..3aebdc7f 100644 --- a/index.html +++ b/index.html @@ -11,6 +11,20 @@

Blog

+
+

Designing a distributed key/value storage

+
A series of posts describing how to design a distributed key/value storage
+ +
+

Misc