Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
928e79a
adds sharded paxos pics
Aug 27, 2015
ac10e58
Merge branch 'master' of github.com:rystsov/rystsov.github.io
Aug 27, 2015
134bb97
Update 2015-08-26-sharded-paxos.markdown
rystsov Aug 27, 2015
fddaf99
Update 2015-08-26-sharded-paxos.markdown
rystsov Aug 27, 2015
fe88132
Update 2015-08-26-sharded-paxos.markdown
rystsov Aug 27, 2015
c6dda55
Update 2015-08-26-sharded-paxos.markdown
rystsov Aug 27, 2015
f355b51
Update 2015-08-26-sharded-paxos.markdown
rystsov Aug 27, 2015
d53b9e5
Update 2015-08-26-sharded-paxos.markdown
rystsov Aug 27, 2015
bb973c7
Update 2015-08-26-sharded-paxos.markdown
rystsov Aug 27, 2015
b7fecc0
Update 2015-08-26-sharded-paxos.markdown
rystsov Aug 27, 2015
bf0cdf6
Update post.css
rystsov Aug 27, 2015
63463d2
Update 2015-08-26-sharded-paxos.markdown
rystsov Aug 27, 2015
9b6486f
Update 2015-08-26-sharded-paxos.markdown
rystsov Aug 27, 2015
d8c2a05
Update 2015-08-26-sharded-paxos.markdown
rystsov Aug 27, 2015
8599b86
Update index.html
rystsov Aug 27, 2015
ccfb1c6
Update index.html
rystsov Aug 27, 2015
21aa63a
Update 2015-08-24-dpp-variable.markdown
rystsov Aug 27, 2015
6433dba
Update 2015-08-23-paxos-variable.markdown
rystsov Aug 27, 2015
db694dd
Update 2015-08-22-paxos-register.markdown
rystsov Aug 27, 2015
fbc49b5
Update index.html
rystsov Aug 27, 2015
988a785
Update 2015-08-22-paxos-register.markdown
rystsov Aug 27, 2015
e5f7ca5
Update 2015-08-22-paxos-register.markdown
rystsov Aug 27, 2015
77dbefc
Update 2015-08-23-paxos-variable.markdown
rystsov Aug 27, 2015
c10644f
Update 2015-08-23-paxos-variable.markdown
rystsov Aug 27, 2015
c84daac
Update 2015-08-23-paxos-variable.markdown
rystsov Aug 27, 2015
1cb3b55
Update 2015-08-24-dpp-variable.markdown
rystsov Aug 27, 2015
0e42147
Update 2015-08-26-sharded-paxos.markdown
rystsov Aug 27, 2015
6a4c5d9
Update index.html
rystsov Aug 27, 2015
9017ea8
Update common.css
rystsov Aug 27, 2015
103da70
Update index.html
rystsov Aug 27, 2015
bc43c58
Update index.html
rystsov Aug 27, 2015
d9e9ae0
Update common.css
rystsov Aug 27, 2015
ca31478
Update 2015-08-22-paxos-register.markdown
kosiachenko Aug 28, 2015
a437a6e
Update 2015-08-23-paxos-variable.markdown
kosiachenko Aug 28, 2015
f21e9e5
Update 2015-08-24-dpp-variable.markdown
kosiachenko Aug 28, 2015
8ac1a75
Update 2015-08-26-sharded-paxos.markdown
kosiachenko Aug 28, 2015
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 8 additions & 8 deletions _posts/2015-08-22-paxos-register.markdown
Original file line number Diff line number Diff line change
@@ -1,25 +1,25 @@
---
layout: post
title: rystsov::The Paxos Register
name: The Paxos Register
tags: ["misc"]
title: rystsov::Write-once distributed register
name: Write-once distributed register
tags: ["distr"]
desc: "Design of a write-once distributed fault-tolerance register"
has_comments: true
---

<h1>The Paxos Register</h1>
<h1>Write-once distributed register</h1>

Imagine a service that provides an API to read and to write a write-once variable with the following properties:
Imagine a service that provides an API to read and to write a write-once register (explain what it is) with the following properties:

1. If a write request succeeds then any consequent write should fail.
2. If a value is written then any consequent read must return the value or fail.
3. Any read request must return empty value until a value is written

The implementation of the service is straightforward when we plan to run it on a single node. But any single node solution has a reliability/availability problem: when the node goes off-line (due to crash or maintenance procedures) our service is unavailable. The obvious solution is to run a service on several nodes to achieve high availability. At that point we enter the realm of distributed systems and everything (including our service) gets harder. Try to think how to satisfy the desired properties in a distributed environment (where nodes may temporary go offline and where network may lose messages) and you will see that consistency is a tough problem.
The implementation of the service is straightforward when we plan to run it on a single node. But any single node solution has a reliability/availability problem: when the node goes off-line (due to crash or maintenance procedures) our service is unavailable. The obvious solution is to run a service on several nodes to achieve high availability. At that point we enter the realm of distributed systems and everything (including our service) gets more complicated. Try to think how to satisfy the desired properties listed above in a distributed environment (where nodes may temporary go offline and where network may lose messages) and you will see that consistency is a tough problem.

Hopefully this task is already solved by the paxos consensus algorithm. People usually use paxos to build way more complex systems than the described service (register) but the algorithm is also applicable here. Moreover the fact that the register is one of the simplest distributed system that can be built based on paxos makes it is a good entry point for the engineers who what to understand paxos on an example.
Hopefully this task is already solved by the Paxos consensus algorithm. People usually use Paxos to build way more complex systems than the described service but the algorithm is also applicable here. Moreover the fact that the register is one of the simplest distributed systems that can be built based on Paxos makes it a good entry point for the engineers who want to understand Paxos on an example.

The description of the register is written in a python inspired pseudocode. I assume that every write to an instance variable is intercepted, redirected to a persistent storage and fsync-ed. Of course every read from an instance variable reads from the storage. The pseudocode follows the algorithm described in the [Paxos Made Simple](http://research.microsoft.com/en-us/um/people/lamport/pubs/paxos-simple.pdf) paper by Leslie Lamport, please read it to understand the idea behind the code and see the proof.
The description of the register is written in a Python inspired pseudocode. I assume that every write to an instance variable is intercepted, redirected to a persistent storage and fsync-ed. Of course every read from an instance variable reads from the storage. The pseudocode follows the algorithm described in the [Paxos Made Simple](http://research.microsoft.com/en-us/um/people/lamport/pubs/paxos-simple.pdf) paper by Leslie Lamport, please read it to understand the idea behind the code and see the proof.

Acceptors:

Expand Down
20 changes: 10 additions & 10 deletions _posts/2015-08-23-paxos-variable.markdown
Original file line number Diff line number Diff line change
@@ -1,25 +1,25 @@
---
layout: post
title: rystsov::The Paxos Variable
name: The Paxos Variable
tags: ["misc"]
desc: "How to update the registor into a distributed fault-tolerance variable with compare-and-set (CAS) concurrency control mechanism"
title: rystsov::Distributed register
name: Distributed register
tags: ["distr"]
desc: "How to update the write-once registor into a distributed fault-tolerance mutable register with the compare-and-set (CAS) concurrency control mechanism"
has_comments: true
---

<h1>The Paxos Variable</h1>
<h1>Distributed register</h1>

In the previous [article]({% post_url 2015-08-22-paxos-register %}) I showed how to create a write-once distributed register. But in real life we usually want to mutate state. So Let's fix it and design a distributed paxos based variable.
In the previous [article]({% post_url 2015-08-22-paxos-register %}) I showed how to create a write-once distributed register. But in real life we usually want to mutate state. So Let's fix it and design a mutable register.

Distributed variable is a little bit more complex system than the register but it uses the same idea so if you understand the register then it should be easy to you to understand the variable's design.
Distributed mutable register is a little bit more complex system than the write-once register but it uses the same idea so if you understand the register then it should be easy to you to understand the current design.

The basic idea is to use a sequence of ordered registers to emulate a variable. For example to write a value we write it to the register with the lowest id among empty registers. To read a value we read it from the register with the highest id among filled registers.
The basic idea is to use a sequence of ordered registers to emulate a variable. For example to write a value we write it to the register with the lowest ID among empty registers. To read a value we read it from the register with the highest ID among filled registers.

Distributed system is by definition is a concurrent system too so its very important to provide concurrency control mechanisms like a compare-and-set (CAS) primitive. To support CAS primitive we must prevent modifications of the registers if a register with a higher is has already chosen a value. We can achieve it if maintain serializability between any successful register's modifications. Hopefully we can do it if make the acceptors to use the same promise. Let's take a look on the new acceptor's methods.
Distributed system is by definition is a concurrent system too so its very important to provide concurrency control mechanisms like a compare-and-set (CAS) primitive. To support CAS primitive we must prevent modification of the past, e.g. edit of registers if a register with a higher ID has already chosen a value. We can achieve it by maintaining serializability between any successful register's modifications. Hopefully we can do it if we make the acceptors use the same promise. Let's take a look on the new acceptor's methods.

{% gist rystsov/a45793daa1662a921479 %}

You may be surprised since you don't see a sequence of the registers. Well, we always are interested only in the filled register with the largest id so we don't store the whole sequence.
You may be surprised since you don't see a sequence of the registers. Well, we always are interested only in the filled register with the largest ID so we don't store the whole sequence.

The API consist of two functions

Expand Down
14 changes: 7 additions & 7 deletions _posts/2015-08-24-dpp-variable.markdown
Original file line number Diff line number Diff line change
@@ -1,21 +1,21 @@
---
layout: post
title: rystsov::dynamic distributed register
name: Dynamic Distributed Register
tags: ["misc"]
title: rystsov::distributed register for the dynamic environment
name: Distributed register for the dynamic environment
tags: ["distr"]
desc: "Updates distributed CAS register to work in the dynamic environment"
has_comments: true
---

<h1>Dynamic Distributed Register</h1>
<h1>Distributed register for the dynamic environment</h1>

Distributed register that we designed in the [previous post]({% post_url 2015-08-23-paxos-variable %}) has an obvious weakness: it works only with the static set of nodes. We know that hardware malfunction are unavoidable and eventually all the nodes will be broken. We need a mechanism to add (remove) nodes to the cluster to let the register to outlive the nodes it runs on. The [Dynamic Plain Paxos]({% post_url 2015-07-19-dynamic-plain-paxos %}) article explains how we can do it with plain paxos, we need to adapt it to use with the register.
Distributed register that we designed in the [previous post]({% post_url 2015-08-23-paxos-variable %}) has an obvious weakness: it works only with the static set of nodes. We know that hardware malfunctions are unavoidable and eventually all the nodes will be broken. So we need a mechanism to add (remove) nodes to the cluster to let the register outlive the nodes it runs on. The [Dynamic Plain Paxos]({% post_url 2015-07-19-dynamic-plain-paxos %}) article explains how we can do it with plain Paxos. We need to adapt it to use with the register.

Lets take a look on the coordinator which manages the process of changing the membership. Compared to the the article we don't use filters. Filters were added to make the proof understandable but if you think about them then you will notice that a system with filters behaves the same way as a system without them so we can omit them during the implementation.
Lets take a look on the coordinator which manages the process of changing the membership. Compared to the article mentioned above we don't use filters. Filters were added to make the proof understandable, but if you think about them you will notice that a system with filters behaves the same way as a system without them, so we can omit filters during the implementation.

{% gist rystsov/0644f6cf7b45f5a72e81 %}

As you can see to change membership we execute several simple steps like increasing the quorum size or altering the set of nodes. But we can't think that those simple changes happen instantly since the old value may be used in the running _read_write_read operation. It makes sense to wait until all running operations with old value are finished. This is implemented via the era variable and the _epic_change method.
As you can see, to change membership we execute several simple steps like increasing the quorum size or altering the set of nodes. But we can't think that those simple changes happen instantly since the old value may be used in the running '_read_write_read' operation. It makes sense to wait until all running operations with old value are finished. This is implemented via the 'era' variable and 'the _epic_change' method.

The acceptors methods are left intact.

Expand Down
38 changes: 31 additions & 7 deletions _posts/2015-08-26-sharded-paxos.markdown
Original file line number Diff line number Diff line change
Expand Up @@ -2,19 +2,43 @@
layout: post
title: rystsov::Paxos-based sharded ordered key value store with CAS
name: Paxos-based sharded ordered key value store with CAS
tags: ["misc"]
desc: "How to run an instance of paxos-register per per key to get a key value storage and shard it on the fly without loosing consistency"
tags: ["distr"]
desc: "How to run an instance of paxos register per per key to build a key value storage and how to shard it on the fly without loosing consistency"
has_comments: true
---

<h1>Paxos-based sharded ordered key value store with CAS</h1>

<img src="{{ site.url }}/images/sharded-paxos-1.png" width="250" align="left" />
To build a key value storage each node runs a new instance of Paxos register each time when the node receives a get or put request for an unseen key and redirects all further requests related to the key to that register. Of course different instances on the same node should share the quorum size and the set of nodes and use unified membership change mechanism which generalises the Paxos variable's version to work with a dynamic set of Paxos registers. For example its <code>2n+1 to 2n+2</code> transition is:
1. Increase the quorum size.
2. Generate an event on each node.
3. Fetch all keys from the nodes up to the event, union them and sync all of them. Of course this operation may be optimized by batching and parallel processing.
4. Add a new node to the set of nodes.

<img src="{{ site.url }}/images/sharded-paxos-2.png" width="250" align="left" />
Up to this point we designed a replicated key/value with strong consistency and per key test-and-set concurrency control primitive. Since the whole dataset fits one node nothing prevents us from maintaining the order on keys and support range queries.

<img src="{{ site.url }}/images/sharded-paxos-3.png" width="250" align="left" />
The next step is to support sharding.

<img src="{{ site.url }}/images/sharded-paxos-4.png" width="250" align="left" />
<h2>Sharding</h2>

<img src="{{ site.url }}/images/sharded-paxos-5.png" width="250" align="left" />
Imagine a key value storage that lives on three nodes.

<img src="{{ site.url }}/images/sharded-paxos-1.png" width="450" class="sharded-paxos-pic"/>

Some day because of the storage usage or high load we will decide to split the storage. So we peek a key from the key space and split the key/value storage into two logical groups. First group (A) contains key less-or-equal to the key and the second group contains the rest (B).

<img src="{{ site.url }}/images/sharded-paxos-2.png" width="450" class="sharded-paxos-pic"/>

Then we add a node to the B group.

<img src="{{ site.url }}/images/sharded-paxos-3.png" width="500" class="sharded-paxos-pic"/>

And remove a node from it.

<img src="{{ site.url }}/images/sharded-paxos-4.png" width="500" class="sharded-paxos-pic"/>

And repeat the process until we get A and B clusters working on the different sets of nodes.

<img src="{{ site.url }}/images/sharded-paxos-5.png" width="545" class="sharded-paxos-pic"/>

As a result we split the storage without stopping the cluster or loosing consistency.
9 changes: 9 additions & 0 deletions css/common.css
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,10 @@ a {
text-decoration: none;
}

.anchor {
background: none;
}

h1,h2,h3,p {
margin: 0;
}
Expand All @@ -24,6 +28,11 @@ h1,h2,h3,p {
font-size: 90%;
}

.key_value_design {
margin-top: 1em;
margin-bottom: 1em;
}

.abstract-center {
max-width: 600px;
margin-right: auto;
Expand Down
6 changes: 6 additions & 0 deletions css/post.css
Original file line number Diff line number Diff line change
Expand Up @@ -30,3 +30,9 @@ h3 {
#low_internet {
margin-top: 2em;
}

.sharded-paxos-pic {
margin-left: auto;
margin-right: auto;
display: block;
}
Binary file added images/sharded-paxos-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/sharded-paxos-2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/sharded-paxos-3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/sharded-paxos-4.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/sharded-paxos-5.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
14 changes: 14 additions & 0 deletions index.html
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,20 @@

<div class="section">
<h1>Blog</h1>
<div class="key_value_design">
<h3><a id="distr_key_value" class="anchor">Designing a distributed key/value storage</a></h3>
<div class="abstract">A series of posts describing how to design a distributed key/value storage</div>
<div class="postlinks">
{% for post in site.tags.distr reversed %}
<div class="postlink">
<div class="date">{{ post.date | | date: "%Y-%m-%d" }}</div>
<a href="{{ post.url }}">{{ post.name }}</a>
<div class="desc post_desc">{{ post.desc }}</div>
</div>
{% endfor %}
</div>
</div>
<h3>Misc</h3>
<div class="postlinks">
{% for post in site.tags.misc %}
<div class="postlink">
Expand Down