Skip to content
Anders Pearson edited this page Jan 27, 2015 · 6 revisions

The Cask cluster decides where to place file replicas via a distributed hashtable (DHT), which is referred to as the Ring in the code (inspiration from Riak).

The best way to explain how it works, I think is to run through an example of what happens when a file is uploaded to a cluster.

Let's create an imaginary cluster of ten nodes, which we'll call "node0" through "node9". Each node gets a random UUID:

node0: 8e80d8df-2907-4c8e-ad9f-7de423843516
node1: 8a246d7b-cd7f-4b7a-8544-9bc50f4ac8ac
node2: 93b78209-585a-4279-ae98-e679403d9efd
node3: 077bdb1b-c1d4-42d7-af44-c641b0462048
node4: e0613e17-afec-44e6-8020-b07fc5f821d9
node5: 3adb9ceb-c43d-4676-a638-cc524665e295
node6: e0046037-0e76-4132-8ba5-9c4ac7ae74a0
node7: 0a00716a-3908-4948-b010-d43ba872c099
node8: 9cba6a9c-618e-4981-9899-7ef9eed456af
node9: f8b0aa21-bf95-4300-9e84-4bd5848dcc9f

For our example, we'll assume that the cluster is fully connected (all the nodes know about each other, none are failing, all are writable, etc). The cluster is configured to replicate data to three nodes.

Now, a client uploads a simple text file with the content hello\n to node0. At a high level, node0 is going to compute a SHA1 hash of the contents of the file, pick three different nodes, and tell those nodes to store the file, then give the client back the SHA1 hash so it can retrieve it later.

So, the first thing that node0 does is compute the SHA1 hash of hello\n, which is "f572d396fae9206628714fb2ce00f72e94f2258f". For the rest of this document, just remember that it starts with "f5".

Next, it has to pick three nodes to send that file to. This is where the Ring comes in and is the focus of this example.

Before going into the algorithm, let's remember what the goal is. A Cask cluster is completely distributed and masterless. The goal is to achieve high availability without a single point of failure. A file could be uploaded to one node in the cluster and then another node could get a request asking for that file back. So all the nodes in the cluster need to be able to figure out where any given file's copies are going to be located. This needs to work reasonably quickly with potentially millions and millions of files. A more traditional system might have some kind of central database that stores that simple information about which files are stored on which nodes (and would probably also be responsible for making those assignments in the first place). That can be very efficient, but then that central database becomes a single point of failure. If it breaks or can't be reached, or gets overloaded with requests, the whole cluster is useless. Avoiding that single point of failure leads into complex territory with attempts to copy the database between nodes and keep it up to date. Frankly, doing that correctly is a harder problem than anything Cask attempts.

Instead, the approach of a distributed hashtable like what Cask uses is to not keep any kind of database of locations. Instead, there's an algorithm that, given the key you want to locate or store and some basic information about the cluster, will return a list of candidate nodes. Any node that has roughly the same basic information about the cluster that uses the same algorithm will always return the same results. No other data needs to be shared or synchronized.

Of course, it's also important that the algorithm distributes files fairly evenly between the nodes. You could use the algorithm "always put stuff on node0" and that would be easy to implement, would certainly return the same result no matter which node you're running it on, and wouldn't have any tricky synchronization problems, but your cluster would be quite unbalanced.

The Ring exists as a data structure inside Cask, which you can think of as a list of (<SHA1>, <node>) tuples. Ie, each tuple has a SHA1 hash and a node that is associated with it. The tuples are sorted in ascending order by the SHA1 hashes. To pick nodes for a given hash to map to, Cask goes through the list until it hits the first tuple with a SHA1 hash that is greater than or equal to the target hash. The node from that tuple and the next two tuples are the three that it will use. Of course, it wraps back around to the beginning if necessary, which is why we think of it as a "ring".

The actual implementation has a few more important details. First, rather than one tuple per node, there are 16 for each node. The SHA1 hash for each tuple is generated by appending 0-16 to the node's UUID and taking the SHA1 of that string. Then, of course, rather than taking the node from the first three tuples >= the target hash, it takes the first three unique nodes after the target hash.

OK, back to the example. 16 tuples per node means that there are 160 tuples in our Ring for a 10 node cluster. I won't show them all, but let's look at enough to get a feel for it.

Eg, node0, with UUID "8e80d8df-2907-4c8e-ad9f-7de423843516" will get the following:

"8e80d8df-2907-4c8e-ad9f-7de423843516" + "0" -> 0e44a6e82d396479b5da1ecdfc36f5c77a884273
"8e80d8df-2907-4c8e-ad9f-7de423843516" + "1" -> ab8d6c9aff42523573fe9b7f62a1db48531feb15
"8e80d8df-2907-4c8e-ad9f-7de423843516" + "2" -> 08ce49815e87894b58c65af81cf6a3bdcb7aba7d
"8e80d8df-2907-4c8e-ad9f-7de423843516" + "3" -> 03635af2bc377536b4edb2da0f6497865c61ad5f
...
"8e80d8df-2907-4c8e-ad9f-7de423843516" + "16" -> 84350b82446f0fdb45e9d06d1b1c940f134017ac

With each other node getting a similar list of 16 hashes calculated in a similar manner. They all go into a big list as tuples like:

[(0e44a6e82d396479b5da1ecdfc36f5c77a884273, <node0>),
 (ab8d6c9aff42523573fe9b7f62a1db48531feb15, <node0>),
 (08ce49815e87894b58c65af81cf6a3bdcb7aba7d, <node0>),
 (03635af2bc377536b4edb2da0f6497865c61ad5f, <node0>),
 ...
 (84350b82446f0fdb45e9d06d1b1c940f134017ac, <node0>),
 (c2a9cb36f0de63aba8db15fd09f60b8021d5a278, <node1>),
 ...
 (e8620a99726758083c046ba63840357b5170b283, <node9>)]

Then it's sorted by the hash component of each tuple:

[(01dde49993a19d42920e7b011d7b6916a92ad47d, <node8>),
 (01e5114d0ee8ea574caecd8a09cbc7a5de7abbb7, <node5>),
 (03635af2bc377536b4edb2da0f6497865c61ad5f, <node0>),
 ...
 (ef838d000c2948f0310e2c55ac741627379eb049, <node2>),
 (fa191b7f6ceff37a8460b1ececd969a29980fa02, <node4>),
 (fbed55f34426071c68c740febb214657159ffa04, <node8>),
 (fc72a18a73fda385c808caa3e3166e452809e333, <node9>),
 (fc7c5193853fb0c93f96fb93f99fd4e9437b65fa, <node2>),
 (fca9a9c05b2b18177a960a926a0fe292cf4a9ed4, <node7>),
 (fe56ac541d7844d05f9dfbe1d9fa8f021da67a6d, <node6>)]

With our original target hash of "f572d396fae9206628714fb2ce00f72e94f2258f", we can through that list until we hit an entry greater than or equal to it and take the next three unique nodes from that. So:

[(fa191b7f6ceff37a8460b1ececd969a29980fa02, <node4>),
(fbed55f34426071c68c740febb214657159ffa04, <node8>),
(fc72a18a73fda385c808caa3e3166e452809e333, <node9>)]

(It happens in this example that there were no duplicate nodes right in that stretch, but that's certainly not guaranteed. If there had been, some intermediate entries would've just been ignored).

So copies of our uploaded file will get sent off to node4, node8, and node9.

Later, when a client asks node0 to retrieve the file associated with the hash "f572d396fae9206628714fb2ce00f72e94f2258f", it will perform the same search and come up with the same three nodes, and it will request the file from each of them in turn until one hands it back.

Why all this complexity? Surely there must be a simpler way to take an uploaded file and pick three nodes out of ten to store it to.

In a world where the cluster never changes, that is certainly true. You could just use a simple modulo and calculate

index = target_hash % number_of_nodes

and then use node[index], node[index+1], and node[index+2].

The downside to that approach is that if a node is added to or removed from the cluster, the index that you'll calculate for a given hash will change dramatically. In other words, your ideal distribution of replicas changes dramatically. Eg, with our target hash and ten nodes, you get (% 10) index = 1, so you would store it to nodes 1, 2, and 3. But then, if node5 caught fire and dropped out of the cluster, leaving only nine nodes, a later retrieve request for that hash would calculate (% 9) index = 5 and the cluster would ask nodes 5, 6, and 7 for it. None of them would have it and it would waste a lot of time talking to nodes that don't have the file. Similarly, if, instead of node5 going away, another node is added to the cluster in the meantime, making eleven nodes, you now get (% 11) index = 9 and the cluster tries to retrieve from nodes 9, 0, and 1. Not as bad, but still two wasted requests and other keys will be much worse.

Using the ring approach, those situations don't turn out nearly as bad. Node5 disappearing from our cluster in the above example would mean that our Ring would now have 160 - 16 = 144 entries. But it would be exactly the same list, just without any of the node5 tuples. Our search through it for the target hash would turn up the exact same results. Even if the node disappearing was one of the ones in our shortlist, it would handle it as well as could possibly be expected (ie, node8 or node9 disappearing wouldn't matter because we'd be asking node4 first and getting it back. If node4 disappeared, we'd ask node8, and not have to fall back to node9 or node2). The worst that could happen is if another node is added to the cluster and has just the right tuple associated with it to insert itself right before node4. In that case, we would ask that new node for the file first, get a negative response from it, and fall back to node4 and node8. One wasted request. When a new node is added to the cluster, that will happen for some percentage of subsequent requests. But on a dynamic cluster, that's basically the best you can do without actually maintaining a list of exactly which nodes have stored which files and keeping that list updated and synchronized across every node of the cluster (or stored in a central spot that becomes a single point of failure for the entire cluster).

As long as cluster "churn" (the rate that nodes join or leave) is kept relatively low, Cask's Active Anti Entropy and Read Repair systems will further minimize the inefficiencies.

Ok, actually there's still a little more to understand about Cask's implementation. A Cask cluster is generally going to be relatively small. Even with a 100 node cluster, that's 1600 entries in the Ring, which still isn't a ton of memory to use. So Cask can afford to be a little wasteful. Instead of getting the first N entries from the Ring for a given hash, Cask actually returns a list of every known node, sorted in decreasing likelihood of having a copy of the data we're looking for. That means that when a node is handling an upload, if one of the first N nodes fails, it can just keep going down that list instead of having to go back to the Ring and find the N+1 node, etc.

The implementation of that ties together the whole Ring implementation. So what really happens is that the full Ring of tuples is constructed, then when we search for a given hash, we find the index in the Ring where the hashes go from less than the target to greater. Then, the list is split on that index and a new list is made with the segment after that index at the front and the part that was before it moved after it. That list is then de-duplicated by node and becomes the sorted search list.

Let's go back to our example. In our case, it would be index = 156 where the keys go from "ef838d00..." to "fa191b7...", which are on either side of our target. So ring[156:] (in Python-ish slice syntax) becomes the front of the new list and ring[:156] gets put after it. Go's slices make this an efficient operation. The actual code that does it is:

reordered := make([]ringEntry, len(ring))
reordered = append(ring[partitionIndex:], ring[:partitionIndex]...)
Clone this wiki locally