Skip to content

What Can Possibly Go Wrong?

Anders Pearson edited this page Jan 27, 2015 · 1 revision

When it comes down to it, what makes distributed systems much more difficult (and interesting, IMO) than monolithic systems, is all the new, exciting ways that things can break.

The core of any given problem is usually one of the items in The Fallacies of Distributed Programming:

  1. The network is reliable.
  2. Latency is zero.
  3. Bandwidth is infinite.
  4. The network is secure.
  5. Topology doesn't change.
  6. There is one administrator.
  7. Transport cost is zero.
  8. The network is homogeneous.

Let's look at which of these problems Cask attempts to deal with, which ones it doesn't, and whatever else could go wrong with a Cask cluster.

First, "The network is reliable" and "Topology doesn't change" are the two big ones. Any time you take a monolithic system or design and attempt to distribute it, those are usually the first two issues you run into. Eg, see the long, sad history of CORBA and various RPC and RMI attempts to make remote calls over the network look like local calls. Everything works great in testing until a router drops a few packets, then all of your assumptions go out the window and everything comes crashing down. If you're not thinking about what happens during various network failures, you're hardly doing distributed systems.

Cask directly approaches those two problems. The entire cluster design is centered around how to deal with full and partial network and disk failures and with nodes occasionally coming and going. Files are stored and retrieved via their cryptographic hash, so nodes can always verify data integrity and clients can be assured that they are getting the file the requested. The replication, distributed hashtable, and active anti entropy mean that the cluster can lose a percentage of nodes (or at least lose the ability to communicate with them) and keep going. Cask treats all data as immutable. This means that there's no such thing as updating (or even deleting) a file in Cask. Storing an updated copy of a file really means storing an entirely new file, independent of the previous version. Immutable data means that all operations are inherently idempotent. This eliminates a large class of potential problems. If there's a one-way network partition and you ask a node to store a file and never hear a response, it's perfectly safe to repeat the request. The worst that will happen is that it's a no-op.

Of course, this is only true up to a point. If you have a 100 node cluster and 99 of them all catch fire at the same time, you're going to have a bad time. Cask's guarantees of data integrity are only probabilistic wrt the rates that nodes fail and are replaced. When a node fails, it takes Cask some time to replace the copies that it was storing. If more nodes fail in that time, chances of losing data go up. This is inevitable; any practical distributed system will have a similar limitation. Ideally, nodes in your cluster will be independent. Then failure rates work in your favor. Eg, if the hard drives that you are backing each node with have a 10% chance of failing for a given time period, replicating across two of them will result in an overall 1% chance of both failing in the same time period. Each independent replica improves that by an order of magnitude. Cask lets you configure the replication rate and how aggressive it will be at repairing replication after a node fails. So you can tune it to get a risk profile that makes sense for the data that you are storing in it. If you had storage and bandwidth to space, you could make that 100 node cluster maintain a 100X replication rate (so every node is effectively a full mirror) and you could sustain loss of 99 of them.

"Bandwidth is infinite" and "Transport cost is zero" are related to the above in Cask. The higher the replication rate and the more aggressive the repair rate, the more bandwidth and transport costs will be incurred. Again, this isn't something that a distributed system "solves" for you; all it can really do is give you the control to balance the costs and risks intelligently. The goal, when configuring a cluster, should be to ensure that the time to repair a failure is significantly less than the MTBF for each of your nodes, but not too much more than that, unless you've got bandwidth and CPU to waste.

"There is one administrator" is another hard one. Cask tries to keep configuration as simple as possible, with basic 12-factor app style environment variables and relies on the gossip and heartbeat systems to converge on a consistent state across the nodes rather than expecting administrator intervention. Nevertheless, if administration of your Cask cluster isn't handled carefully, you could run into problems. Eg, if different nodes in the cluster are configured for different replication levels, you'll get some weird behavior and the one with the lowest replication setting will "win", and you'll have a lot of excess churn as well as nodes swap replicas back and forth. In the future, Cask might gain the ability to talk to a service discovery/configuration management system like Zookeeper, etcd, confd, consul, etc. to make sure that the cluster agrees on replication rates. Similarly, and also going back to "The network is homogeneous", Cask won't deal all that well with very unbalanced performance across the cluster. Ie, if one node is running on a very old, slow machine with a particularly slow network connection, it will tend to drag the rest of the cluster down with it. When you upload a file to a Cask node, it won't give the client an affirmative response until it's verified that the full N nodes have stored it. If the slow node is in the list of nodes that it's trying to store to, the whole request will be waiting on that slow node. Currently, Cask doesn't have any internal tooling for detecting and quarantining exceptionally poorly performing or unreliable nodes. (tracking and exposing performance stats is definitely on the list of desirable future features).

more later...

Clone this wiki locally