Here's a list of things I like to review periodically to keep my understanding of key concepts fresh. I try to review everything on this list once a year or so. Gotta keep those swords sharp!
- Relevant chapters in "The Unix and Linux Administration Handbook":
- Chapter 14 on TCP/IP Networking
- Chapter 21 on Network Debugging
- CIDR notation (cidr.md)
Getting Things Done
- Stellman & Greene, "Learning Agile"
- Work is Work by Coda Hale
- Contention cost can grow faster than work capacity: "If contention on those [shared] resources is unmanaged, organizational growth can result in catastrophic increases in wait time. At some point, adding new members can cause the organization’s overall productivity to decrease instead of increase, as the increase in wait time due to contention is greater than the increase in work capacity."
- Invest in reducing resource contention: "If the organization’s intent is to increase value delivery by hiring more people, work efforts must be as independent as possible. Leaders should develop practices and processes to ensure that the work efforts which their strategies consider parallel are actually parallel."
- Kubernetes and Helm (kubernetes.md)
- Burns et al., "Kubernetes Up and Running"
- Design patterns outlined in distributed-systems
- Best explanation of the CAP theorem
- Of the CAP theorem's consistency, availability, and partition tolerance, partition tolerance is mandatory in distributed systems. You can only choose between consistency and availability.
- A more helpful heuristic may be the tradeoff between yield (percent of requests answered successfully) and harvest (percent of most upt-to-date data included in the responses).
- The yield/harvest tradeoff is outlined in Fox & Brewer, "Harvest, yield, and scalable tolerant systems" (1999).
- The overview of Consul's architecture is a great real-world illustration of how distributed systems are built in practice: Consul Architecture
Replication and Broadcast
- Broadcast ordering (a good summary):
- total order: messages are delivered on all nodes in the same order. Often accomplished with a leader/follower pattern, where a single leader determines total order (see consensus below).
- causal: messages are delivered on all nodes in causal order, but concurrent messages are delivered in any order and may vary from node to node.
- reliable: non-faulty nodes deliver every message, retrying dropped messages.
- best-effort: messages may be dropped.
- Consensus protocols allow follower nodes to elect a leader, and allows the leader to help followers reach a consensus on shared state. Consensus is formally equivalent to total order broadcast.
- Excellent explanation of the Raft consensus protocol
- Transactions can either be committed or aborted (rolled back). Distributed transactions do this across multiple nodes.
- Atomic commits sounds similar to consensus, but it makes stricter guarantees.
- Two-phase commits are a common protocol for atomic commits in a distributed system.
- How do you define "consistency" when you have multiple nodes reading/writing replicated data?
- Linearizability: This is the strongest definition of consistency. All operations behave as if they were a against a single copy of the data. If a write starts and finish before a read, the read should always return the latest value. Concurrent read/writes make no guarantees about which value is returned by the read, but that's still consistent with linearizability.