Skip to content

Client side failure detector implementations

afeinberg edited this page Jul 29, 2011 · 4 revisions

In order for Voldemort to implement a system with fail-over, consistent hashing, balancing, etc. it’s important for the client to maintain up-to-date status of each storage node’s availability. The Voldemort client needs to know which nodes are available in order to correctly route requests. This is the job of the failure detector.

Background

Prior to this work, failure detection was implemented across two classes: RoutedStore and NodeStatus. The logic was really simple – when an attempt to perform an operation (get, put, etc.) on a remote node failed, an UnreachableStoreException would be thrown. The RoutedStore implementation would catch this exception and mark the node as offline for N seconds (the default was 30 seconds). During that bannage period no attempts were made by the client routing code to access that offline node. After the bannage period had elapsed, the node would thereafter again be a candidate for the next remote operation. However, this was no guarantee that the node had actually become available, but rather an assumption that a node failure was more often caused by it being temporarily overloaded rather than actually being offline.

There are a number of downsides with this approach:

  1. A temporary hiccup in a node operation forces the entire node to be marked as unavailable for a relatively long period of time
  2. If the server representing a node is literally down (i.e. the system is powered down, rebooting, etc.), attempts are made every N seconds to reconnect, causing the Voldemort client thread to block for minutes at a time (see Issue 137: Periodic Delays with Voldemort Queries when Nodes are “offline”)
  3. Little, if any visibility into the current internal state of the failure detection logic resulted in a “black box” for operations teams.
  4. There was basically only one configuration option: the value of N for the node bannage period.
  5. From a code structure standpoint, the failure detection logic was tightly coupled with the RoutedStore making researching alternative implementations difficult.
  6. As a result of the code structure, there were only a couple of unit tests to verify correct failure detection.

It was decided that a better approach to failure detection needed to solve these fundamental issues by providing a solution that enables:

  1. Allowance for intermittent and/or temporary failures; for example, certain types of failures, or a certain number of failures in a window of time are considered acceptable.
  2. Re-establish node availability asynchronously. As a node continues to be down for a long period of time (on the order of minutes), say while it’s being reconfigured, rebooted, or replaced, these should be handled in the background to minimize impact on the client communicating with available nodes.
  3. Included JMX instrumentation for increased visibility into the internal monitoring, validation, and so forth of the failure detection. These provide up-to-the-minute statistics which can optionally be captured on a period basis for trend analysis by external tools.
  4. Without going overboard, provide a reasonable set of configuration parameters for optimizing site-specific failure detection and recovery.
  5. Provide an API with interface/implementation separation. This allows for multiple implementations to coexist, one perhaps being better than others for a particular environment/use case. However, the main reason is to allow there to be a preferred implementation that can be implemented, tweaked, tested, etc. in isolation.
  6. The new structure of the code facilitates in-depth unit testing. Additionally, integration and performance tests can be performed using the new EC2 Testing Infrastructure.

Introducing the FailureDetector API

As noted above, one of the first areas of work was to separate out the logic and implementation of the failure detection logic into an API. This is now represented by the voldemort.cluster.failuredetector.FailureDetector API. Here is a subset of the salient APIs for the failure detector:

public interface FailureDetector {

    public boolean isAvailable(Node node);

    public void recordSuccess(Node node, long requestTime);

    public void recordException(Node node, long requestTime, UnreachableStoreException e);

}

The FailureDetector API is used to determine a cluster’s node availability. Machines and servers can go down at any time and usage of this API can be used by request routing in an attempt to avoid unavailable servers.

A FailureDetector is specific to a given cluster and as such there should only be one instance per cluster per JVM.

Implementations can differ dramatically in how they approach the problem of determining node availability. Some implementations may rely heavily on invocations of recordException and recordSuccess to determine availability. The result is that such a FailureDetector implementation performs little logic other than bookkeeping, implicitly trusting users of the API. However, other implementations may be more selective in using results of any external users’ calls to the recordException and recordSuccess methods. Implementations may use these error/success calls as “hints” or may ignore them outright.

To contrast the two approaches to implementing:

  1. Externally-based implementations use algorithms that rely heavily on users for correctness. For example, let’s say a user attempts to contact a node which then fails. A responsible caller should invoke the recordException API to inform the FailureDetector that an error has taken place for the node. The FailureDetector itself hasn’t really determined availability itself. So if the caller is incorrect or buggy, the FailureDetector’s accuracy is compromised.
  2. Internally-based implementations rely on their own determination of node availability. For example, a heartbeat style implementation may pay only a modicum of attention when its recordException and/or recordSuccess methods are invoked by outside callers.

Naturally there is a spectrum of implementations and external calls to recordException and recordSuccess should (not must) provide some input to the internal algorithm.

Current Implementations

Despite the fact that there is an API with a pluggable implementation (as we’ll see), the goal is not to have users writing their own implementations. The ultimate goal is to have the One True Implementation that is used by most, if not all, users of Voldemort use by default. (Presently that is voldemort.cluster.failuredetector.ThresholdFailureDetector.)

Configuration

Configuration for the failure detector is handled by properties used by the client configuration. Each implementation may have its own configuration, but the common configuration property is:

  • failuredetector_implementation: The fully-qualified class name of the failure detector implementation

The implementation-specific properties will be listed in their specific sub-sections in this document.

JMX

All current implementations support the following JMX attributes for the failure detector:

  • availableNodes: A comma-separated list of the node IDs that the failure detector considers available
  • unavailableNodes: A comma-separated list of the node IDs that the failure detector considers unavailable
  • availableNodeCount: The count of nodes in the cluster that the failure detector considers available
  • nodeCount: The total number of nodes in the cluster

The implementation-specific JMX attributes will be listed in their specific sub-sections in this document.

BannagePeriodFailureDetector

BannagePeriodFailureDetector relies on external callers to notify it of failed attempts to access a node’s store via recordException. When recordException is invoked, the node is marked offline for a period of time as defined by the client or server configuration. Once that period has passed, the node is considered available. However, BannagePeriodFailureDetector’s definition of available uses a fairly loose sense of the word. Rather than considering the node to be available for access, it is available for attempting to access. In actuality the node may still be down. However, the intent is simply to mark it down for N seconds and then attempt to try again and repeat. If the node is truly available for access, the caller will then invoke recordSuccess and the node will be marked available in the truest sense of the word.

Configuration

In addition to the common configuration, the BannagePeriodFailureDetector supports the following optional property:

  • failuredetector_bannage_period: The number of milliseconds this node is considered as “banned”

JMX

In addition to the common JMX attributes, the BannagePeriodFailureDetector adds the following JMX attribute:

  • unavailableNodesBannageExpiration: A comma-separated list of unavailable nodes and their respective bannage expiration

AsyncRecoveryFailureDetector

AsyncRecoveryFailureDetector detects failures and then attempts to contact the failing node’s Store to determine availability.

When a node does go down, attempts to access the remote Store for that node may take several seconds. Rather than cause the thread to block, we perform this check in a background thread.

Configuration

In addition to the common configuration, the AsyncRecoveryFailureDetector supports the following optional property:

  • failuredetector_asyncscan_interval: The number of milliseconds to wait between scans of previously unavailable nodes to check for recovery

ThresholdFailureDetector

ThresholdFailureDetector builds upon the AsyncRecoveryFailureDetector and provides a more lenient for marking nodes as unavailable. Fundamentally, for each node, the ThresholdFailureDetector keeps track of a “success ratio” which is a ratio of successful operations to total operations and requires that ratio to meet or exceed a threshold. That is, every call to recordException or recordSuccess increments the total count while only calls to recordSuccess increments the success count. Calls to recordSuccess increase the success ratio while calls to recordException by contrast decrease the success ratio.

As the success ratio threshold continues to exceed the threshold, the node will be considered as available. Once the success ratio dips below the threshold, the node is marked as unavailable. As this class extends the AsyncRecoveryFailureDetector, an unavailable node is only marked as available once a background thread has been able to contact the node asynchronously.

There is also a minimum number of requests that must occur before the success ratio is checked against the threshold. This is to prevent occurrences like 1 failure out of 1 attempt yielding a success ratio of 0%. There is also a threshold interval which means that the success ratio for a given node is only “valid” for a certain period of time, after which it is reset. This prevents scenarios like 100,000,000 successful requests (and thus 100% success threshold) overshadowing a subsequent stream of 10,000,000 failures because this is only 10% of the total and above a given threshold.

Additionally, there are so-called “catastrophic” errors that can occur regarding which the ThresholdFailureDetector will immediately mark the node as unavailable for the current period. These errors are exceptions such as java.net.ConnectException, java.net.UnknownHostException, and java.net.NoRouteToHostException. They can be extended/changed via the appropriate configuration property (failuredetector_catastrophic_error_types).

Configuration

In addition to the common configuration, the ThresholdFailureDetector supports the following optional property:

  • failuredetector_threshold: The integer percentage representation of the threshold that must be met or exceeded.
  • failuredetector_threshold_interval: Millisecond interval for which the threshold is valid; it is “reset” after this period is exceeded
  • failuredetector_threshold_countminimum: Minimum number of failures that must occur before the success ratio is checked against the threshold
  • failuredetector_asyncscan_interval: The number of milliseconds to wait between scans of previously unavailable nodes to check for recovery
  • failuredetector_catastrophic_error_types: List of fully-qualified class names of exception types to be considered as catastrophic
  • failuredetector_request_length_threshold: Maximum length of time a request (get, put, delete, etc.) can take before being considered as an “error”; these will not necessarily mark the node as unavailable immediately, however. They will simply contribute as an error in the “success ratio”

Configuration used at LinkedIn production for ThresholdFailureDetector

failuredetector_threshold_countminimum = 30
failuredetector_threshold_interval = 300000
failuredetector_threshold = 95

JMX

In addition to the common JMX attributes, the ThresholdFailureDetector adds the following JMX attribute:

  • nodeThresholdStats: Each node is listed with its status (available/unavailable) and success percentage