Permalink
Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
200 lines (174 sloc) 9.1 KB
---
# Copyright 2017 Yahoo Holdings. Licensed under the terms of the Apache 2.0 license. See LICENSE in the project root.
title: "Cluster and Node States"
---
<p>
Clients route requests to distributors using the distribution algorithm.
Node states are input to the algorithm.
Distributors use the distribution algorithm to keep bucket copies on correct
storage nodes. To do this, it needs to know which storage nodes are available
to hold copies and what their capacities are.
Service layer nodes may split requests between different partitions, in which
case it needs to know what partitions are available.
To achieve this, state is tracked for all the major components of the content layer:
</p>
<dl class="dl-horizontal">
<dt>Up</dt>
<dd>The node is up and available to keep buckets and serve requests.</dd>
<dt>Down</dt>
<dd>The node is not available, and can not be used. The distribution algorithm
will not consider this node for bucket placement.</dd>
<dt>Initializing</dt>
<dd>The node is starting up. It knows what buckets it stores data for, and may
serve requests, but it does not know the metadata for all its buckets yet,
such as checksums and document counts. The distribution algorithm will
consider this node as available when calculating bucket placement.
</dd>
<dt>Stopping</dt>
<dd>This node is stopping and is expected to be down soon. This state is
typically only exposed to the cluster controller to tell why the node
stopped. The cluster controller will expose the node as down or in
maintenance mode for the rest of the cluster. This state is thus not seen
by the distribution algorithm.
</dd>
<dt>Maintenance</dt>
<dd>This component is temporarily unavailable. The distribution algorithm will
count this node as available for bucket placement, causing less than
redundancy copies to be available. This mode is typically used to mask a
down state during controlled node restarts, or by an administrator that
need to do some short maintenance work, like upgrading software or restart
the node. Using this mode, new copies of the documents stored on this node
will not be created, allowing the node to be down with less of a performance
impact on the rest of the cluster.
</dd>
<dt>Retired</dt>
<dd>A retired node is available and serves requests.
This state is used to remove nodes while keeping redundancy.
The distribution algorithm will not consider retired nodes when calculating bucket placement.
Buckets are moved to other nodes (with low priority), until empty.
Special considerations apply when using
<a href="data-placement.html">hierarchical distribution</a> as buckets are not necessarily removed.
</dd>
</dl>
<h2 id="distributor-state">Distributor state</h2>
<p>
The distributors use similar node states as the service layer nodes. However,
maintenance currently makes little sense as there is only one distributor able
to handle requests for a given bucket. Setting a distributor in maintenance
mode will make the subset owned by that distributor unavailable, and is thus
not recommended to be used for distributors. The cluster controller will not,
by default, mask distributors restarting as in maintenance mode. This will
change when distributor redundancy is implemented.
</p><p>
Retired mode also makes little sense for distributors, as ownership transfer
happens quickly at any distributor state change anyhow. A retired distributor
will not be used by any clients.
</p>
<h2 id="partition-state">Partition state</h2>
<p>
The content layer allows service layer nodes to be split into multiple
partitions, allowing parts of a service layer node to be unavailable. If the
backend is actually using a JBOD disk setup, and a disk is unusable, it can
mark it unavailable, without causing all the capacity on the entire node to be
lost.
</p><p>
The partitions can currently only be in up or down states, and changing state
entails restarting the service layer node.
</p>
<h2 id="cluster-state">Cluster state</h2>
<p>
The cluster controller generates a cluster state by combining node and
partition states for all the nodes. Each time the state is altered in a way that
has an effect on the cluster, the cluster state version is upped, and the
new cluster state is broadcasted to all the distributor and service layer nodes,
provided a minimum time has passed since last time a new state was created.
</p><p>
The cluster controller has settings for how big a percentage of distributor
and service layer nodes need to be available for the cluster to be available.
If too many nodes are unavailable, allowing load to the remaining nodes will
overload the nodes and completely fill them with data. This will not give a
good user experience, and the cluster will use quite some time to recover
from this afterwards. Thus, if a cluster is missing too many nodes to perform
decently, the entire cluster is considered down. While this sounds drastic, it
allows building a failover solution, or at the
very least, the cluster will get back to a usable state faster once enough
nodes are available again.
</p><p>
Note, that if a cluster has so many nodes unavailable that it is considered
down, the state of the individual nodes are irrelevant to the cluster, and thus
new cluster states will not be created and broadcasted before enough nodes are
back for the cluster to come back up. A cluster state indicating the entire
cluster is down, may thus have outdated data on the node level.
</p><p>
State is viewed in different contexts: <em>unit</em>, <em>user</em>
and <em>generated</em> states:
</p>
<h3 id="unit-state">Unit state</h3>
<p>
The cluster controller fetches node states from all nodes. It attempts to always
have a pending node state request to every node, such that any node state change
can be reported back to the cluster controller immediately. These states
are called <em>unit states</em>.
States reported from the nodes themselves are either <em>initializing</em>, <em>up</em> or
<em>stopping</em>. If the node can not be reached, a <em>down</em> state is assumed.
</p>
<h3 id="user-state">User state</h3>
<p>
By default, it is assumed that all the nodes in the cluster are preferred to be
up and available. Several other cases exist though:
<ul>
<li>Retire a node from a cluster -
use retiring mode to get buckets moved elsewhere</li>
<li>Short-lived maintenance work on the node -
set in maintenance mode to avoid the rest of the cluster spending
resources to create more copies of buckets.</li>
<li>A node with bad hardware or corrupt files may cause havoc in the cluster.
In such cases, the cluster is better off not using the node at all. The
cluster controller might detect this, or administrators looking into issues
may find issues and manually want to mark the node not to be used.</li>
</ul>
Administrators or cluster controller logic may alter the preferred node state
for a node. Preferred states must be one of <em>up</em>, <em>down</em>,
<em>maintenance</em> or <em>retired</em>.
These are called <em>user states</em>.
</p><p>
User states are stored in a ZooKeeper cluster run by the configuration
servers.
</p>
<h3 id="generated-state">Generated state</h3>
<p>
When viewing node states from a cluster state, one sees a state calculated based
on reported states, preferred states and historic knowledge kept of the node
by the cluster controller.
</p><p>
The cluster controller will typically mask a down reported state with a
maintenance state for a short time if it has seen a stopping state indicating
a controlled restart. Assuming the node will soon be initializing again,
masking it as maintenance will stop distributors from creating new copies of
buckets during that time window. If the node fails to come back quickly
though, the state will become down.
</p><p>
Nodes that come back up, reporting as <em>initializing</em>, may be masked as
<em>down</em> by the cluster controller. The cluster controller might suspect it of
stalling or crashing during initialization due to historic events, and may
keep it unused for the time being to avoid the cluster to be interrupted again.
</p><p>
Node states seen through the currently active cluster state
are called <em>generated states</em>.
</p>
<h2 id="inspecting-and-modifying-states">Inspecting and modifying states</h2>
<p>
State information is available through the
<a href="api-state-rest-api.html">State Rest API</a>. This interface can
also be used to set user states for nodes.
For convenience, use the
command line tools. See the <em>--help</em> pages for
<em>vespa-get-cluster-state</em>, <em>vespa-get-node-state</em>
and <em>vespa-set-node-state</em>.
</p><p>
More detailed information is available in the cluster controller status pages.
They can change at any time, and the output of these pages may require a
good knowledge of content layer design to make sense. Find the status pages
on the STATE port used by the cluster controller component at
<em>/clustercontroller-status/v1/</em>
</p>