Fetching contributors…
Cannot retrieve contributors at this time
472 lines (440 sloc) 23.4 KB
# Copyright 2017 Yahoo Holdings. Licensed under the terms of the Apache 2.0 license. See LICENSE in the project root.
title: "Elastic Vespa"
<p>Vespa allows application instances to grow (and shrink) while serving queries and accepting writes.
To maintain a uniform data distribution, documents are automatically redistributed in the background,
using minimal data movement.
Change the configured <a href="/documentation/reference/services-content.html#nodes">nodes</a>
and redeploy the application - no restarts needed.</p>
<p>The elasticity mechanism is also used to recover from a node loss -
new replicas of documents are auto-created to maintain redundancy.
Failed nodes is hence not a problem that requires immediate attention -
as long as the instance has sufficient capacity to handle the data and load,
it self-heals from node failures.</p>
<p>This article covers:
<li><a href="#storage-and-retrieval">storage and retrieval</a></li>
<li><a href="#search">search</a></li>
<li><a href="#resizing">resizing the instance</a></li>
<h2 id="storage-and-retrieval">Storage and retrieval</h2>
<p>The document distribution system in Vespa consists of two main components:
<a href="distributor.html">distributors</a> and content nodes.
Documents are accessed through a distributor.
To handle a large number of documents, Vespa groups them in <a href="content/buckets.html">buckets</a>,
using hashing or hints in the <a href="documents.html">document id</a>.
The content nodes provide a <em>Service Provider Interface</em> (SPI)
that abstracts how documents are stored in the elastic system.
The SPI is the link between the elastic bucket management system and the storage.
The SPI is implemented by <a href="proton.html">proton</a> -
<em>proton</em> and <em>content node</em> is used interchangeably in this documentation.
A document Put or Update is sent to all replicas of the bucket keeping the document.
If bucket replicas are out of sync, a bucket merge operation is run to re-sync the bucket.
A bucket contains <a href="#data-retention-vs-size">tombstones</a> of recently removed documents.
Read more about document expiry and batch removes in
<a href="search-definitions.html#document-expiry">document expiry</a>.
Buckets are split when they grow too large, and joined when they shrink.
<em>This is a key feature for high performance in very small to very large instances,
and eliminates need for downtime or manual operations when scaling.</em>
<h3 id="writing-documents">Writing documents</h3>
<p>Documents enter Vespa using the <a href="api.html">Document API</a>,
using <a href="reference/vespa-feeder.html">vespa-feeder</a>, <a href="document-api.html">Document API</a>,
the <a href="vespa-http-client.html">Vespa Feeding Client API</a> or <a href="document-api-guide.html">Java Document API</a>
(all binaries use the Java Document API).
If no <a href="routing.html">route</a> is set, clients feed to the <em>default</em> route.
Next is <a href="document-processing-overview.html">document processing</a> (round-robin)
where documents are prepared for indexing, before entering the <em>distributor</em>.
The distributor maps the document to bucket, and sends it to <em>content nodes</em>:
</p><img src="content/img/elastic_feed.png"/><p>
<table class="table">
<th id="document-processing">Document processing</th>
The <a href="document-processing-overview.html">document processing chain</a>
is a chain of processors that manipulate documents before they are stored.
Document processors can be user defined.
When using indexed search, the final step in the chain prepares documents for indexing.
The Document API forwards requests to distributors.
It calculates the correct distributor using
the distribution algorithm and the <em>cluster state</em>.
With no known cluster state, the client library will send requests to a random node,
which replies with the updated cluster state if the node was incorrect.
Cluster states are versioned, such that clients hitting outdated distributors do not override
updated states with old states.
<th id="distributor">Distributor</th>
The <a href="distributor.html">distributor</a> keeps track of
which content nodes that stores replicas of each bucket,
based on the <em>redundancy</em> and information from the <em>cluster controller</em>.
A bucket maps to one distributor only.
A distributor keeps a bucket database with bucket metadata.
The metadata holds which content nodes store replicas of the buckets,
the checksum of the bucket content and the number of documents and meta entries within the bucket.
Each document is algorithmically mapped to a bucket and forwarded to the correct content nodes.
<a href="distributor.html">Read more</a>.
<th id="cluster-controller">Cluster controller</th>
The <a href="clustercontroller.html">cluster controller</a>
manages the state of the distributor and content nodes.
This <em>cluster state</em> is used by the document processing chains
to know which distributor to send documents to,
as well as by the distributor to know which content nodes should have which bucket.
<a href="clustercontroller.html">Read more</a>.
<th id="content-node">Content node</th>
The content node has a <em>bucket management system</em>,
which sends requests to a set of <em>document databases</em>,
which each consists of three <em>sub-databases</em>.
In short, this node activates and deactivates buckets for search.
Refer to <a href="proton.html">proton</a> for details.
Other aspects of feeding:
<table class="table">
<th id="redundancy">Redundancy</th>
<a href="reference/services-content.html#redundancy">Redundancy</a>
configures how many content nodes stores replicas of the same bucket.
No node may store more than one replica of a bucket.
The distributors detect whether there are enough bucket replicas on the
content nodes and add/remove as needed.
Write operations wait for replies from every replica
and fail if less than redundancy are persisted within timeout.
Replica placement is found in
<a href="content/data-placement.html">document distribution</a>.
<a href="reference/services-content.html#searchable-copies">Searchable-copies</a>
configures how many replicas to keep in the
<a href="/documentation/proton.html#sub-databases">Ready sub-database</a>.
<th id="consistency">Consistency</th>
Consistency is maintained at bucket level.
Content nodes calculate checksums based on the bucket contents,
and the distributors compare checksums among the bucket replicas.
A <em>bucket merge</em> is issued to resolve inconsistency, when detected.
While there are inconsistent bucket replicas, the distributors route operations to the "best" replica.
As buckets are split and joined, it is possible for replicas of a bucket to be split at different levels.
A node may have been down while its buckets have been split or joined.
This is called <em>inconsistent bucket splitting</em>.
Bucket checksums can not be compared across buckets with different split levels.
Consequently, distributors do not know whether all documents exist in enough replicas in this state.
Due to this, inconsistent splitting is one of the highest maintenance priorities.
After all buckets are split or joined back to the same level,
the distributors can verify that all the replicas are consistent and fix any detected issues with a merge.
<a href="content/consistency.html">Read more</a>.
Detailed process overview:
<img src="img/reference/flow/feed1.png"/>
Content node expanded:
<img src="img/reference/flow/feed2.png" />
<h3 id="retrieving-documents">Retrieving documents</h3>
Retrieving documents is done by specifying an id to <em>get</em>,
or use a <a href="reference/document-select-language.html">selection expression</a>
to <em>visit</em> a range of documents - refer to the <a href="api.html">Document API</a>.
</p><img src="content/img/elastic_visit_get.png"/>
<table class="table">
<th id="get">Get</th>
When the content node receives a get request,
it scans through all the document databases,
and for each one it checks all three sub-databases.
Once the document is found, the scan is stopped and the document returned.
If the document is found in a Ready sub-database,
the document retriever will apply any changes that is stored in the
<a href="attributes.html">attributes</a> before returning the document.
<th id="visit">Visit</th>
A visit request creates an iterator over each candidate bucket.
This iterator will retrieve matching documents from all sub-databases of all document databases.
As for get, attributes values are applied to document fields in the Ready sub-database.
<h2 id="search">Search</h2>
Indexed search has a separate pathway through the system.
It does not use the distributor, nor has it anything to do with the SPI.
It is orthogonal to the elasticity set up by the storage and retrieval described above.
How queries move through the system:
</p><img src="content/img/elastic_query.png"/><p>
A query enters the system through the <em>QR-server (query rewrite server)</em>
in the <a href="container-intro.html">Vespa Container</a>.
The QR-server issues one query per document type to the <em>top level dispatcher</em>,
which in turn keeps track of all search nodes and relays queries to them:
<table class="table">
<th id="container">Container</th>
The Container knows all the document types and rewrites queries as a
collection of queries, one for each type.
Queries may have a <a href="reference/search-api-reference.html#model.restrict">restrict</a> parameter,
in which case the container will send the query only to the specified document types.
<th id="top-level-dispatcher">Top level dispatcher</th>
The top level dispatcher is a small process by default running on each container node,
which is responsible for sending the query to each content node and collecting partial results.
It pings all content nodes every second to know whether they are alive,
and keeps open TCP connections to each one.
If a node goes down, the elastic system will make the documents available on other nodes.
<th id="content-node-matching">Content node matching</th>
The <em>match engine</em> <!-- link here to design doc / move it -->
receives queries and routes them to the right document database based on the document type.
The query is passed to the <em>Ready</em> sub-database, where the searchable documents are.
Based on information stored in the document meta store,
the query is augmented with a blacklist that ensures only <em>active</em> documents are matched.
<h2 id="resizing">Resizing</h2>
Resizing is orchestrated by the distributor, and happens gradually in the background.
It uses a variation of the <em>RUSH algorithms</em> to distribute documents.
Under normal circumstances, it makes a minimal number of documents move when nodes are added or removed.
Modify the configuration to add/remove nodes,
then <a href="cloudconfig/application-packages.html#deploy">deploy</a>.
Add an elastic content cluster to the application by adding a
<a href="reference/services-content.html">content</a> element
in <a href="reference/services.html">services.xml</a>.
<h3 id="bucket-distribution-algorithm">Bucket distribution algorithm</h3>
Central to the <a href="content/idealstate.html">ideal state distribution algorithm</a>
is the assignment of a node sequence to each bucket.
The illustration shows how each bucket uses a pseudo-random sequence
of numbers to derive the node sequence.
The pseudo-random generator is seeded with the bucket id, so each bucket will always get the same sequence:
</p><img src="content/img/BucketNodeSequence.png"/><p>
Each node is assigned a <em>distribution-key</em>, which is an index in the number sequence.
The set of number/index pairs is sorted on descending numbers, and this new sequence of indexes determines
which nodes the replicas are placed on, with the first node being host to the primary replica, i.e. the <em>active</em> replica.
This specification of where to place a bucket is called the bucket's <em>ideal state</em>.
<h3 id="adding-nodes">Adding nodes</h3>
When adding a new node, new ideal states are calculated for all buckets.
The buckets that should have a replica on the new node are moved, while the superfluous are removed.
Redistribution example - add a new node to the system, with redundancy 2:
</p><img src="content/img/AddNodeMoveBuckets.png"/><p>
The distribution algorithm generates a random node sequence for each bucket.
In this example with redundancy of two, the two nodes sorted first will store replicas of the data.
The figure shows how random placement onto two nodes changes as a third node is introduced.
The new node introduced takes over as primary for the buckets where it got sorted first in order,
and as secondary for the buckets where it got sorted second.
This ensures minimal data movement when nodes come or go, and allows capacity to be changed easily.
No buckets are moved between the existing nodes when a new node is added.
Based on the pseudo-random sequences, some buckets change from primary to secondary, or are removed.
Many nodes can be added in the same deployment. Adding more than one minimizes the total
bucket redistribution, but increases the time to get to the <em>ideal state</em>. Procedure:
<strong>Node setup:</strong> Prepare nodes by installing software,
set up the file systems / directories and set configuration server(s).
<a href="vespa-quick-start.html">Details</a>. Start the node.
<strong>Modify configuration:</strong>
Add a <a href="reference/services-content.html#node">node</a>-element in <em>services.xml</em>
and <a href="reference/hosts.html">hosts.xml</a>.
Refer to <a href="/documentation/vespa-quick-start-multinode-aws.html">multinode install</a>.
It is key that the node's <em>distribution-key</em> is higher than the highest existing index.
<strong>Deploy</strong>: <a href="operations/admin-procedures.html#monitor-distance-to-ideal-state">
Observe metrics</a> to track progress as the cluster redistributes documents.
Use the <a href="#cluster-controller">cluster controller</a> to monitor the state of the cluster.
<strong>Restart dispatch nodes?:</strong> ToDo check this.
<strong>Tune performance:</strong> Use
<a href="">
maxpendingidealstateoperations</a> to tune concurrency of bucket merge operations from distributor nodes.
Likewise, tune <a href="reference/services-content.html#merges">merges</a> -
concurrent merge operations per content node. The tradeoff is speed of bucket replication vs
use of resources, which impacts the applications' regular load.
<strong>ToDo:</strong> add something on bucket-oriented summary store as well.
<strong>Finish:</strong> The cluster is done redistributing when <em>idealstate.merge_bucket.pending</em>
is zero on all distributors.
<h3 id="removing-nodes">Removing nodes</h3>
Whether a node fails or is deliberately removed, the same redistribution happens.
If the remove is planned, the node remains up until it has been drained.
Example of redistribution after node failure, redundancy 2:
</p><img src="content/img/LoseNodeMoveBuckets.png"/><p>
In the figure, node 2 fails. This node held the active replicas of bucket 2 and 6.
Once the node fails the secondary replicas are set active.
If they were already in a <em>ready</em> state, they can start serving queries immediately.
Otherwise they will have to index their data.
All buckets that no longer have secondary replicas
are merged to the remaining nodes according to their pseudo-random sequence.
It is important that the nodes retain their original index;
otherwise the buckets would all have to move to regain their ideal states.
Do not remove more than <em>redundancy</em>-1 nodes at a time.
Observe <em>idealstate.merge_bucket.pending</em> to know bucket replica status,
when zero on all distributor nodes, it is safe to remove more nodes.
If <a href="content/data-placement.html">hierarchical distribution</a>
is used to control bucket replicas, remove all nodes in a group
if the redundancy settings ensure replicas in each group.
To increase bucket redundancy level before taking nodes out,
<a href="content/admin-states.html">retire</a> nodes.
Again, track <em>idealstate.merge_bucket.pending</em> to know when done.
Use the <a href="content/api-state-rest-api.html">State API</a> or
<em>vespa-set-node-state</em><!-- ToDo: this tool should be documented somewhere -->
to set a node to <em>retired</em>.
The <a href="#cluster-controller">cluster controller's</a> status page lists node states.
<h3 id="merge-clusters">Merge clusters</h3>
To merge two content clusters, add nodes to the cluster like <em>add node</em> above, considering:
<a href="reference/services-content.html#node">distribution-keys</a> must be unique.
Modify paths like <em>$VESPA_HOME/var/db/vespa/search/mycluster/n3</em> before adding the node.
Set <a href="setting-vespa-variables.html">VESPA_CONFIGSERVERS</a>, then start the node.
<h3 id="migrate-clusters">Migrate clusters</h3>
An alternative to increasing cluster size is building a new cluster, then migrate documents to it.
This is supported using <a href="content/visiting.html#data-export-import">visiting</a>.
<h2 id="limitations-and-tradeoffs">Limitations and tradeoffs</h2>
<table class="table">
<tr><th id="availability-vs-resources">Availability vs resources</th>
Keeping index structures costs resources. Not all replicas of buckets are
necessarily searchable, unless configured using
<a href="reference/services-content.html#searchable-copies">searchable-copies</a>.
As Vespa indexes buckets on-demand, the most cost-efficient setting is 1,
if one can tolerate temporary coverage loss during node failures.
Note that <em>searchable-copies</em> does not apply to <a href="streaming-search.html">streaming search</a>
as this does not use index structures.
Distributing buckets using a <a href="content/data-placement.html">hierarchy</a>
helps implement policies for availability and
<a href="qps-scaling-content-cluster.html#hierarchical-distribution">query performance</a>.
<tr><th id="data-retention-vs-size">Data retention vs size</th>
When a document is removed, the document data is not immediately purged.
Instead, proton keeps <em>remove-entries</em> (tombstones of removed documents)
for a configurable amount of time. The default is two weeks, refer to
<a href="">
This ensures that removed documents stay removed in a distributed system where nodes change state.
Entries are removed periodically after expiry.
Hence, if a node comes back up after being down for more than two weeks,
removed documents are available again, unless the data on the node is wiped first.
A larger <em>pruneremoveddocumentsage</em> will hence grow the storage size as this keeps
document and tombstones longer.
<strong>Note:</strong> The backend does not store remove-entries for nonexistent documents.
This to prevent clients sending wrong document identifiers from
filling a cluster with invalid remove-entries.
A side-effect is that if a problem has caused all replicas of a bucket to be unavailable,
documents in this bucket cannot be marked removed until at least one replica is available again.
Documents are written in new bucket replicas while the others are down -
if these are removed, then older versions of these will not re-emerge,
as the most recent change wins.
<tr><th id="transition-time" style="white-space:nowrap;">Transition time</th>
See <a href="reference/services-content.html#transition-time">transition-time</a>
for tradeoffs for how quickly nodes are set down vs. system stability.
<tr><th id="removing-unstable-nodes">Removing unstable nodes</th>
One can configure how many times a node is allowed to crash before it will automatically be removed.
The crash count is reset if the node has been up or down continuously for more than the
<a href="reference/services-content.html#stable-state-period">stable state period</a>.
If the crash count exceeds
<a href="reference/services-content.html#max-premature-crashes">max premature crashes</a>, the node will be disabled.
Refer to <a href="operations/admin-procedures.html#troubleshooting">troubleshooting</a>.
<tr><th id="minimal-amount-of-nodes-required-to-be-available">Minimal amount of nodes required to be available</th>
A cluster is typically sized to handle a given load.
A given percentage of the cluster resources are required for normal operations,
and the remainder is the available resources that can be used if some of the nodes are no longer usable.
If the cluster loses enough nodes, it will be overloaded:
<li>Remaining nodes may create disk full situation.
This will likely fail a lot of write operations, and if disk is shared with OS,
it may also stop the node from functioning.</li>
<li>Partition queues will grow to maximum size.
As queues are processed in FIFO order, operations are likely to get long latencies.</li>
<li>Many operations may time out while being processed,
causing the operation to be resent, adding more load to the cluster.</li>
<li>When new nodes are added, they cannot serve requests
before data is moved to the new nodes from the already overloaded nodes.
Moving data puts even more load on the existing nodes,
and as moving data is typically not high priority this may never actually happen.</li>
To configure what the minimal cluster size is, use
<a href="reference/services-content.html#min-distributor-up-ratio">min-distributor-up-ratio</a> and
<a href="reference/services-content.html#min-storage-up-ratio">min-storage-up-ratio</a>.