Skip to content
Go to file
Cannot retrieve contributors at this time
550 lines (509 sloc) 27.2 KB
# Copyright 2017 Yahoo Holdings. Licensed under the terms of the Apache 2.0 license. See LICENSE in the project root.
title: "Elastic Vespa"
<p>Vespa allows application instances to grow (and shrink) while serving queries and accepting writes.
To maintain a uniform data distribution, documents are automatically redistributed in the background,
using minimal data movement.
Change the configured <a href="/documentation/reference/services-content.html#nodes">nodes</a>
and redeploy the application - no restarts needed.</p>
<p>The elasticity mechanism is also used to recover from a node loss -
new replicas of documents are auto-created to maintain redundancy.
Failed nodes is hence not a problem that requires immediate attention -
as long as the instance has sufficient capacity to handle the data and load,
it self-heals from node failures.</p>
<p>This article covers:
<li><a href="#storage-and-retrieval">storage and retrieval</a></li>
<li><a href="#search">search</a></li>
<li><a href="#resizing">resizing the instance</a></li>
<h2 id="storage-and-retrieval">Storage and retrieval</h2>
<p>The document distribution system in Vespa consists of two main components:
<a href="content/content-nodes.html#distributor">distributors</a> and content nodes.
Documents are accessed through a distributor.
To handle a large number of documents, Vespa groups them in <a href="content/buckets.html">buckets</a>,
using hashing or hints in the <a href="documents.html">document id</a>.
The content nodes provide a <em>Service Provider Interface</em> (SPI)
that abstracts how documents are stored in the elastic system.
The SPI is the link between the elastic bucket management system and the storage.
The SPI is implemented by <a href="proton.html">proton</a> -
<em>proton</em> and <em>content node</em> is used interchangeably in this documentation.
A document Put or Update is sent to all replicas of the bucket keeping the document.
If bucket replicas are out of sync, a bucket merge operation is run to re-sync the bucket.
A bucket contains <a href="#data-retention-vs-size">tombstones</a> of recently removed documents.
Read more about document expiry and batch removes in
<a href="documents.html#document-expiry">document expiry</a>.
Buckets are split when they grow too large, and joined when they shrink.
<em>This is a key feature for high performance in very small to very large instances,
and eliminates need for downtime or manual operations when scaling.</em>
<h3 id="offline-index">Offline index?</h3>
Building the index offline used to be attractive when CPU was the bottleneck for indexing.
Bandwith is since more often the bottleneck,
it is hence more important to limit data transfer to nodes.
Memory bandwith is also an important factor, blowing cpu caches is a big deal -
switching an entire index while serving has severe performance impact.
Serving systems want to smooth changes as much as possible
and offline index build / deployment is the opposite extreme.
Building indexes offline requires the partition layout to be known in the offline system,
which is in conflict with elasticity and auto-recovery
(where nodes can come and go without service impact).
It conflicts with per colos/data center optimization.
And - it is completely at odds with realtime writes.
For these reasons, it is not recommended, and not supported.
To control the indexing/serving resource tradeoff,
CPU fraction for indexing per Vespa node is <a href="reference/services-content.html#feeding">configurable</a>.
<h3 id="consistency">Consistency</h3>
Consistency is maintained at bucket level.
Content nodes calculate checksums based on the bucket contents,
and the distributors compare checksums among the bucket replicas.
A <em>bucket merge</em> is issued to resolve inconsistency, when detected.
While there are inconsistent bucket replicas, the distributors route operations to the "best" replica.
As buckets are split and joined, it is possible for replicas of a bucket to be split at different levels.
A node may have been down while its buckets have been split or joined.
This is called <em>inconsistent bucket splitting</em>.
Bucket checksums can not be compared across buckets with different split levels.
Consequently, distributors do not know whether all documents exist in enough replicas in this state.
Due to this, inconsistent splitting is one of the highest maintenance priorities.
After all buckets are split or joined back to the same level,
the distributors can verify that all the replicas are consistent and fix any detected issues with a merge.
<a href="content/consistency.html">Read more</a>.
<h2 id="writing-documents">Writing documents</h2>
Applications use the <a href="vespa-http-client.html">vespa-http-client</a>
or <a href="reference/document-v1-api-reference.html">/document/v1/</a> API to write to Vespa.
Alternatively, use <a href="reference/vespa-cmdline-tools.html#vespa-feeder">vespa-feeder</a> to feed files
or the Java <a href="api.html">Document API</a> directly - see illustration below and
<a href="writing-to-vespa.html">writing to Vespa</a>.
Next is <a href="indexing.html">indexing</a>
and/or <a href="document-processing.html">document processing</a>
where documents are prepared for indexing (and optionally processed using custom code),
before sent to a <em>distributor</em>.
The distributor maps the document to bucket, and sends it to <em>content nodes</em>:
<img src="content/img/elastic-feed-container.svg" height="410" width="340"/>
<img src="content/img/elastic-feed-vespafeeder.svg" height="410" width="340"/>
<table class="table">
<tr id="document-processing">
<th style="white-space: nowrap">Document processing</th>
The <a href="document-processing.html">document processing chain</a>
is a chain of processors that manipulate documents before they are stored.
Document processors can be user defined.
When using indexed search, the final step in the chain prepares documents for indexing.
The Document API forwards requests to distributors.
It calculates the correct distributor using
the distribution algorithm and the <a href="content/content-nodes.html#cluster-state">cluster state</a>.
With no known cluster state, the client library will send requests to a random node,
which replies with the updated cluster state if the node was incorrect.
Cluster states are versioned, such that clients hitting outdated distributors do not override
updated states with old states.
</tr><tr id="distributor">
The <a href="content/content-nodes.html#distributor">distributor</a> keeps track of
which content nodes that stores replicas of each bucket (maximum one replica each),
based on the <a href="reference/services-content.html#redundancy">Redundancy</a>
and information from the <em>cluster controller</em>.
A bucket maps to one distributor only.
A distributor keeps a bucket database with bucket metadata.
The metadata holds which content nodes store replicas of the buckets,
the checksum of the bucket content and the number of documents and meta entries within the bucket.
Each document is algorithmically mapped to a bucket and forwarded to the correct content nodes.
The distributors detect whether there are enough bucket replicas on the
content nodes and add/remove as needed.
Write operations wait for replies from every replica
and fail if less than redundancy are persisted within timeout.
</tr><tr id="cluster-controller">
<th>Cluster controller</th>
The <a href="content/content-nodes.html#cluster-controller">cluster controller</a>
manages the state of the distributor and content nodes.
This <em>cluster state</em> is used by the document processing chains
to know which distributor to send documents to,
as well as by the distributor to know which content nodes should have which bucket.
</tr><tr id="content-node">
<th>Content node</th>
The <a href="content/content-nodes.html">content node</a> has a <em>bucket management system</em>,
which sends requests to a set of <em>document databases</em>,
which each consists of three <em>sub-databases</em>.
In short, this node activates and deactivates buckets for search.
Refer to <a href="proton.html">proton</a> for details.
<h3 id="detailed-process-overview">Detailed process overview</h3>
<img src="img/reference/flow/feed1.png"/>
Content node expanded:
<img src="img/reference/flow/feed2.png" />
<h2 id="retrieving-documents">Retrieving documents</h2>
Retrieving documents is done by specifying an id to <em>get</em>,
or use a <a href="reference/document-select-language.html">selection expression</a>
to <em>visit</em> a range of documents - refer to the <a href="api.html">Document API</a>.
</p><img src="content/img/elastic-visit-get.svg" height="550" width="500"/>
<table class="table">
<tr id="get">
When the content node receives a get request,
it scans through all the document databases,
and for each one it checks all three sub-databases.
Once the document is found, the scan is stopped and the document returned.
If the document is found in a Ready sub-database,
the document retriever will apply any changes that is stored in the
<a href="attributes.html">attributes</a> before returning the document.
</tr><tr id="visit">
A visit request creates an iterator over each candidate bucket.
This iterator will retrieve matching documents from all sub-databases of all document databases.
As for get, attributes values are applied to document fields in the Ready sub-database.
<h2 id="search">Search</h2>
Indexed search has a separate pathway through the system.
It does not use the distributor, nor has it anything to do with the SPI.
It is orthogonal to the elasticity set up by the storage and retrieval described above.
How queries move through the system:
</p><img src="content/img/elastic-query.svg" height="320" width="270"/><p>
A query enters the system through the <em>QR-server (query rewrite server)</em>
in the <a href="jdisc/">Vespa Container</a>.
The QR-server issues one query per document type to the search nodes:
<table class="table">
<tr id="container">
The Container knows all the document types and rewrites queries as a
collection of queries, one for each type.
Queries may have a <a href="reference/query-api-reference.html#model.restrict">restrict</a> parameter,
in which case the container will send the query only to the specified document types.
It sends the query to content nodes and collect partial results.
It pings all content nodes every second to know whether they are alive,
and keeps open TCP connections to each one.
If a node goes down, the elastic system will make the documents available on other nodes.
</tr><tr id="content-node-matching">
<th>Content node matching</th>
The <em>match engine</em> <!-- link here to design doc / move it -->
receives queries and routes them to the right document database based on the document type.
The query is passed to the <em>Ready</em> sub-database, where the searchable documents are.
Based on information stored in the document meta store,
the query is augmented with a blocklist that ensures only <em>active</em> documents are matched.
<h2 id="resizing">Resizing</h2>
Resizing is orchestrated by the distributor, and happens gradually in the background.
It uses a variation of the <em>RUSH algorithms</em> to distribute documents.
Under normal circumstances, it makes a minimal number of documents move when nodes are added or removed.
Modify the configuration to add/remove nodes,
then <a href="cloudconfig/application-packages.html#deploy">deploy</a>.
Add an elastic content cluster to the application by adding a
<a href="reference/services-content.html">content</a> element
in <a href="reference/services.html">services.xml</a>.
<h3 id="bucket-distribution-algorithm">Bucket distribution algorithm</h3>
Central to the <a href="content/idealstate.html">ideal state distribution algorithm</a>
is the assignment of a node sequence to each bucket.
The illustration shows how each bucket uses a pseudo-random sequence
of numbers to derive the node sequence.
The pseudo-random generator is seeded with the bucket id, so each bucket will always get the same sequence:
</p><img src="content/img/BucketNodeSequence.png"/><p>
Each node is assigned a <em>distribution-key</em>, which is an index in the number sequence.
The set of number/index pairs is sorted on descending numbers, and this new sequence of indexes determines
which nodes the replicas are placed on, with the first node being host to the primary replica, i.e. the <em>active</em> replica.
This specification of where to place a bucket is called the bucket's <em>ideal state</em>.
<h3 id="adding-nodes">Adding nodes</h3>
When adding a new node, new ideal states are calculated for all buckets.
The buckets that should have a replica on the new node are moved, while the superfluous are removed.
Redistribution example - add a new node to the system, with redundancy 2:
</p><img src="content/img/AddNodeMoveBuckets.png"/><p>
The distribution algorithm generates a random node sequence for each bucket.
In this example with redundancy of two, the two nodes sorted first will store replicas of the data.
The figure shows how random placement onto two nodes changes as a third node is introduced.
The new node introduced takes over as primary for the buckets where it got sorted first in order,
and as secondary for the buckets where it got sorted second.
This ensures minimal data movement when nodes come or go, and allows capacity to be changed easily.
No buckets are moved between the existing nodes when a new node is added.
Based on the pseudo-random sequences, some buckets change from primary to secondary, or are removed.
Many nodes can be added in the same deployment. Adding more than one minimizes the total
bucket redistribution, but increases the time to get to the <em>ideal state</em>. Procedure:
<strong>Node setup:</strong> Prepare nodes by installing software,
set up the file systems / directories and set configuration server(s).
<a href="vespa-quick-start.html">Details</a>. Start the node.
<strong>Modify configuration:</strong>
Add a <a href="reference/services-content.html#node">node</a>-element in <em>services.xml</em>
and <a href="reference/hosts.html">hosts.xml</a>.
Refer to <a href="/documentation/vespa-quick-start-multinode-aws.html">multinode install</a>.
It is key that the node's <em>distribution-key</em> is higher than the highest existing index.
<strong>Deploy</strong>: <a href="operations/admin-procedures.html#monitor-distance-to-ideal-state">
Observe metrics</a> to track progress as the cluster redistributes documents.
Use the <a href="#cluster-controller">cluster controller</a> to monitor the state of the cluster.
<strong>Tune performance:</strong> Use
<a href="">
maxpendingidealstateoperations</a> to tune concurrency of bucket merge operations from distributor nodes.
Likewise, tune <a href="reference/services-content.html#merges">merges</a> -
concurrent merge operations per content node. The tradeoff is speed of bucket replication vs
use of resources, which impacts the applications' regular load.
<strong>ToDo:</strong> add something on bucket-oriented summary store as well.
<strong>Finish:</strong> The cluster is done redistributing when <em>idealstate.merge_bucket.pending</em>
is zero on all distributors.
<h3 id="removing-nodes">Removing nodes</h3>
Whether a node fails or is deliberately removed, the same redistribution happens.
If the remove is planned, the node remains up until it has been drained.
Example of redistribution after node failure, redundancy 2:
</p><img src="content/img/LoseNodeMoveBuckets.png"/><p>
In the figure, node 2 fails. This node held the active replicas of bucket 2 and 6.
Once the node fails the secondary replicas are set active.
If they were already in a <em>ready</em> state, they can start serving queries immediately.
Otherwise they will have to index their data.
All buckets that no longer have secondary replicas
are merged to the remaining nodes according to their pseudo-random sequence.
It is important that the nodes retain their original index;
otherwise the buckets would all have to move to regain their ideal states.
Do not remove more than <em>redundancy</em>-1 nodes at a time.
Observe <em>idealstate.merge_bucket.pending</em> to know bucket replica status,
when zero on all distributor nodes, it is safe to remove more nodes.
If <a href="#grouped-distribution">grouped distribution</a>
is used to control bucket replicas, remove all nodes in a group
if the redundancy settings ensure replicas in each group.
To increase bucket redundancy level before taking nodes out,
<a href="content/content-nodes.html">retire</a> nodes.
Again, track <em>idealstate.merge_bucket.pending</em> to know when done.
Use the <a href="content/api-state-rest-api.html">State API</a> or
<em>vespa-set-node-state</em><!-- ToDo: this tool should be documented somewhere -->
to set a node to <em>retired</em>.
The <a href="#cluster-controller">cluster controller's</a> status page lists node states.
<h3 id="merge-clusters">Merge clusters</h3>
To merge two content clusters, add nodes to the cluster like <em>add node</em> above, considering:
<a href="reference/services-content.html#node">distribution-keys</a> must be unique.
Modify paths like <em>$VESPA_HOME/var/db/vespa/search/mycluster/n3</em> before adding the node.
Set <a href="setting-vespa-variables.html">VESPA_CONFIGSERVERS</a>, then start the node.
<h3 id="migrate-clusters">Migrate clusters</h3>
An alternative to increasing cluster size is building a new cluster, then migrate documents to it.
This is supported using <a href="content/visiting.html#data-export-import">visiting</a>.
<h2 id="grouped-distribution">Grouped distribution</h2>
Documents can be uniformly distributed or in groups:
<img src="img/qps-scaling-content-cluster.svg" height="420" width="740" />
The <a href="reference/services-content.html#group">group</a> element is used to distribute
<a href="content/buckets.html">buckets</a> of documents - use cases:
<table class="table">
<th style="white-space: nowrap">Cluster upgrade</th>
Use two (or more) groups - then shut down one group in the cluster for upgrade.
This is quicker than doing it one by one node -
the tradeoff is that 50% of the nodes must sustain the full load (in case of two groups),
and one must always grow the cluster by two (or more) nodes, one in each group.
Latency per search node goes up with a larger index, such grouping will double the index size.
<th style="white-space: nowrap">Query throughput</th>
Applications with high query load on a small index can benefit of keeping the full index on all nodes -
i.e. configuring the same number of groups as nodes.
This confines queries to one node only,
so less <em>total</em> work is done per query in the cluster.
Hence the throughput increase.
This increases latency compared with splitting the index on many nodes.
It also limits options when growing the index size -
at some point, the query latency might grow too high,
and the only option then is to add nodes to the group.
<th style="white-space: nowrap">Cautious data placement</th>
A way to reduce the risk of data unavailability,
is to spread the data copies across different locations,
like placing all the copies in different racks.
Furthermore, this data placement enables fast upgrade procedures without service interruption,
as entire groups can be upgraded at a time.
By default, Vespa distributes <a href="content/buckets.html">buckets</a>
uniformly across all distributor and storage nodes.
Use grouped distribution to control where bucket copies are stored in relation to each other.
For instance, if your cluster consists of 10 racks,
you can define that for any given bucket,
it should store 2 copies in one rack and 2 copies in another rack.
That means that you can take down a whole rack,
and you are still sure that all your data is available,
because you know all data on that rack should have 2 copies on another rack.
Also, improved network bandwidth can be utilized.
If a node has restarted and it needs to fetch some data that came in while restarting to get up to date,
it can now choose to fetch that data from a node on the same rack,
which has better connectivity as it is in the same rack.
(Here you must of course define one group per rack,
and add the nodes in the rack in the respective groups)
When configuring the groups, specify how data is to be distributed among the group children.
Refer to the <a href="reference/services-content.html#group">group</a>
documentation for details about how to configure storage groups.
Which groups are selected as primary, secondary and so on, groups for a
given bucket is randomly determined using the same ideal state algorithm we
use to pick nodes, described in more detail in the
<a href="content/idealstate.html">ideal state</a> reference document.
Each group is assigned an index, to be used in this algorithm.
This will however likely create a bit worse skew globally compared to not using groups.
See <a href="performance/sizing-search.html">sizing search</a> for grouped distribution examples.
<h2 id="limitations-and-tradeoffs">Limitations and tradeoffs</h2>
<table class="table">
<tr id="availability-vs-resources"><th>Availability vs resources</th>
Keeping index structures costs resources. Not all replicas of buckets are
necessarily searchable, unless configured using
<a href="reference/services-content.html#searchable-copies">searchable-copies</a>.
As Vespa indexes buckets on-demand, the most cost-efficient setting is 1,
if one can tolerate temporary coverage loss during node failures.
Note that <em>searchable-copies</em> does not apply to <a href="streaming-search.html">streaming search</a>
as this does not use index structures.
Distributing buckets using <a href="#grouped-distribution">groups</a>
helps implement policies for availability and
<a href="performance/sizing-search.html">query performance</a>.
<tr id="data-retention-vs-size"><th>Data retention vs size</th>
When a document is removed, the document data is not immediately purged.
Instead, proton keeps <em>remove-entries</em> (tombstones of removed documents)
for a configurable amount of time. The default is two weeks, refer to
<a href="">
This ensures that removed documents stay removed in a distributed system where nodes change state.
Entries are removed periodically after expiry.
Hence, if a node comes back up after being down for more than two weeks,
removed documents are available again, unless the data on the node is wiped first.
A larger <em>pruneremoveddocumentsage</em> will hence grow the storage size as this keeps
document and tombstones longer.
<strong>Note:</strong> The backend does not store remove-entries for nonexistent documents.
This to prevent clients sending wrong document identifiers from
filling a cluster with invalid remove-entries.
A side-effect is that if a problem has caused all replicas of a bucket to be unavailable,
documents in this bucket cannot be marked removed until at least one replica is available again.
Documents are written in new bucket replicas while the others are down -
if these are removed, then older versions of these will not re-emerge,
as the most recent change wins.
<tr id="transition-time"><th style="white-space:nowrap;">Transition time</th>
See <a href="reference/services-content.html#transition-time">transition-time</a>
for tradeoffs for how quickly nodes are set down vs. system stability.
<tr id="removing-unstable-nodes"><th>Removing unstable nodes</th>
One can configure how many times a node is allowed to crash before it will automatically be removed.
The crash count is reset if the node has been up or down continuously for more than the
<a href="reference/services-content.html#stable-state-period">stable state period</a>.
If the crash count exceeds
<a href="reference/services-content.html#max-premature-crashes">max premature crashes</a>, the node will be disabled.
Refer to <a href="operations/admin-procedures.html#troubleshooting">troubleshooting</a>.
<tr id="minimal-amount-of-nodes-required-to-be-available"><th>Minimal amount of nodes required to be available</th>
A cluster is typically sized to handle a given load.
A given percentage of the cluster resources are required for normal operations,
and the remainder is the available resources that can be used if some of the nodes are no longer usable.
If the cluster loses enough nodes, it will be overloaded:
<li>Remaining nodes may create disk full situation.
This will likely fail a lot of write operations, and if disk is shared with OS,
it may also stop the node from functioning.</li>
<li>Partition queues will grow to maximum size.
As queues are processed in FIFO order, operations are likely to get long latencies.</li>
<li>Many operations may time out while being processed,
causing the operation to be resent, adding more load to the cluster.</li>
<li>When new nodes are added, they cannot serve requests
before data is moved to the new nodes from the already overloaded nodes.
Moving data puts even more load on the existing nodes,
and as moving data is typically not high priority this may never actually happen.</li>
To configure what the minimal cluster size is, use
<a href="reference/services-content.html#min-distributor-up-ratio">min-distributor-up-ratio</a> and
<a href="reference/services-content.html#min-storage-up-ratio">min-storage-up-ratio</a>.
You can’t perform that action at this time.