Permalink
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
398 lines (387 sloc) 16.1 KB
---
# Copyright 2017 Yahoo Holdings. Licensed under the terms of the Apache 2.0 license. See LICENSE in the project root.
title: "Proton"
---
<p>
Proton is Vespa's search core. Proton maintains disk and memory structures for documents.
As the data is dynamic, these structures are periodically optimized.
These jobs temporarily consume resources - make sure the host running Vespa
has sufficient buffer.
</p><p>
Proton stores its index in <em>$VESPA_HOME/var/db/vespa/search/</em>.
</p>
<h2 id="internal-structure">Internal structure</h2>
<p>
The internal structure of a <em>proton node</em>:
</p><img src="content/img/proton_feed.png"/><p>
The search node consists of a <em>bucket management system</em> which
sends requests to a set of <em>document databases</em>, which each
consists of three <em>sub-databases</em>.
</p>
<h3 id="bucket-management">Bucket management</h3>
<p>
When the node starts up it first needs to get an overview of what
partitions it have, and what buckets each partition currently stores.
Once it knows this, it is in initializing mode, able to handle load,
but distributors do not yet know bucket metadata for all the buckets,
and thus can't know whether buckets are consistent with copies on other nodes.
Once metadata for all buckets are known, the content nodes transitions from
initializing to up state.
As the distributors wants quick access to bucket metadata,
it keeps an in-memory bucket database to efficiently serve these requests.
</p><p>
It implements elasticity support in terms of the SPI.
Operations are ordered according to priority,
and only one operation per bucket can be in-flight at a time.
Below bucket management is the persistence engine,
which implements the SPI in terms of Vespa search.
The persistence engine reads the document type from the document id,
and dispatches requests to the correct document database.
</p>
<h3 id="document-database">Document database</h3>
<p>
Each document database is responsible for a single document type.
It has a component called FeedHandler which takes care of incoming documents, updates, and remove requests.
All requests are first written to a transaction log to make sure they are not lost if the node restarts.
They are then handed to the appropriate sub-database, based on the request type.
</p>
<h3 id="sub-databases">Sub-databases</h3>
<p>
There are three types of sub-databases, each with its own
<a href="attributes.html#document-meta-store">document meta store</a> and
<a href="document-summaries.html#document-store">document store</a>.
The document meta store holds a map from the document id to a local
id. This local id is used to address the document in the document
store. The document meta store also maintains information on the
state of the buckets that are present in the sub-database.
</p><p>
The sub-databases are maintained by the <em>index maintainer</em>.
The document distribution changes as the system is resized.
When the number of nodes in the system changes,
the index maintainer will move documents between the Ready and
Not Ready sub-databases to reflect the new distribution.
When an entry in the Removed sub-database gets old it is purged.
The sub-databases are:
<table class="table">
<thead></thead><tbody>
<tr>
<th>Not Ready</th>
<td>
Holds the redundant documents that are not searchable,
i.e. the <em>not ready</em> documents.
Documents that are not ready are only stored, not indexed.
It takes some processing to move from this state to the ready state.
</td>
</tr><tr>
<th>Ready</th>
<td>
Maintains an index of all <em>ready</em> documents and keeps them searchable.
One of the ready copies is <em>active</em> while the rest are <em>not active:</em>
<ul>
<li><strong>Active: </strong>
There should always be exactly one active copy of each
document in the system, though intermittently there may be more.
These documents produce results when queries are evaluated.
</li>
<li><strong>Not Active: </strong>
The ready copies that are not active are indexed but will not produce results.
By being indexed, they are ready to take over immediately
if the node holding the active copy becomes unavailable.
</li>
</ul>
</td>
</tr><tr>
<th>Removed</th>
<td>Keeps track of documents that have been removed.
The id and timestamp for each document is kept.
This information is used when buckets from two nodes are merged.
If the removed document exists on another node but with a different timestamp,
the most recent entry prevails.
</td>
</tr>
</tbody>
</table>
Hence, only <em>active</em> documents in <em>Ready</em> are searchable:
<table style="border: 1px solid black; padding: 15px;">
<tr><th></th><th><em>Not Ready</em></th><th><em>Ready</em></th><th><em>Removed</em></th></tr>
<tr><th>Not searchable</th><td rowspan="2">not indexed</td><td>indexed - not active</td><td rowspan="2">not indexed</td></tr>
<tr><th>Searchable</th><td style="background-color: lightgreen;">indexed - active</td></tr>
</table>
</p>
<h2 id="proton-maintenance-jobs">Proton maintenance jobs</h2>
<p>
Tune the jobs in <a href="content/setup-proton-tuning.html">Proton tuning</a>.
<a href="performance/sizing-search.html">Sizing search</a> describes the <em>static</em> proton sizing -
this article details the <em>temporary</em> resource usage for the proton jobs.
</p><p>
There is only one instance of each job at a time - e.g. attributes are flushed in sequence.
When a job is running, its metric is set to 1 - otherwise 0.
Use this to correlate observed performance with job runs - see <em>Run metric</em>.
</p><p>
Refer to the implementation of
<a href="https://github.com/vespa-engine/vespa/blob/master/config-model/src/main/java/com/yahoo/vespa/model/admin/monitoring/VespaMetricSet.java">
performance metrics</a>, see <em>getSearchNodeMetrics()</em>.
Metrics are available at the <a href="reference/metrics-health-format.html">Metrics API</a>.
</p>
<table class="table">
<thead></thead><tbody>
<tr>
<th id="attribute-flush">attribute flush</th>
<td>
Flush an <a href="attributes.html">attribute</a> vector from memory to disk,
based on configuration in the
<a href="content/setup-proton-tuning.html#flushstrategy">flushstrategy</a>.
This controls memory usage and query performance.
This also makes proton starts quicker - see
<a href="reference/services-content.html#flush-on-shutdown">flush on shutdown</a>.
<table class="table">
<thead>
</thead><tbody>
<tr><th>CPU</th><td>
Little - one thread flushes to disk
</td></tr>
<tr><th>Memory</th><td>
Little - some temporary use
</td></tr>
<tr><th>Disk</th><td>
A new file is written to, hence 2x the size of an attribute on disk.
</td></tr>
<tr><th>Run metric</th><td>
<em>content.proton.documentdb.job.attribute_flush</em>
</td></tr>
<tr><th>Metric prefix</th><td>
<em>content.proton.documentdb.[ready|notready].attribute.memory_usage.</em>
</td></tr>
<tr><th>Metrics</th><td>
<em>
allocated_bytes.average<br/>
used_bytes.average<br/>
dead_bytes.average<br/>
onhold_bytes.average<br/>
</em>
</td></tr>
</tbody>
</table>
</td>
</tr><tr>
<th id="memory-index-flush">memory index flush</th>
<td>
Flush a <em>memory index</em> to disk, then trigger <a href="#disk-index-fusion">disk index fusion</a>.
The goal is to shrink memory usage by adding to the disk-backed indices.
Performance characteristics for this flush is similar to indexing.
Note: A high feed rate can cause multiple smaller flushed indices, like
<em>$VESPA_HOME/var/db/vespa/search/cluster.name/n1/documents/doc/0.ready/index/index.flush.102</em>
- see the high index number.
Multiple smaller indices is a symptom of too small memory indices compared to feed rate -
to fix, increase <a href="content/setup-proton-tuning.html#flushstrategy">
flushstrategy &gt; native &gt; component &gt; maxmemorygain</a>.
<table class="table">
<thead>
</thead><tbody>
<tr><th>CPU</th><td>
Little - one thread indexes to disk
</td></tr>
<tr><th>Memory</th><td>
Little
</td></tr>
<tr><th>Disk</th><td>
Creates a new disk index, size of the memory index.
</td></tr>
<tr><th>Run metric</th><td>
<em>content.proton.documentdb.job.memory_index_flush</em>
</td></tr>
<tr><th>Metric prefix</th><td>
<em>content.proton.documentdb.index.memory_usage.</em>
</td></tr>
<tr><th>Metrics</th><td>
<em>
allocated_bytes<br/>
used_bytes<br/>
dead_bytes<br/>
onhold_bytes<br/>
</em>
</td></tr>
</tbody>
</table>
</td>
</tr><tr>
<th id="disk-index-fusion">disk index fusion</th>
<td>
Merge the primary disk index with smaller indices generated by
<a href="#memory-index-flush">memory index flush</a> -
triggered by the memory index flush.
<table class="table">
<thead>
</thead><tbody>
<tr><th>CPU</th><td>
Little - one thread merges indices
</td></tr>
<tr><th>Memory</th><td>
Little
</td></tr>
<tr><th>Disk</th><td>
Creates a new index while serving from the current -
hence 2x temporary disk usage for the given index.
</td></tr>
<tr><th>Run metric</th><td>
<em>content.proton.documentdb.job.disk_index_fusion</em>
</td></tr>
</tbody>
</table>
</td>
</tr><tr>
<th id="document-store-flush">document store flush</th>
<td>
Flushes the <a href="document-summaries.html#document-store-memory-usage">document store</a>.
<table class="table">
<thead>
</thead><tbody>
<tr><th>CPU</th><td>
Little
</td></tr>
<tr><th>Memory</th><td>
Little
</td></tr>
<tr><th>Disk</th><td>
Little
</td></tr>
<tr><th>Run metric</th><td>
<em>content.proton.documentdb.job.document_store_flush</em>
</td></tr>
</tbody>
</table>
</td>
</tr><tr>
<th id="document-store-compaction">document store compaction</th>
<td>
Defragment and sort <a href="document-summaries.html#document-store">
document store</a> files as documents are updated and deleted,
in order to reduce disk space and speed up
<a href="streaming-search.html">streaming search</a>.
The file is sorted in bucket order on output. Triggered by
<a href="content/setup-proton-tuning.html#flushstrategy-native-total-diskbloatfactor">
diskbloatfactor</a>.
<table class="table">
<thead>
</thead><tbody>
<tr><th>CPU</th><td>
Little - one thread reads one files, sorts and writes a new file
</td></tr>
<tr><th>Memory</th><td>
Holds a document summary store file in memory plus memory for sorting the file.
<strong>Note: This is important on hosts with little memory!</strong>
Reduce <a href="content/setup-proton-tuning.html#summary-store-logstore-maxfilesize">
maxfilesize</a> to increase number of files and use less temporary memory for compaction.
</td></tr>
<tr><th>Disk</th><td>
A new file is written while the current is serving, max temporary usage is 2x.
</td></tr>
<tr><th>Run metric</th><td>
<em>content.proton.documentdb.job.document_store_compact</em>
</td></tr>
<tr><th>Metric prefix</th><td>
<em>content.proton.documentdb.[ready|notready|removed].document_store.</em>
</td></tr>
<tr><th>Metrics</th><td>
<em>
disk_usage.average<br/>
disk_bloat.average<br/>
max_bucket_spread.average<br/>
memory_usage.allocated_bytes.average<br/>
memory_usage.used_bytes.average<br/>
memory_usage.dead_bytes.average<br/>
memory_usage.onhold_bytes.average<br/>
</em>
</td></tr>
</tbody>
</table>
</td>
</tr><tr>
<th id="bucket-move">bucket move</th>
<td>
Triggered by nodes going up/down, refer to <a href="elastic-vespa.html">
elastic Vespa</a> and
<a href="reference/services-content.html#searchable-copies">searchable-copies</a>.
Causes documents to be indexed or de-indexed, similar to feeding.
This moves documents in or out of ready/active sub-databases.
<table class="table">
<thead>
</thead><tbody>
<tr><th>CPU</th><td>
CPU similar to feeding.
Consumes capacity from the index write thread - hence has feeding impact
</td></tr>
<tr><th>Memory</th><td>
As feeding - the memory index will grow
</td></tr>
<tr><th>Disk</th><td>
As feeding
</td></tr>
<tr><th>Run metric</th><td>
<em>content.proton.documentdb.job.bucket_move</em>
</td></tr>
</tbody>
</table>
</td>
</tr><tr>
<th id="lid-space-compaction">lid-space compaction</th>
<td>
As <a href="#bucket-move">bucket move</a>, however moves documents <em>within</em> a sub-database.
This is often triggered when a <a href="elastic-vespa.html">cluster grows</a> with more nodes,
documents are redistributed to new nodes and each nodes has less documents -
a LIDspace compaction is hence triggered. This inplace defragments the
<a href="attributes.html#document-meta-store">document meta store</a>.
Resources are freed on a subsequent <a href="#attribute-flush">attribute flush</a>.
<table class="table">
<thead>
</thead><tbody>
<tr><th>CPU</th><td>
like feeding - add and delete doc
</td></tr>
<tr><th>Memory</th><td>
Little
</td></tr>
<tr><th>Disk</th><td>
0
</td></tr>
<tr><th>Run metric</th><td>
<em>content.proton.documentdb.job.lid_space_compact</em>
</td></tr>
<tr><th>Metric prefix</th><td>
<em>content.proton.documentdb.[ready|notready|removed].lid_space.</em>
</td></tr>
<tr><th>Metrics</th><td>
<em>
lid_bloat_factor.average<br/>
lid_fragmentation_factor.average<br/>
</em>
</td></tr>
</tbody>
</table>
</td>
</tr><tr>
<th id="removed-documents-pruning" style="white-space:nowrap;">removed documents pruning</th>
<td>
Prunes the <em>deleted documents</em> sub-database which keeps IDs for deleted documents.
Default runs once per hour.
<table class="table">
<thead>
</thead><tbody>
<tr><th>CPU</th><td>
Little
</td></tr>
<tr><th>Memory</th><td>
Little
</td></tr>
<tr><th>Disk</th><td>
Little
</td></tr>
<tr><th>Run metric</th><td>
<em>content.proton.documentdb.job.removed_documents_prune</em>
</td></tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>