Permalink
Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
371 lines (324 sloc) 13.8 KB
---
# Copyright 2017 Yahoo Holdings. Licensed under the terms of the Apache 2.0 license. See LICENSE in the project root.
title: "Document Summaries"
---
<p>
Use document summaries to configure which fields to include in results.
The <em>default</em> summary contains <em>all</em> fields that
are possible to include in summaries; all other summaries will contain
a subset of the fields included in the <em>default</em> summary.
</p><p>
Vespa keeps <a href="attributes.html">attribute</a>
type fields in memory and fetches those fields from
memory when requested as part of document summaries.
This means summaries are memory-only operations if all fields are attributes.
The other document fields are stored as blobs/records in the
<a href="#document-store">document store</a>.
This record is used when processing summary requests
that include fields in this record, and as needed during visiting or
re-distribution of content to handle elasticity.
</p><p>
The default summary class will always access the document store because it
includes the <a href="documents.html#document-ids">document ID</a> which is stored here.
To include the <a href="documents.html#document-ids">document ID</a> in a custom summary
class, add a field for the id and include it in the summary class.
</p><p>
When using additional summary classes to increase performance,
only the network data size is changed - the data read from storage is unchanged.
Having "debug" fields with summary enabled will hence also affect the
amount of information that needs to be read from disk.
</p><p>
Use <a href="/documentation/reference/search-definitions-reference.html#summary">dynamic</a>
to generate dynamic abstracts of fields, based on search keywords.
</p>
<h2 id="summary-classes-in-the-search-definition">Summary classes in the search definition</h2>
<p>
Define additional summary classes as described in
the <a href="reference/search-definitions-reference.html#summary">searchdefinition reference</a>.
This can be done by an implicit definition or by an explicit definition.
</p><p>
In an <em>implicit</em> definition, name one or more summary classes
that should contain the field on the field definition itself:
</p>
<pre>
field [name] type [type] {
&hellip;
summary-to: [summary name], [summary name] # The names of the doc summaries which will include this field
}
</pre>
<p>
Example: the <em>title</em> and <em>year</em> fields are
included in a the <em>titleyear</em> summary.
Also note that both <em>year</em> and <em>title</em> will be a part of the
<em>default</em> summary, even if not mentioned in the summary-to statement.
</p>
<pre class="brush: sd">
# A basic search definition - called music, should be saved to music.sd
search music {
# It contains one document type only - called music as well
document music {
field title type string {
indexing: summary | index # How this field should be indexed
<span style="background-color: yellow;">summary-to: titleyear</span>
}
field artist type string {
indexing: summary | attribute | index
}
field year type int {
indexing: summary | attribute
<span style="background-color: yellow;">summary-to: titleyear</span>
}
field popularity type int {
indexing: summary | attribute
}
field url type uri {
indexing: summary | index
}
}
fieldset default {
fields: title, artist
}
rank-profile default inherits default {
first-phase {
expression: nativeRank(title,artist) + attribute(popularity)
}
}
rank-profile textmatch inherits default {
first-phase {
expression: nativeRank(title,artist)
}
}
}
</pre>
<p>
In an <em>explicit</em> definition, name a summary class with the list of fields to include:
</p>
<pre>
search [name] {
document [name] {
&hellip;
}
document-summary [name] {
summary [field name] type [type] {
source: [source field name]
}
summary [field name] type [type] {
source: [source field name], [source field name]
}
}
}
</pre>
<p>
Example: Equivalent to the above example
</p>
<pre>
# A basic search definition - called music, should be saved to music.sd
search music {
# It contains one document type only - called music as well
document music {
field title type string {
indexing: summary | index # How this field should be indexed
}
field artist type string {
indexing: summary | attribute | index
}
field year type int {
indexing: summary | attribute
}
field popularity type int {
indexing: summary | attribute
}
field url type uri {
indexing: summary | index
}
}
fieldset default {
fields: title, artist
}
rank-profile default inherits default {
first-phase {
expression: nativeRank(title,artist) + attribute(popularity)
}
}
rank-profile textmatch inherits default {
first-phase {
expression: nativeRank(title,artist)
}
}
<span style="background-color: yellow;"> document-summary titleyear {
summary title type string {
source: title
}
summary year type int {
source: year
}
}</span>
}
</pre>
<p>
It is also possible to combine implicit and explicit definition of
summary classes. For more details on summary properties,
see <a href="reference/search-definitions-reference.html#summary">Search Definitions: Summary</a>.
</p>
<h2 id="summary-classes-in-queries">Summary classes in queries</h2>
<p>
Use <em>summary=[summary name]</em> in the query to choose the
summary class to use (the default one is called <em>default</em>).
See <a href="reference/search-api-reference.html#presentation.summary">Search API</a>. Example:
<pre>
/search/?yql=select+*+from+sources+*+where+default+contains+"best"%3B&amp;summary=titleyear
</pre>
The <em>select</em> statement in YQL lists a set of fields to return. Vespa
in general makes a best effort to return those fields, and only those fields,
unless a wildcard ("*") is given as argument. The wildcard implies returning
the full set of fields included in the given summary class.</p>
<p>In conjunction with YQL statements, the <em>summary</em> argument operates like a
definition of the set which YQL <em>select</em> then chooses a subset of fields
from.</p>
<p>In other words, if the YQL expression is "select * …", and the summary
argument is <em>titleyear</em>, all the fields in the summary class <em>titleyear</em> will be
returned. If the select statement lists one or more fields (and summary is
<em>titleyear</em>), the summary class <em>titleyear</em> is fetched, and the fields
<em>not</em> listed in the select statement will be stripped away.
</p>
<h2 id="document-store">Document store</h2>
<p>
Documents are stored in the <em>document store</em> in <em>proton</em>.
Put, update and remove operations are persisted in the <em>transaction log server (TLS)</em>
before updating the document in the document store. The operation is ack'ed to the client
and the result of the operation is immediately seen in search results.
</p><p>
<strong>Note:</strong> If <a href="reference/services-content.html#visibility-delay">visibility-delay</a>
is set to non-zero, writes are batched (for better write performance) and delayed in search results.
</p><p>
Files in the document store are written sequentially, and occur in pairs - example:
<pre>
-rw-r--r-- 1 owner users 4133380096 Aug 10 13:36 1467957947689211000.dat
-rw-r--r-- 1 owner users 71192112 Aug 10 13:36 1467957947689211000.idx
</pre>
The <a href="content/setup-proton-tuning.html#summary-store-logstore-maxfilesize">maximum size</a>:
(in bytes) per .dat file on disk can be set using the following:
<pre>
&lt;content id="mycluster" version="1.0"&gt;
&lt;engine&gt;
&lt;proton&gt;
&lt;tuning&gt;
&lt;searchnode&gt;
&lt;summary&gt;
&lt;store&gt;
&lt;logstore&gt;
<span style="background-color: yellow;">&lt;maxfilesize&gt;8000000000&lt;/maxfilesize&gt;</span>
</pre>
Notes:
<ul>
<li>The files are written in sequence. <em>proton</em> starts with one pair
and grows it until <em>maxfilesize</em>. Once full, a new pair is started.</li>
<li>This means, the pair is immutable, except the last pair, which is written to.</li>
<li>This also implies that documents for a given group or user (i.e. bucket)
will be distributed to multiple such pairs, depending on insertion order.</li>
<li>This also means that a <a href="streaming-search.html">streaming search</a> potentially hits multiple
files per bucket searched. This impacts search latency if disk access time is high.</li>
<li>Documents exist in multiple versions in multiple files.
Older versions are compacted away when a pair is scheduled for being the new active pair -
obsolete versions are removed, leaving only the active document version left in a new file pair
- which is the new active pair.</li>
<li>Read more on implications of setting <em>maxfilesize</em> in
<a href="proton.html#document-store-compaction">proton maintenance jobs</a>.</li>
<li>Files are written in <a href="content/setup-proton-tuning.html#summary-store-logstore-chunk">chunks</a>,
using comression settings.
</ul>
<h3 id="visiting-the-document-store">Visiting the document store</h3>
<p>
A streaming search is a <a href="content/visiting.html">visit</a>,
by buckets, in sequence or (semi)parallel. Access to the document store is by local ID - LID.
In <em>proton</em>, a bucket is just a property on the <a href="documents.html">document's ID</a>.
As documents are added to the file pairs in insertion order, a scan of all documents
in a bucket is hence a set of random file accesses, unless some kind of bucket localization is done.
The set of document IDs for a given bucket is easily generated from memory structures and assumed
to take little resources.
</p>
<h3 id="defragmentation-within-files">Defragmentation within files</h3>
<p>
<a href="proton.html#document-store-compaction">Document store compaction</a>,
(defragmentation), does two things:
<ol>
<li>Removes stale versions of documents (i.e. old version of updated documents).
Triggered when the disk bloat of the document store is larger than the total disk usage of the document store times
<a href="content/setup-proton-tuning.html#flushstrategy-native-total-diskbloatfactor"><em>diskbloatfactor</em></a>.</li>
</ol>
Refer to <a href="content/setup-proton-tuning.html#summary">summary tuning</a> for details.
As streaming accesses are organized per bucket, latencies are cut by co-locating documents by bucket.
<p></p>
Defragmentation status is hence best observed by tracking <em>max_bucket_spread</em> over time,
a sawtooth pattern is normal for corpused that change over time.
The <em>document_store_compact</em> metric tracks when proton is running compaction jobs.
Compaction settings can be set too tight, in that case, the metric is always, or close to, 1.
</p><p>
When benchmarking, it is hence important to set the correct compaction settings,
and also make sure that proton has compacted files since (can take hours),
and is not actively compacting (<em>document_store_compact</em> should be 0 most of the time).
<!-- The effect of compaction on query latency is <strong>ToDo</strong>. -->
</p>
<h3 id="defragmentation-across-files">Defragmentation across files</h3>
<p>
There is no bucket-compaction across files - documents will not move between files.
</p>
<h3 id="optimized-reads-using-chunks">Optimized reads using chunks</h3>
<p>
As documents are clustered within the .dat file, proton optimizes reads by reading larger chunks
when accessing documents. When visiting (as in streaming search), documents are read in <em>bucket</em> order.
This is the same order as the defragmentation jobs uses.
</p><p>
The first document read in a visit operation for a bucket will read a chunk
from the .dat file into memory. Subsequent document accesses are served be a memory lookup only.
The chunk size is configured by
<a href="content/setup-proton-tuning.html#summary-store-logstore-chunk-maxsize">maxsize</a>:
<pre>
&lt;engine&gt;
&lt;proton&gt;
&lt;tuning&gt;
&lt;searchnode&gt;
&lt;summary&gt;
&lt;store&gt;
&lt;logstore&gt;
&lt;chunk&gt;
<span style="background-color: yellow;">&lt;maxsize&gt;16384&lt;/maxsize&gt;</span>
&lt;/chunk&gt;
&lt;/logstore&gt;
</pre>
There can be 2^22=4M chunks. This sets a minimum chunk size based on <em>maxfilesize</em> -
e.g. an 8G file can have minimum 2k chunk size.
Finally, bucket size is configured by setting
<a href="reference/services-content.html#bucket-splitting">bucket-splitting</a>:
<pre>
&lt;content id="imagepersonal" version="1.0"&gt;
&lt;tuning&gt;
<span style="background-color: yellow;">&lt;bucket-splitting max-documents="1024"/&gt;</span>
</pre>
</p>
<p>
The following are hence the relevant sizing units:
<ul>
<li>.dat file size - <em>maxfilesize</em>.
Larger files give less files and hence better locality,
but compaction requires more memory and more time to complete.</li>
<li>chunk size - <em>maxsize</a></em>.
Smaller chunks give less wasted IO bytes but more IO operations.
<li>bucket size - <em>bucket-splitting</em>.
Larger buckets give less buckets and hence better locality to nodes and files.
Larger buckets means higher streaming search latency per bucket.</li>
</ul>
</p>
<h2 id="document-store-memory-usage">Document store memory usage</h2>
<p>
The <em>document store</em> has a mapping in memory from local ID (LID)
to position in a document store file (.dat).
Part of this mapping is persisted in the .idx-file paired to the .dat file.
The memory used by the document store is linear with number of documents
and updates to these.
</p><p>
The metric <em>content.proton.documentdb.ready.document_store.memory_usage.allocated_bytes</em>
gives the size in memory - use the
<a href="reference/metrics-health-format.html#metrics-api">metric API</a> to find it.
A rule of thumb is 12 bytes per document.
</p>