Permalink
Fetching contributors…
Cannot retrieve contributors at this time
798 lines (701 sloc) 42.6 KB
---
# Copyright 2017 Yahoo Holdings. Licensed under the terms of the Apache 2.0 license. See LICENSE in the project root.
title: "Grouping Reference"
---
<p>
Refer to the <a href="../grouping.html">grouping guide</a> for an introduction
to Vespa's grouping feature.
</p>
<ul>
<li>
Group query results using a custom expression (using the <code>group</code> clause):
<ul>
<li>A numerical constant.</li>
<li>A document attribute.</li>
<li>
A function over another expression (<code>xorbit</code>, <code>md5</code>, <code>cat</code>,
<code>xor</code>, <code>and</code>, <code>or</code>, <code>add</code>, <code>sub</code>,
<code>mul</code>, <code>div</code>, <code>mod</code>) or any other <a href="#expression">expression</a>.
</li>
<li>
The data type of an expression is resolved using best effort, similarly to how you expect common
programming languages to resolve arithmetics of different data typed operands.
</li>
<li>
The results of any expression are either scalar or single dimension arrays.
<ul>
<li><code>add(&lt;array&gt;)</code> adds all elements together to produce a scalar.</li>
<li>
<code>add(&lt;arrayA&gt;, &lt;arrayB&gt;)</code> adds each element together producing a new
array whose size is <code>max(|&lt;arrayA&gt;|, |&lt;arrayB&gt;|)</code>.
</li>
</ul>
</li>
</ul>
</li>
<li>
Groups can contain subgroups (by using <code>each</code> and <code>group</code> operations),
and may be nested to any level.
</li>
<li>
Multiple sub-groupings or outputs can be created under the same group level, using multiple parallel <code>each</code>
or <code>all</code> clauses, and each one may be labelled using the <code><a href="#labels">as(mylabel)</a></code> construct.
<li>
Each level of grouping specifies a set of <a href="#aggregation">aggregates</a> to collect for all documents that belong to that
group (using the <code>output</code> operation):
<ul>
<li>The documents in a group, retrieved using a specified summary class.</li>
<li>The count of documents in a group.</li>
<li>The sum, average, min, max, xor or standard deviation of an expression.</li>
</ul>
</li>
<li>
Each level of grouping may specify how to order its groups (using the <code>order</code> operation):
<ul>
<li>Ordering can be done using any of the available aggregates.</li>
<li>Multi-level grouping allows strict ordering where primary aggregates may be equal.</li>
<li>Ordering is either ascending or descending, specified per level of ordering.</li>
</ul>
</li>
<li>
You may limit the number of groups returned for each level (using the <code>max</code> operation),
returning only first <em>n</em> groups as specified by the <code>order</code> operation.
</li>
<li>
You may count the <a href="#counting-unique-groups">number of unique groups</a> for a level using the
<code>count</code> aggregator. Note that <code>count</code> operates independently of the
<code>max</code> clause.
</li>
<li>
You may paginate through group- and hit-lists using the "<a href="#continue">continuations</a>"
query parameter.
</li>
<li>
You may group on <a href="#multivalue-grouping">multivalued attributes</a>.
Most grouping functions will just handle the elements of
multivalued attributes separately, as if they were all individual values in separate documents.
</li>
<li>
The <a href="#interpolatedlookup"><code>interpolatedlookup</code></a> function
will count elements in a sorted array that are less than an expression,
with linear interpolation if the expression is between element values.
</li>
</ul>
<h2 id="grammar">Select parameter language grammar</h2>
<pre>
request ::= group [ "where" "(" ( "true" | "$query" ) ")" ]
group ::= ( "all" | "each") "(" operations ")" [ "as" "(" identifier ")" ]
operations ::= [ "group" "(" expression ")" ]
( ( "alias" "(" identifier "," expression ")" ) |
( "max" "(" number ")" ) |
( "order" "(" expList | aggrList ")" ) |
( "output" "(" aggrList ")" ) |
( "precision" "(" number ")" ) )*
group*
aggrList ::= aggr ( "," aggr )*
aggr ::= ( ( "count" "(" ")" ) |
( "sum" "(" exp ")" ) |
( "avg" "(" exp ")" ) |
( "max" "(" exp ")" ) |
( "min" "(" exp ")" ) |
( "xor" "(" exp ")" ) |
( "stddev" "(" exp ")" ) |
( "summary" "(" [ identifier ] ")" ) )
[ "as" "(" identifier ")" ]
expList ::= exp ( "," exp )*
exp ::= ( "+" | "-") ( "$" identifier [ "=" math ] ) | ( math ) | ( aggr )
math ::= value [ ( "+" | "-" | "*" | "/" | "%" ) value ]
value ::= ( "(" exp ")" ) |
( "add" "(" expList ")" ) |
( "and" "(" expList ")" ) |
( "cat" "(" expList ")" ) |
( "div" "(" expList ")" ) |
( "docidnsspecific" "(" ")" ) |
( "fixedwidth" "(" exp "," number ")" ) |
( "interpolatedlookup" "(" attributeName "," exp ")") |
( "math" "." (
(
"exp" | "log" | "log1p" | "log10" | "sqrt" | "cbrt" |
"sin" | "cos" | "tan" | "asin" | "acos" | "atan" |
"sinh" | "cosh" | "tanh" | "asinh" | "acosh" | "atanh"
) "(" exp ")" |
( "pow" | "hypot" ) "(" exp "," exp ")"
)) |
( "max" "(" expList ")" ) |
( "md5" "(" exp "," number "," number ")" ) |
( "min" "(" expList ")" ) |
( "mod" "(" expList ")" ) |
( "mul" "(" expList ")" ) |
( "or" "(" expList ")" ) |
( "predefined" "(" exp "," "(" bucket ( "," bucket )* ")" ")" ) |
( "reverse" "(" exp ")" ) |
( "relevance" "(" ")" ) |
( "sort" "(" exp ")" ) |
( "strcat" "(" expList ")" ) |
( "strlen" "(" exp ")" ) |
( "size" "(" exp")" ) |
( "sub" "(" expList ")" ) |
( "time" "." ( "year" | "monthofyear" | "dayofmonth" | "dayofyear" | "dayofweek" |
"hourofday" | "minuteofhour" | "secondofminute" ) "(" exp ")" ) |
( "todouble" "(" exp ")" ) |
( "tolong" "(" exp ")" ) |
( "tostring" "(" exp ")" ) |
( "toraw" "(" exp ")" ) |
( "uca" "(" exp "," string [ "," string ] ")" ) |
( "xor" "(" expList ")" ) |
( "xorbit" "(" exp "," number ")" ) |
( "ymum" "(" ")" ) |
( "zcurve" "." ( "x" | "y" ) "(" exp ")" ) |
( attributeName "." "at" "(" number ")") |
( attributeName )
bucket ::= "bucket" ( "(" | "[" | "&lt;" )
( "-inf" | rawvalue | number | string )
[ "," ( "inf" | rawvalue | number | string ) ]
( ")" | "]" | "&gt;" )
rawvalue ::= "{" ( ( string | number ) "," )* "}"
</pre>
<h2 id="output-format">Output format</h2>
<p>
When grouping results, <strong>groups</strong> that contain <strong>outputs</strong>, <strong>group lists</strong>,
and <strong>hit lists</strong> are generated.
Group lists contain sub-groups, and hit lists contain hits that are part of the owning group.
</p><p>
The identity of a group is held by its <em>id</em>.
Scalar identities such as long, double and string, are directly
available from the <em>id</em>, whereas range identities used for bucket aggregation are
separated into the sub-nodes <em>from</em> and <em>to</em>.
Refer to the <a href="default-result-format.html">result format reference</a>.
</p>
<h2 id="continue">Continue parameter</h2>
<p>
Pagination of grouping results are managed by "continuations". These are
opaque objects that can be combined and re-submitted using the
"continuations" annotation on the grouping step of the query to move to
the previous or next page in a result list.
</p><p>
All root groups contain a single "this" continuation. That continuation
represents the current view, and if submitted as the sole continuation it
will reproduce the exact same result as the one that contained it. Other
named continuations are available in the result, and these can be appended
to the "this" continuation to perform the corresponding pagination
operation. E.g. the "next" continuation of a group list can be used to
move to the next page of groups in that list.
</p><p>
Any number of continuations can be combined in a query, but the first must
always be the "this" continuation. E.g. you may simultaneously move both
to the next page of one list, and the previous page of another.
</p><p class="alert alert-success">
If more than one continuation object are provided for the same group- or
hit-list, the one given last is the one that takes effect. This is because
continuations are processed in the order given, and they replace whatever
continuations they collide with.
</p><p>
If working programmatically with grouping, you will find the
<code><a href="http://javadoc.io/page/com.yahoo.vespa/container-search/latest/com/yahoo/search/grouping/Continuation.html">Continuation</a></code>
objects within
<code><a href="http://javadoc.io/page/com.yahoo.vespa/container-search/latest/com/yahoo/search/grouping/result/RootGroup.html">RootGroup</a></code>,
<code><a href="http://javadoc.io/page/com.yahoo.vespa/container-search/latest/com/yahoo/search/grouping/result/GroupList.html">GroupList</a></code> and
<code><a href="http://javadoc.io/page/com.yahoo.vespa/container-search/latest/com/yahoo/search/grouping/result/HitList.html">HitList</a></code>
result objects. These can then be added back into the continuation list of the
<code><a href="http://javadoc.io/page/com.yahoo.vespa/container-search/latest/com/yahoo/search/grouping/GroupingRequest.html">GroupingRequest</a></code>
to paginate.
</p><p>
Here is an example of a query that provides a continuation to the grouping statement:
</p>
<pre>/search/?yql=select (&hellip;) | [{ 'continuations':['BGAAABEBCA'] }]all(&hellip;);</pre>
<h2 id="labels">Labels</h2>
Lists created using the <code>each</code> keyword can be assigned a label using the construct <code>each(...) as(mylabel)</code>.
The outputs created by that each clause will be identified by this label.
<h2 id="aggregation">Aggregators</h2>
<table class="table">
<tr><td colspan="4"><h3>Group list aggregators</h3></td></tr>
<tr><th>Name</th><th>Description</th><th>Arguments</th><th>Result</th></tr>
<tr><td>count</td><td>Counts the number of unique groups (As produced by the <code>group</code> clause).</td><td>None</td><td>Long</td></tr>
<tr><td colspan="4"><h3>Group aggregators</h3></td></tr>
<tr><th>Name</th><th>Description</th><th>Arguments</th><th>Result</th></tr>
<tr><td>count</td><td>Increments a long counter everytime it is invoked.</td><td>None</td><td>Long</td></tr>
<tr><td>sum</td><td>Sums the argument over all selected documents.</td><td>Numeric</td><td>Numeric</td></tr>
<tr><td>avg</td><td>Computes the average over all selected documents.</td><td>Numeric</td><td>Numeric</td></tr>
<tr><td>min</td><td>Keeps the minimum value of selected documents.</td><td>Numeric</td><td>Numeric</td></tr>
<tr><td>max</td><td>Keeps the maximum value of selected documents.</td><td>Numeric</td><td>Numeric</td></tr>
<tr><td>xor</td><td>XOR the values (their least significant 64 bits) of all selected documents.</td><td>Any</td><td>Long</td></tr>
<tr><td>stddev</td><td>Computes the population standard deviation over all selected documents.</td><td>Numeric</td><td>Double</td></tr>
<tr><td colspan="4"><h3>Hit aggregators</h3></td></tr>
<tr><th>Name</th><th>Description</th><th>Arguments</th><th>Result</th></tr>
<tr><td>summary</td><td>Produces a summary of the requested summary class.</td><td>Name of summary class</td><td>Summary</td></tr>
</table>
<p class="alert alert-success">
When all arguments are numeric, the result type is resolved by looking at the argument types. If all
arguments are longs, the result is an long, if at least one argument is a double, the result is a
double.
</p>
<p class="alert alert-success">
When using order(), aggregators can also be used in expressions, in
order to get increased control over group sorting. This does not work with
expressions that takes attributes as an argument, unless the expression is enclosed
within an aggregator.
</p>
<h2 id="expression">Expressions</h2>
<table class="table">
<tr><td colspan="4"><h3>Arithmetic expressions</h3></td></tr>
<tr><th>Name</th><th>Description</th><th>Arguments</th><th>Result</th></tr>
<tr><td>add</td><td>Add the arguments together.</td><td>Numeric+</td><td>Numeric</td></tr>
<tr><td>+</td><td>Add left and right argument.</td><td>Numeric, Numeric</td><td>Numeric</td></tr>
<tr><td>mul</td><td>Multiply the arguments together.</td><td>Numeric+</td><td>Numeric</td></tr>
<tr><td>*</td><td>Multiply left and right argument.</td><td>Numeric, Numeric</td><td>Numeric</td></tr>
<tr><td>sub</td><td>Subtract second argument from first, third from result, etc.</td><td>Numeric+</td><td>Numeric</td></tr>
<tr><td>-</td><td>Subtract right argument from left.</td><td>Numeric, Numeric</td><td>Numeric</td></tr>
<tr><td>div</td><td>Divide first argument by second, result by third, etc.</td><td>Numeric+</td><td>Numeric</td></tr>
<tr><td>/</td><td>Divide left argument by right.</td><td>Numeric, Numeric</td><td>Numeric</td></tr>
<tr><td>mod</td><td>Modulo first argument by second, result by third, etc.</td><td>Numeric+</td><td>Numeric</td></tr>
<tr><td>%</td><td>Modulo left argument by right.</td><td>Numeric, Numeric</td><td>Numeric</td></tr>
<tr><td>neg</td><td>Negate argument.</td><td>Numeric</td><td>Numeric</td></tr>
<tr><td>-</td><td>Negate right argument.</td><td>Numeric</td><td>Numeric</td></tr>
<tr><td colspan="4"><h3>Bitwise expressions</h3></td></tr>
<tr><th>Name</th><th>Description</th><th>Arguments</th><th>Result</th></tr>
<tr><td>and</td><td>AND the arguments in order.</td><td>Long+</td><td>Long</td></tr>
<tr><td>or</td><td>OR the arguments in order.</td><td>Long+</td><td>Long</td></tr>
<tr><td>xor</td><td>XOR the arguments in order.</td><td>Long+</td><td>Long</td></tr>
<tr><td colspan="4"><h3>String expressions</h3></td></tr>
<tr><th>Name</th><th>Description</th><th>Arguments</th><th>Result</th></tr>
<tr><td>strlen</td><td>Count the number of bytes in argument.</td><td>String</td><td>Long</td></tr>
<tr><td>strcat</td><td>Concatenate arguments in order.</td><td>String+</td><td>String</td></tr>
<tr><td colspan="4"><h3>Type conversion expressions</h3></td></tr>
<tr><th>Name</th><th>Description</th><th>Arguments</th><th>Result</th></tr>
<tr><td>todouble</td><td>Convert argument to double.</td><td>Any</td><td>Double</td></tr>
<tr><td>tolong</td><td>Convert argument to long.</td><td>Any</td><td>Long</td></tr>
<tr><td>tostring</td><td>Convert argument to string.</td><td>Any</td><td>String</td></tr>
<tr><td>toraw</td><td>Convert argument to raw.</td><td>Any</td><td>Raw</td></tr>
<tr><td colspan="4"><h3>Raw data expressions</h3></td></tr>
<tr><th>Name</th><th>Description</th><th>Arguments</th><th>Result</th></tr>
<tr><td>cat</td><td>Cat the binary representation of the arguments together.</td><td>Any+</td><td>Raw</td></tr>
<tr><td>md5</td><td>Does an md5 over the binary representation of the argument, and keeps the lowest 'width' bits.</td><td>Any, Numeric(width)</td><td>Raw</td></tr>
<tr><td>xorbit</td><td>Does an xor of 'width' bits over the binary representation of the argument. Width is rounded up to a multiple of 8.</td><td>Any, Numeric(width)</td><td>Raw</td></tr>
<tr><td colspan="4"><h3>Accessor expressions</h3></td></tr>
<tr><th>Name</th><th>Description</th><th>Arguments</th><th>Result</th></tr>
<tr><td>relevance</td><td>Return the computed rank of a document.</td><td>None</td><td>Double</td></tr>
<tr><td>docidnsspecific</td><td>Return the docid without namespace.</td><td>None</td><td>String</td></tr>
<tr><td>&nbsp;</td><td colspan="3">Applies only to streaming mode.</td></tr>
<tr><td>ymum</td><td>Return the ymum part of docid.</td><td>None</td><td>Long</td></tr>
<tr><td>&nbsp;</td><td colspan="3">Applies only to streaming search.</td></tr>
<tr><td>&lt;attribute-name&gt;</td><td>Return the value of the named attribute.</td><td>None</td><td>Any</td></tr>
<tr><td colspan="4"><h3>Bucket expressions</h3></td></tr>
<tr><th>Name</th><th>Description</th><th>Arguments</th><th>Result</th></tr>
<tr><td>fixedwidth</td><td>Maps the value of the first argument into consecutive buckets whose width equals the second argument.</td><td>Any, Numeric</td><td>NumericBucketList</td></tr>
<tr><td>predefined</td><td>Maps the value of the first argument into the given buckets.</td><td>Any, Bucket+</td><td>BucketList</td></tr>
<tr><td colspan="4"><h3>Time expressions</h3>
Use the query parameter "timezone" to set the timezone to use when running these
expressions. E.g. <code>&amp;timezone=GMT-1</code>. See Sun's documentation
on <a href="http://java.sun.com/javase/6/docs/api/java/util/TimeZone.html">TimeZone</a> for format
reference.
</td></tr>
<tr><th>Name</th><th>Description</th><th>Arguments</th><th>Result</th></tr>
<tr><td>time.dayofmonth</td><td>Returns the day of month (1-31) for the given timestamp.</td><td>Long</td><td>Long</td></tr>
<tr><td>time.dayofweek</td><td>Returns the day of week (0-6) for the given timestamp, Monday being 0.</td><td>Long</td><td>Long</td></tr>
<tr><td>time.dayofyear</td><td>Returns the day of year (0-365) for the given timestamp.</td><td>Long</td><td>Long</td></tr>
<tr><td>time.hourofday</td><td>Returns the hour of day (0-23) for the given timestamp.</td><td>Long</td><td>Long</td></tr>
<tr><td>time.minuteofhour</td><td>Returns the minute of hour (0-59) for the given timestamp.</td><td>Long</td><td>Long</td></tr>
<tr><td>time.monthofyear</td><td>Returns the month of year (1-12) for the given timestamp.</td><td>Long</td><td>Long</td></tr>
<tr><td>time.secondofminute</td><td>Returns the second of minute (0-59) for the given timestamp.</td><td>Long</td><td>Long</td></tr>
<tr><td>time.year</td><td>Returns the full year (e.g. 2009) of the given timestamp.</td><td>Long</td><td>Long</td></tr>
<tr><td colspan="4"><h3>List expressions</h3></td></tr>
<tr><th>Name</th><th>Description</th><th>Arguments</th><th>Result</th></tr>
<tr><td>size</td><td>Return the number of elements in the argument if it is a list. If not return 1.</td><td>Any</td><td>Long</td></tr>
<tr><td>sort</td><td>Sort the elements in argument in ascending order if argument is a list If not it is a NOP.</td><td>Any</td><td>Any</td></tr>
<tr><td>reverse</td><td>Reverse the elements in the argument if argument is a list If not it is a NOP.</td><td>Any</td><td>Any</td></tr>
<tr><td colspan="4"><h3>Other expressions</h3></td></tr>
<tr><th>Name</th><th>Description</th><th>Arguments</th><th>Result</th></tr>
<tr>
<td>zcurve.x</td><td>
Returns the X component of the given zcurve encoded 2d point.
All fields of type "position" have an accompanying "&lt;fieldName&gt;_zcurve" attribute that can be decoded using this expression, e.g. <code>zcurve.x(foo_zcurve)</code>.
</td><td>Long</td><td>Long</td>
</tr>
<tr><td>zcurve.y</td><td>Returns the Y component of the given zcurve encoded 2d point.</td><td>Long</td><td>Long</td></tr>
<tr><td>uca</td><td>Converts the attribute string using unicode collation algorithm, useful for sorting.</td><td>Any, Locale(String), Strength(String)</td><td>Raw</td></tr>
<tr><td colspan="4"><h3>Single argument standard mathematical expressions</h3>
These are the standard mathematical functions as found in the Java
<a href="https://docs.oracle.com/javase/8/docs/api/java/lang/Math.html">Math</a>
class.
</td></tr>
<tr><th>Name</th><th>Description</th><th>Arguments</th><th>Result</th></tr>
<tr><td>math.exp</td><td>&nbsp;</td><td>Double</td><td>Double</td></tr>
<tr><td>math.log</td><td>&nbsp;</td><td>Double</td><td>Double</td></tr>
<tr><td>math.log1p</td><td>&nbsp;</td><td>Double</td><td>Double</td></tr>
<tr><td>math.log10</td><td>&nbsp;</td><td>Double</td><td>Double</td></tr>
<tr><td>math.sqrt</td><td>&nbsp;</td><td>Double</td><td>Double</td></tr>
<tr><td>math.cbrt</td><td>&nbsp;</td><td>Double</td><td>Double</td></tr>
<tr><td>math.sin</td><td>&nbsp;</td><td>Double</td><td>Double</td></tr>
<tr><td>math.cos</td><td>&nbsp;</td><td>Double</td><td>Double</td></tr>
<tr><td>math.tan</td><td>&nbsp;</td><td>Double</td><td>Double</td></tr>
<tr><td>math.asin</td><td>&nbsp;</td><td>Double</td><td>Double</td></tr>
<tr><td>math.acos</td><td>&nbsp;</td><td>Double</td><td>Double</td></tr>
<tr><td>math.atan</td><td>&nbsp;</td><td>Double</td><td>Double</td></tr>
<tr><td>math.sinh</td><td>&nbsp;</td><td>Double</td><td>Double</td></tr>
<tr><td>math.cosh</td><td>&nbsp;</td><td>Double</td><td>Double</td></tr>
<tr><td>math.tanh</td><td>&nbsp;</td><td>Double</td><td>Double</td></tr>
<tr><td>math.asinh</td><td>&nbsp;</td><td>Double</td><td>Double</td></tr>
<tr><td>math.acosh</td><td>&nbsp;</td><td>Double</td><td>Double</td></tr>
<tr><td>math.atanh</td><td>&nbsp;</td><td>Double</td><td>Double</td></tr>
<tr><td colspan="4"><h3>Dual argument standard mathematical expressions</h3>
We also implement a few other convenient expressions. One very nice
for geometrical distance calculations.
</td></tr>
<tr><th>Name</th><th>Description</th><th>Arguments</th><th>Result</th></tr>
<tr>
<td>math.pow</td>
<td>Return X^Y.</td>
<td>Double, Double</td>
<td>Double</td>
</tr>
<tr>
<td>math.hypot</td>
<td>Return length of hypotenuse given X and Y sqrt(X^2 + Y^2).</td>
<td>Double, Double</td>
<td>Double</td>
</tr>
</table>
<h2 id="examples">Examples</h2>
<h3 id="fullgrouping">TopN / Full corpus</h3>
<p>Simple grouping where you count the number of documents in each group:</p>
<pre>/search/?yql=select (&hellip;) | all(group(a) each(output(count())));</pre>
<p>Two parallel groupings:</p>
<pre>/search/?yql=select (&hellip;) | all(all(group(a) each(output(count())))
all(group(b) each(output(count()))));</pre>
<p>Only the 1000 best hits will be grouped at each backend node. Lower accuracy, but higher speed:</p>
<pre>/search/?yql=select (&hellip;) | all(max(1000) all(group(a) each(output(count()))));</pre>
<p>In streaming search you may also group all searched documents by adding a <code>where(true)</code> clause:</p>
<pre>/search/?yql=select (&hellip;) | all(group(a) each(output(count()))) where(true);</pre>
<h3 id="selection">Selecting groups</h3>
<p>Perform a modulo 5 operation before selecting the group you want:</p>
<pre>/search/?yql=select (&hellip;) | all(group(a % 5) each(output(count())));</pre>
<p>Perform <code>a + b * c</code> before selecting the group you want:</p>
<pre>/search/?yql=select (&hellip;) | all(group(a + b * c) each(output(count())));</pre>
<h3 id="map">Grouping on maps</h3>
<p>The following syntax can be used when grouping on <a href="search-definitions-reference.html#type:map">map</a> attribute fields.
It creates a group for values whose keys match the specified key.
The key can either be specified directly or indirectly via a key source attribute.</p>
<p>Direct key on a primitive map:</p>
<pre>/search/?yql=select (&hellip;) | all(group(my_map{"my_key"}) each(output(count())));</pre>
<p>Direct key on a map of struct:</p>
<pre>/search/?yql=select (&hellip;) | all(group(my_map{"my_key"}.my_field) each(output(count())));</pre>
<p>Indirect key via a key source attribute:</p>
<pre>/search/?yql=select (&hellip;) | all(group(my_map{attribute(my_key_source)}) each(output(count())));</pre>
<p>The key is retrieved from the key source attribute for each document.
Note that the key source attribute must be single value and have the same data type as the key type of the map.
Using a key source attribute is not supported for streaming search.</p>
<h3 id="uca">Locale aware sorting</h3>
<p>Groups are sorted using locale aware sorting, with the default and primary strength values, respectively:</p>
<pre>/search/?yql=select (&hellip;) | all(group(s) order(max(uca(s, "sv")))
each(output(count())));</pre>
<pre>/search/?yql=select (&hellip;) | all(group(s) order(max(uca(s, "sv", "PRIMARY")))
each(output(count())));</pre>
<h3 id="ordering">Ordering groups</h3>
<p>Perform a modulo 5 operation before selecting the group you want. The groups are then ordered by
their aggregated sum of attribute "b":</p>
<pre>/search/?yql=select (&hellip;) | all(group(a % 5) order(sum(b))
each(output(count())));</pre>
<p>Perform <code>a + b * c</code> before selecting the group you want. Ordering is given by the
maximum value of attribute "d" in each group:</p>
<pre>/search/?yql=select (&hellip;) | all(group(a + b * c) order(max(d))
each(output(count())));</pre>
<p>Take the average relevance of the groups and multiply it with
the number of groups to get a cumulative count:</p>
<pre>/search/?yql=select (&hellip;) | all(group(a) order(avg(relevance()) * count())
each(output(count())));</pre>
<p>You can not, however, directly reference an attribute in your order
clause, as this:</p>
<pre>/search/?yql=select (&hellip;) | all(group(a) order(attr * count())
each(output(count())));</pre>
<p>But, you can do this:</p>
<pre>/search/?yql=select (&hellip;) | all(group(a) order(max(attr) * count())
each(output(count())));</pre>
<h3 id="Collecting">Collecting aggregates</h3>
<p>Simple grouping where you count number of documents in each group and return the best hit in each
group:</p>
<pre>/search/?yql=select (&hellip;) |
all(group(a) each(max(1) each(output(summary()))));</pre>
<p>Also return the sum of attribute "b":</p>
<pre>/search/?yql=select (&hellip;) |
all(group(a) each(max(1) output(count(), sum(b))
each(output(summary()))));</pre>
<p>Also return an xor of the 64 most significant bits of an md5 over the concatenation of
attributes "a", "b" and "c":</p>
<pre>/search/?yql=select (&hellip;) |
all(group(a) each(max(1) output(count(), sum(b), xor(md5(cat(a, b, c), 64)))
each(output(summary()))));</pre>
<h3 id="predefined">Predefined buckets</h3>
<p>Group on predefined buckets for raw attribute and use infinity to make
sure the buckets cover the whole possible range:</p>
<pre>/search/?yql=select (&hellip;) |
all(group(predefined(r, bucket(-inf, {0, 'a', 3}), bucket({1, 'u', 4}, inf)))
each(output(count())));</pre>
<p>Standard mathematical start and end specifiers may be used to define the
width of a bucket. The "(" and ")" evaluates to "[" and "&gt;" by default.
Here, make a bucket that can only with one exact group, and use different
width specifiers:</p>
<pre>/search/?yql=select (&hellip;) |
all(group(predefined(r, bucket[-inf, "bar"&gt;, bucket["bar"], bucket&lt;"bar", inf]))
each(output(count())));</pre>
<h3 id="grouping">Grouping</h3>
<p>Single level grouping on "a" attribute, returning at most 5 groups with full hit count as well as the
69 best hits.</p>
<pre>/search/?yql=select (&hellip;) |
all(group(a) max(5) each(max(69) output(count())
each(output(summary()))));</pre>
<p>Two level grouping on "a" and "b" attribute:</p>
<pre>/search/?yql=select (&hellip;) |
all(group(a) max(5) each(output(count())
all(group(b) max(5) each(max(69) output(count())
each(output(summary()))))));</pre>
<p>Three level grouping on "a", "b" and "c" attribute:</p>
<pre>/search/?yql=select (&hellip;) |
all(group(a) max(5) each(output(count())
all(group(b) max(5) each(output(count())
all(group(c) max(5) each(max(69) output(count())
each(output(summary()))))));</pre>
<p>As above example, but also collect best hit in level 2:</p>
<pre>/search/?yql=select (&hellip;) |
all(group(a) max(5) each(output(count())
all(group(b) max(5) each(output(count())
all(max(1) each(output(summary())))
all(group(c) max(5) each(max(69) output(count())
each(output(summary()))))));</pre>
<p>As above example, but also collect best hit in level 1:</p>
<pre>/search/?yql=select (&hellip;) |
all(group(a) max(5) each(output(count())
all(max(1) each(output(summary())))
all(group(b) max(5) each(output(count())
all(max(1) each(output(summary())))
all(group(c) max(5) each(max(69) output(count())
each(output(summary()))))));</pre>
<p>As above example, but using different document summaries on each level:</p>
<pre>/search/?yql=select (&hellip;) |
all(group(a) max(5) each(output(count())
all(max(1) each(output(summary(complexsummary))))
all(group(b) max(5) each(output(count())
all(max(1) each(output(summary(simplesummary))))
all(group(c) max(5) each(max(69) output(count())
each(output(summary(fastsummary)))))));</pre>
<p>Group on fixed width buckets for numeric attribute, then on "a" attribute, count hits in leaf
nodes:</p>
<pre>/search/?yql=select (&hellip;) |
all(group(fixedwidth(n, 3)) each(group(a) max(2) each(output(count()))));</pre>
<p>As above example, but limiting groups in level 1, and returning hits from level 2:</p>
<pre>/search/?yql=select (&hellip;) |
all(group(fixedwidth(n, 3)) max(5) each(group(a) max(2) each(output(count())
each(output(summary())))));</pre>
<p>Deep grouping with counting and hit collection on all levels:</p>
<pre>/search/?yql=select (&hellip;) |
all(group(a) max(5) each(output(count())
all(max(1) each(output(summary())))
all(group(b) each(output(count())
all(max(1) each(output(summary())))
all(group(c) each(output(count())
all(max(1) each(output(summary())))))))));</pre>
<h3 id="time">Time aware grouping</h3>
<p>Group by year:</p>
<pre>/search/?yql=select (&hellip;) |
all(group(time.year(a)) each(output(count())));</pre>
<p>Group by year, then by month:</p>
<pre>/search/?yql=select (&hellip;) |
all(group(time.year(a)) each(output(count())
all(group(time.month(a)) each(output(count())))));</pre>
<p>Group by year, then by month, then day, then by hour:</p>
<pre>/search/?yql=select (&hellip;) |
all(group(time.year(a)) each(output(count())
all(group(time.monthofyear(a)) each(output(count())
all(group(time.dayofmonth(a)) each(output(count())
all(group(time.hourofday(a)) each(output(count())))))))));</pre>
<p>
Groups <em>today</em>, <em>yesterday</em>, <em>lastweek</em>, and <em>lastmonth</em>
using <code>predefined</code> aggregator, and groups each day within each of these separately:
</p>
<pre>/search/?yql=select (&hellip;) |
all(group(predefined((now() - a) / (60 * 60 * 24),
bucket(0,1), bucket(1,2),
bucket(3,7), bucket(8,31))) each(output(count())
all(max(2) each(output(summary())))
all(group((now() - a) / (60 * 60 * 24)) each(output(count())
all(max(2) each(output(summary())))))));</pre>
<h3 id="counting-unique-groups">Counting unique groups</h3>
<p>
The <code>count</code> aggregator can be applied on list of groups to determine the number of
unique groups without having to explicitly retrieve all groups. Another use case for this aggregator is
counting the number of unique instances matching a given expression.
</p>
<p>The following query outputs the number of groups, which is equivalent to the number of unique values for attribute "a".</p>
<pre>/search/?yql=select (&hellip;) |
all(group(a) output(count()))</pre>
<p>The following query outputs the number of unique string lengths for the attribute "name".</p>
<pre>/search/?yql=select (&hellip;) |
all(group(strlen(name)) output(count()))</pre>
<p>The following query outputs the sum of the "b" attribute for each group in addition to the overall group count.</p>
<pre>/search/?yql=select (&hellip;) |
all(group(a) output(count()) each(output(sum(b))))</pre>
<p>The <code>max</code> clause is used to restrict the number of groups returned.
The query outputs the sum for the 3 best groups. The <code>count</code> clause
outputs the actual number of groups (potentially &gt;3).</p>
<pre>/search/?yql=select (&hellip;) |
all(group(a) max(3) output(count()) each(output(sum(b))))</pre>
<p>The following query outputs the number of top level groups, and for the 10 best groups,
outputs the number of unique values for attribute "b".</p>
<pre>/search/?yql=select (&hellip;) |
all(group(a) max(10) output(count()) each(group(b) output(count())))</pre>
<h2 id="sessionCache">Using the grouping session cache</h2>
<p>
When having multi-level grouping expressions, the search query is normally
re-run for each level. The drawback of this is that if you have an expensive
ranking function, the query will take more time than strictly necessary.
</p><p>
To avoid this, you can set the <a
href="search-api-reference.html#groupingSessionCache"><code>groupingSessionCache</code></a>
query flag. This causes the query and grouping expression to be performed
only once.
</p><p>
However, the flag is <strong>only useful if</strong> the grouping expression does
not have a <code>order()</code> clause.
The <strong>drawback</strong> of using this flag is that when <code>max()</code> is
specified in the grouping expression, it might cause inaccuracies in
aggregated values such as <code>count()</code>. We therefore recommend that
you test whether or not this is an issue for your queries, and (if it is an
issue) adjust the <code>precision</code> parameter to still get correct
counts.
</p>
<h2 id="multivalue-grouping">Grouping of multivalue attributes</h2>
<p>
Some grouping operators may be used with multivalue attributes.
Note that using a multivalued attribute (such as an array of doubles)
in a grouping expression is likely to have a large, adverse impact
on performance, particularly if the set of hits to be processed
is large, since it means a large amount of data is streamed through the CPU.
Such operations is therefore likely to hit a bottleneck on memory bandwidth.
</p>
<h3 id="multivalue-caveats">Caveats</h3>
<p>For streaming search, multi-value fields such as maps, arrays etc. can be
used for grouping. However, using aggregation functions such as sum() on
such fields can give misleading results. Assume a map from strings to
integers, where the strings are some sort of key you wish to use for
grouping. The following expression will provide the sum of the values for
all keys:</p>
<pre>/search/?yql=select (&hellip;) | all(group(mymap.key) each(output(sum(mymap.value))));</pre>
<p>and not the sum of the values within each key, as one would expect.
It is still, however, possible
to run the following expression to get the sum of values within a specific
key:</p>
<pre>/search/?yql=select (&hellip;) | all(group(mymap{"foo"}) each(output(sum(mymap.value))));</pre>
<p class="alert alert-success">This syntax is map-specific, it does NOT apply to weighted sets.</p>
<h3 id="aggregators-multivalue">Using sum, max, etc on a multivalued attribute</h3>
<p>
Doing an operation such as <code>output(sum(myarray))</code> will
run the sum over each element value in each document. The result
is the sum of sums of values. Similarly <code>max(myarray)</code>
will yield the maximal element over all elements in all documents,
and so on.
</p>
<h3 id="arrayat">Array at: element access</h3>
<p>
The expression <code>array.at(myarray, idx)</code> will yield one
value per document by evaluating the <code>idx</code> expression
and using it as an index into the given array. The expression will
be capped to the range <code>[0, size(myarray)-1]</code>.
If it's larger than the array size you always get
the last element, while if it's smaller than zero you always get
the first element. This expression can then be used to build
bigger expressions such as <code>output(sum(array.at(myarray, 0)))</code>
which will sum the first element in the array of each document.
</p>
<h3 id="interpolatedlookup">Interpolated lookup (BETA)</h3>
<p>
The operation <code>interpolatedlookup(myarray, expr)</code> is
intended for generic graph/function lookup. The data
in <code>myarray</code> should be numerical values sorted in
ascending order. The operation will then scan from the start
of the array to find the position where the element values
become equal to (or greater than) the value of
the <code>expr</code> lookup argument, and return the index
of that position.
When the lookup argument's value is between two consecutive array
element values, the returned position will be a linear
interpolation between their respective indexes. The return
value is always in the range <code>[0, size(myarray)-1]</code>
of the legal index values for an array.
</p>
<p>
Given an example where <code>myarray</code> is a sorted array of
type <code>array&lt;double&gt;</code> in each document.
The expression <code>interpolatedlookup(myarray, 4.2)</code>
is now a per-document expression that first evaluates the lookup
argument, here a constant expression 4.2, and then looks at
the contents of <code>myarray</code> in the document.
The scan starts at the first element and proceeds until it hits
an element value greater than 4.2 in the array. This means that:
<ul>
<li>
If the first element in the array is greater than 4.2, the
expression returns 0.
</li>
<li>
If the first element in the array is exactly 4.2, the
expression still returns 0.
</li>
<li>
If the first element in the array is 1.7 while the <strong>second</strong>
element value is exactly 4.2, the expression return 1.0
&ndash; the index of the second element.
</li>
<li>
If <strong>all</strong> the elements in the array are less than 4.2,
the last valid array index <code>size(myarray)-1</code> is returned.
</li>
<li>
If the 5 first elements in the array have values smaller than
the lookup argument, and the lookup argument is halfway between the fifth and sixth
element, a value of 4.5 is returned &ndash; halfway between the
array indexes of the fifth and sixth elements.
</li>
<li>
Similarly, if the elements in the array are <code>{0, 1, 2, 4, 8}</code>
then passing a lookup argument of "5" would return 3.25 (linear interpolation
between <code>indexOf(4)==3</code> and <code>indexOf(8)==4</code>).
</li>
</ul>
</p>
<h4 id="usecase-interpolatedlookup">Use case: Impression counting</h4>
<p>
If you have the impression logs for a specific user, you can
make a function that maps from rankscore to the number of impressions
an advertisement would get. So you would have a table like this:
<pre>
Score Integer (# impressions for this user)
0.200 0
0.210 1
0.220 2
0.240 3
0.320 4
0.420 5
0.560 6
0.700 7
0.800 8
0.880 9
0.920 10
0.940 11
0.950 12
</pre>
Storing just the first column (the rank scores, including a
rank score for 0 impressions) in an array attribute named
"impressions", we could then use the grouping operation
<code>interpolatedlookup(impressions, relevance())</code>
to figure out how many times a given advertisement would have
been shown to this particular user.
So if the rankscore is 0.420 for a specific user/ad/bid
combination, then <code>interpolatedlookup(impressions,relevance())</code>
would return 5.0. If the the bid is
increased so the rankscore gets to 0.490 it would get 5.5 as the
return value instead.
In this context a count of 5.5 isn't meaningful for the past of
a single user, but it gives more information that may be used as
a forecast. Summing this across many different users may then
be used to forecast the total of future impressions for the
advertisement.
</p>