Skip to content
Permalink
master
Go to file
 
 
Cannot retrieve contributors at this time
1598 lines (1549 sloc) 36.8 KB
---
# Copyright 2017 Yahoo Holdings. Licensed under the terms of the Apache 2.0 license. See LICENSE in the project root.
title: "Grouping Information in Results"
---
<p>
This document describes what grouping is and gives a few examples on how to use it.
Refer to the <a href="reference/query-api-reference.html#select">Query API reference</a>
for how to set the <em>select</em> parameter,
and the <a href="reference/grouping-syntax.html">Grouping reference</a> for details.
</p><p>
Vespa grouping has a flexible language in
which to describe how the query hits should be grouped, aggregated and presented in results.
The grouping parameter can be thought of as a high-level program.
The core operations are <code>all</code> (process a list of elements as a whole),
<code>each</code> (process each element in a list separately),
<code>group</code> (group the elements of a list into sub-lists) and
<code>output</code> (output result).
</p><p>
The operations can be nested.
</p><p>
Vespa distributes and executes the grouping program on content nodes, and merges results on container nodes -
in multiple phases, as needed.
As realizing such programs over a distributed data set requires more network roundtrips than a regular search query,
these queries are more expensive than regular queries.
However, executing such programs are more efficient
than achieving the same end result by issuing multiple independent queries to Vespa.
</p><p>
Grouping supports continuation objects that is used for <a href="#pagination">pagination</a>.
</p><p>
For the entirety of this document, assume an index of engine part purchases:
</p>
<table class="table">
<thead>
<tr>
<th>Date</th>
<th>Price</th>
<th>Tax</th>
<th>Item</th>
<th>Customer</th>
</tr>
</thead>
<tbody>
<tr>
<td>2006-09-06 09:00:00</td>
<td>$1 000</td>
<td>0.24</td>
<td>Intake valve</td>
<td>Smith</td>
</tr>
<tr>
<td>2006-09-07 10:00:00</td>
<td>$1 000</td>
<td>0.12</td>
<td>Rocker arm</td>
<td>Smith</td>
</tr>
<tr>
<td>2006-09-07 11:00:00</td>
<td>$2 000</td>
<td>0.24</td>
<td>Spring</td>
<td>Smith</td>
</tr>
<tr>
<td>2006-09-08 12:00:00</td>
<td>$3 000</td>
<td>0.12</td>
<td>Valve cover</td>
<td>Jones</td>
</tr>
<tr>
<td>2006-09-08 10:00:00</td>
<td>$5 000</td>
<td>0.24</td>
<td>Intake port</td>
<td>Jones</td>
</tr>
<tr>
<td>2006-09-08 11:00:00</td>
<td>$8 000</td>
<td>0.12</td>
<td>Head</td>
<td>Brown</td>
</tr>
<tr>
<td>2006-09-09 12:00:00</td>
<td>$1 300</td>
<td>0.24</td>
<td>Coolant</td>
<td>Smith</td>
</tr>
<tr>
<td>2006-09-09 10:00:00</td>
<td>$2 100</td>
<td>0.12</td>
<td>Engine block</td>
<td>Jones</td>
</tr>
<tr>
<td>2006-09-09 11:00:00</td>
<td>$3 400</td>
<td>0.24</td>
<td>Oil pan</td>
<td>Brown</td>
</tr>
<tr>
<td>2006-09-09 12:00:00</td>
<td>$5 500</td>
<td>0.12</td>
<td>Oil sump</td>
<td>Smith</td>
</tr>
<tr>
<td>2006-09-10 10:00:00</td>
<td>$8 900</td>
<td>0.24</td>
<td>Camshaft</td>
<td>Jones</td>
</tr>
<tr>
<td>2006-09-10 11:00:00</td>
<td>$1 440</td>
<td>0.12</td>
<td>Exhaust valve</td>
<td>Brown</td>
</tr>
<tr>
<td>2006-09-10 12:00:00</td>
<td>$2 330</td>
<td>0.24</td>
<td>Rocker arm</td>
<td>Brown</td>
</tr>
<tr>
<td>2006-09-10 10:00:00</td>
<td>$3 770</td>
<td>0.12</td>
<td>Spring</td>
<td>Brown</td>
</tr>
<tr>
<td>2006-09-10 11:00:00</td>
<td>$6 100</td>
<td>0.24</td>
<td>Spark plug</td>
<td>Smith</td>
</tr>
<tr>
<td>2006-09-11 12:00:00</td>
<td>$9 870</td>
<td>0.12</td>
<td>Exhaust port</td>
<td>Jones</td>
</tr>
<tr>
<td>2006-09-11 10:00:00</td>
<td>$1 597</td>
<td>0.24</td>
<td>Piston</td>
<td>Brown</td>
</tr>
<tr>
<td>2006-09-11 11:00:00</td>
<td>$2 584</td>
<td>0.12</td>
<td>Connection rod</td>
<td>Smith</td>
</tr>
<tr>
<td>2006-09-11 12:00:00</td>
<td>$4 181</td>
<td>0.24</td>
<td>Rod bearing</td>
<td>Jones</td>
</tr>
<tr>
<td>2006-09-11 13:00:00</td>
<td>$6 765</td>
<td>0.12</td>
<td>Crankshaft</td>
<td>Jones</td>
</tr>
</tbody>
</table>
<h2 id="basic-grouping">Basic Grouping</h2>
<p>
Example: <em>Return the total sum of purchases per customer</em> - steps:
<ol>
<li>
Select all documents, query by document type <em>purchase</em>:
<pre>/search/?yql=select * from sources * where sddocname contains "purchase"</pre>
</li><li>
Specify the grouping expression:
<pre>group(customer)</pre>
</li><li>
Specify output - return the sum of the purchase price for each customer.
<code>each</code> refers to each of the groups produced by <code>group(customer)</code>:
<pre>each(output(sum(price)))</pre>
</li><li>
Set <code>hits=0</code> to reduce latency and data in result set - the documents need not be returned - final query:
<pre>
/search/?hits=0&amp;yql=select * from sources * where sddocname contains "purchase" |
all( group(customer) each(output(sum(price))) );
</pre>
</li>
</ol>
</p><p>
URL encoded equivalent:<br/>
<pre>
/search/?hits=0&amp; yql=select%20%2A%20from%20sources%20%2A%20where%20sddocname%20contains%20%27purchase%27%20%7C%20
all(group(customer)%20each(output(sum(price))))%3B
</pre>
Result:
</p>
<table class="table">
<thead>
<tr>
<th>GroupId</th>
<th>Sum(price)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Brown</td>
<td>$20 537</td>
</tr>
<tr>
<td>Jones</td>
<td>$39 816</td>
</tr>
<tr>
<td>Smith</td>
<td>$19 484</td>
</tr>
</tbody>
</table>
<p>
Example: <em>Sum price of purchases <a href="#time-and-date">per date</a>:</em>
</p>
<pre>
select (&hellip;) | all(group(time.date(date)) each(output(sum(price))));
</pre>
<table class="table">
<thead>
<tr>
<th>GroupId</th>
<th>Sum(price)</th>
</tr>
</thead>
<tbody>
<tr>
<td>2006-09-06</td>
<td>$1 000</td>
</tr>
<tr>
<td>2006-09-07</td>
<td>$3 000</td>
</tr>
<tr>
<td>2006-09-08</td>
<td>$16 000</td>
</tr>
<tr>
<td>2006-09-09</td>
<td>$12 300</td>
</tr>
<tr>
<td>2006-09-10</td>
<td>$22 540</td>
</tr>
<tr>
<td>2006-09-11</td>
<td>$24 997</td>
</tr>
</tbody>
</table>
<p>
Note: in examples above, <em>all</em> documents are evaluated.
Modify the query to add filters (and hence cut query latency), like (remember to URL encode):
<pre>/search/?yql=select * from sources * where customer contains "smith";</pre>
</p>
<h2 id="ordering-and-limiting-groups">Ordering and Limiting Groups</h2>
<p>
In many scenarios, a large collection of groups is produced, possibly too large to display or process.
This is handled by ordering groups, then limiting, the number of groups to return.
</p><p>
The <code>order</code> clause accepts a list of one or more expressions.
Each of the arguments to <code>order</code> is prefixed by either a plus/minus for ascending/descending order.
</p><p>
Limit the number of groups using <code>max</code> and <code>precision</code> -
the latter is the number of groups returned per content node to be merged to the global result.
Larger document distribution skews hence require a higher <code>precision</code> for accurate results.
</p><p>
Example: To find the 2 globally best groups, make an educated guess on how
many samples are needed to fetch from each node in order to get the right groups.
This is the <code>precision</code>.
An initial factor of 3 has proven to be quite good in most usecases.
If however the data for customer 'Jones' was spread on 3 different content nodes.
'Jones' might be among the 2 best on only one node. But based on the distribution
of the data we have concluded by earlier tests that if we fetch 5.67 as many groups
as we need we will have a correct answer with at least 99.999% confidence. So then
we just use 6 times as much groups when doing the merge.
</p><p>
However there is one exception.
Without an <code>order</code> constraint, <code>precision</code> is not required.
Then local ordering will be the same as global ordering.
Ordering will not change after a merge operation.
</p><p>
Example: <em>The two customers with most purchases, returning the sum for each:</em>
</p>
<pre>
select (&hellip;) | all(group(customer) max(2) precision(12) order(-count())
each(output(sum(price))));
</pre>
<table class="table">
<thead>
<tr>
<th>GroupId</th>
<th>sum(price)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Jones</td>
<td>$39 816</td>
</tr>
<tr>
<td>Smith</td>
<td>$19 484</td>
</tr>
</tbody>
</table>
<h2 id="hits-per-group">Hits per Group</h2>
<p>
Use <code>summary</code> to print the fields for a hit,
and <code>max</code> to limit the number of hits per group.
Example: Return the three most expensive parts per customer:
<pre>
/search/?yql=select * from sources * where sddocname contains "purchase" |
all(group(customer) each(max(3) each(output(summary()))));
&amp;ranking=pricerank
</pre>
<table class="table">
<thead>
<tr>
<th>GroupId</th>
<th />
<th />
<th />
<th />
<th />
</tr>
</thead>
<tbody>
<tr>
<td>Brown</td>
<td />
<td />
<td />
<td />
<td />
</tr>
<tr>
<td />
<th>Date</th>
<th>Price</th>
<th>Tax</th>
<th>Item</th>
<th>Customer</th>
</tr>
<tr>
<td />
<td>2006-09-08 11:00</td>
<td>$8 000</td>
<td>0.12</td>
<td>Head</td>
<td>Brown</td>
</tr>
<tr>
<td />
<td>2006-09-10 10:00</td>
<td>$3 770</td>
<td>0.12</td>
<td>Spring</td>
<td>Brown</td>
</tr>
<tr>
<td />
<td>2006-09-09 11:00</td>
<td>$3 400</td>
<td>0.24</td>
<td>Oil pan</td>
<td>Brown</td>
</tr>
<tr>
<td>Jones</td>
<td />
<td />
<td />
<td />
<td />
</tr>
<tr>
<td />
<th>Date</th>
<th>Price</th>
<th>Tax</th>
<th>Item</th>
<th>Customer</th>
</tr>
<tr>
<td />
<td>2006-09-11 12:00</td>
<td>$9 870</td>
<td>0.12</td>
<td>Exhaust port</td>
<td>Jones</td>
</tr>
<tr>
<td />
<td>2006-09-10 10:00</td>
<td>$8 900</td>
<td>0.24</td>
<td>Camshaft</td>
<td>Jones</td>
</tr>
<tr>
<td />
<td>2006-09-11 13:00</td>
<td>$6 765</td>
<td>0.12</td>
<td>Crankshaft</td>
<td>Jones</td>
</tr>
<tr>
<td>Smith</td>
<td />
<td />
<td />
<td />
<td />
</tr>
<tr>
<td />
<th>Date</th>
<th>Price</th>
<th>Tax</th>
<th>Item</th>
<th>Customer</th>
</tr>
<tr>
<td />
<td>2006-09-10 11:00</td>
<td>$6 100</td>
<td>0.24</td>
<td>Spark plug</td>
<td>Smith</td>
</tr>
<tr>
<td />
<td>2006-09-09 12:00</td>
<td>$5 500</td>
<td>0.12</td>
<td>Oil sump</td>
<td>Smith</td>
</tr>
<tr>
<td />
<td>2006-09-11 11:00</td>
<td>$2 584</td>
<td>0.12</td>
<td>Connection rod</td>
<td>Smith</td>
</tr>
</tbody>
</table>
Notes on ordering in the example above:
<ul>
<li>
The <code>order</code> clause is a directive for <em>group</em> ordering, not <em>hit</em> ordering.
Here, there is no order clause on the groups, hence default ordering <code>max(relevance())</code> is used.
In this case, the query is "all documents", hence all groups are equally relevant and the group order is random.
</li><li>
To order hits inside groups, use ranking. Add <code>ranking=pricerank</code> to the query
to use the pricerank <a href="ranking.html">rank profile</a> to rank by price:
</p>
<pre>
rank-profile pricerank inherits default {
first-phase {
expression: attribute(price)
}
}</pre>
</li>
</ul>
</p>
<h2 id="nested-groups">Nested Groups</h2>
<p>
Groups can be nested. This offers great drilling capabilities,
as there are no limits to nesting depth or presented information on any level.
Example: How much each customer has spent per day by grouping on customer, then date:
</p>
<pre>
select (&hellip;) | all(group(customer) each(group(time.date(date)) each(output(sum(price)))));
</pre>
<table class="table">
<thead>
<tr>
<th>GroupId</th>
<th />
<th />
</tr>
</thead>
<tbody>
<tr>
<td>Brown</td>
<td />
<td />
</tr>
<tr>
<td />
<th>GroupId</th>
<th>Sum(price)</th>
</tr>
<tr>
<td />
<td>2006-09-08</td>
<td>$8 000</td>
</tr>
<tr>
<td />
<td>2006-09-09</td>
<td>$3 400</td>
</tr>
<tr>
<td />
<td>2006-09-10</td>
<td>$7 540</td>
</tr>
<tr>
<td />
<td>2006-09-11</td>
<td>$1 597</td>
</tr>
<tr>
<td>Jones</td>
<td />
<td />
</tr>
<tr>
<td />
<th>GroupId</th>
<th>Sum(price)</th>
</tr>
<tr>
<td />
<td>2006-09-08</td>
<td>$8 000</td>
</tr>
<tr>
<td />
<td>2006-09-09</td>
<td>$2 100</td>
</tr>
<tr>
<td />
<td>2006-09-10</td>
<td>$8 900</td>
</tr>
<tr>
<td />
<td>2006-09-11</td>
<td>$20 816</td>
</tr>
<tr>
<td>Smith</td>
<td />
<td />
</tr>
<tr>
<td />
<th>GroupId</th>
<th>Sum(price)</th>
</tr>
<tr>
<td />
<td>2006-09-06</td>
<td>$1 000</td>
</tr>
<tr>
<td />
<td>2006-09-07</td>
<td>$3 000</td>
</tr>
<tr>
<td />
<td>2006-09-09</td>
<td>$6 800</td>
</tr>
<tr>
<td />
<td>2006-09-10</td>
<td>$6 100</td>
</tr>
<tr>
<td />
<td>2006-09-11</td>
<td>$2 584</td>
</tr>
</tbody>
</table>
<p>
Use this to query for all items on a per-customer basis, displaying the most expensive hit for each customer,
with subgroups of purchases on a per-date basis.
Use the <code><a href="#hits-per-group">summary</a></code> clause
to show hits inside any group at any nesting level.
Include the sum price for each customer, both as a grand total and broken down on a per-day basis:
<pre>
/search/?yql=select * from sources * where sddocname contains "purchase" |
all(group(customer)
each(max(1) output(sum(price)) each(output(summary())))
each(group(time.date(date))
each(max(10) output(sum(price)) each(output(summary())))));
&amp;ranking=pricerank
</pre>
<table class="table">
<thead>
<tr>
<th>GroupId</th>
<th>sum(price)</th>
<th />
<th />
<th />
<th />
<th />
</tr>
</thead>
<tbody>
<tr>
<td>Brown</td>
<td>$20 537</td>
<td />
<td />
<td />
<td />
<td />
</tr>
<tr>
<td />
<th>Date</th>
<th>Price</th>
<th>Tax</th>
<th>Item</th>
<th>Customer</th>
<th />
</tr>
<tr>
<td />
<td>2006-09-08 11:00</td>
<td>$8 000</td>
<td>0.12</td>
<td>Head</td>
<td>Brown</td>
<td />
</tr>
<tr>
<td />
<th>GroupId</th>
<th>Sum(price)</th>
<th />
<th />
<th />
<th />
</tr>
<tr>
<td />
<td>2006-09-08</td>
<td>$8 000</td>
<td />
<td />
<td />
<td />
</tr>
<tr>
<td />
<td />
<th>Date</th>
<th>Price</th>
<th>Tax</th>
<th>Item</th>
<th>Customer</th>
</tr>
<tr>
<td />
<td />
<td>2006-09-08 11:00</td>
<td>$8 000</td>
<td>0.12</td>
<td>Head</td>
<td>Brown</td>
</tr>
<tr>
<td />
<td>2006-09-09</td>
<td>$3 400</td>
<td />
<td />
<td />
<td />
</tr>
<tr>
<td />
<td />
<th>Date</th>
<th>Price</th>
<th>Tax</th>
<th>Item</th>
<th>Customer</th>
</tr>
<tr>
<td />
<td />
<td>2006-09-09 11:00</td>
<td>$3 400</td>
<td>0.12</td>
<td>Oil pan</td>
<td>Brown</td>
</tr>
<tr>
<td />
<td>2006-09-10</td>
<td>$7 540</td>
<td />
<td />
<td />
<td />
</tr>
<tr>
<td />
<td />
<th>Date</th>
<th>Price</th>
<th>Tax</th>
<th>Item</th>
<th>Customer</th>
</tr>
<tr>
<td />
<td />
<td>2006-09-10 10:00</td>
<td>$3 770</td>
<td>0.12</td>
<td>Spring</td>
<td>Brown</td>
</tr>
<tr>
<td />
<td />
<td>2006-09-10 12:00</td>
<td>$2 330</td>
<td>0.24</td>
<td>Rocker arm</td>
<td>Brown</td>
</tr>
<tr>
<td />
<td />
<td>2006-09-10 11:00</td>
<td>$1 440</td>
<td>0.12</td>
<td>Exhaust valve</td>
<td>Brown</td>
</tr>
<tr>
<td />
<td>2006-09-11</td>
<td>$1 597</td>
<td />
<td />
<td />
<td />
</tr>
<tr>
<td />
<td />
<th>Date</th>
<th>Price</th>
<th>Tax</th>
<th>Item</th>
<th>Customer</th>
</tr>
<tr>
<td />
<td />
<td>2006-09-11 10:00</td>
<td>$1 597</td>
<td>0.24</td>
<td>Piston</td>
<td>Brown</td>
</tr>
<tr>
<td>Jones</td>
<td>$39 816</td>
<td />
<td />
<td />
<td />
<td />
</tr>
<tr>
<td />
<th>Date</th>
<th>Price</th>
<th>Tax</th>
<th>Item</th>
<th>Customer</th>
<th />
</tr>
<tr>
<td />
<td>2006-09-11 12:00</td>
<td>$9 870</td>
<td>0.12</td>
<td>Exhaust port</td>
<td>Jones</td>
<td />
</tr>
<tr>
<td />
<th>GroupId</th>
<th>Sum(price)</th>
<th />
<th />
<th />
<th />
</tr>
<tr>
<td />
<td>2006-09-08</td>
<td>$8 000</td>
<td />
<td />
<td />
<td />
</tr>
<tr>
<td />
<td />
<th>Date</th>
<th>Price</th>
<th>Tax</th>
<th>Item</th>
<th>Customer</th>
</tr>
<tr>
<td />
<td />
<td>2006-09-08 10:00</td>
<td>$5 000</td>
<td>0.24</td>
<td>Intake port</td>
<td>Jones</td>
</tr>
<tr>
<td />
<td />
<td>2006-09-08 12:00</td>
<td>$3 000</td>
<td>0.12</td>
<td>Valve cover</td>
<td>Jones</td>
</tr>
<tr>
<td />
<td>2006-09-09</td>
<td>$2 100</td>
<td />
<td />
<td />
<td />
</tr>
<tr>
<td />
<td />
<th>Date</th>
<th>Price</th>
<th>Tax</th>
<th>Item</th>
<th>Customer</th>
</tr>
<tr>
<td />
<td />
<td>2006-09-09 10:00</td>
<td>$2 100</td>
<td>0,12</td>
<td>Engine block</td>
<td>Jones</td>
</tr>
<tr>
<td />
<td>2006-09-10</td>
<td>$8 900</td>
<td />
<td />
<td />
<td />
</tr>
<tr>
<td />
<td />
<th>Date</th>
<th>Price</th>
<th>Tax</th>
<th>Item</th>
<th>Customer</th>
</tr>
<tr>
<td />
<td />
<td>2006-09-10 10:00</td>
<td>$8 900</td>
<td>0.24</td>
<td>Camshaft</td>
<td>Jones</td>
</tr>
<tr>
<td />
<td>2006-09-11</td>
<td>$20 816</td>
<td />
<td />
<td />
<td />
</tr>
<tr>
<td />
<td />
<th>Date</th>
<th>Price</th>
<th>Tax</th>
<th>Item</th>
<th>Customer</th>
</tr>
<tr>
<td />
<td />
<td>2006-09-11 12:00</td>
<td>$9 870</td>
<td>0.12</td>
<td>Exhaust port</td>
<td>Jones</td>
</tr>
<tr>
<td />
<td />
<td>2006-09-11 13:00</td>
<td>$6 765</td>
<td>0.12</td>
<td>Crankshaft</td>
<td>Jones</td>
</tr>
<tr>
<td />
<td />
<td>2006-09-11 12:00</td>
<td>$4 181</td>
<td>0.24</td>
<td>Rod bearing</td>
<td>Jones</td>
</tr>
<tr>
<td>Smith</td>
<td>$19 484</td>
<td />
<td />
<td />
<td />
<td />
</tr>
<tr>
<td />
<th>Date</th>
<th>Price</th>
<th>Tax</th>
<th>Item</th>
<th>Customer</th>
<th />
</tr>
<tr>
<td />
<td>2006-09-10 11:00</td>
<td>$6 100</td>
<td>0.24</td>
<td>Spark plug</td>
<td>Smith</td>
<td />
</tr>
<tr>
<td />
<th>GroupId</th>
<th>Sum(price)</th>
<th />
<th />
<th />
<th />
</tr>
<tr>
<td />
<td>2006-09-06</td>
<td>$1 000</td>
<td />
<td />
<td />
<td />
</tr>
<tr>
<td />
<td />
<th>Date</th>
<th>Price</th>
<th>Tax</th>
<th>Item</th>
<th>Customer</th>
</tr>
<tr>
<td />
<td />
<td>2006-09-06 09:00</td>
<td>$1 000</td>
<td>0.24</td>
<td>Intake valve</td>
<td>Smith</td>
</tr>
<tr>
<td />
<td>2006-09-07</td>
<td>$3 000</td>
<td />
<td />
<td />
<td />
</tr>
<tr>
<td />
<td />
<th>Date</th>
<th>Price</th>
<th>Tax</th>
<th>Item</th>
<th>Customer</th>
</tr>
<tr>
<td />
<td />
<td>2006-09-07 11:00</td>
<td>$2 000</td>
<td>0.24</td>
<td>Spring</td>
<td>Smith</td>
</tr>
<tr>
<td />
<td />
<td>2006-09-07 10:00</td>
<td>$1 000</td>
<td>0.12</td>
<td>Rocker arm</td>
<td>Smith</td>
</tr>
<tr>
<td />
<td>2006-09-09</td>
<td>$6 800</td>
<td />
<td />
<td />
<td />
</tr>
<tr>
<td />
<td />
<th>Date</th>
<th>Price</th>
<th>Tax</th>
<th>Item</th>
<th>Customer</th>
</tr>
<tr>
<td />
<td />
<td>2006-09-09 12:00</td>
<td>$5 500</td>
<td>0.12</td>
<td>Oil sump</td>
<td>Smith</td>
</tr>
<tr>
<td />
<td />
<td>2006-09-09 12:00</td>
<td>$1 300</td>
<td>0.24</td>
<td>Coolant</td>
<td>Smith</td>
</tr>
<tr>
<td />
<td>2006-09-10</td>
<td>$6 100</td>
<td />
<td />
<td />
<td />
</tr>
<tr>
<td />
<td />
<th>Date</th>
<th>Price</th>
<th>Tax</th>
<th>Item</th>
<th>Customer</th>
</tr>
<tr>
<td />
<td />
<td>2006-09-10 11:00</td>
<td>$6 100</td>
<td>0.24</td>
<td>Spark plug</td>
<td>Smith</td>
</tr>
<tr>
<td />
<td>2006-09-11</td>
<td>$2 584</td>
<td />
<td />
<td />
<td />
</tr>
<tr>
<td />
<td />
<th>Date</th>
<th>Price</th>
<th>Tax</th>
<th>Item</th>
<th>Customer</th>
</tr>
<tr>
<td />
<td />
<td>2006-09-11 11:00</td>
<td>$2 584</td>
<td>0.12</td>
<td>Connection rod</td>
<td>Smith</td>
</tr>
</tbody>
</table>
</p>
<h2 id="range-grouping">Range grouping</h2>
<p>
In examples above, results are grouped on distinct values, like customer or date.
To group on price:
<pre>
select (&hellip;) | all(group(price) each(each(output(summary()))));
</pre>
This gives one group per price. To group on price <em>ranges</em>, one could compress the price range.
This gives prices in $0 - $999 in bucket 0, $1 000 - $2 000 in bucket 1 and so on:
<pre>
select (&hellip;) | all(group(price/1000) each(each(output(summary()))));
</pre>
An alternative is using <a href="reference/grouping-syntax.html#bucket-expressions">bucket expressions</a> -
think of a bucket as the range per group.
Group on price, make groups have a width of 1000:
<pre>
select (&hellip;) | all(group(fixedwidth(price,1000)) each(each(output(summary()))));
</pre>
Use <code>predefined</code> to configure group sizes individually (the two below are equivalent):
<pre>
select (&hellip;) | all(group(predefined(price, bucket(0,1000), bucket(1000,2000), bucket(2000,5000), bucket(5000,inf))) each(each(output(summary()))));
select (&hellip;) | all(group(predefined(price, bucket[0,1000&gt;, bucket[1000,2000&gt;, bucket[2000,5000&gt;, bucket[5000,inf&gt;)) each(each(output(summary()))));
</pre>
This works with strings as well - put Jones and Smith in the second group:
<pre>
select (&hellip;) | all(group(predefined(customer, bucket(-inf,"Jones"), bucket("Jones", inf))) each(each(output(summary()))));
</pre>
... or have Jones in his own group:
<pre>
select (&hellip;) | all(group(predefined(customer, bucket&lt;-inf,"Jones"&gt;, bucket["Jones"], bucket&lt;"Jones", inf&gt;)) each(each(output(summary()))));
</pre>
</p>
<h2 id="pagination">Pagination</h2>
<p>
Grouping supports <a href="reference/grouping-syntax.html#continuations">continuation</a> objects
that are passed as annotations to the grouping statement.
The <code>continuations</code> annotation is a list of zero or more continuation strings, returned in the grouping result.
For example, given the result:
</p>
<pre>
{
"root": {
"children": [
{
"children": [
{
"children": [
{
"fields": {
"count()": 7
},
"value": "Jones",
"id": "group:string:Jones",
"relevance": 1.0
}
],
"continuation": {
"next": "BGAAABEBEBC",
"prev": "BGAAABEABC"
},
"id": "grouplist:customer",
"label": "customer",
"relevance": 1.0
}
],
"continuation": {
"this": "BGAAABEBCA"
},
"id": "group:root:0",
"relevance": 1.0
}
],
"fields": {
"totalCount": 20
},
"id": "toplevel",
"relevance": 1.0
}
}
</pre>
<p>
reproduce the same result by passing the <em>this</em> continuation along the orginal select:
<pre>
select (&hellip;) | [{ 'continuations':['BGAAABEBCA'] }]all(&hellip;);
</pre>
To display the next page of customers, pass the <em>this</em> continuation of the root
group, and the <em>next</em> continuation of the customer list:
<pre>
select (&hellip;) | [{ 'continuations':['BGAAABEBCA', 'BGAAABEBEBC'] }]all(&hellip;);
</pre>
To display the previous page of customers, pass the <em>this</em>
continuation of the root group, and the <em>prev</em> continuation of the customer list:
<pre>
select (&hellip;) | [{ 'continuations':['BGAAABEBCA', 'BGAAABEABC'] }]all(&hellip;);
</pre>
The <code>continuations</code> annotation is an ordered list of continuation strings.
These are combined by replacement,
so that a continuation given later will replace any shared state with a continuation given before.
Also, when using the <code>continuations</code> annotation, always pass the <em>this</em> continuation as its first element.
</p>
<p>
Note: Continuations work best when you have a stable ordering of hits - which can be achieved by using
<a href="ranking.html">ranking</a> or <a href="reference/grouping-syntax.html#order">ordering</a>.
You may also need to add an additional tie-breaker - like <a href="reference/rank-features.html#random">random.match</a>
or a random double value stored in each document - to keep the ordering stable in case of multiple documents which would
otherwise get the same rank score, or the same value used for ordering.
</p>
<h2 id="expressions">Expressions</h2>
<p>
Instead of just grouping on some attribute value,
the <code>group</code> clause may contain arbitrarily complex expressions -
see <code>group</code> in the
<a href="reference/grouping-syntax.html">grouping reference</a> for an exhaustive list.
Examples:
<ul>
<li>Select the minimum or maximum of sub-expressions</li>
<li>Addition, subtraction, multiplication, division, and even modulo of sub-expressions</li>
<li>Bitwise operations on sub-expressions</li>
<li>Concatenation of the results of sub-expressions</li>
</ul>
Sum the prices of purchases on per-hour-of-day basis:
</p>
<pre>
select (&hellip;) | all(group(mod(div(date,mul(60,60)),24)) each(output(sum(price))));
</pre>
<table class="table">
<thead>
<tr>
<th>GroupId</th>
<th>sum(price)</th>
</tr>
</thead>
<tbody>
<tr>
<td>09:00</td>
<td>$1 000</td>
</tr>
<tr>
<td>10:00</td>
<td>$22 367</td>
</tr>
<tr>
<td>11:00</td>
<td>$23 524</td>
</tr>
<tr>
<td>12:00</td>
<td>$26 181</td>
</tr>
<tr>
<td>13:00</td>
<td>$6 765</td>
</tr>
</tbody>
</table>
<p>
These types of expressions may also be used inside <code>output</code> operations,
so instead of simply calculating the sum price of the grouped purchases,
calculate the sum income after taxes per customer:
</p>
<pre>
select (&hellip;) | all(group(customer) each(output(sum(mul(price,sub(1,tax))))));
</pre>
<table class="table">
<thead>
<tr>
<th>GroupId</th>
<th>sum(mul(price,sub(1,tax)))</th>
</tr>
</thead>
<tbody>
<tr>
<td>Brown</td>
<td>$17 193</td>
</tr>
<tr>
<td>Jones</td>
<td>$32 868</td>
</tr>
<tr>
<td>Smith</td>
<td>$15 897</td>
</tr>
</tbody>
</table>
<p>
Note that the validity of an expression depends on the current nesting level.
E.g. while <code>sum(price)</code> would be a valid expression for a group of hits, <code>price</code> would not.
As a general rule, each operator within an expression either applies to a single hit or aggregates values across a group.
</p>
<h2 id="search-container-api">Search Container API</h2>
<p>
As an alternative to a textual representation,
one can use the programmatic API to execute grouping requests.
This allows multiple grouping requests to run in parallel,
and does not collide with the <code>yql</code> parameter - example:
<pre>
@Override
public Result search(Query query, Execution execution) {
// Create grouping request.
GroupingRequest request = GroupingRequest.newInstance(query);
request.setRootOperation(new AllOperation()
.setGroupBy(new AttributeValue(&quot;foo&quot;))
.addChild(new EachOperation()
.addOutput(new CountAggregator().setLabel(&quot;count&quot;))));
// Perform grouping request.
Result result = execution.search(query);
// Process grouping result.
Group root = request.getResultGroup(result);
GroupList foo = root.getGroupList(&quot;foo&quot;);
for (Hit hit : foo) {
Group group = (Group)hit;
Long count = (Long)group.getField(&quot;count&quot;);
// TODO: Process group and count.
}
// Pass results back to calling searcher.
return result;
}
</pre>
Refer to the
<a href="http://javadoc.io/page/com.yahoo.vespa/container-search/latest/com/yahoo/search/grouping/package-summary.html">
API documentation</a> for the complete reference.
</p>
<h2 id="more-examples">More examples</h2>
<h3 id="topn-full-corpus">TopN / Full corpus</h3>
<p>
Simple grouping, count the number of documents in each group:
<pre>all( group(a) each(output(count())) );</pre>
Two parallel groupings:
<pre>all( all(group(a) each(output(count())))
all(group(b) each(output(count()))) );</pre>
Only the 1000 best hits will be grouped at each content node. Lower accuracy, but higher speed:
<pre>all( max(1000) all(group(a) each(output(count()))) );</pre>
<!-- ToDo: check - only for streaming? what is this ... -->
In <a href="streaming-search.html#grouping">streaming search</a>
one can group all searched documents by adding a <code>where(true)</code> clause:
<pre>all( where(true) all(group(myfield) each(output(count()))) );</pre>
</p>
<h3 id="selecting-groups">Selecting groups</h3>
<p>
Do a modulo 5 operation before selecting the group:
<pre>all( group(a % 5) each(output(count())) );</pre>
Do <code>a + b * c</code> before selecting the group:
<pre>all( group(a + b * c) each(output(count())) );</pre>
</p>
<h3 id="ordering-groups">Ordering groups</h3>
<p>
Do a modulo 5 operation before selecting the group -
the groups are then ordered by their aggregated sum of attribute "b":
<pre>all( group(a % 5) order(sum(b)) each(output(count())) );</pre>
Do <code>a + b * c</code> before selecting the group.
Ordering is given by the maximum value of attribute "d" in each group:
<pre>all( group(a + b * c) order(max(d)) each(output(count())) );</pre>
Take the average relevance of the groups and multiply it with
the number of groups to get a cumulative count:
<pre>all( group(a) order(avg(relevance()) * count()) each(output(count())) );</pre>
One can not directly reference an attribute in the order clause, as this:
<pre>all(group(a) order(attr * count()) each(output(count())) );</pre>
However, one can do this:
<pre>all(group(a) order(max(attr) * count()) each(output(count())) );</pre>
</p>
<h3 id="collecting-aggregates">Collecting aggregates</h3>
<p>
Simple grouping to count number of documents in each group and return the best hit in each group:
<pre>all( group(a) each(max(1) each(output(summary()))) );</pre>
Also return the sum of attribute "b":
<pre>all( group(a) each(max(1) output(count(), sum(b)) each(output(summary()))) );</pre>
Also return an XOR of the 64 most significant bits of an MD5 over the concatenation of attributes "a", "b" and "c":
<pre>all(group(a) each(max(1) output(count(), sum(b), xor(md5(cat(a, b, c), 64)))
each(output(summary()))));</pre>
</p>
<h3 id="grouping">Grouping</h3>
<p>Single level grouping on "a" attribute, returning at most 5 groups with full hit count as well as the 69 best hits.
<pre>all( group(a) max(5) each(max(69) output(count()) each(output(summary()))) );</pre>
Two level grouping on "a" and "b" attribute:
<pre>all( group(a) max(5) each(output(count())
all(group(b) max(5) each(max(69) output(count())
each(output(summary()))))) );</pre>
Three level grouping on "a", "b" and "c" attribute:
<pre>all( group(a) max(5) each(output(count())
all(group(b) max(5) each(output(count())
all(group(c) max(5) each(max(69) output(count())
each(output(summary()))))) );</pre>
As above, but also collect best hit in level 2:
<pre>all( group(a) max(5) each(output(count())
all(group(b) max(5) each(output(count())
all(max(1) each(output(summary())))
all(group(c) max(5) each(max(69) output(count())
each(output(summary()))))) );</pre>
As above, but also collect best hit in level 1:
<pre>all( group(a) max(5) each(output(count())
all(max(1) each(output(summary())))
all(group(b) max(5) each(output(count())
all(max(1) each(output(summary())))
all(group(c) max(5) each(max(69) output(count())
each(output(summary()))))) );</pre>
As above, but using different document summaries on each level:
<pre>all( group(a) max(5) each(output(count())
all(max(1) each(output(summary(complexsummary))))
all(group(b) max(5) each(output(count())
all(max(1) each(output(summary(simplesummary))))
all(group(c) max(5) each(max(69) output(count())
each(output(summary(fastsummary)))))) );</pre>
Deep grouping with counting and hit collection on all levels:
<pre>all( group(a) max(5) each(output(count())
all(max(1) each(output(summary())))
all(group(b) each(output(count())
all(max(1) each(output(summary())))
all(group(c) each(output(count())
all(max(1) each(output(summary())))))))) );</pre>
</p>
<h3 id="time-and-date">Time and date</h3>
<p>
Group by year:
<pre>all( group(time.year(a)) each(output(count())) );</pre>
Group by year, then by month:
<pre>all( group(time.year(a)) each(output(count())
all(group(time.month(a)) each(output(count())))) );</pre>
Group by year, then by month, then day, then by hour:
<pre>all( group(time.year(a)) each(output(count())
all(group(time.monthofyear(a)) each(output(count())
all(group(time.dayofmonth(a)) each(output(count())
all(group(time.hourofday(a)) each(output(count())))))))) );</pre>
Groups <em>today</em>, <em>yesterday</em>, <em>lastweek</em>, and <em>lastmonth</em>
using <code>predefined</code> aggregator, and groups each day within each of these separately:
<pre>all( group(predefined((now() - a) / (60 * 60 * 24),
bucket(0,1), bucket(1,2), bucket(3,7), bucket(8,31)))
each(output(count())
all(max(2) each(output(summary())))
all(group((now() - a) / (60 * 60 * 24)) each(output(count())
all(max(2) each(output(summary())))))) );</pre>
</p>
<h3 id="counting-unique-groups">Counting unique groups</h3>
<p>
The <code>count</code> aggregator can be applied on list of groups to determine the number of unique groups
without having to explicitly retrieve all groups. Note that that this count is an estimate using HyperLogLog++ which is an algorithm for the count-distinct problem.
To get an accurate count one needs to explicitly retrieve all groups and count them in a custom component or in the middle tier calling out to Vespa.
This is network intensive and might not be feasible in cases with many unique groups.
Another use case for this aggregator is counting the number of unique instances matching a given expression.
Output an estimate of the number of groups, which is equivalent to the number of unique values for attribute "a":
<pre>all( group(a) output(count()) )</pre>
Output an estimate of the number of unique string lengths for the attribute "name":
<pre>all( group(strlen(name)) output(count()) )</pre>
Output the sum of the "b" attribute for each group in addition to the accurate count of the overall number of unique groups as the inner each causes all groups to be returned.
<pre>all( group(a) output(count()) each(output(sum(b))) )</pre>
The <code>max</code> clause is used to restrict the number of groups returned.
The query outputs the sum for the 3 best groups. The <code>count</code> clause
outputs the estimated number of groups (potentially &gt;3). The <code>count</code> becomes an estimate here as the number of groups is limited by max<br/> while in the above example it's not limited by max:
<pre>all( group(a) max(3) output(count()) each(output(sum(b))) )</pre>
Output the number of top level groups, and for the 10 best groups,
output the number of unique values for attribute "b":
<pre>all( group(a) max(10) output(count()) each(group(b) output(count())) )</pre>
</p>
<h3 id="impression-forecasting">Impression forecasting</h3>
<p>
Using impression logs for a given user,
one can make a function that maps from rank score to the number of impressions an advertisement would get - example:
<pre>
Score Integer (# impressions for this user)
0.200 0
0.210 1
0.220 2
0.240 3
0.320 4
0.420 5
0.560 6
0.700 7
0.800 8
0.880 9
0.920 10
0.940 11
0.950 12
</pre>
Storing just the first column (the rank scores, including a rank score for 0 impressions)
in an array attribute named <em>impressions</em>, the grouping operation
<code><a href="reference/grouping-syntax.html#interpolatedlookup">interpolatedlookup(impressions, relevance())</a></code>
can be used to figure out how many times a given advertisement would have been shown to this particular user.
</p><p>
So if the rank score is 0.420 for a specific user/ad/bid combination,
then <code>interpolatedlookup(impressions,relevance())</code> would return 5.0.
If the the bid is increased so the rankscore gets to 0.490 it would get 5.5 as the return value instead.
</p><p>
In this context a count of 5.5 isn't meaningful for the past of a single user,
but it gives more information that may be used as a forecast.
Summing this across many different users may then be used to forecast
the total of future impressions for the advertisement.
</p>
You can’t perform that action at this time.