-
Notifications
You must be signed in to change notification settings - Fork 94
/
advanced-search-operators.html
264 lines (244 loc) · 11.5 KB
/
advanced-search-operators.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
---
# Copyright 2017 Yahoo Holdings. Licensed under the terms of the Apache 2.0 license. See LICENSE in the project root.
title: "Advanced Search Operators"
---
<p>
This document describes the characteristics and differences of Vespa's advanced search operators:
<ul>
<li><a href="#parallel-wand">Parallel Wand</a></li>
<li><a href="#dot-product">Dot Product</a></li>
<li><a href="#weighted-set">Weighted Set</a></li>
<li><a href="#vespa-wand">Vespa Wand</a></li>
</ul>
A basic description is given for each operator in addition to the following characteristics:
</p>
<ul>
<li><strong>Field type</strong>: What kind of field types is supported for this operator.</li>
<li><strong>Query model</strong>: What kind of data is passed down with the query when using this operator.</li>
<li><strong>Matching</strong>: What is the criteria for a document being a match.</li>
<li><strong>Ranking</strong>: What kind of ranking is this operator using internally (if any) and
what is exposed to the ranking framework.</li>
<li><strong>YQL operator</strong>: How is this operator used in YQL.</li>
<li><strong>Java Query Item</strong>: How is this operator used in the Java Query API.</li>
</ul>
<h2 id="parallel-wand">Parallel Wand</h2>
<p>
Parallel Wand is a search operator that can be used for efficient top-k retrieval.
It implements the "Weak AND"/"Weighted AND" algorithm as described by Broder et al in
"Efficient query evaluation using a two-level retrieval process", and is an operator that scales
adaptively from OR to AND.
</p><p>
Parallel Wand can be used to search for documents where weighted tokens in a field matches a subset of
weighted tokens in the query. At the same time, it internally calculates the dot product
between token weights in the query and the field. Parallel Wand is guaranteed to return the top-k
hits according to its internal dot product rank score.
It also optimizes the performance of using multiple threads per search in the backend.
Take a look at
<a href="../reference/wand-operator.html">Wand Search Operators</a>
reference for more information on how to use this operator.
</p>
<table class="table">
<thead></thead><tbody>
<tr>
<th>Field type</th>
<td>Weighted set attribute with fast-search. Note: Also supported for regular attribute or
index fields, but then with much weaker performance).</td>
</tr>
<tr>
<th>Query model</th>
<td>Weighted set with {token, weight} pairs.</td>
</tr>
<tr>
<th>Matching</th>
<td>Documents where the weighted set field contains at least one of the tokens in the query and
where the internal dot product score for this document, is larger than the worst among the
current top-k best hits. This means that more than top-k documents are matched and returned
for ranking. It also means that many documents are skipped even they match several tokens
in the query because the dot product score is too low. This skipping makes Parallel Wand faster
than Dot Product Operator in some cases.
</td>
</tr>
<tr>
<th>Ranking</th>
<td>Dot product score between the weights of the matched query tokens and field tokens.
This score is available using <code>rawScore</code> or <code>itemRawScore</code> rank features.
Note that the top-k best hits are only guaranteed to be returned when using this
internal score as the final ranking expression.
</td>
</tr>
<tr>
<th>YQL operator</th>
<td><a href="../query-language.html#wand">wand()</a></td>
</tr>
<tr>
<th>Java Query Item</th>
<td><a href="http://javadoc.io/page/com.yahoo.vespa/container-search/latest/com/yahoo/prelude/query/WandItem.html">WandItem</a></td>
</tr>
</tbody>
</table>
<h3 id="parallel-wand-tuning-target-number-of-hits">Tuning target number of hits</h3>
<p>
When using Parallel Wand via YQL or a Java Searcher plugin, you can specify the target for minimum
number of hits the operator should produce. By default, set <code>targetNumHits</code>
equal to the number of hits you are going to display to the user.
If additional second phase ranking with rerank-count is used,
do not set <code>targetNumHits</code> less than the configured rank-profile's rerank-count.
Note that if you combine Parallel Wand with a ranking expression that does not use its raw score,
the tuning of <code>targetNumHits</code> should follow the guide lines given for Vespa Wand.
</p>
<h3 id="parallel-wand-score-tuning-score-threshold">Tuning Score Threshold</h3>
<p>
You can also specify a score threshold when using Parallel Wand. The internal dot product score for
a document must be larger than <code>scoreThreshold</code> in order to be considered a match.
Default value is 0.0.
</p>
<h2 id="dot-product">Dot Product</h2>
<p>
Dot Product Operator is the brute force equivalent to Parallel Wand.
They are both used to search for documents where weighted tokens in a field matches a subset of
weighted tokens in the query. They also produce the exact same dot product score.
In some simple cases Dot Product Operator is preferable to Parallel Wand.
Take a look at
<a href="../reference/dot-product-search-operator.html">Dot Product Search Operator</a>
reference for more information on these use cases.
</p>
<table class="table">
<thead></thead><tbody>
<tr>
<th>Field type</th>
<td>Weighted set attribute with fast-search. Note: Also supported for regular attribute or
index fields, but then with much weaker performance).</td>
</tr>
<tr>
<th>Query model</th>
<td>Weighted set with {token, weight} pairs</td>
</tr>
<tr>
<th>Matching</th>
<td>Documents where the weighted set field contains at least one of the tokens in the query.</td>
</tr>
<tr>
<th>Ranking</th>
<td>Dot product score between the weights of the matched query tokens and field tokens.
This score is available using <code>rawScore</code> or <code>itemRawScore</code> rank features.</td>
</tr>
<tr>
<th>YQL operator</th>
<td><a href="../query-language.html#dot-product">dotProduct()</a></td>
</tr>
<tr>
<th>Java Query Item</th>
<td><a href="http://javadoc.io/page/com.yahoo.vespa/container-search/latest/com/yahoo/prelude/query/DotProductItem.html">DotProductItem</a></td>
</tr>
</tbody>
</table>
<h2 id="weighted-set">Weighted Set</h2>
<p>
Weighted Set Operator is used to search for documents where all tokens in the searched field will
be reverse matched against the tokens of the weighted set in the query. This means that using a Weighted Set
Operator to search a single-value attribute field will have similar semantics to using a normal
term to search a weighted set field. Take a look at
<a href="../reference/weighted-set-term.html">Weighted Set Search Operator</a>
reference for more information on how to use this operator.
</p>
<table class="table">
<thead></thead><tbody>
<tr>
<th>Field type</th>
<td>Single value or multi-value attribute or index field.
(Note: Most use cases operates on a single value field).</td>
</tr>
<tr>
<th>Query model</th>
<td>Weighted set with {token, weight} pairs.</td>
</tr>
<tr>
<th>Matching</th>
<td>Documents where the field contains at least one of the tokens in the query.</td>
</tr>
<tr>
<th>Ranking</th>
<td>The operator will act as a single term in the back-end. The query term weight is the weight
assigned to the operator itself and the match weight is the largest weight among matching
tokens from the weighted set. This operator does not produce a raw score. Due to better
ranking and performance we recommend using the Dot Product Operator instead.</td>
</tr>
<tr>
<th>YQL operator</th>
<td><a href="../query-language.html#weighted-set">weightedSet()</a></td>
</tr>
<tr>
<th>Java Query Item</th>
<td><a href="http://javadoc.io/page/com.yahoo.vespa/container-search/latest/com/yahoo/prelude/query/WeightedSetItem.html">WeightedSetItem</a></td>
</tr>
</tbody>
</table>
<h2 id="vespa-wand">Vespa Wand</h2>
<p>
Vespa Wand is a search operator that also implements the "Weak AND"/"Weighted AND" algorithm.
Unlike Parallel Wand, Vespa Wand can be used to search across several fields of various types,
but it does NOT guarantee to return the top-k best number of hits.
It can however be combined with any ranking expression, but keep in mind that this expression should
correlate with its simple internal ranking score that uses query term weight and inverse document
frequency for matching terms.
</p>
<table class="table">
<thead></thead><tbody>
<tr>
<th>Field type</th>
<td>Multiple fields of all types (both attribute and index).</td>
</tr>
<tr>
<th>Query model</th>
<td>Arbitrary number of query items searching across different fields.</td>
</tr>
<tr>
<th>Matching</th>
<td>Documents that matches at least one of the tokens in the query and where the
internal operator score for this document is larger than the worst among the current top-k
best hits. As with Parallel Wand, this means that typically more than top-k documents are matched
and a lot of documents are skipped.
</td>
</tr>
<tr>
<th>Ranking</th>
<td>Internal ranking score based on query term weight and inverse document frequency for
matching terms to find the top-k hits. This score is currently not available to the ranking
framework. Matching terms are exposed to the ranking framework (same as when using AND or OR),
so an arbitrary ranking expression can be used in combination with this operator.
Note that the ranking expression used should correlate with this internal ranking score.
<code>nativeFieldMatch</code> and <code>nativeDotProduct</code> are good starting points.
</td>
</tr>
<tr>
<th>YQL operator</th>
<td><a href="../query-language.html#weak-and">weakAnd()</a></td>
</tr>
<tr>
<th>Java Query Item</th>
<td><a href="http://javadoc.io/page/com.yahoo.vespa/container-search/latest/com/yahoo/prelude/query/WeakAndItem.html">WeakAndItem</a></td>
</tr>
</tbody>
</table>
<h3 id="vespa-wand-tuning-target-number-of-hits">Tuning target number of hits</h3>
<p>
When using Vespa Wand via YQL or a Java Searcher plugin you can specify the target for minimum
number of hits the operator should produce. The effect of tuning <code>targetNumHits</code> may not
be very intuitive. To ensure that you get the best hits possible with a Vespa Wand
set the target number somewhat higher than the number of hits displayed to the user;
setting it 10 times higher should be more than enough. The reason for
increasing the target number is that Vespa Wand uses a very simple ranking function internally to
filter away bad hits. If your normal rank expression correlates poorly with this internal filtering
formula, you need to increase the target number. The easiest is probably to use the default number
(100) and see if that gives good results, only tuning if you see a problem. Also note that if your
rank correlates poorly with the filtering criteria, Vespa Wand may not be the appropriate operator.
</p>
<h3 id="vespa-wand-selecting-a-rank-function">Selecting a rank function</h3>
<p>
Anything similar to classic vector model ranking will work well with Vespa Wand. We suggest using
the <code>nativeFieldMatch</code> or <code>nativeDotProduct</code> feature as a starting point.
Note that because Vespa Wand relies on feedback identifying which hits are used for first phase
ranking to increase its threshold for what's considered a good hit, the special <code>unranked</code>
rank profile (which turns off ranking completely) may cause Vespa Wand queries to become slower than
a more normal ranking.
</p>