Permalink
Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
87 lines (78 sloc) 3.65 KB
---
# Copyright 2017 Yahoo Holdings. Licensed under the terms of the Apache 2.0 license. See LICENSE in the project root.
title: "Stemming"
---
<p>
Stemming means <em>translate a word to its base form</em> (singular forms
for nouns, infinitive for verbs), using a
<a href="http://en.wikipedia.org/wiki/Stemming">stemmer</a>. Use of
stemming increases recall when searching because the searcher is
usually interested in documents containing query words regardless of
the word form used. Stemming in Vespa is symmetric, i.e. words are
converted to stems both when indexing and searching.
</p><p>
Examples of this is when text is indexed, the stemmer will convert the
noun <em>reports</em> (plural) to <em>report</em>, and the latter will
be stored in the index. Likewise, before searching, <em>reports</em>
will be stemmed to <em>report</em>. Another example is
that <em>am</em>, <em>are</em> and <em>was</em> all will be stemmed
to <em>be</em> both in queries and indexes.
</p><p>
When bolding is enabled, all forms of the query term will be bolded.
I.e. when searching for <em>reports</em>,
both <em>report</em>, <em>reported</em> and <em>reports</em> will be
bolded.
</p>
<h2 id="theory">Theory</h2>
<p>
From a matching point of view, stemming takes all possible token strings
and maps them into equivalence classes. So in the example above, the set
of tokens { "report", "reports", "reported" } are in an equivalence class.
To represent the class the linguistics library should pick the shortest
element in the class. At query time, the text typed by a user will be tokenized,
and then each token should be mapped to the most likely equivalence class,
again represented by the shortest element that belongs to the class.
</p><p>
While the theory sounds pretty simple, in practice it is not always possible
to figure out which equivalence class a token should belong to. A typical
example is the string "number". In most cases we would guess this to
mean a numerical entity of some kind, and the equivalence class would be
{ "number", "numbers" } - but it could also be a verb, with a different
equivalence class { "number", "numbered", "numbering" } for example. These are of course
closely related, and in practice they will be merged, so we'll have a slightly
larger equivalence class { "number", "numbers", "numbered", "numbering" } and
be happy with that.
However, in a sentence such as "my legs keep getting number every day"
the "number" token clearly does not have the semantics of a numerical entity, but
should be in the equivalence class { "numb", "number", "numbest", "numbness" } instead.
But blindly assigning "number" to the equivalence class "numb" is clearly
not right, since the "more numb" meaning is much less likely than the
"numerical entity" meaning.
</p><p>
The approach currently taken by the low-level linguistics library
will often lead to problems in the "number"-like cases as described above.
To give better recall, Vespa has implemented a "multiple" stemming option.
</p>
<h2 id="languages">Languages</h2>
<p>
Stemming is currently available for English, other languages needs user contribution.
</p>
<h2 id="configuration">Configuration</h2>
<p>
By default, all words are stemmed to their <em>shortest</em> form in Vespa.
Refer to the <a href="reference/search-definitions-reference.html#stemming">
stemming reference</a> for other stemming types. To change type, add:
</p>
<pre class="brush: text">
stemming: [stemming-type]
</pre>
<p>
Stemming can be set either for a field, a fieldset or as a default for all fields.
Example: Disable stemming for the field <em>title</em>:
<pre class="brush: sd">
field title type string {
indexing: summary | index
stemming: none
}
</pre>
</p>