Permalink
Fetching contributors…
Cannot retrieve contributors at this time
162 lines (142 sloc) 7.79 KB
---
# Copyright 2017 Yahoo Holdings. Licensed under the terms of the Apache 2.0 license. See LICENSE in the project root.
title: "Linguistics in Vespa"
---
<p>
Vespa uses a <em>linguistics</em> module to process text in queries and documents
during indexing and searching. The goal of linguistic processing is to increase
<em>recall</em> (how many documents are matched) without hurting <em>precision</em>
(the relevance of the documents matched) too much. It consists of such operations
as tokenizing text into chunks of known types such as words and punctuation, and
normalizing accents and finding the base form of words (stemming or lemmatization).
These operations can be turned on or off per field in a search definition.
</p>
<p>The default linguistics implementation - SimpleLinguistics, provides
support for english stemming only. To support additional languages you can use
<a href="https://opennlp.apache.org/">OpenNlp</a> linguistics instead by loading the
<code>com.yahoo.language.opennlp.OpenNlpLinguistics</code>
module, or providing your own linguistics module.
</p>
<h2 id="configuring-a-linguistics-implementation">Configuring a linguistics implementation</h2>
<p>The linguistics implementation must be configured as a component in container clusters doing linguistics processing.</p>
<p>As document processing for indexing is by default done by an autogenerated container cluster
which cannot be configured, specify a container cluster for indexing explicitly.</p>
<p>This example shows how to configure OpenNlp for linguistics using the same cluster for both query and indexing processing (if using different clusters, add the same linguistics component to all of them):
<pre>
&lt;services&gt;
&lt;container version="1.0" id="mycontainer"&gt;
<span style="background-color: yellow;">&lt;component id="com.yahoo.language.opennlp.OpenNlpLinguistics"/&gt;</span>
&lt;document-processing/&gt;
&lt;search/&gt;
&lt;nodes ...&gt;
&lt;/container&gt;
&lt;content version="1.0"&gt;
&lt;redundancy&gt;1&lt;/redundancy&gt;
&lt;documents&gt;
&lt;document type="mydocument" mode="index"/&gt;
<span style="background-color: yellow;">&lt;document-processing cluster="mycontainer"/&gt;</span>
&lt;/documents&gt;
&lt;nodes ...&gt;
&lt;/content&gt;
&lt;/services&gt;
</pre>
<p>Note that if you change the linguistics component of a live system you may experience reduced recall
until all documents are re-written as documents will still be stored with tokens generated by the previous
linguistics module.</p>
<h2 id="creating-a-custom-linguistics-implementation">Creating a custom linguistics implementation</h2>
<p>A linguistics component is an implementation of
<a href="https://github.com/vespa-engine/vespa/blob/master/linguistics/src/main/java/com/yahoo/language/Linguistics.java">
com.yahoo.language.Linguistics</a>. Refer to the
<a href="https://github.com/vespa-engine/vespa/blob/master/linguistics/src/main/java/com/yahoo/language/simple/SimpleLinguistics.java">
com.yahoo.language.simple.SimpleLinguistics</a> implementation (which you can subclass for convenience).
</p>
<h2 id="language-handling">Language handling</h2>
<p>
This section describes how language settings are applied in Vespa.
This covers both the <a href="reference/advanced-indexing-language.html#set_language">
set_language</a> indexing expression, as well as the
<a href="reference/search-api-reference.html#model.language">language</a> query parameter.
</p><p>
The single most important thing to note about language handling in Vespa,
is that Vespa does <em>not</em> know the language of a document. Instead,
1) the indexing processor is instructed on a per-field level what language to
use when calling the underlying linguistics library, and
2) the query processor is instructed on a per-query level what language to use.
If no language is explicitly set in a document or a query,
Vespa will run its configured language detector on the available text
(the full content of a document field, or the full <code>query=</code> parameter value).
</p><p>
A document that contains the exact same word as a query might not be recallable
if the language of the document field is detected differently from the query.
Unless the query has explicitly declared a language, this has a high probability of occurring.
</p>
<h3 id="indexing-with-language">Indexing with language</h3>
<p>
The indexing process run by Vespa is nothing more than the sequential execution
of the indexing script of every field in the input document.
At any point, the script may choose to set the language state of the processor
using <a href="reference/advanced-indexing-language.html#set_language">set_language</a>. Example:
<pre>
search book {
document book {
field language type string {
indexing: set_language
}
field title type string {
indexing: index
}
}
}
</pre>
Indicating that every document in the input is expected to have its own language.
</p><p>
Because indexing scripts are executed in the order they are given in the search definition,
and because the language state is never reset during the processing of a single document,
all indexed string fields following the <code>language</code> field
will be processed under the rules of that language.
</p><p>
The only thing that changes due to language is the output from
<code>normalize</code> and <code>tokenize</code>.
Now, because <code> indexing: index</code> implies <code>tokenize</code> for string fields,
the field <code>title</code> is affected.
</p><p>
If either <code>normalize</code> or <code>tokenize</code> is invoked prior to <code>set_language</code>,
the language detector is run on the input string.
</p><p>
The net result of this is that by calling <code>set_language</code> inside a document,
you change the terms that end up in a tokenized index.
This means that at query-time, you need to apply the same language settings
before tokenizing the query terms to be able to match what was stored in the index.
This also means that a single index may simultaneously contain terms of multiple languages.
</p><p class="alert alert-success">
Even if a document contains a string field used as input for the
<code>set_language</code> indexing expression,
there is no automation in storing this language in an index.
If you wish to filter by language at some point,
you would have to explicitly save this field as an attribute.
</p>
<h3 id="querying-with-language">Querying with language</h3>
<p>
Now that we understand that the content of an indexed string field are language-agnostic,
it should be clear that one must apply a symmetric tokenization on the query terms
in order to match the content of that field.
And this is exactly what Vespa's query parser does for you.
</p><p>
The query parser subscribes to a configuration file that tells it what fields are indexed strings,
and every query term that targets such a field are run through appropriate tokenization.
The <code> language</code> query parameter is what controls the language state of these calls.
</p><p>
Because an index may simultaneously contain terms in any number of languages,
you might have stemmed variants of one language match the stemmed variants of another.
If you need to work around this, you must store the language of a document in a separate attribute,
and apply a filter against that attribute at query-time.
</p><p>
If no language parameter is given, the language detector is called to process the query string.
The detector is likely to be confused by field names and query syntax,
but it is a best-effort approach.
This matches the language resolution of the index pipeline.
</p><p class="alert alert-success">
By default, there is no knowledge anywhere that captures what
languages are used to generate the content of an index.
The language parameter only affects the transformation of query terms that hit tokenized indexes.
</p>