Advanced Fulltext Search Configuration

Tadas Tamošauskas edited this page Aug 11, 2014 · 4 revisions

Sunspot's default search schema uses a fairly conservative configuration for fulltext search. Text is divided into tokens based on whitespace and other delimiter characters using a smart tokenizer called the StandardTokenizer; it's lower-cased using the LowerCaseFilter, to make fulltext search case-insensitive; and that's about it. However, Solr is extremely flexible in how it indexes and searches fulltext; a lot of advanced functionality can be configured quite easily.

When you run the embedded Solr instance provided with Sunspot::Rails using rake sunspot:solr:start, it creates a solr directory in your project root containing, among other things, solr/conf/schema.xml. It is in this file that you can change your fulltext search configuration. We'll consider two use cases.

Adding functionality to your default fulltext behavior

In the first case, we simply want to add some extra functionality to all of the fulltext searches our application performs. For our purposes, let's add stemming (which allows, for instance, a search for "run" to match the word "running") and synonyms. In your schema.xml, you'll find the following section:

<fieldType name="text" class="solr.TextField" omitNorms="false">
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

This is the field type that's used by Sunspot text fields. First, let's break out the analyzer into two: one for analyzing text at index-time, and one for analyzing text at query-time:

<fieldType name="text" class="solr.TextField" omitNorms="false">
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

For now our index-time analyzer and query-time analyzer are doing the same thing, so we haven't changed Solr's behavior in any way. Next, let's add synonym support at index-time:

<fieldType name="text" class="solr.TextField" omitNorms="false">
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" tokenizerFactory="solr.StandardTokenizerFactory"/>
  </analyzer>
</fieldType>

You'll need to create your synonyms.txt file (in the same conf directory); check the Solr Wiki for details.

Now let's add support for Porter stemming, which will allow Solr to match different words with the same root. The Porter stemmer should run at both index-time and query-time, so let's add it to both:

<fieldType name="text" class="solr.TextField" omitNorms="false">
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" tokenizerFactory="solr.StandardTokenizerFactory"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
</fieldType>

Note that you will need to restart Solr and re-index your data after making changes like this.

Synonyms and stemming are two of the most common modifications you might want to make to your fulltext configuration, but there are many other options built in to Solr. Check out the Solr Wiki for the full range of possibilities.