Skip to content

Matching substrings in fulltext search

levity edited this page Nov 9, 2010 · 12 revisions

By default, Solr breaks fulltext into tokens, each of which is a full word. Using a reverse index, fulltext search tokenizes the query input and matches query tokens to indexed tokens. This works well for standard fulltext applications, but doesn't lend itself well to prefix or substring matching.

The best way to perform prefix/substring matching is to use the NGramFilter (substring) or EdgeNGramFilter (prefix) filters in Solr. While Sunspot does not currently have explicit support for these filters, with a few modifications to your schema.xml file, you can easily make them available to Sunspot search. First, add a new type to your schema:

<fieldtype class="solr.TextField" name="text_pre" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EdgeNGramFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldtype>

Then, you'll need to add a new dynamic field definition to the schema, so that you can reference the field names using the dynamic field prefix from Sunspot. Add this to your schema.xml as well:

<dynamicField name="*_textp" stored="false" type="text_pre" multiValued="true" indexed="true"/>

Finally, to configure a text field in Sunspot to index into your newly created dynamic field, use the :as option (new in Sunspot 1.2) to explicitly map the field to the appropriate name in Solr:

searchable do
  text :code, :as => :code_textp
  # etc.
end

Now when you perform fulltext searches, user-entered keywords will match any prefix of indexed text, rather than requiring the full word to be entered. To match arbitrary substrings instead, replace EdgeNGramFilterFactory with NGramFilterFactory.

For more information, see the Solr wiki: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.EdgeNGramFilterFactory