Matching substrings in fulltext search

alindeman edited this page Sep 5, 2012 · 9 revisions

The NGramFilterFactory and EdgeNGramFilter factory didn't return results with the default maxGramSize=1. Changing it to a higher value did the trick. EdgeNGramFilterFactory description --bvajda

Thanks, bvajda! I took your advice and replaced the suggested EdgeNGramFilterFactory line with

<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="10" side="front"/>

and I'm now getting the results that I expected. --bradcater

FYI this isn't working, for me or another user on the mailing list. Suggestions welcome! --Inspire22

For me the setup in Wildcard searching with ngrams works quite well for substring search even with NGramFilterFactory --medihack

By default, Solr breaks fulltext into tokens, each of which is a full word. Using a reverse index, fulltext search tokenizes the query input and matches query tokens to indexed tokens. This works well for standard fulltext applications, but doesn't lend itself well to prefix or substring matching.

The best way to perform prefix/substring matching is to use the NGramFilter (substring) or EdgeNGramFilter (prefix) filters in Solr. While Sunspot does not currently have explicit support for these filters, with a few modifications to your schema.xml file, you can easily make them available to Sunspot search. First, add a new type to your schema:

<fieldType class="solr.TextField" name="text_pre" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EdgeNGramFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

Then, you'll need to add a new dynamic field definition to the schema, so that you can reference the field names using the dynamic field prefix from Sunspot. Add this to your schema.xml as well:

<dynamicField name="*_textp" stored="false" type="text_pre" multiValued="true" indexed="true"/>

Finally, to configure a text field in Sunspot to index into your newly created dynamic field, use the :as option (new in Sunspot 1.2) to explicitly map the field to the appropriate name in Solr:

searchable do
  text :code, :as => :code_textp
  # etc.
end

Now when you perform fulltext searches, user-entered keywords will match any prefix of indexed text, rather than requiring the full word to be entered. To match arbitrary substrings instead, replace EdgeNGramFilterFactory with NGramFilterFactory.

For more information, see the Solr wiki: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.EdgeNGramFilterFactory