Skip to content

Setting up classes for search and indexing

Nicholas VanHowe edited this page Aug 6, 2013 · 14 revisions

In order for objects to be indexed and searched using Sunspot, their class must be configured for search. Configuration tells Sunspot what fields should be indexed, how to get the data for those fields, and a few other class-specific options. In order to configure a class for Sunspot, use the Sunspot.setup method. Let’s start with an example — a Post model for a blog — followed by a discussion of what is contained therein:

Sunspot.setup(Post) do
  text :title, :boost => 2.0
  text :body, :stored => true
  text :author_names do
    authors.map { |author| author.full_name }
  end
  string :title, :stored => true
  integer :blog_id, :references => Blog
  integer :category_ids, :references => Category, :multiple => true
  float :average_rating
  time :published_at
  boolean :featured, :using => :featured?
  boost { featured? ? 2.0 : 1.0 }
end

If you’re using Sunspot::Rails in a rails app, you can do the same inside a searchable block in the model:

class Post < ActiveRecord::Base
  searchable do
    text :title, :boost => 2.0
    # etc.
  end
end

Ways to populate field data

When Sunspot indexes a Ruby object, it extracts data from that object based on the setup and creates a Solr document. Sunspot has two methods of extracting field data, known as attribute extraction and block extraction.

Most of the fields configured in the above setup use attribute extraction, which is to say they simply call a method on the object and index the return value. By default, the method name used is the same as the name of the field; however, a different method name can be specified with the :using option. For example, the above setup indexes the return value of the #featured? method in a field called featured.

The :author_names field in the above setup is an example of block extraction, which is to say the given block is evaluated in the context of the indexed objects, and its return value is indexed as field data. Block extraction is useful when the data with which you want to populate the field is useful only for that purpose; you can thus keep the logic in the search definition and avoid polluting your object’s method namespace.

Note that in block extraction, as with all of Sunspot’s DSL blocks, your block can take an argument, in which case the object being indexed will be passed as the argument and the block will be evaluated in the calling context. So, the following would be equivalent:

text :author_names do |post|
  post.authors.map { |author| author.full_name }
end

Text Fields

The first three fields defined above, :title, :body, and :author_names, are text fields. When text fields are indexed, they are broken up into their constituent words and then processed using a definable set of filters (with Sunspot’s default Solr installation, they’re just lower-cased). This process is known as tokenization, and it’s what allow text fields to be searched using fulltext matching. You can read more about tokenization and the available filter options on the Solr wiki.

Boost

When text fields are searched, each document is assigned a relevance score based on where the searched words appear in the document, how many times they appear in the document, and how common they are in the index as a whole. You can shape the relevance score by assigning boost, which at the field level tells Solr to assign higher relevance to search terms found in the field. For example, finding a search term in the :title field above would give a result document a higher score than finding the same search term in the :body field.

As well as specifying field boost, you can also assign a document boost, which will make certain documents globally more relevant than others, regardless of search terms. Use the boost method in the DSL to assign document boost, as in the last line of the block in the example. As with field definitions, document boost can be extracted using attribute or block extraction; the above example uses block extraction. Document boost can also be specified statically, thus giving all objects of the class under setup the same boost:

Sunspot.setup(Comment) do
  boost 1.2
end

Attribute Fields

Attribute fields are the focus of most of the other components of search: scoping, faceting, ordering, etc. The fields :title, :blog_id, :category_ids, :average_rating, :published_at, and :featured are all attribute fields. Unlike text fields, attribute fields are not tokenized: they are indexed and searched verbatim, similarly to how columns are saved and queried in a relational database.

Attribute fields are also typed; the available types are string, integer, float, time, date, and boolean. As illustrated by the example above, the method called in the DSL is the type of the field being defined. There is also a special type class, which is used to store the class name of each indexed object; it should not be used explicitly.

Attribute field definitions can take a number of options:

:multiple

Boolean: Whether the field should index multiple values (the method/block used to generate the data returns an Array). Multiple-value fields cannot be used for sorting, for reasons that are fairly obvious when you think about it.

:references

Class: Indicates that the values in this field act as a primary key for the specified class. This allows Sunspot to populate facet rows for this class with the referenced instance. See Drilling down with facets for more information

:stored

Boolean: If true, the values in this field will be stored as well as indexed. When results are retrieved, stored field values are available in Hit objects, and can be accessed without making a database round-trip to populate the actual result instance

:trie

Boolean: Numeric and time fields only. Use a TrieField to store the data. Read on for what that means.

Attribute Field Types

Attribute fields are typed data; Sunspot supports most of the standard types that Solr does. A field’s type is determined by the method name used to define it, and can also be modified by certain options passed to the field. Here’s an overview:

String Fields

String fields store string data. How is this different from text fields? A text field is tokenized, which is to say it’s broken up into its constituent words; that’s how fulltext search works. String fields, on the other hand, are just indexed as-is: the indexed data is exactly that string, from beginning to end.

Numeric Fields

The numeric field types are Integer, Long, Float, and Double. They’re pretty self-explanatory: they index numbers.

Time Fields

Time fields store date/time data; they’re the equivalent of Ruby’s Time class

Date Fields

Date fields store a date, but not a time. They’re the equivalent of Ruby’s Date class. Note that Solr does not provide a built-in Date type, so internally Sunspot still indexes these as date/times

Trie Fields

This is a new type of field introduced in Solr 1.4, and first supported in Sunspot 1.0. Trie fields don’t simply index numbers as-is: instead, they index the number at different levels of accuracy; you can think of it as indexing the ones place, tens place, hundreds place, etc., individually. This makes doing range searches much faster. Trie fields can index any numeric type, as well as times. To use a Trie field, pass :trie => true as an option to your field definition. Read more about TrieFields here: http://lucene.apache.org/solr/api/org/apache/solr/schema/TrieField.html