What follows are some general recommendations on how to improve your search. Some tips represent performance benefits, some provide a better search index. You should evaluate these options for yourself and pick the ones that will work best for you. Not all situations are created equal and many of these options could be considered mandatory in some cases and unnecessary premature optimizations in others. Your mileage may vary.
Most search engines work best when they're given corpuses with predominantly text (as opposed to other data like dates, numbers, etc.) in decent quantities (more than a couple words). This is in stark contrast to the databases most people are used to, which rely heavily on non-text data to create relationships and for ease of querying.
To this end, if search is important to you, you should take the time to
carefully craft your SearchIndex
subclasses to give the search engine the
best information you can. This isn't necessarily hard but is worth the
investment of time and thought. Assuming you've only ever used the
BasicSearchIndex
, in creating custom SearchIndex
classes, there are
some easy improvements to make that will make your search better:
- For your
document=True
field, use a well-constructed template. - Add fields for data you might want to be able to filter by.
- If the model has related data, you can squash good content from those
related models into the parent model's
SearchIndex
. - Similarly, if you have heavily de-normalized models, it may be best represented by a single indexed model rather than many indexed models.
A relatively unique concept in Haystack is the use of templates associated with
SearchIndex
fields. These are data templates, will never been seen by users
and ideally contain no HTML. They are used to collect various data from the
model and structure it as a document for the search engine to analyze and index.
Note
If you read nothing else, this is the single most important thing you can do to make search on your site better for your users. Good templates can make or break your search and providing the search engine with good content to index is critical.
Good templates structure the data well and incorporate as much pertinent text as possible. This may include additional fields such as titles, author information, metadata, tags/categories. Without being artificial, you want to construct as much context as you can. This doesn't mean you should necessarily include every field, but you should include fields that provide good content or include terms you think your users may frequently search on.
Unless you have very unique numbers or dates, neither of these types of data
are a good fit within templates. They are usually better suited to other
fields for filtering within a SearchQuerySet
.
Documents by themselves are good for generating indexes of content but are generally poor for filtering content, for instance, by date. All search engines supported by Haystack provide a means to associate extra data as attributes/fields on a record. The database analogy would be adding extra columns to the table for filtering.
Good candidates here are date fields, number fields, de-normalized data from related objects, etc. You can expose these things to users in the form of a calendar range to specify, an author to look up or only data from a certain series of numbers to return.
You will need to plan ahead and anticipate what you might need to filter on, though with each field you add, you increase storage space usage. It's generally NOT recommended to include every field from a model, just ones you are likely to use.
Related data is somewhat problematic to deal with, as most search engines are
better with documents than they are with relationships. One way to approach this
is to de-normalize a related child object or objects into the parent's document
template. The inclusion of a foreign key's relevant data or a simple Django
{% for %}
templatetag to iterate over the related objects can increase the
salient data in your document. Be careful what you include and how you structure
it, as this can have consequences on how well a result might rank in your
search.
A very easy but effective thing you can do to drastically reduce hits on the
database is to pre-render your search results using stored fields then disabling
the load_all
aspect of your SearchView
.
Warning
This technique may cause a substantial increase in the size of your index as you are basically using it as a storage mechanism.
To do this, you setup one or more stored fields (indexed=False) on your
SearchIndex
classes. You should specify a template for the field, filling it
with the data you'd want to display on your search results pages. When the model
attached to the SearchIndex
is placed in the index, this template will get
rendered and stored in the index alongside the record.
Note
The downside of this method is that the HTML for the result will be locked in once it is indexed. To make changes to the structure, you'd have to reindex all of your content. It also limits you to a single display of the content (though you could use multiple fields if that suits your needs).
The second aspect is customizing your SearchView
and its templates. First,
pass the load_all=False
to your SearchView
, ideally in your URLconf.
This prevents the SearchQuerySet
from loading all models objects for results
ahead of time. Then, in your template, simply display the stored content from
your SearchIndex
as the HTML result.
Warning
To do this, you must absolutely avoid using {{ result.object }}
or any
further accesses beyond that. That call will hit the database, not only
nullifying your work on lessening database hits, but actually making it
worse as there will now be at least query for each result, up from a single
query for each type of model with load_all=True
.
If your site sees heavy search traffic and up-to-date information is very important,
Haystack provides a way to constantly keep your index up to date. By using the
RealTimeSearchIndex
class instead of the SearchIndex
class, Haystack will
automatically update the index whenever a model is saved/deleted.
You can find more information within the :doc:`searchindex_api` documentation.
By default, you have to manually reindex content, Haystack immediately tries to merge it into the search index. If you have a write-heavy site, this could mean your search engine may spend most of its time churning on constant merges. If you can afford a small delay between when a model is saved and when it appears in the search results, queuing these merges is a good idea.
You gain a snappier interface for users as updates go into a queue (a fast operation) and then typical processing continues. You also get a lower churn rate, as most search engines deal with batches of updates better than many single updates. You can also use this to distribute load, as the queue consumer could live on a completely separate server from your webservers, allowing you to tune more efficiently.
Implementing this is relatively simple. There are two parts, creating a new
QueuedSearchIndex
class and creating a queue processing script to handle the
actual updates.
For the QueuedSearchIndex
, simply inherit from the SearchIndex
provided
by Haystack and override the _setup_save
/_setup_delete
methods. These
methods usually attach themselves to their model's post_save
/post_delete
signals and call the backend to update or remove a record. You should override
this behavior and place a message in your queue of choice. At a minimum, you'll
want to include the model you're indexing and the id of the model within that
message, so that you can retrieve the proper index from the SearchSite
in
your consumer. Then alter all of your SearchIndex
classes to inherit from
this new class. Now all saves/deletes will be handled by the queue and you
should receive a speed boost.
For the consumer, this is much more specific to the queue used and your desired
setup. At a minimum, you will need to periodically consume the queue, fetch the
correct index from the SearchSite
for your application, load the model from
the message and pass that model to the update_object
or remove_object
methods on the SearchIndex
. Proper grouping, batching and intelligent
handling are all additional things that could be applied on top to further
improve performance.