Skip to content
Find file
Fetching contributors…
Cannot retrieve contributors at this time
717 lines (541 sloc) 29.7 KB

Sphinx Module

Maintainer Contact

  • Hamish Friedlander
  • Mark Stephens

Requirements

  • SilverStripe 2.4 or newer
  • sphinx binary installation 0.9.9-rc2 or greater with 64-bit document id

Alternative fulltext search modules

This module may in the future be replaced by the fulltext search module, currently in development at https://github.com/silverstripe-labs/silverstripe-fulltextsearch

In particular, for projects with very large amounts of data to be stored in the index or projects that will be hosted in windows environments, the fulltext search module with it's Solr connector is likely to be a better solution.

The sphinx connector for the fulltext search module is not yet developed as of this writing, but once that has happened this module is likely to be deprecated.

Licensing

The file sphinxapi.php is the official php client for Sphinx, and is distributed under the GPL. We have email confirmation from the Sphinx developers that we include sphinxapi.php in SilverStripe under a license similar to Sun's MySQL FOSS exception - that is, the file itself remains under GPL, but distribution within SilverStripe is OK without SilverStripe needing to be distributed under the GPL.

Thanks to Andrew Aksyonoff and everyone involved in Sphinx

Quick Usage Overview

  • Compile / Install the Sphinx searchengine

  • Add the SphinxSearchable decorator to any DataObjects you want to search (for example, SiteTree and File)

  • Use SphinxSearchForm instead of SearchForm (if you use it in the default configuration, both SiteTree and File must have been decorated with SphinxSearchable).

For more complex searching using SphinxSearch::search

Installation

Install Sphinx Binaries

You will need to install sphinx 0.99 or higher on your environment. Sphinx is not distributed with the SilverStripe sphinx module.

The sphinx binaries should be compiled with 64-bit IDs enabled. If the xmlpipes mode is going to be used, the requisite xml libraries are also required.

Ensure the sphinxd search daemon process is set up to start when the computer reboots, running as the same user as apache (e.g. www-data)

On OS X, you'll probably need to raise the open file limit for more complex sites (see http://serverfault.com/questions/15564/where-are-the-default-ulimits-specified-on-os-x-10-5)

Install and Configure the SilverStripe Sphinx Module

To install the module source, extract the module into the root directory of your project (or add to the svn externals of the project). If you want near-realtime delta-indexing outside the user transaction (recommended), also install the messagequeue module. Sphinx will automatically detect and use it by default.

To configure, you will need to apply the SphinxSearchable decorator to the classes that you want Sphinx to index, and set any additional options on those classes. See the section "Applying the Decorator" for more information on ways to configure indexing.

If you are using MAMP:

  • Make sure you are using a TCP socket with searchd, not a unix socket - i.e. put this in your _ss_environment: define('SS_SPHINX_TCP_PORT', 5000);
  • Make sure your database string in _ss_environment has the port number included i.e. define('SS_DATABASE_SERVER', 'localhost:8889');

If you are using Windows: Sphinx runs as a service on windows. This service needs to be controllable by the user the webserver acts as. This module provides a Sphinx/install task which will install & adjust the permissions of a sphinx service as required by the most common configuration under windows.

Refresh Configuration and Reindex

/dev/build the project. This will update the database structure, but also generate the sphinx configuration file from the decorated classes. This needs to be done any time there is a change to which classes are decorated, or if the class hierarchy is changed in way that affects what is indexed, if other changes are made to indexed classes, or if the sphinx configuration static properties are changed. You can use the command: sapphire/sake dev/build to do this.

Ensure the sphinx directory and it's contents are owned by the same user as the sphinx process.

The command: sapphire/sake Sphinx/reindex can be used to force Sphinx to refresh its indexes. Note that the sphinx daemon may take a little time to rotate the set of indexes it uses. This happens automatically.

Set Up Periodic Reindexing

Set up a cron job to run as the apache user and issue the command: sapphire/sake Sphinx/reindex The effect of this is to cause Sphinx to completely rebuild it's primary indexes, and clear the delta indexes. If this is not set up, the delta indexes will tend to increase in size as indexed content is changed, and will increasingly degrade system performance. Depending on the nature and size of the content, the cron job is typically set up to run periodically anywhere from 15 minutes to a 24 hour period.

Windows

The install process for Windows is slightly different, as searchd is installed as a service.

  • Use 0.9.9 ID-64 binary installer for Windows from sphinxsearch.com. Install this in c:\sphinx
  • Install the module, but don't do a dev/build or Sphinx/configure yet.
  • Configure your project to use TCP/IP connections to searchd, rather than unix sockets ( the default, not supported under Windows). To do this, just set Sphinx::$tcp_port to a TCP/IP port in mysite_config.php
  • Run dev/build, which will also generate the sphinx configuration.
  • Install searchd as a service, using the following command:

cd c:\sphinx\bin searchd --install --config tmpdir/sphinx/sphinx.conf --servicename SphinxFoo

  • You'll need to change tmpdir above to be the path to the temporary folder for the project. It is also recommended to change the service name to something appropriate for the project. It is possible to run more than one Sphinx searchd service on the same server for different projects, but they require different service names. Note that if the location of the temporary directory changes, you will need to manually stop and uninstall the searchd service, and re-install with the new config file.
  • Decorate classes as required.

If sphinxd port changes, you'll need to stop the service, do Sphinx/configure, and restart it.

PHP binary should be on the path.

From a permissions perspective, IIS user needs to have full access to the temp directory. Alternatively, create a silverstripe-cache directory in the top level of your project, and SilverStripe will use that instead.

Comments

  • start and stop do nothing.
  • status assumes the service is running.
  • reindex doesn't report indexer errors as windows doesn't capture its stdout. To debug, run indexer command manually.

Applying the Decorator

The following example shows how the Sphinx decorator is applied to cause indexing of a class.

`class MyPage extends Page { static $db = array( ... );

static $extensions = array(
    'SphinxSearchable'
);

static $sphinx = array(
    "search_fields" => array("a", "b", "c"),
    "filter_fields" => array("a", "c"),
    "index_filter" => '"ShowInSearch" = 1',
    "sort_fields" => array("Title"),
    "extra_fields" => array("_contenttype" => "Page::getComputedValue"),
    'filterable_many_many' => '*',
    'extra_many_many' => array(
        'documents' => 'select (' . SphinxSearch::unsignedcrc('SiteTree') . '<<32) | PageID AS id, DocumentID AS Documents FROM Page_Documents')
    "mode" => "xmlpipe",
    "external_content" => array("field" => array("myclass", "somefunc"))
);

`

This will mark the class and all sub-classes for indexing.

Properties of the $sphinx Static

  • search_fields - an array of fields in this class to be indexed. If $excludeByDefault (see below) is false and searchFields is not supplied, all fields are index. If $excludeByDefault is true, this must be supplied.
  • filter_fields - an array of fields in this class that can be filtered. If $excludeByDefault is false and filterFields is not supplied, all non-string fields will be made filterable (including from has_one relationships). If $excludeByDefault is true, this must be supplied to have fields made filterable.
  • index_filter - A SQL where clause to filter the index by; it will be impossible to search for anything that does not meet this criteria.
  • sort_fields - an array of fields that can be sorted on. Similar to filterFields in that all fields listed here are added as filters, but this can include string fields. You can only sort on fields that:
    • special attributes created by sphinx
    • non-string fields that are in filterFields (or any non-int if $excludeByDefault is false)
    • string fields that are explicitly defined in sortFields. Because Sphinx cannot filter on string fields, special behaviour is implemented to create proxy int filter fields, which are then sorted more accurately once the result is returned from the sphinx process.
  • extra_fields - defines extra fields into the main SQL used for generating indexes, and includes them as attributes in the index. The value can be a SQL expression, but can also be of the form "class::method" which is called using call_user_func to get the value. The resulting value should be a SQL expression that returns an int.
  • filterable_many_many - an array of many-many relationship names, or '*' for all many-many relationships on this class. These are added as filters to the index.
  • extra_many_many - this allows injection of many-many attributes bypassing sapphire's generation of SQL automatically of the relationship. This is specifically useful for working around an issue with sapphire 2.4, which generates ANSI compliant SQL statements, but these fail in sphinx indexing if the database server is not set to use ANSI compliant, because there is no way to control ansi-mode for the query that retrieves many-many data in sphinx (a bug in sphinx: http://www.sphinxsearch.com/bugs/view.php?id=394). This is not an issue if mode is 'xmlpipe'. The example shows a MySQL expression (the shift operator does not work with all database platforms).
  • mode - this determines the mode used by the sphinx indexer to retrieve data. One of the values:
    • 'sql' (default) - SQL statements are used. The statements are written to the sphinx.conf file. Indexer handles database connections itself.
    • 'xmlpipe' - the indexer runs a command that invokes the SphinxXMLPipe controller to get the data to index as an XML stream. Experimental at this stage (well, more experimental than SQL). This is likely to be slower than SQL indexing. Has the advantage that content outside the database can be included in the index.
  • external_content provides a hook to provide additional content to add to the search index. The value provided is passed as the function argument to call_user_func, so can be a function, array($instance, $functionName) for an instance method, or array($className, $functionName) for a static function. NOTE: This is only applicable when mode is 'xmlpipe'. It is ignored if mode is 'sql'. The function is called with a single parameter, the ID of the decorated instance.

Parameters of the Constructor

  • excludeByDefault (default false) When false, all properties on sub-classes are automatically indexed, and all non-string fields are made filterable. This gives maximum searchability with the cost of potentially increasing the number of indices, and increasing the memory footprint of searchd. If this is set to true, fields in subclasses are excluded from indexing unless the sub-class specifically defines $searchFields, $filterFields. e.g. static $extensions = array('SphinxSearchable(true)'); // disable automatic inclusion of all subclass fields

Many-Many and MVA in Sphinx

Sphinx supports a feature called Multi-Value Attributes, in which a named attribute stored against an indexed document can have zero or more values, as compared to ordinary attributes that have one value. MVAs are good for representing relationships to other documents, as well as tags and categories where a document can have many.

In the Sphinx module, these are declared in one of two ways:

  • filterable_many_many is suitable when the IDs related to are stored in a many_many relationship in the ORM. In this case, the module will automatically define the MVAs for you.

  • extra_many_many is used when there are multiple values per document, but the data does not come from a many_many relationship.

extra_many_many is an array that maps attribute names to the sources of values for that attribute. The source can be one of two things: a SQL query or a callback.

Consider this example:

static $sphinx = array( ... "extra_many_many" => array( "attr1" => "select (' . SphinxSearch::unsignedcrc('SiteTree') . '<<32) | PageID AS id, DocumentID AS attr1 FROM Page_Documents'), "attr2" => array("Page", "get_attr2_values") ) ... );

This defines two MVA attributes, "attr1" which is defined by SQL, and "attr2" which is defined by callback.

SQL MVAs must be a SQL statement that returns two columns, 'id' and a column of the same name as the attribute. The 'id' column will be a 64-bit sphinx document ID that identifies the document the MVA is attached to. The other column defines one of the MVA values for that document.This query is executed once when indexing, and will return MVAs for all documents.

The callback variant works a little differently. For each document that is being indexed, the callback is called with the data object ID passed in. This function should return an array of int values which are the attribute values for that document, or null if the document has no values for that attribute.

Notes on How Indexes are Constructed

The Sphinx module automatically determines the indexes required for a given set of classes. It calculates a signature for each searchable class that incorporates the fields to be indexed, extra fields and attributes for filtering. Classes with the same signature are combined into a single index.

For example, consider a class A that is decorated with SphinxSearchable, with subclasses B and C. B does not have any additional indexable fields, but C does. Objects of class B will be put in the same index as objects of class A, but a separate index will be constructed for class C.

Triggers for Reindexing

Sphinx maintains two sets of indexes, primary and deltas. A primary delta will contain all indexed objects (for the classes for that index), whereas the delta index only contains recently changed objects. Each time an indexed object is changed, the SphinxSearchable decorator invalidates that object in the primary index, and re-indexes the delta index to pick up just those that have changed since the primary was last rebuilt.

As changes occur, the delta index will grow, and will progressively get slower to index, until the primary index is rebuilt and the delta index cleared.

The primary index is only rebuilt as a result of calling Sphinx::reindex() (Sphinx class is a controller, so accessing Sphinx/reindex will do this). This is typically set up as a cron job.

The reindexing of deltas is controlled by the static variable SphinxSearchable::$reindex_mode:

  • If "endrequest" (the default) reindexing is done once at the end of the PHP request, and only if a write() or delete() have been done (any op which flags the record dirty). If the messagequeue module is installed and SphinxSearchable::$reindex_queue is specified, a message is sent to do the refresh to keep it out of the user process. Otherwise it done in this process but at the end of the PHP request (this will be noticable to the user)
  • If "write" (old behaviour) reindexing is done on write or delete.
  • If "disabled" reindexing of the delta is disabled, which is useful when writing many SphinxSearchable items (such as during a migration import) where the burden of keeping the Sphinx index updated in realtime is both unneccesary and prohibitive.

Note that the default configuration of messagequeue will execute the delta reindexing in a separate process initiated as part of PHP shutdown. The effect is the delta is reindexed in near-realtime, but without the user experiencing the delay. If xmlpipes is used, messagequeue is highly recommended.

Indexing Content of Files

Sphinx module can be configured index file contents. This optional feature relies on extractor classes that use external tools to get the text for a file. This module currently provides two file extractor classes:

  • PDFTextExtractor - uses pdftotext utility
  • HTML extractor - uses internal striptags method to crudely get content

Other extractors can be added by defining subclasses of FileTextExtractor.

Configuration

Add the extension in your mysite/_config.php:

DataObject::add_extension('File', 'FileTextExtractable');

Indexing File via another Class (e.g. a Document class)

On your document class (which is assumed to contain a has_one relationship to File), configure sphinx to index it. The key properties are setting the mode to "xmlpipe" and external_content to point to a function in the class that retrieves the content. In this case, the function simply calls extractFileAsText() (in the decorator) on the related file object. ` static $sphinx = array( "search_fields" => array("Title","Description"), "mode" => "xmlpipe", "external_content" => array("file_content" => array("Document", "getDocumentContent")) );

static function getDocumentContent($documentId) {
    $doc = DataObject::get_by_id("Document", $documentId);
    if (!$doc || !$doc->File()) return "";
    return $doc->File()->extractFileAsText();
}

`

Generally File should not be directly indexed, as this provides no control over which files are indexed.

Doing a search

SphinxSearch::search() actually performs a search. e.g.:

$res = SphinxSearch::search(array('Page', 'Document', $queryString, array( 'require' => $includeFilters, 'exclude' => $excludeFilters, 'page' => $page, 'pagesize' => $pagesize, 'sortmode' => $sortmode, 'sortarg' => $sortarg, 'field_weights' => $fieldWeights, 'suggestions' => true)));

The parameters are:

  • An array of classes to search. Subclasses are also searched. The set of indexed is automatically determined.
  • The query string itself, which is just a string with space-separated words and other tokens to be interpreted by sphinx. See the sphinx documentation for the available options.
  • An options array

Available options are:

  • 'require' - an array of inclusion filters that are passed to Sphinx. Results will match these filters.
  • 'exclude' - an array of exclusion filters that are passed to Sphinx. Results will not match these filters.
  • 'page' - in a multi-page result, this is the page of results
  • 'pagesize' - the number of results to return in each page.
  • 'sortmode' - the sorting mode to user.
  • 'sortarg' - an argument to the sorting mode.
  • 'suggestions' - if true, alternative spelling suggestions are returned if there are less than 10 results. If false, suggestions are never returned.
  • 'field_weights' - if provided, a map searchable text field names to integer weights. See Sphinx documentation for how weighting works before setting values.

The result is an associative array with the following keys:

  • 'Matches' - a DataObjectSet with the search results.
  • 'Suggestions' - if suggestions are enabled and there are less than 10, this will be an array of possible values.

Managing Larger Configurations

Sphinx performance and resource usage is affected by a number of factors, including:

  • Number of indices
  • Whether or not deltas are used
  • The number of attributes in the indices. These are kept in searchd memory.

Ways to control these factors include:

  • Attach the decorator at a deeper level in the class tree. e.g. instead of decorating Page, decorate specific subclasses of Page.
  • Use the excludeByDefault option on the constructor, and explicitly control the search, filter and order fields on the class.
  • For classes that change very infrequently, or are small, consider disabling delta indices.
  • For versioned pages, if search is not required in the CMS, consider explicit control over the stages that are indexed. (e.g. only index Live if searching is only enabled at the front end.)

Best Practices

  • On larger sites, don't make SiteTree or Page searchable directly, but create sub-classes of Page and decorate those. This provides better control over what classes are indexed (e.g. a Page derivative whose content is to summarise other content or pages will probably not want to be indexed).
  • If files need to be indexed, consider sub-classing File rather than decorating it directly, as this will cause overhead in attempting to index non-indexable files such as images, or use the index_filter property to be selective on which files are indexed.

API

There are six parts to this module:

  1. Sphinx - manages the searchd and indexer binaries, generating configuration files for them, starting and stopping them as needed. Use as both a controller and a singleton.

  2. SphinxSearch - handles performing actual searches

  3. SphinxSearchable - decorator that handles introspection of DataObjects, and adds a couple of utility functions to DataObjects

  4. SphinxVariants - handles altering index building and searching to account for things like Versioned and Subsites adjusting the access semantics

  5. SphinxSearchForm - provides a SearchForm-a-like, using Sphinx instead of database specific full-text search

  6. Spell - interfaces with pspell for the 'Did you mean..' functionality

SilverStripe to Sphinx mapping

Sphinx, SilverStripe and Global IDs

Sphinx needs globally unique IDs for all documents. SilverStripe only provides unique IDs within a model inheritance chain - that is, classes that are direct descendants of DataObject have a unique set of IDs with their children, but not with each other.

In order to provide globally unique IDs to Sphinx, this module uses a 64 bit globally unique ID, the high dword being the 32 bit CRC of the ClassInfo::baseTableName (equivilent to the ClassName of the direct decendant of DataObject), and the low dword the 32 bit row ID. This has these caveats:

  • Sphinx must be compiled in 64 bit document ID mode.
  • There must be no collision in the set of CRC32(ClassName). This is checked on config building, and the chances of collision are very small.

Additionally, the original row ID, as well as the CRC of the baseTableName and the CRC of the ClassName are available in the attributes _id, _baseid and _classid respectively. These are useful to avoid having to do 64 bit math, and for class-specific filtering, such as is needed by SphinxVariants_Subsite to filter differently depending on the class

Advanced Configuration

By default, the sphinx module will generate sensible default values in sphinx.conf. You can, however, override these. It is recommended that you read the sphinx documentation and sphinx module code to ensure you understand the effects of overriding values.

Indexer Options

Indexer options can be overridden as follows in your mysite/_config.php:

    Sphinx::set_indexer_options(array(
        "mem_limit" => "512M",
        "write_buffer" => "4M"
    ));

searchd Options

searchd options can be overridden as follows in your mysite/_config.php:

    Sphinx::set_searchd_options(array(
        "max_children" => 10
    ));

Warning: the following searchd options are typically dynamically calculated by the sphinx module, but you can override with the above function. Only do this if you understand what you are doing, otherwise sphinx module may not function properly:

  • listen
  • pid_file
  • log
  • query_log

Index Options

The sphinx module automatically generates index (and source) sections in the configuration. There are two types of index sections, each of which has the same setting options but are set differently.

Base Index Options

The base index definition in sphinx.conf contains settings that are inherited by all other index definitions.

Base index definitions can be set as follows:

Sphinx::set_base_index_options(array(
    "enable_star" => 1,
    "min_prefix_len" => 3
));

Unit Test Behaviour and Features

Unit tests for sphinx are written to use a fake sphinx client, so that automated testing environments don't require a full set up of the sphinx module. When you are writing unit tests for your own code that makes calls to Sphinx::search, you can get the fake client to return objects from your test's fixture.

To enable this feature, you need to tell Sphinx to use the fake client, and tell it your test instance so it can retrieve objects from the fixture:

// Grab a singleton instance of the sphinx controller, and tell it to use the fake client with this
// test instance.
static $sphinx = null;
function setUpOnce() {
    self::$sphinx = new Sphinx();
    self::$sphinx->setClientClass("SphinxClientFaker", $this);
}

Within your test, you can call SphinxSearch::search with specially formatted query text. There are two formats you can use:

  • class:(id,id,id...) returns specifically identified objects from the fixture file.
  • class:cond returns objects using DataObject::get.

Example 1:

To get sphinx to return two page objects that are identified in the fixture as 'page1' and 'page2':

$results = SphinxSearch::search(array('Page'),
                            "Page:(page1,page2)",
                            array('suggestions' => false, 'page' => 0, 'pagesize' => 5);

Example 2:

To get sphinx to return all pages where the Title starts with 'Test':

$results = SphinxSearch::search(array('Page'),
                            "Page:\"Title\" like 'Test%'",
                            array('suggestions' => false, 'page' => 0, 'pagesize' => 5);

Note: It is recommended to set the suggestions option to false. Having suggestions enabled may cause errors.

Known Issues

Re-indexing Many-Many Relationships on Write

Currently many-many relationships are not re-indexed on write, as there is no way to reliably detect changes in the components if the decorated object doesn't change. So if changes are made in a M-M, these need to be re-indexed by calling $do->sphinxComponentsChanged() on the decorated instance. This will re-index the object in the delta. Otherwise the M-M changes will be picked up at the next primary re-index.

Slow Saving with XML Pipes

Saving pages that are indexed using XML pipes can be very slow in the CMS. This is due to the relatively high overhead of invoking the framework from the command line interactively in order to re-index the delta. This is compounded by versioning (which doubles the number of writes), cmsworkflow and any other decorator that introduces additional write() calls to the data object.

Troubleshooting

The first thing to do is issue the command: sapphire/sake Sphinx/diagnose on the command line, or point your browser at yoursite/Sphinx/diagnose This will attempt to find a number of common conditions such as:

  • no classes are decorated with SphinxSearchable
  • the sphinx binaries aren't installed
  • the configuration file or indexes haven't been built
  • indexes are not in the sphinx configuration file that should be, indicating changes to the decorated classes without a dev/build.
  • delta indexes are populated but primaries not, indicating a reindex has never been done.

More tests will be added over time.

The second thing to try is running the command: sapphire/sake Sphinx/reindex verbose=1 or point your browser at yoursite/Sphinx/reindex?verbose=1 This will re-generate the indexes, and will output all raw messages from the indexing process. Often errors such as permission problems or incorrect configuration will appear in the output.

Is the Sphinx configuration file being Created?

  • Check permissions

Can the Indexer Build all the Indices?

  • Error is 'No local indices'
  • Check that the class being searched actually have indices
  • Check the sphinx.conf to ensure config for that class is correct.

Is Temp Directory Path Short Enough

When using sockets to connect to the sphinx daemon, the temp path is used as part of the connection string. There is a limit of 128 characters on this string, so the temp folder path name can't be too long. This is typically OK on unix-based systens (e.g,. debian), but is sometimes an issue on OSX installs due to the long default temp path.

If you use _ss_environment.php to configure your environments, you can override SilverStripe's built-in TEMP_FOLDER constant to something shorter:

// Necessary for sphinx module, the default /var/folders/... path is too long
define('TEMP_FOLDER', '/var/tmp');

Errors

"failed to send client protocol version"

This error has been seen intermittently. It has been worked around with a change directly in thirdparty/sphinxapi.php.

Sorting Issues

  • If sorting on a text field, it must be declared as a sortable column

Further Enhancements

Performance of XML Pipes Write

XML pipes performance is not as high as with SQL, and carries additional overhead of invoking a sake instance to run the controller. If the messagequeue module is installed, sphinx will automatically use it and send reindex requests to a queue named by SphinxSearchable::$reindex_queue (default being sphinx_indexing). If the message queue interface that handles this queue is set to "send" => "processOnShutdown" => true (the default interface), then the reindexing requests will be performed on PHP shutdown in a separate process. In this configuration, reindexing of deltas is almost interactive without the user having the penalty. The queueing system can be used in other configurations, for example to offload indexing to another server that shares a database or message queue.

Other

  • Specify weighting per field
  • Allow fields to be hidden
  • Put hasMany attributes back in (was taken out as conflicted with Variant changes)
  • Allow using TCP to connect with searchd, rather than always using unix sockets
Something went wrong with that request. Please try again.