Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate rules on "startup" #15

Closed
magro opened this issue May 3, 2015 · 3 comments
Closed

Evaluate rules on "startup" #15

magro opened this issue May 3, 2015 · 3 comments

Comments

@magro
Copy link
Contributor

magro commented May 3, 2015

Rules should be evaluated when they're loaded/parsed to improve performance at query time.

@magro
Copy link
Contributor Author

magro commented May 3, 2015

@renekrie Do you have a hint where to start? Then I'd give it a try.

@renekrie
Copy link
Collaborator

renekrie commented May 4, 2015

The idea would be to partially evaluate queries that constitute the right-hand side of Common Rules, for example, 'personal computers' in

pc =>
    SYNONYM: personal computers

The following information could be loaded and cached on startup:

  • the Lucene query that is created for a given term of the rhs query (per field). This will probably be a simple TermQuery in most cases but it could also be a BooleanQuery in case the analysis chain emits more than one token for the input term.
  • whether the Lucene query has any results (maybe this is an optional information but at least for TermQueries this is easy to retrieve via DF).

Loading this info on startup would save query time especially if there are many query fields and if a Common Rule adds many query terms. Adding 10 synonyms with 10 query fields would result in 100 additional TermQueries. We would always have to go through the Lucene analysis in order to create these queries - regardless of Solr caching - and some of these TermQueries would never match any document. Doing the analysis on startup and caching the TermQueries (or BooleanQueries) together with the DF information should therefore reduce query execution time later.

Where to start:

I've created a branch 'crpreload' for the development of this feature. In this branch, querqy-core/querqy.trie.TrieMap has already been made an Iterable over its values. TrieMap contains the mapping between an input and the resulting instructions and it is filled on startup.You can thus iterate over its values (getting you Instructions objects, which are just lists of Instruction objects) and inspect the instructions to get the rhs queries that you need for the preload (visitor pattern to deal with the different instruction types?).

Note that instructions and the Querqy query object model are search-engine independent. Maybe you'd want to pass an abstract Preloader to querqy.rewrite.commonrules.SimpleCommonRulesRewriterFactory and provide the implementation on the querqy-lucene module.

To create the cached Lucene query have a look at querqy-lucene/querqy.lucene.rewrite.LuceneQueryBuilder and at querqy-solr/querqy.solr.QuerqyDismaxQParser for passing in analyzers and params.

We'll have to clone the cached query when executing it later as boost factors might change per request. This will still be cheaper than running the full analysis chain.

@renekrie
Copy link
Collaborator

This has been available for some time and is now described here: https://github.com/renekrie/querqy#advanced-configuration-caching .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants