Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caching support for UnIndex #9

Closed
wants to merge 7 commits into from
Closed

Conversation

andbag
Copy link
Member

@andbag andbag commented Mar 30, 2016

Hi,

I have adopted the caching mechanism of DateRangeIndex in order to improve the performance of catalog queries calling repetitively indexes inherited from UnIndex (e.g. FieldIndex, KeywordIndex etc.). The performance enhancement becomes especially noticeable on big sites. For example, in plone are the indexes review_state, portal_type, allowedRolesAndUsers etc. within a request affected.

Kind regards
Andreas

@@ -35,6 +35,7 @@

from Products.PluginIndexes.common import safe_callable
from Products.PluginIndexes.common.UnIndex import UnIndex
from Products.PluginIndexes.common.UnIndex import RequestCache
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused import.

@tseaver
Copy link
Member

tseaver commented Mar 30, 2016

Thanks for the patch! I'm not sure I've been able to grasp entirely how the cache interacts with UnIndex._apply_index, which is already quite complex. :( I'd like another reviewer to look at it, maybe @hannosch, @mauritsvanrees, @mgedmin, or @davisagli?

@hannosch
Copy link
Contributor

Just to refresh my memory, the way the date range index cache worked was:

  • On the request object, it sets a RequestCache value for the key catalog.getId(), if getCounter is magically found it uses '%s_%s' % (catalog.getId(), catalog.getCounter())
  • The date range index stores times rounded to minutes and the search terms get rounded to minutes to match this.
  • The apply_index logic constructs a new cache id '_daterangeindex_%s_%s' % (index.id, term / 10 or 'None') and caches the resultset. Or tries to cache the inverse / difference of the resultset compared to an earlier query resultset if its smaller.
  • The getCounter and _increment_counter logic in Plone's CatalogTool only change when catalog_object or uncatalog_object are called on the ZCatalog. Not when methods are called on the Catalog or when methods are called directly on any index.

So this cache already can be wrong in a number of cases, when data changes during the same request. First do a query, then change some items in the catalog, then do a query again for the same term and the cache might not have been invalidated.

The extra 'divide by 10' to construct the cache key is also a bit weird. I think it was supposed to catch queries when you handle a request that takes a couple seconds and starts at second 59, going over to second 01. If something does repeated queries for "now" taking the exact minute would result in cache misses here. It would probably be better to use a relative difference here and consider cache entries if the difference between a cached value and the search term is below some threshold like 60 seconds instead.

In order to generalize this concept it might be better if the getCounter logic gets attached to each index to actually catch all changes to it, not just those done via the ZCatalog API. It might be best to implement such a change counter first in one PR and than let the cache generalization PR build on top of it.

@mauritsvanrees
Copy link
Member

I have trouble following what is happening in this code (not the pull request per se, but the catalog code in general), but the direction seems good. I agree with the detailed code quality remarks by Tres and the notes by Hanno.

@andbag
Copy link
Member Author

andbag commented Mar 31, 2016

Thanks for the comments. I agree with Hanno's idea to implement the getCounter logic first in order to get rid of the dependency on plone. If you agree I create a PR for the getCounter logic and a separate PR for the generalized caching logic.

@@ -288,7 +275,7 @@ def _apply_index(self, request, resultset=None):
until = multiunion(self._until.values(term))

# Total result is bound by resultset
if REQUEST is None:
if cache is None:
until = intersection(resultset, until)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not pull request related, but while reviewing this I stumbled over this:
We are in a block where it is guaranteed that resultset is None - why do we do intersection(None, until)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed the same thing. That block can just be removed.

@andbag
Copy link
Member Author

andbag commented Apr 18, 2016

PR #12 implements now the generalized caching logic based on the getCounter logic of master branch (incl. code quality corrections).

@andbag andbag closed this Apr 18, 2016
@hannosch hannosch deleted the andbag-unindex-caching branch July 17, 2016 17:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants