Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update SiteManager to cache sites identified by hostname/port [WIP] #4014

Closed
wants to merge 15 commits into from

Conversation

ababic
Copy link
Contributor

@ababic ababic commented Nov 13, 2017

With all of the recent talk about potential changes to Site, and addressing some of the longstanding bugbears with the current Site implementation (see #3834, #2840), I decided to work on a cached SiteMiddleware solution that would yield the same results in all of the current test scenarios, while also attempting to improve the efficiency of how sites are identified for individual requests.

Changes include:

  • Moving 'finding an appropriate Site for handling a request' logic to a 'get_for_request()' method on the SiteManager (Managers are good for this sort of thing)
  • Adding a 'get_default()' method to SiteManager to help easily identify the default site
  • Turning the 'find_for_request()' static method on the Site model to a class method that calls the SiteManager's 'get_for_request()' method
  • Adding caching to 'get_default()' and 'get_for_request()' for faster results after the first call
  • Using signal handlers to clear the cache when relevant actions occur
  • Using smaller, simpler queries when attempting to find a Site for a request, instead of a single complicated query that automatically results in a slower initial response in all scenarios (in real life, the new solution will result in at most 2 small queries per request, or 3 if the default site hasn't yet been cached)
  • Adjusted self.assertNumQueries() tests to account for the change in query numbers, and extended existing tests to show the cache (and cache clearing) in action

Notes:

Unlike wagtail.wagtailcore.sites.get_site_for_hostname(), none of these new methods use 'select_related' to prefetch the root_page and include it with the result. This was a deliberate decision, made for the following reasons:

  1. We don't know that the root page is needed every time
  2. It makes things far less efficient
  3. With efficiency of fetching sites improved massively in most circumstances, fetching the page data afresh when needed each time shouldn't be a problem

…o a 'get_for_request()' method on the SiteManager

- Add a 'get_default()' method to SiteManager to help easily identify the default site
- Add a 'get_current()' method to SiteManager to help identify a site to use when a request might not be available
- Turn the 'find_for_request()' static method on the Site model to a class method that calls the SiteManager's 'get_current' method
- Cache results for 'get_default()' and 'get_for_request()' for faster access after the first fetch
- Use 'pre_delete' and 'pre_save' signals to clear any cached results when a Site is updated or deleted
- Use smaller, simpler queries tactfully, instead of a single complicated query that automatically results in a slower initial response in all scenarios (in real life, this will result in at most 2 queries per request, or 3 if the default site hasn't yet been cached)
- Adjusted self.assertNumQueries() tests to account for the change in query numbers, and extended existing tests to show the cache (and cache clearing) in action
@ababic
Copy link
Contributor Author

ababic commented Nov 13, 2017

Any thoughts or comments would be gratefully received :)

@ababic
Copy link
Contributor Author

ababic commented Nov 13, 2017

All of the current test failures are coming from the wagtail.api app. Have tried for hours to figure out why these failures are happening, but have so far come up short. If anyone with insight into the wagtail.api app could help shed some light on this, that would be great.

Andy Babic added 3 commits November 14, 2017 08:00
…ch running instance of the project is defined using the `SITE_ID` setting for each version. Wagtail's default site is marked in the database, which will always be the same, and so cannot work in a similar way. For now, lets remove 'get_current' to avoid potential confusion.
…the existing 'post_save_site_signal_handler' and 'post_delete_site_signal_handler' handlers in 'wagtailcore.signal_handlers' to clear the site cache when sites are saved or deleted

- Added two new hooks for pages, to clear the site cache when any pages designated as site root pages are saved or deleted
@ababic
Copy link
Contributor Author

ababic commented Nov 14, 2017

Okay, turns out the test failures were due to a wee cache invalidation issue, which is solved by the new post_save and post_delete signal handlers for Page. That saying about cache invalidation and naming things rings true :)

@gasman gasman added this to the 2.0 milestone Nov 14, 2017
Andy Babic added 2 commits November 14, 2017 10:10
… storing

- Ditch global cach_delete method
- Rename SiteManager.clear_cache() to Site.clear_request_site_cache()
@ababic ababic changed the title Adding caching to SiteManager [WIP] Update SiteManager to cache sites identified by host/port [WIP] Nov 21, 2017
@ababic ababic changed the title Update SiteManager to cache sites identified by host/port [WIP] Update SiteManager to cache sites identified by hostname/port [WIP] Nov 21, 2017
…ecessarily

- Rename REQUEST_SITE_CACHE to REQUEST_SITE_MATCH_CACHE and only store ids of matches to save on memory usage
- SiteManager.clear_request_site_cache() renamed to just clear_site_cache()
Copy link
Member

@BertrandBordage BertrandBordage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for that work @ababic!
Sorry, I had a few remarks on this tricky work.

Also, I have something to propose, that will probably simplify things a lot. Instead of fetching uncached sites one by one by doing between 1 and 3 SQL queries, we could write something really easy and more efficient in this case. Every time we can’t find a site in the cache, we fetch all uncached sites and cache them. This may seem brutal, but I can tell you this is way faster than doing 3 SQL queries to fetch a single record. It only gets slow when the database library (like psycopg2) has to convert a lot of data. Here, we’ll only fetch at max a hundred sites, which is faster to fetch than a single row of long HTML.

global SITE_CACHE
global REQUEST_SITE_MATCH_CACHE
SITE_CACHE = {}
REQUEST_SITE_MATCH_CACHE = {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use dict.clear() instead of attributing again global variables.
This is very important, otherwise if we import the cache somewhere else, the old dictionary will still be used.

I’m aware it’s used this way for caching sites in Django…

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, i'll get this sorted soon :)

def _cache_and_return_match(self, site, match_cache_key):
SITE_CACHE[site.pk] = site
REQUEST_SITE_MATCH_CACHE[match_cache_key] = site.pk
return SITE_CACHE[site.pk]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not return directly site?

Copy link
Contributor Author

@ababic ababic Nov 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it has the same address in memory, so I see no reason not to. I was really just taking my cues from Django's implementation. Again, happy to change

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It points to the same address, indeed, but fetching an attribute and a key from a dictionary have a speed cost. That takes around 150 ns, which seems almost nothing, but caches usually return results in a few μs, so additional costs like this end up being important ;)

(And yeah, sorry to break the myth a bit, but Django has very mediocre optimization ^^
Always wanted to work on it, but I’m still scared by the amount of “paperwork” involved.)

def _get_by_id(self, site_id):
if site_id not in SITE_CACHE:
site = self.get(pk=site_id)
SITE_CACHE[site_id] = site
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two lines can be merged by removing the useless site variable attribution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this might produce a slightly less scary stack-trace if the query failed for whatever reason. But also happy to make this change.

if key in REQUEST_SITE_MATCH_CACHE:
return self._get_by_id(REQUEST_SITE_MATCH_CACHE['default'])

match = self.get(is_default_site=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if multiple sites have is_default_site=True?
I suggest to change to self.filter(is_default_site=True).first().

Copy link
Contributor Author

@ababic ababic Nov 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one I'm a little unsure about. I know site data can be altered in ways that bypass this, but it seems that the model assumes there should only be one default site:
https://github.com/wagtail/wagtail/blob/master/wagtail/wagtailcore/models.py#L145

I fully understand that doing as you propose would cater for every scenario, but does that mean that the code contradicts itself slightly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, a bit tricky to decide. It’s either being tolerant or strict.
We can keep it strict as you wrote, but maybe it would be better to explain what’s happening, like catching the error and reraising it with a more explicit message:

MultipleObjectsReturned: Multiple default sites found, only one allowed.

set as the default. The result is cached."""
key = 'default'
if key in REQUEST_SITE_MATCH_CACHE:
return self._get_by_id(REQUEST_SITE_MATCH_CACHE['default'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Define a global variable for the default cache key name instead of defining a local variable and hard-code it twice.
Also, use something else that 'default', since it is possible to define a host named 'default'. Something like '__default__' should be enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, yes. I'll definitely do this.


def _cache_and_return_match(self, site, match_cache_key):
SITE_CACHE[site.pk] = site
REQUEST_SITE_MATCH_CACHE[match_cache_key] = site.pk
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t understand the point of this second cache. From what you wrote in your last commit, it’s to save memory by only caching IDs, but… Sites are still stored in SITE_CACHE, so I don’t see how this saves memory.

Unless I totally missed your point, you can remove this cache, it should use more memory than earlier. Ditching this second cache also makes the code simpler, as we can directly fetch site objects from host+port strings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was (perhaps naively) attempting to guard against the possibility of the original SITE_CACHE getting filled with duplicates in the event of some attack where ports could be spoofed for a matching domain, resulting in a Site object existing in memory for each port/hostname combination. In a genuine attack situation, I suppose it wouldn't make much difference in the grand scheme of things.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number of ports is finite, it’s 65535. So an hypothetical attack based on that would at max populate the dictionary with 65535 site objects × the number of hosts. This may take at max a dozen megabytes, so it would be a pretty lame attack.

That being said, this will never happen unless you did all these:

  • configured your OS to accept connections on all ports
  • configured your HTTP server to redirect by default all hosts and ports to a Wagtail instance
  • configured Django with something like ALLOWED_HOSTS = ['*']

So this will not happen in any correctly configured server, and would actually need some extra work to allow any port through the HTTP server: by default they only listen to ports 80 & 443, as far as I know.

It’s possible though to have an infinite number of hosts if you setup wildcard subdomains and adjust ALLOWED_HOSTS accordingly. But in that event, you also need to have an huge number of Site instances in the database, so we can’t avoid caching each one of them. If a subdomain does not exist in the database, then we simply fetch the default one using the '__default__' key, so it doesn’t change memory usage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks for the insight. Like I said, it was a little naive. I'll re-simplify.


# Fall back to using default site
match = self.get_default()
return self._cache_and_return_match(match, key)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In all this method, match are Site instances. This is a bit misleading when we read what was in get_site_for_hostname, which was a very different type of match. Prefer site.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I can see your point there. Using site consistently would be less ambiguous.

match = self.get(hostname=hostname)
return self._cache_and_return_match(match, key)
except(Site.DoesNotExist, Site.MultipleObjectsReturned):
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn’t it better to return the first site matching hostname when there are multiple sites for that hostname, instead of returning the default site? I would put self.filter(hostname=hostname).first() and remove MultipleObjectsReturned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a couple of tests that are directly affected by this:

On the plus side: wagtail.wagtailcore.tests.test_page_model.TestSiteRouting.test_unrecognised_port_and_default_host_routes_to_default_site is made slightly more efficient (2 queries instead of 3)

But on the negative:
wagtail.wagtailcore.tests.test_page_model.TestSiteRouting.test_unrecognised_port_on_known_hostname_routes_to_default_site_if_ambiguity fails.

The existing pattern was the only one I tried that would pass the tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, if I’m not mistaken, the “fetch all once” solution I suggested would handle these correctly.

return self._get_by_id(REQUEST_SITE_MATCH_CACHE['default'])

match = self.get(is_default_site=True)
return self._cache_and_return_match(match, key)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When caching the default site, you could save it both in the '__default__' key and the one of its host+port. This will avoid extra SQL queries when fetching that site again for the first time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestion.

Copy link
Contributor Author

@ababic ababic Nov 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, wait. This is what happens already in the last line of get_for_request()

@ababic
Copy link
Contributor Author

ababic commented Nov 24, 2017

Thank you for the review @BertrandBordage! much appreciated! In relation to your idea proposed above:

Also, I have something to propose, that will probably simplify things a lot. Instead of fetching uncached sites one by one by doing between 1 and 3 SQL queries, we could write something really easy and more efficient in this case. Every time we can’t find a site in the cache, we fetch all uncached sites and cache them.

From what I understand, hostname-only or default site matches can be returned for an undeterminable number of 'ports', so I'm not sure we can populate everything up-front. Unless there's some configuration for that somewhere that I've missed? If not, and we can only partially populate the cache, is that still useful enough to justify the additional code complication?

… in SiteManager.get_default()

- Use dict.clear() in SiteManager.clear_site_cache() instead of attributing again global variables
- Rename '_cache_and_return_match' to '_cache_and_return_site'
- Use 'site' as a variable name instead of 'match' in In SiteManager.get_for_request()
@BertrandBordage
Copy link
Member

As mentioned in my previous comment, the attack on ports is almost impossible on a normal server, and it is very limited in case it happens anyway.

Additionally, for a request on an unknown port, we simply return the default site, so we don’t even have to do the 2 queries done with your solution.

Something I’m wondering now about all this caching system: what happens when running multiple workers? If I modify a site on a given process, the cache is correctly cleared in that process, but it’s not clearing the cache from other processes…

@ababic
Copy link
Contributor Author

ababic commented Nov 25, 2017

what happens when running multiple workers? If I modify a site on a given process, the cache is correctly cleared in that process, but it’s not clearing the cache from other processes…

I can't claim to be knowledgeable about this particular subject, but wouldn't those workers also be responding to the same signals triggered by such changes, and also be clearing their version of the cache in memory?

Have we come to a bit of an impasse here? Is a change in tact needed? Would a switch to using the low-level caching API address these concerns?

@BertrandBordage
Copy link
Member

BertrandBordage commented Nov 25, 2017

Unfortunately, signals are simply a list storing Python functions then calling them later, so it’s within the same process. The only way to communicate to the other processes is either through the database or through a cache backend. And it can’t work with the default cache backend, LocMem, since it’s simply a Python dictionary.

What’s weird is that Django is using exactly the same approach as you took, don’t spend more time on this, I’ll make some research and ask the Django team about it. Either I’m missing something, or Django really has an issue with this. Since I remember having to restart servers after changing Site objects in both Django & Wagtail, I guess there’s indeed a problem.

Anyway, the caching could be as simple as using a configured cache instead of a dictionary, but only if the cache is not LocMem.

Or we can ditch the idea of caching site objects, django-cachalot does exactly that kind of caching automatically ;)
(Just kidding, I know it doesn’t fit all use cases.)

@BertrandBordage
Copy link
Member

Just created ticket #28844 in Django about the multi-process issue, with a minimal example.

@ababic
Copy link
Contributor Author

ababic commented Nov 26, 2017

Looks like you were right @BertrandBordage. Apparently it's been a known issue in Django for years, but there's been no consensus on a solution yet :(

Seems like one way forward might be a switch to using the cache backend under certain circumstances. But then, we'd have to rethink entirely the way cache invalidation works, introducing a tonne more code that will require tests and maintenance...

And by the time we're done, we may as well have used django-cachalot, which feels like the 'already-evolved' version of what we're trying to achieve (with the advantage of it being generic, and already well tested). But then again, that would mean forcing additional dependencies and configuration onto the developers, which I would expect that nobody in the core team (including yourself) is going to want to do for the sake of speeding up this one small part of wagtail.

For now, I think the sensible decision is to abandon this PR - unless you feel differently?

Andy Babic added 3 commits November 26, 2017 15:29
…_CACHE_ENABLED setting (False/off by default)

- Allow developers to change port matching behaviour in caching mode using the WAGTAIL_SITE_ALLOW_HOSTNAME_ONLY_MATCHES_FOR_PORTS setting
- If caching is turned on using setting, raise a warning if the site is configured to us LocMemCache backend by default
- If caching is turned off, resort to previous behaviour of identifying matching sites using wagtailcore.sites.get_site_for_hostname()
- If caching is turned on, query all sites and prepopulate the cache by calling rebuild_site_cache()
- Where appropriate, also update signal/signal processors to rebuild the cache instead of just clearing it
- Updated tests
@ababic
Copy link
Contributor Author

ababic commented Nov 27, 2017

Hi @BertrandBordage, I got over my initial feelings of deflation and have decided to plough on with this in the direction you proposed. It has meant a LOT of changes though.

In particular, I think we might want to revise how the new settings are used, or the warning is surfaced (maybe warning isn't enough, and you'd rather raise an IncorrectlyConfigured). But anyway.

I have some lingering concerns about ensuring all existing data is cleared from the cache when clear_site_cache() is called. These will no doubt benefit from some unit tests to ensure that's working well enough.

Andy Babic added 3 commits November 27, 2017 10:08
- Remove unused 'sender' from signature for new signal handlers
…hose keys for cache clearing instead of querying the database again

- Remove the now redundant 'deleted_site' kwarg from clear_site_cache() and rebuild_site_cache()
- Remove unnecessary post_delete hook for pages
@ababic
Copy link
Contributor Author

ababic commented Nov 30, 2017

@BertrandBordage The original intention has changed so much since starting this, that it's probably best to close this pull and create a fresh one based off the master branch. I'll resubmit again soon with a more appropriate writeup.

@ababic ababic closed this Nov 30, 2017
@BertrandBordage
Copy link
Member

Thank you @ababic, I’ll then make a new review. From what I quickly saw of your recent changes, this seems way more robust.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants