Update SiteManager to cache sites identified by hostname/port [WIP] #4014

ababic · 2017-11-13T14:26:26Z

With all of the recent talk about potential changes to Site, and addressing some of the longstanding bugbears with the current Site implementation (see #3834, #2840), I decided to work on a cached SiteMiddleware solution that would yield the same results in all of the current test scenarios, while also attempting to improve the efficiency of how sites are identified for individual requests.

Changes include:

Moving 'finding an appropriate Site for handling a request' logic to a 'get_for_request()' method on the SiteManager (Managers are good for this sort of thing)
Adding a 'get_default()' method to SiteManager to help easily identify the default site
Turning the 'find_for_request()' static method on the Site model to a class method that calls the SiteManager's 'get_for_request()' method
Adding caching to 'get_default()' and 'get_for_request()' for faster results after the first call
Using signal handlers to clear the cache when relevant actions occur
Using smaller, simpler queries when attempting to find a Site for a request, instead of a single complicated query that automatically results in a slower initial response in all scenarios (in real life, the new solution will result in at most 2 small queries per request, or 3 if the default site hasn't yet been cached)
Adjusted self.assertNumQueries() tests to account for the change in query numbers, and extended existing tests to show the cache (and cache clearing) in action

Notes:

Unlike wagtail.wagtailcore.sites.get_site_for_hostname(), none of these new methods use 'select_related' to prefetch the root_page and include it with the result. This was a deliberate decision, made for the following reasons:

We don't know that the root page is needed every time
It makes things far less efficient
With efficiency of fetching sites improved massively in most circumstances, fetching the page data afresh when needed each time shouldn't be a problem

…o a 'get_for_request()' method on the SiteManager - Add a 'get_default()' method to SiteManager to help easily identify the default site - Add a 'get_current()' method to SiteManager to help identify a site to use when a request might not be available - Turn the 'find_for_request()' static method on the Site model to a class method that calls the SiteManager's 'get_current' method - Cache results for 'get_default()' and 'get_for_request()' for faster access after the first fetch - Use 'pre_delete' and 'pre_save' signals to clear any cached results when a Site is updated or deleted - Use smaller, simpler queries tactfully, instead of a single complicated query that automatically results in a slower initial response in all scenarios (in real life, this will result in at most 2 queries per request, or 3 if the default site hasn't yet been cached) - Adjusted self.assertNumQueries() tests to account for the change in query numbers, and extended existing tests to show the cache (and cache clearing) in action

ababic · 2017-11-13T14:32:33Z

Any thoughts or comments would be gratefully received :)

ababic · 2017-11-13T22:24:33Z

All of the current test failures are coming from the wagtail.api app. Have tried for hours to figure out why these failures are happening, but have so far come up short. If anyone with insight into the wagtail.api app could help shed some light on this, that would be great.

…ch running instance of the project is defined using the `SITE_ID` setting for each version. Wagtail's default site is marked in the database, which will always be the same, and so cannot work in a similar way. For now, lets remove 'get_current' to avoid potential confusion.

…the existing 'post_save_site_signal_handler' and 'post_delete_site_signal_handler' handlers in 'wagtailcore.signal_handlers' to clear the site cache when sites are saved or deleted - Added two new hooks for pages, to clear the site cache when any pages designated as site root pages are saved or deleted

ababic · 2017-11-14T09:19:36Z

Okay, turns out the test failures were due to a wee cache invalidation issue, which is solved by the new post_save and post_delete signal handlers for Page. That saying about cache invalidation and naming things rings true :)

… storing - Ditch global cach_delete method - Rename SiteManager.clear_cache() to Site.clear_request_site_cache()

…ecessarily - Rename REQUEST_SITE_CACHE to REQUEST_SITE_MATCH_CACHE and only store ids of matches to save on memory usage - SiteManager.clear_request_site_cache() renamed to just clear_site_cache()

BertrandBordage

Thanks for that work @ababic!
Sorry, I had a few remarks on this tricky work.

Also, I have something to propose, that will probably simplify things a lot. Instead of fetching uncached sites one by one by doing between 1 and 3 SQL queries, we could write something really easy and more efficient in this case. Every time we can’t find a site in the cache, we fetch all uncached sites and cache them. This may seem brutal, but I can tell you this is way faster than doing 3 SQL queries to fetch a single record. It only gets slow when the database library (like psycopg2) has to convert a lot of data. Here, we’ll only fetch at max a hundred sites, which is faster to fetch than a single row of long HTML.

BertrandBordage · 2017-11-24T00:54:13Z

wagtail/wagtailcore/models.py

+        global SITE_CACHE
+        global REQUEST_SITE_MATCH_CACHE
+        SITE_CACHE = {}
+        REQUEST_SITE_MATCH_CACHE = {}


Use dict.clear() instead of attributing again global variables.
This is very important, otherwise if we import the cache somewhere else, the old dictionary will still be used.

I’m aware it’s used this way for caching sites in Django…

Sure, i'll get this sorted soon :)

BertrandBordage · 2017-11-24T00:58:08Z

wagtail/wagtailcore/models.py

+    def _cache_and_return_match(self, site, match_cache_key):
+        SITE_CACHE[site.pk] = site
+        REQUEST_SITE_MATCH_CACHE[match_cache_key] = site.pk
+        return SITE_CACHE[site.pk]


Why not return directly site?

I guess it has the same address in memory, so I see no reason not to. I was really just taking my cues from Django's implementation. Again, happy to change

It points to the same address, indeed, but fetching an attribute and a key from a dictionary have a speed cost. That takes around 150 ns, which seems almost nothing, but caches usually return results in a few μs, so additional costs like this end up being important ;)

(And yeah, sorry to break the myth a bit, but Django has very mediocre optimization ^^
Always wanted to work on it, but I’m still scared by the amount of “paperwork” involved.)

BertrandBordage · 2017-11-24T00:59:31Z

wagtail/wagtailcore/models.py

+    def _get_by_id(self, site_id):
+        if site_id not in SITE_CACHE:
+            site = self.get(pk=site_id)
+            SITE_CACHE[site_id] = site


These two lines can be merged by removing the useless site variable attribution.

I thought this might produce a slightly less scary stack-trace if the query failed for whatever reason. But also happy to make this change.

BertrandBordage · 2017-11-24T01:14:39Z

wagtail/wagtailcore/models.py

+        if key in REQUEST_SITE_MATCH_CACHE:
+            return self._get_by_id(REQUEST_SITE_MATCH_CACHE['default'])
+
+        match = self.get(is_default_site=True)


What if multiple sites have is_default_site=True?
I suggest to change to self.filter(is_default_site=True).first().

This one I'm a little unsure about. I know site data can be altered in ways that bypass this, but it seems that the model assumes there should only be one default site:
https://github.com/wagtail/wagtail/blob/master/wagtail/wagtailcore/models.py#L145

I fully understand that doing as you propose would cater for every scenario, but does that mean that the code contradicts itself slightly?

@BertrandBordage See also: http://docs.wagtail.io/en/v1.13.1/reference/pages/model_reference.html#wagtail.wagtailcore.models.Site.is_default_site

Yeah, a bit tricky to decide. It’s either being tolerant or strict.
We can keep it strict as you wrote, but maybe it would be better to explain what’s happening, like catching the error and reraising it with a more explicit message:

MultipleObjectsReturned: Multiple default sites found, only one allowed.

BertrandBordage · 2017-11-24T01:21:08Z

wagtail/wagtailcore/models.py

+        set as the default. The result is cached."""
+        key = 'default'
+        if key in REQUEST_SITE_MATCH_CACHE:
+            return self._get_by_id(REQUEST_SITE_MATCH_CACHE['default'])


Define a global variable for the default cache key name instead of defining a local variable and hard-code it twice.
Also, use something else that 'default', since it is possible to define a host named 'default'. Something like '__default__' should be enough.

Sure, yes. I'll definitely do this.

BertrandBordage · 2017-11-24T01:25:42Z

wagtail/wagtailcore/models.py

+
+    def _cache_and_return_match(self, site, match_cache_key):
+        SITE_CACHE[site.pk] = site
+        REQUEST_SITE_MATCH_CACHE[match_cache_key] = site.pk


I don’t understand the point of this second cache. From what you wrote in your last commit, it’s to save memory by only caching IDs, but… Sites are still stored in SITE_CACHE, so I don’t see how this saves memory.

Unless I totally missed your point, you can remove this cache, it should use more memory than earlier. Ditching this second cache also makes the code simpler, as we can directly fetch site objects from host+port strings.

I was (perhaps naively) attempting to guard against the possibility of the original SITE_CACHE getting filled with duplicates in the event of some attack where ports could be spoofed for a matching domain, resulting in a Site object existing in memory for each port/hostname combination. In a genuine attack situation, I suppose it wouldn't make much difference in the grand scheme of things.

The number of ports is finite, it’s 65535. So an hypothetical attack based on that would at max populate the dictionary with 65535 site objects × the number of hosts. This may take at max a dozen megabytes, so it would be a pretty lame attack.

That being said, this will never happen unless you did all these:

configured your OS to accept connections on all ports

configured your HTTP server to redirect by default all hosts and ports to a Wagtail instance

configured Django with something like ALLOWED_HOSTS = ['*']

So this will not happen in any correctly configured server, and would actually need some extra work to allow any port through the HTTP server: by default they only listen to ports 80 & 443, as far as I know.

It’s possible though to have an infinite number of hosts if you setup wildcard subdomains and adjust ALLOWED_HOSTS accordingly. But in that event, you also need to have an huge number of Site instances in the database, so we can’t avoid caching each one of them. If a subdomain does not exist in the database, then we simply fetch the default one using the '__default__' key, so it doesn’t change memory usage.

Yes, thanks for the insight. Like I said, it was a little naive. I'll re-simplify.

BertrandBordage · 2017-11-24T01:27:36Z

wagtail/wagtailcore/models.py

+
+        # Fall back to using default site
+        match = self.get_default()
+        return self._cache_and_return_match(match, key)


In all this method, match are Site instances. This is a bit misleading when we read what was in get_site_for_hostname, which was a very different type of match. Prefer site.

Yes, I can see your point there. Using site consistently would be less ambiguous.

BertrandBordage · 2017-11-24T01:30:14Z

wagtail/wagtailcore/models.py

+                match = self.get(hostname=hostname)
+                return self._cache_and_return_match(match, key)
+            except(Site.DoesNotExist, Site.MultipleObjectsReturned):
+                pass


Isn’t it better to return the first site matching hostname when there are multiple sites for that hostname, instead of returning the default site? I would put self.filter(hostname=hostname).first() and remove MultipleObjectsReturned.

There are a couple of tests that are directly affected by this:

On the plus side: wagtail.wagtailcore.tests.test_page_model.TestSiteRouting.test_unrecognised_port_and_default_host_routes_to_default_site is made slightly more efficient (2 queries instead of 3)

But on the negative:
wagtail.wagtailcore.tests.test_page_model.TestSiteRouting.test_unrecognised_port_on_known_hostname_routes_to_default_site_if_ambiguity fails.

The existing pattern was the only one I tried that would pass the tests.

OK, if I’m not mistaken, the “fetch all once” solution I suggested would handle these correctly.

BertrandBordage · 2017-11-24T01:36:45Z

wagtail/wagtailcore/models.py

+            return self._get_by_id(REQUEST_SITE_MATCH_CACHE['default'])
+
+        match = self.get(is_default_site=True)
+        return self._cache_and_return_match(match, key)


When caching the default site, you could save it both in the '__default__' key and the one of its host+port. This will avoid extra SQL queries when fetching that site again for the first time.

Great suggestion.

Oh, wait. This is what happens already in the last line of get_for_request()

ababic · 2017-11-24T21:24:05Z

Thank you for the review @BertrandBordage! much appreciated! In relation to your idea proposed above:

Also, I have something to propose, that will probably simplify things a lot. Instead of fetching uncached sites one by one by doing between 1 and 3 SQL queries, we could write something really easy and more efficient in this case. Every time we can’t find a site in the cache, we fetch all uncached sites and cache them.

From what I understand, hostname-only or default site matches can be returned for an undeterminable number of 'ports', so I'm not sure we can populate everything up-front. Unless there's some configuration for that somewhere that I've missed? If not, and we can only partially populate the cache, is that still useful enough to justify the additional code complication?

… in SiteManager.get_default() - Use dict.clear() in SiteManager.clear_site_cache() instead of attributing again global variables - Rename '_cache_and_return_match' to '_cache_and_return_site' - Use 'site' as a variable name instead of 'match' in In SiteManager.get_for_request()

BertrandBordage · 2017-11-25T13:22:55Z

As mentioned in my previous comment, the attack on ports is almost impossible on a normal server, and it is very limited in case it happens anyway.

Additionally, for a request on an unknown port, we simply return the default site, so we don’t even have to do the 2 queries done with your solution.

Something I’m wondering now about all this caching system: what happens when running multiple workers? If I modify a site on a given process, the cache is correctly cleared in that process, but it’s not clearing the cache from other processes…

ababic · 2017-11-25T14:20:59Z

what happens when running multiple workers? If I modify a site on a given process, the cache is correctly cleared in that process, but it’s not clearing the cache from other processes…

I can't claim to be knowledgeable about this particular subject, but wouldn't those workers also be responding to the same signals triggered by such changes, and also be clearing their version of the cache in memory?

Have we come to a bit of an impasse here? Is a change in tact needed? Would a switch to using the low-level caching API address these concerns?

BertrandBordage · 2017-11-25T14:30:18Z

Unfortunately, signals are simply a list storing Python functions then calling them later, so it’s within the same process. The only way to communicate to the other processes is either through the database or through a cache backend. And it can’t work with the default cache backend, LocMem, since it’s simply a Python dictionary.

What’s weird is that Django is using exactly the same approach as you took, don’t spend more time on this, I’ll make some research and ask the Django team about it. Either I’m missing something, or Django really has an issue with this. Since I remember having to restart servers after changing Site objects in both Django & Wagtail, I guess there’s indeed a problem.

Anyway, the caching could be as simple as using a configured cache instead of a dictionary, but only if the cache is not LocMem.

Or we can ditch the idea of caching site objects, django-cachalot does exactly that kind of caching automatically ;)
(Just kidding, I know it doesn’t fit all use cases.)

BertrandBordage · 2017-11-25T15:49:14Z

Just created ticket #28844 in Django about the multi-process issue, with a minimal example.

ababic · 2017-11-26T11:03:02Z

Looks like you were right @BertrandBordage. Apparently it's been a known issue in Django for years, but there's been no consensus on a solution yet :(

Seems like one way forward might be a switch to using the cache backend under certain circumstances. But then, we'd have to rethink entirely the way cache invalidation works, introducing a tonne more code that will require tests and maintenance...

And by the time we're done, we may as well have used django-cachalot, which feels like the 'already-evolved' version of what we're trying to achieve (with the advantage of it being generic, and already well tested). But then again, that would mean forcing additional dependencies and configuration onto the developers, which I would expect that nobody in the core team (including yourself) is going to want to do for the sake of speeding up this one small part of wagtail.

For now, I think the sensible decision is to abandon this PR - unless you feel differently?

…_CACHE_ENABLED setting (False/off by default) - Allow developers to change port matching behaviour in caching mode using the WAGTAIL_SITE_ALLOW_HOSTNAME_ONLY_MATCHES_FOR_PORTS setting - If caching is turned on using setting, raise a warning if the site is configured to us LocMemCache backend by default - If caching is turned off, resort to previous behaviour of identifying matching sites using wagtailcore.sites.get_site_for_hostname() - If caching is turned on, query all sites and prepopulate the cache by calling rebuild_site_cache() - Where appropriate, also update signal/signal processors to rebuild the cache instead of just clearing it - Updated tests

ababic · 2017-11-27T10:06:35Z

Hi @BertrandBordage, I got over my initial feelings of deflation and have decided to plough on with this in the direction you proposed. It has meant a LOT of changes though.

In particular, I think we might want to revise how the new settings are used, or the warning is surfaced (maybe warning isn't enough, and you'd rather raise an IncorrectlyConfigured). But anyway.

I have some lingering concerns about ensuring all existing data is cleared from the cache when clear_site_cache() is called. These will no doubt benefit from some unit tests to ensure that's working well enough.

- Remove unused 'sender' from signature for new signal handlers

…hose keys for cache clearing instead of querying the database again - Remove the now redundant 'deleted_site' kwarg from clear_site_cache() and rebuild_site_cache() - Remove unnecessary post_delete hook for pages

ababic · 2017-11-30T15:08:05Z

@BertrandBordage The original intention has changed so much since starting this, that it's probably best to close this pull and create a fresh one based off the master branch. I'll resubmit again soon with a more appropriate writeup.

BertrandBordage · 2017-11-30T15:56:22Z

Thank you @ababic, I’ll then make a new review. From what I quickly saw of your recent changes, this seems way more robust.

Ditch custom exception in get_default()

6c10dec

Andy Babic added 3 commits November 14, 2017 08:00

Let's also delete tests for 'get_current'

46ba939

gasman added the status:Needs Review label Nov 14, 2017

gasman added this to the 2.0 milestone Nov 14, 2017

Andy Babic added 2 commits November 14, 2017 10:10

Delete now unused wagtail/wagtailcore/sites.py

f27d6f4

- Rename SITE_CACHE to REQUEST_SITE_CACHE to better explain what it's…

cc2dee0

… storing - Ditch global cach_delete method - Rename SiteManager.clear_cache() to Site.clear_request_site_cache()

ababic changed the title ~~Adding caching to SiteManager [WIP]~~ Update SiteManager to cache sites identified by host/port [WIP] Nov 21, 2017

ababic changed the title ~~Update SiteManager to cache sites identified by host/port [WIP]~~ Update SiteManager to cache sites identified by hostname/port [WIP] Nov 21, 2017

- Cache Site objects via pk to avoid dupes being stored in memory unn…

165a1f0

…ecessarily - Rename REQUEST_SITE_CACHE to REQUEST_SITE_MATCH_CACHE and only store ids of matches to save on memory usage - SiteManager.clear_request_site_cache() renamed to just clear_site_cache()

BertrandBordage suggested changes Nov 24, 2017

View reviewed changes

Andy Babic added 3 commits November 26, 2017 15:29

Move TestSiteRouting test case from test_page_model to test_sites

4cd60f7

Restore wagtailcore.sites.py

8b171d9

Andy Babic added 3 commits November 27, 2017 10:08

Remove unused 'pre_save' signal import

994d1c6

- Revert changes to existing signal handlers

bece53d

- Remove unused 'sender' from signature for new signal handlers

ababic closed this Nov 30, 2017

ababic mentioned this pull request Dec 17, 2017

Improve efficiency of Site.find_for_request() by utilising Django's cache API #4142

Closed

2 tasks

Update SiteManager to cache sites identified by hostname/port [WIP] #4014

Update SiteManager to cache sites identified by hostname/port [WIP] #4014

Conversation

ababic commented Nov 13, 2017 • edited

ababic commented Nov 13, 2017

ababic commented Nov 13, 2017

ababic commented Nov 14, 2017

BertrandBordage left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ababic Nov 24, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ababic Nov 24, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BertrandBordage Nov 25, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ababic Nov 24, 2017 • edited

Choose a reason for hiding this comment

ababic commented Nov 24, 2017 • edited

BertrandBordage commented Nov 25, 2017

ababic commented Nov 25, 2017 • edited

BertrandBordage commented Nov 25, 2017 • edited

BertrandBordage commented Nov 25, 2017

ababic commented Nov 26, 2017 • edited

ababic commented Nov 27, 2017

ababic commented Nov 30, 2017

BertrandBordage commented Nov 30, 2017

ababic commented Nov 13, 2017 •

edited

ababic Nov 24, 2017 •

edited

ababic Nov 24, 2017 •

edited

BertrandBordage Nov 25, 2017 •

edited

ababic Nov 24, 2017 •

edited

ababic commented Nov 24, 2017 •

edited

ababic commented Nov 25, 2017 •

edited

BertrandBordage commented Nov 25, 2017 •

edited

ababic commented Nov 26, 2017 •

edited