New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make cache stampede protection optional, configurable, and more flexible #27729

Closed
pounard opened this Issue Jun 26, 2018 · 16 comments

Comments

Projects
None yet
3 participants
@pounard
Contributor

pounard commented Jun 26, 2018

Following issue #27604 where @nicolas-grekas added the LockRegistry class to prevent cache stampede, I feel the need to re-open the discussion about the current implementation. Here are the details I wan't to raise red flag onto:

Issue #27009 add a new bit of documentation:

added CacheInterface, which provides stampede protection via probabilistic early expiration and should become the preferred way to use a cache

Whereas I do agree with this statement, cache stampede protection is a good thing, it should nevertheless not be hardcoded in current existing backends using the GetTrait. For example, it would be very easy to implement a Redis adapter that does atomic operations or provide locking by its own, case in which the need for this stampede protection algorithm would be less necessary.

Furthermore, #27604 introduces, under the same GetTrait::get() implementation a lock based protection (which means that cache stampede protection uses two different mecanisms, both are great, and using both is legit) but it the developper should be able to choose whether or not each cache is critical or not. Other valid mechanisms can be (as documented in https://en.wikipedia.org/wiki/Cache_stampede and which for most of them I already did use independently depending on the use case):

Wait until the value is recomputed
Return a "not-found" and have the client handle the absence of the value properly
Keep a stale item in the cache to be used while the new value is recomputed

All three are valid choices, it all depend on your business, and I think that Symfony hardcoding any kind of cache stampede implementation is loosing flexibility, and adding overhead in situations it's not necessary. People that do configure the application (may be often developers themselves, but also it could be sysadmins) should be able to choose different strategies for different business components in order to prevent cache stampede.

Right now, none of this is important, because nobody uses the CacheInterface yet - but as soon as people will start using it, problems will start to rise.

My proposal for fixing this are (and we choose all of, none of, some of, or just discuss about):

  • make it configurable at a global level, using two new configuration options: one for toggling lock based protection, and the other for toggling probabilistic early expiration based protection,
  • make it configurable per pool, using the same two variables, but on a cache pool or namespace basis,
  • extract the LockRegistry usage from the hardcoded GetTrait::get() and place it in a decorating implementation instead: LockStampedeProtectedCacheProxy for example,
  • extract the probabilistic early expiration too, and the same way implement it via a proxy, good thing about that would be that it could be easily applied to other cache implementations too (not only the CacheInterface).

Problem is that, as soon as everyone will be using CacheInterface, everyone will implicitely use both of the locked based protection (which is prone to errors or just be useless depending on how is setup the file system under) and the probabilistic early expiration (which may add unecessary overhead for many, many contextes).

@nicolas-grekas

This comment has been minimized.

Member

nicolas-grekas commented Jun 26, 2018

A big thank you for digging into this @pounard. I'm not going to answer on locking since there is #27730 to do so.

[probabilistic] cache stampede protection is a good thing, it should nevertheless not be hardcoded in current existing backends using the GetTrait

Probabilistic early expiration requires a $beta argument. In order to not force anyone to actually use the behavior, I made the $beta argument default to null, leaving the default behavior as a choice for any concrete pool. I also defined two limit conditions: using $beta = 0 should mean "don't do any early expiration", and $beta = INF should mean "do early expiration NOW". Considering this covers all the use cases + flexibility you mention, I made it into the interface. Thus having it as a core feature and not as an external middleware makes the most sense to me (because allowing the consumer to set the $beta also does.)

If we want to provide the ability to force-disable early expiration, we can provide an implementation-specific way to set the default $beta to 0, using a constructor argument or a setter.

implement a Redis adapter that does atomic operations or provide locking by its own

locking is a totally different strategy that doesn't compete with early expiration. The big difference is that with locking, ALL clients wait, which means ALL users experience the delay. With early expiration, only one of them does. With this in mind, I see no reason to not have early expiration always enabled (thus the ability to force-disable mentioned above would really be a non-feature to me.)

Other valid mechanisms can be

sure, and they do not compete with each other. Having them all used at once is what provides the most effective cache strategy.

probabilistic early expiration (which may add unecessary overhead [...]).

I acknowledge this overhead at the payload size level. Do you have other issues or problems in mind that aren't answered here? This size overhead is, for now, the only downside I see in having probabilistic early expiration designed as it is now - ie always enabled and not optional.

I won't strongly object making it optional by adding a way to set the default $beta, but I don't see the need for it (I see the added complexity though ;))

About the payload size overhead, I think that people who will care about this do have a solution: the PSR-6 interface. And if they do care, they will be able to put it into practice and get their way out of the default recommended way.

Does this make sense?

@pounard

This comment has been minimized.

Contributor

pounard commented Jun 27, 2018

Probabilistic early expiration requires a $beta argument. In order to not force anyone to actually use the behavior
I also defined two limit conditions: using $beta = 0 should mean "don't do any early expiration", and $beta = INF should mean "do early expiration NOW". Considering this covers all the use cases + flexibility you mention, I made it into the interface.

Indeed, but this argument leaves the responsibility to the developer but closes the door to per-instance tuning by configuration (for example, for someone writing a bundle using this, but another unrelated person using it in a very different environment under very different load scenarios).

Thus having it as a core feature and not as an external middleware makes the most sense to me (because allowing the consumer to set the $beta also does.)

This makes sense. I think both approach (being in the interface, and using a middleware) makes sense. But as the CacheInterface does an heavy job, I'd like more to see separate responsibilities, and the middleware approach allows arbitrary adapter usage even if they don't implement this (and such allows to easily replace implementations depending on your bottlenecks or your environment). This way you leave the door opened to some dev to write their own complex/custom cache eviction or cache stampede protection strategies without being help hostage by your own by design. And you can still provide Symfony's one per default configuration, which for most people will be a real benefit.

I made the $beta argument default to null, leaving the default behavior as a choice for any concrete pool

Yes I did understand that, and that is good, I don't remember being documented so explicitly on the interface thought.

locking is a totally different strategy that doesn't compete with early expiration. The big difference is that with locking, ALL clients wait, which means ALL users experience the delay.

Sure, but sometime you just want to do this. I think that Symfony's job as a framework, is to provide the complete toolset and reliable, sensible and fast-enough defaults, and those stampede protections are very, very nice additions. In the other side Symfony's responsibility is to allow people to tune, swap, change, decide which strategies they want to put into place easily. Put in other words, I don't want to fight against the framework when I stumble upon edge cases in real life, I want it to help me!

About the payload size overhead, I think that people who will care about this do have a solution: the PSR-6 interface. And if they do care, they will be able to put it into practice and get their way out of the default recommended way.

Once again, I think that as you documented yourself, CacheInterface is to become the prefered way to use cache, and it's actually a very good idea. I don't like having many ways to skin a cat. But there's a lot of legacy code all around that will continue to use PSR's and that could ideally benefit from the stampede protection strategies, that's another reason I do like the idea of having flexible middlewares instead.

Does this make sense?

Yes ! Thank you very much.

@nicolas-grekas

This comment has been minimized.

Member

nicolas-grekas commented Jun 27, 2018

this argument leaves the responsibility to the developer but closes the door to per-instance tuning by configuration

If the consumer is explicit, it wins (as it should, esp. for INF.) If it's not explicit (ie $beta = null) the pool config takes over with whatever default behavior it was configured with (e.g no probabilistic expiry as done for ArrayAdatper in #27734.) Looks correct at the semantic level to me, ie having $beta in the interface. I think we agree on that.

On the implementation side, we could turn it to a middleware stack I agree. I've no big personal incentive to do it, especially since I think this can be done anytime later in the future by someone who really needs it :) There would be added complexity because the middleware stack will require some new interfaces to be made generic. Maybe fixing #27730 will provide those?

legacy code all around that will continue to use PSR

Yes CacheInterface should IMHO become the way to go for most need, yet I wouldn't go as far as calling PSR-based code "legacy". I think there are very legitimate (but specific) use cases where this "get()" won't be enough (e.g. when fetching several items at once is required.)

from #27734 (comment)
[INF is] the only way to explicitly expire an item

Should we add a deleteItem() method on CacheInterface, for situations where you know a specific key isn't going to be needed anymore?

@pounard

This comment has been minimized.

Contributor

pounard commented Jun 27, 2018

Yes CacheInterface should IMHO become the way to go for most need, yet I wouldn't go as far as calling PSR-based code "legacy".

It'll probably indeed very good for most need. I said "legacy" about libraries or bundles, not PSR itself, but it's good to tell it that way: once you start a new piece of API, older ones are older, especially when they were meant to do the same.

Should we add a deleteItem() method on CacheInterface, for situations where you know a specific key isn't going to be needed anymore?

I don't think it's mandatory to add new methods on CacheInterface, it serves the purpose of exposing a very simple API for users. If it starts evolving towards new methods, it'll never end and probably end up with something very similar than the PSR's (et hop https://xkcd.com/927/ c'est gratuit). It's probably great to keep it tied for when no manual invalidations are necessary.

All that said, CacheInterface seems to be added to expose the stampede protection by itself, that's what's bothering me somehow: I like it because exposing a callback to create a value reduces the cache API to most simple need, and it's efficient and enough for most case, in that way, I like it: it places itself as a very high level API for end users, nice! Aside of that, exposing stampede protection internals over something that exposes itself as a high level methods sounds weird. But I might be wrong here.

@nicolas-grekas

This comment has been minimized.

Member

nicolas-grekas commented Jun 27, 2018

CacheInterface seems to be added to expose the stampede protection

Not only, but that's definitely part of it yes. Two reasons for this:

  • there are only 3 stampede protection strategies - having core support for them is legit and gives access to academically-backed features.
  • providing INF to force immediate expiration is really a MUST to me, otherwise, the range of use cases covered is MUCH smaller (like comparing a toy/helper to a feature IMHO.)
@pounard

This comment has been minimized.

Contributor

pounard commented Jun 27, 2018

there are only 3 stampede protection strategies - having core support for them is legit and gives access to academically-backed features.

There's a 4th one: none (and in some cases, it makes sense to have none). But yes I agree.

providing INF to force immediate expiration is really a MUST to me, or the range of use cases provided otherwise is MUCH smaller (like comparing a toy/helper to a feature IMHO.)

I believe your word blindfully about this one, but something bothers me: mutable parameter that can be NULL, INF (which is basically not really a number - I know that a var_dump() gives you double(INF)) or a number, it's a weird signature isn't it ?

@nicolas-grekas

This comment has been minimized.

Member

nicolas-grekas commented Jun 27, 2018

which is basically not really a number, it's a weird signature isn't it

cannot argue against "weird" since it's a feeling :)

Yet here is the reasoning: first of all, INF, -INF and NAN are really floats, as defined by the IEEE standards.

Now the definition of $beta is a float that controls the likeliness of triggering early expiration. 0 means "probability = 0", and the probability increases as $beta increases. Mathematically, providing INF effectively means "probability = 1". The docblock on the interface just makes this clear.

@pounard

This comment has been minimized.

Contributor

pounard commented Jun 27, 2018

"probability = 0", and the probability increases as $beta increases. Mathematically, providing INF effectively means "probability = 1"

Yes, I got that for sure, but are all users that good with probability ? :)

Jokes aside, I'm asking to myself since the begining, I don't see how and when developers will play with the beta argument, since in most case they will just call get('some_key' function () { return 'turlututu'; }); in runtime hotpath, and prey for Symfony to just do the right job by itself.

I strongly feel that in real life, people that need manual invalidation will probably use PSR interface over events (for exemple, a Doctrine entity save event to invalid associated caches) but I'm not sure they will call the cache rebuild at this time - in the other hand, when loading it at some point, I guess that most people will not attempt to play with the beta parameter either and mostly focus on their business.

That is also why I do like the middleware approach, when they'll do complex stuff to properly manage the stampede protection or there invalidation strategy, they probably will want to do it elsewhere than in the controller which is meant to carry business logic - and keep the hard cache related stuff somewhere they can easily just unplug it, or that an admin might understand and polish without having to play with actual business. That's just an hunch thought, but still.

Probably that, in the end, if they choose to explicitely rebuild the cache upon business data modification, a second method such as rebuild() with the same signature as get() would be more explicit and comprehensible for users ?

@pounard

This comment has been minimized.

Contributor

pounard commented Jun 27, 2018

I'm thinking that maybe we have very different cache usage in mind, I guess that you are mostly optimizing for system cache - case in which I have nothing to add to what you've done, it's just perfect to keep it as is, but I mostly think about business cached data, case in which the needs are very, very different, and stampede protection would be a good thing to have too.

@nicolas-grekas

This comment has been minimized.

Member

nicolas-grekas commented Jun 27, 2018

Don't assume I'm optimizing for system caches, I don't: system caches usually don't need stampede protection to me, because they should always be append-only-and-never-expire-until-next-redeploy - and ideally, they should also be warmed up offline.

So yes, I think exposing INF to the users allows recomputing ahead of time while benefitting from any stampede protection provided around "get()". Doing this with only PSRs would defeat everything we're talking about :)

@nicolas-grekas

This comment has been minimized.

Member

nicolas-grekas commented Jun 28, 2018

(I feel like we could benefit from using Slack a bit, feel free to reach me on Symfony Devs'.)

@pounard

This comment has been minimized.

Contributor

pounard commented Jul 2, 2018

Don't assume I'm optimizing for system caches, I don't: system caches usually don't need stampede protection to me, because they should always be append-only-and-never-expire-until-next-redeploy

Yes, I totally agree with this. That's why I dislike systems where a user click may drop system caches (for example, Drupal).

@nicolas-grekas

This comment has been minimized.

Member

nicolas-grekas commented Sep 24, 2018

Should we add a deleteItem() method on CacheInterface

answering this myself: that's not needed because CacheInterface already allows deleting an item, here is how:
$pool->get('key', function ($item) { $item->expiresAfter(0); }, INF);

Yes, that's not the most convenient way, but at least it proves the CacheInterface is "complete" - and that the $beta argument is an integral part of this completeness.

I'm closing here because to me CacheInterface::get() is as it should be, both at the interface level and at the implementation. There is no extra flexibility required.
Thanks for the brainstorm.

@pounard

This comment has been minimized.

Contributor

pounard commented Sep 24, 2018

I see that you closed it, for your information, before summer I tried to find some time to reach you for a chat, but I didn't succeed, and summer being summer, we all have our own holidays and stuff, so I forgotten about it. Here is a patch I made at that time: pounard@b1c2a40

It should not apply anymore, but basically it:

  • removes beta from the interface,
  • moves (now configurable) implementation of early probabilistic expiration into a decorator/wrapper,
  • allows beta to be configured per namespace/service,
  • moves locking into an interface, and provide a default implementation using your file based code.
@nicolas-grekas

This comment has been minimized.

Member

nicolas-grekas commented Sep 24, 2018

Thanks, I think #28588 solves all the issues you raised (and doesn't change the ones that shouldn't change to me, e.g. $beta)

@pounard

This comment has been minimized.

Contributor

pounard commented Sep 24, 2018

It seems much cleaner that it was indeed, I do agree. I will take some time to re-read everything and have a look at it. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment