-
Notifications
You must be signed in to change notification settings - Fork 800
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Global ratelimiter helper: usage-tracking fallback-capable rate.Limiter #6028
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files
... and 21 files with indirect coverage changes Continue to review full report in Codecov by Sentry.
|
this might change if we go with #6030, e.g. it could be implemented as a wrapper and then it could collect accurate reserve->cancel metrics too. I'm not sure if that's currently necessary - I believe all our reserve->cancel logic uses per-host limiters as the first level, rather than something intended to be global. but that does seem like something we will eventually change, so it'd be risky to leave without at least documenting. |
// - this limit is legitimately unused and no data exists for it | ||
// - ring re-sharding led to losing track of the limit, and it is now owned by a host with insufficient data | ||
// | ||
// To mitigate the impact of the second case, "insufficient data" responses from aggregating hosts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps, it's worth adding some metrics tracking when we fallback
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I'm planning on metrics. that'll be outside these objects though, and it's why Collect()
returns the "use fallback" bool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is a transition from non-fallback to fallback or vice versa, the counter will be inaccurate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you mean the limiting behavior will be inaccurate, e.g. it'll allow / deny more requests than it should:
yes, but we don't have any great way to resolve that. *rate.Limiter
doesn't allow directly manipulating its tokens, and there isn't even a true "correct" behavior if this happens because it implies we don't know the correct behavior - we haven't been able to sync with other instances.
it'll only be incorrect for one "time to fill the ratelimiter's burstable tokens" period though, which is currently 1 second (limit == burst in basically all of our uses). generally speaking that'd mean ~2x overage at worst, assuming all hosts switched at the same time.
I could try to empty the tokens when switching, which would be safer for us because it wouldn't allow exceeding either limit, but we don't have an efficient API for that - ideally it'd get tokens and try to e.g. binary-search Allow(n)
to empty most or all of them, since it's racing with other calls. Right now it'd need around thousands of calls per limiter to drain, which is definitely excessive.
The #6030 wrapper could do this relatively precisely and efficiently, as it can lock everything while draining. If we think 1 second of overage is worth preventing, I suspect it'd be relatively easy to add a "drain the limiter" API for this kind of purpose. I'm... ~50% "yes, worth it"? Can definitely be convinced to build that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tl;dr from IRL, from clarifying some of the surroundings here:
a layer above this, which holds all these individual tracking-and-fallback-able limiters, will be periodically calling Collect()
and it'll emit a counter for "using fallback" so we can monitor how often fallbacks are used.
but "uses fallback" should only occur in practice:
- temporarily after startup / when a key is first used (no data, must use fallback)
- this might be worth tracking separately, the failed-update counter can show when this is happening (negative value)
- when the global limiter is disabled and rejects all requests (falling back is desired here)
- when something cataclysmic is happening and global limit data is being lost rapidly, e.g. many host crashes -> ring changes rapidly -> does not return weights before crashing too. (falling back is desirable here too, as the old values are old enough to not trust)
so the sustained existence of fallback-using things when we expect the global ratelimiter to be working is a concern, but not individual "failed update" events or temporary use.
the exact metrics emitted are going to be in a later diff, that all happens outside of this object.
…er (uber#6028) This commit largely exists to break this out for separate review, as it's relatively simple and isolated. In a later PR, this will be used by the "limiter"-side logic to track and enforce usage of a ratelimit. At a high level it does only a couple things: 1. Tracks calls to `Allow() bool` so we can find out how much a limit is actually used (so we can distribute the allowances fairly) 2. Updates ratelimit rates when requested 3. Automatically falls back to different ratelimit if the system seems unhealthy ^ all of which is done in an atomic-like way so concurrent operations do not block each other, as true precision is not necessary. It only has to be "good enough", and performance is more important than precision.
This commit largely exists to break this out for separate review, as it's relatively simple and isolated.
In a later PR, this will be used by the "limiter"-side logic to track and enforce usage of a ratelimit. At a high level it does only a couple things:
Allow() bool
so we can find out how much a limit is actually used (so we can distribute the allowances fairly)^ all of which is done in an atomic-like way so concurrent operations do not block each other, as true precision is not necessary. It only has to be "good enough", and performance is more important than precision.