-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding a jitter to the backoff when an aws request gets throttled. #6980
Comments
@philiiiiiipp do you meant, that initialize deploys of numeour stack at exactly same time. and fact that retries across them happen at about same time, there's a higher risk for Rate exceeded error? That seems a bit unlikely to me, as there are many factors that I believe already de-sync the concurrent deployments |
Yes thats what I mean. It's at least what I do believe I see. The moment we are deploying more than 5 stacks at the same time they start getting those timeouts, some of them will then go through after some retries, but some will fail. I do not understand how it would de-sync concurrent deployments when they are all executed by different machines? I am not talking about one |
Looking furhter into it, I suppose that the error occurs when |
Each machine is about to deploy a different service stack or deploy same one? If it's about same one, then naturally such deploys should be consecutive and not happen at same time (and adding jitter, won't help in anything) |
Yes, I understand that it does not make sense to deploy the same stack from multiple machines. So let me make this clear:
If more than a couple deploy at the same time, you will see I mean in general I think that adding a jitter and exponential backoff when retrying an api request is a good idea to prevent congestion. |
One stack deployment usually lasts tens of seconds (depending on service complexity - it can even last minutes). In my understanding they'll be enabled to deploy their stack once some stack that's being deployed will finish its deployment. So while increasing retry times may help your specific case (e.g. making it 5s, 10s, 20s etc., instead of 5s, 5s, 5s), I'm not sure how adding a random factor helps in anything (?) Other thing is, that for many other cases it is desirable to repeat every 5s and not increase the gap. I guess that best way to address your issue, is to make Will that work for you? |
I am not sure I understand what you mean here, but you can deploy as many stacks as you want in parallel.
Ok, lets say you have an api which allows 10 requests a second and 1000 clients are requesting at the same time. At first 10 go through and 990 are left, they now all wait for 5 seconds and retry, so within 5 seconds you get ~10 clients to pass. If you are adding a jitter, the load will be distributed, and you will in the best case get 50 clients through. You can read about it here
Can you give me an example of a case ? My understanding is that you want to smear the requests out as much as possible to have the highest likelyhood of success.
I suppose its better than it is now, but you will still lose 4 seconds of potential api requests. |
Thanks @philiiiiiipp I think I have better understanding now. Rate limits applies to SDK request, and not to stack update process concurrency (this is to what I've attached in previous comments) Still it's not perfectly clear to me, how do you happen to have SDK requests coming from multiple service deployments being issued at very same time. That seems quite likely. Have you actually tested that adding a jitter indeed brings observable improvement? |
To be honest, I am wondering if you are trolling me :-P.
The limits are on an account basis, so the more people deploy at the same time ( different stages or different stacks ) to the same account they will hit those limits.
Like I said, adding exponential backoff and a jitter makes it less likely to hit the limits because you prevent them all retrying at the same time, therefore making failures to deploy the stacks much less likely ( I mean the stack is still deploying fine, its just serverless that is stopping with an error code ). We are currently running on a forked version of serverless with the patch included and they are not failing anymore with the load we are currently using. |
No, I want to be sure, we come with right solution, and not just take-in the code that we just assume that may improve situation.
Adding a jitter in my understanding makes sense, when we deal with situation where AWS requests are issued at very same time (with at least 0.1s accuracy) - and that in my understanding is hard to achieve even by issuing SLS deploys automatically at very same time (there's a lot going on in between issued AWS requests in a process, so naturally AWS requests should rather go out of sync among deploys) Additionally above your stating that deploys potentially happen at same time. Can you elaborate on "potentially"? Are they manually released? If that's the case, you already naturally run AWS requests with unpredictable pattern, and that should make then out of sync. Is this then to address accidental, random cases (which may occur occasionally) where two concurrent deploys start to hit AWS SDK at about same time? As then, yes, in such scenarios, jitter can help to de-sync situation. |
Yes, dont get me wrong, I am happy for your thuroughness.
https://github.com/serverless/serverless/blob/master/lib/plugins/aws/lib/monitorStack.js#L137 I think this is the culprit. |
I'm aware of that fixed interval setting. Did you read my response? :) |
Yes, but its like I said, they are all potentually deploying at the same time, and if they do, every time multiple people deploy at the same time, they hit this case.
Yes. That is what I am trying to say. |
That every time doesn't feel as well explored to me, as it shouldn't be likely for AWS request to get in sync across different deploys (even if deploys are issued at every same time) Anyway, just for the sake that it can help in those occasional cases, I think adding jitter makes sense. Let me review PR |
Right, I can elaborate further there. Our CI/CD system is automatically testing the deployment of the stacks after each push to a branch. The deployment of a stack, like you said, can take up to a couple of minutes ( in case of elasticsearch ) where it will constantly checks for the deployment status. Some developers are pushing quite frequently, which means, a large number of deployments happen at the same time. Pretty much every time more than 5 builds are deploying, which in turn deploy multiple stacks, we see failures due to the rate limit. So, you are right that it is not happening every time, just when a couple of stacks are aligning by chance. Which gets more likely the more stacks we deploy in parallel. I think, in addition to the jitter, maybe allowing the change of the |
Yes, totally. I think we should be able to tweak it via e.g. |
Ok cool, I will amend the PR over the weekend |
I added support for the environment variable to change the frequency I also made the calculation for the backoff simpler and provided a rounded number log. |
Feature Proposal
Add exponential backoff instead of always 5 seconds.
Description
We are deploying quite often to our test AWS account. In that I noticed that we are hitting cloudformation api throttling
serverless/lib/plugins/aws/provider/awsProvider.js
Line 255 in 801539a
This is at least what I am expecting that is happending. The issue is that all my stacks will now retrying in exactly 5 seconds causing them eventually to fail.
I think there should be some kind of jitter added in order to have a higher chance of success.
The text was updated successfully, but these errors were encountered: