Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding a jitter to the backoff when an aws request gets throttled. #6980

Closed
philiiiiiipp opened this issue Nov 19, 2019 · 18 comments · Fixed by #6981
Closed

Adding a jitter to the backoff when an aws request gets throttled. #6980

philiiiiiipp opened this issue Nov 19, 2019 · 18 comments · Fixed by #6981
Assignees

Comments

@philiiiiiipp
Copy link
Contributor

Feature Proposal

Add exponential backoff instead of always 5 seconds.

Description

We are deploying quite often to our test AWS account. In that I noticed that we are hitting cloudformation api throttling

Serverless: Uploading CloudFormation file to S3...
Serverless: Uploading artifacts...
Serverless: Validating template...
Serverless: Creating Stack...
Serverless: Checking Stack create progress...
...Serverless: Recoverable error occurred (Rate exceeded), sleeping for 5 seconds. Try 1 of 4
Serverless: Recoverable error occurred (Rate exceeded), sleeping for 5 seconds. Try 1 of 4
Serverless: Recoverable error occurred (Rate exceeded), sleeping for 5 seconds. Try 2 of 4
Serverless: Recoverable error occurred (Rate exceeded), sleeping for 5 seconds. Try 3 of 4
Serverless: Recoverable error occurred (Rate exceeded), sleeping for 5 seconds. Try 4 of 4

setTimeout(doCall, 5000, numTry + 1);

This is at least what I am expecting that is happending. The issue is that all my stacks will now retrying in exactly 5 seconds causing them eventually to fail.

I think there should be some kind of jitter added in order to have a higher chance of success.

@medikoo
Copy link
Contributor

medikoo commented Nov 21, 2019

@philiiiiiipp do you meant, that initialize deploys of numeour stack at exactly same time. and fact that retries across them happen at about same time, there's a higher risk for Rate exceeded error?

That seems a bit unlikely to me, as there are many factors that I believe already de-sync the concurrent deployments

@philiiiiiipp
Copy link
Contributor Author

Yes thats what I mean. It's at least what I do believe I see. The moment we are deploying more than 5 stacks at the same time they start getting those timeouts, some of them will then go through after some retries, but some will fail.

I do not understand how it would de-sync concurrent deployments when they are all executed by different machines? I am not talking about one sls deploy from a machine, I am talking about many machines trying to setup a stack for integration tests within the same account.

@philiiiiiipp
Copy link
Contributor Author

Looking furhter into it, I suppose that the error occurs when sls is checking for the stack status. Maybe it could also be an option to optional decrease the amount of checks and/or increase the interval between

@medikoo
Copy link
Contributor

medikoo commented Nov 22, 2019

I am not talking about one sls deploy from a machine, I am talking about many machines trying to setup a stack for integration tests within the same account.

Each machine is about to deploy a different service stack or deploy same one?

If it's about same one, then naturally such deploys should be consecutive and not happen at same time (and adding jitter, won't help in anything)

@philiiiiiipp
Copy link
Contributor Author

philiiiiiipp commented Nov 22, 2019

Yes, I understand that it does not make sense to deploy the same stack from multiple machines.

So let me make this clear:

  • There are X people working within the same account.
  • They all have their own stack.
  • They all potentially deploy their own stack at the same time ( at least I have no control over when they are deploying )

If more than a couple deploy at the same time, you will see Rate Exceeded errors, which is possibly due to the checking of the stack status. If they now all retry after 5 seconds, all will get the same error again and they will ultimately fail. Thats why adding an exponential backoff with a jitter helps ( at least for us ) to prevent this from happening.

I mean in general I think that adding a jitter and exponential backoff when retrying an api request is a good idea to prevent congestion.

@medikoo
Copy link
Contributor

medikoo commented Nov 22, 2019

If they now all retry after 5 seconds, all will get the same error again and they will ultimately fail.

One stack deployment usually lasts tens of seconds (depending on service complexity - it can even last minutes). In my understanding they'll be enabled to deploy their stack once some stack that's being deployed will finish its deployment.

So while increasing retry times may help your specific case (e.g. making it 5s, 10s, 20s etc., instead of 5s, 5s, 5s), I'm not sure how adding a random factor helps in anything (?)

Other thing is, that for many other cases it is desirable to repeat every 5s and not increase the gap.

I guess that best way to address your issue, is to make MAX_TRIES configurable via CLI option, so e.g. you can put --max-request-retries=100 and it'll retry 100 times (so for 500s).

Will that work for you?

@philiiiiiipp
Copy link
Contributor Author

One stack deployment usually lasts tens of seconds (depending on service complexity - it can even last minutes). In my understanding they'll be enabled to deploy their stack once some stack that's being deployed will finish its deployment.

I am not sure I understand what you mean here, but you can deploy as many stacks as you want in parallel.

So while increasing retry times may help your specific case (e.g. making it 5s, 10s, 20s etc., instead of 5s, 5s, 5s), I'm not sure how adding a random factor helps in anything (?)

Ok, lets say you have an api which allows 10 requests a second and 1000 clients are requesting at the same time. At first 10 go through and 990 are left, they now all wait for 5 seconds and retry, so within 5 seconds you get ~10 clients to pass.

If you are adding a jitter, the load will be distributed, and you will in the best case get 50 clients through.

You can read about it here

Other thing is, that for many other cases it is desirable to repeat every 5s and not increase the gap.

Can you give me an example of a case ? My understanding is that you want to smear the requests out as much as possible to have the highest likelyhood of success.

I guess that best way to address your issue, is to make MAX_TRIES configurable via CLI option, so e.g. you can put --max-request-retries=100 and it'll retry 100 times (so for 500s).

Will that work for you?

I suppose its better than it is now, but you will still lose 4 seconds of potential api requests.

@medikoo
Copy link
Contributor

medikoo commented Nov 25, 2019

Thanks @philiiiiiipp I think I have better understanding now. Rate limits applies to SDK request, and not to stack update process concurrency (this is to what I've attached in previous comments)

Still it's not perfectly clear to me, how do you happen to have SDK requests coming from multiple service deployments being issued at very same time. That seems quite likely.
Is this about deploying same kind of services (but e.g. different stages), at very same time?

Have you actually tested that adding a jitter indeed brings observable improvement?

@philiiiiiipp
Copy link
Contributor Author

philiiiiiipp commented Nov 26, 2019

To be honest, I am wondering if you are trolling me :-P.

Still it's not perfectly clear to me, how do you happen to have SDK requests coming from multiple service deployments being issued at very same time. That seems quite likely.
Is this about deploying same kind of services (but e.g. different stages), at very same time?

* There are X people working within the same account.

* They all have their own stack.

* They all potentially deploy their own stack at the same time ( at least I have no control over when they are deploying )

The limits are on an account basis, so the more people deploy at the same time ( different stages or different stacks ) to the same account they will hit those limits.

Have you actually tested that adding a jitter indeed brings observable improvement?

Like I said, adding exponential backoff and a jitter makes it less likely to hit the limits because you prevent them all retrying at the same time, therefore making failures to deploy the stacks much less likely ( I mean the stack is still deploying fine, its just serverless that is stopping with an error code ). We are currently running on a forked version of serverless with the patch included and they are not failing anymore with the load we are currently using.

@medikoo
Copy link
Contributor

medikoo commented Nov 26, 2019

To be honest, I am wondering if you are trolling me :-P.

No, I want to be sure, we come with right solution, and not just take-in the code that we just assume that may improve situation.

They all potentially deploy their own stack at the same time ( at least I have no control over when they are deploying )

Adding a jitter in my understanding makes sense, when we deal with situation where AWS requests are issued at very same time (with at least 0.1s accuracy) - and that in my understanding is hard to achieve even by issuing SLS deploys automatically at very same time (there's a lot going on in between issued AWS requests in a process, so naturally AWS requests should rather go out of sync among deploys)

Additionally above your stating that deploys potentially happen at same time. Can you elaborate on "potentially"? Are they manually released? If that's the case, you already naturally run AWS requests with unpredictable pattern, and that should make then out of sync.

Is this then to address accidental, random cases (which may occur occasionally) where two concurrent deploys start to hit AWS SDK at about same time? As then, yes, in such scenarios, jitter can help to de-sync situation.

@philiiiiiipp
Copy link
Contributor Author

Yes, dont get me wrong, I am happy for your thuroughness.

Looking furhter into it, I suppose that the error occurs when sls is checking for the stack status. Maybe it could also be an option to optional decrease the amount of checks and/or increase the interval between

https://github.com/serverless/serverless/blob/master/lib/plugins/aws/lib/monitorStack.js#L137

I think this is the culprit.

@medikoo
Copy link
Contributor

medikoo commented Nov 26, 2019

I think this is the culprit.

I'm aware of that fixed interval setting. Did you read my response? :)

@philiiiiiipp
Copy link
Contributor Author

Yes, but its like I said, they are all potentually deploying at the same time, and if they do, every time multiple people deploy at the same time, they hit this case.

Is this then to address accidental, random cases (which may occur occasionally) where two concurrent deploys start to hit AWS SDK at about same time? As then, yes, in such scenarios, jitter can help to de-sync situation.

Yes. That is what I am trying to say.

@medikoo
Copy link
Contributor

medikoo commented Nov 26, 2019

every time multiple people deploy at the same time, they hit this case.

That every time doesn't feel as well explored to me, as it shouldn't be likely for AWS request to get in sync across different deploys (even if deploys are issued at every same time)

Anyway, just for the sake that it can help in those occasional cases, I think adding jitter makes sense. Let me review PR

@philiiiiiipp
Copy link
Contributor Author

philiiiiiipp commented Nov 26, 2019

Right, I can elaborate further there. Our CI/CD system is automatically testing the deployment of the stacks after each push to a branch. The deployment of a stack, like you said, can take up to a couple of minutes ( in case of elasticsearch ) where it will constantly checks for the deployment status.

Some developers are pushing quite frequently, which means, a large number of deployments happen at the same time. Pretty much every time more than 5 builds are deploying, which in turn deploy multiple stacks, we see failures due to the rate limit.

So, you are right that it is not happening every time, just when a couple of stacks are aligning by chance. Which gets more likely the more stacks we deploy in parallel.

I think, in addition to the jitter, maybe allowing the change of the frequency parameter here here could also be a good idea.

@medikoo
Copy link
Contributor

medikoo commented Nov 26, 2019

I think, in addition to the jitter, maybe allowing the change of the frequency parameter here here could also be a good idea.

Yes, totally. I think we should be able to tweak it via e.g. SLS_AWS_MONITORING_INTERVAL env var (you may add support for that in same PR)

@philiiiiiipp
Copy link
Contributor Author

Ok cool, I will amend the PR over the weekend

@philiiiiiipp
Copy link
Contributor Author

I added support for the environment variable to change the frequency SLS_AWS_MONITORING_FREQUENCY. I was debating whether to call it Interval or Frequency, but since its already has a frequency options argument I went with this one.

I also made the calculation for the backoff simpler and provided a rounded number log.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants