Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ThrottlingException: Rate exceeded #1344

Open
Arisfx opened this issue Feb 7, 2022 · 22 comments
Open

ThrottlingException: Rate exceeded #1344

Arisfx opened this issue Feb 7, 2022 · 22 comments
Labels
kind/bug Something isn't working

Comments

@Arisfx
Copy link

Arisfx commented Feb 7, 2022

Hi team, do you know how we can avoid the rate exceeded error?

Scanned states (7)      
ThrottlingException: Rate exceeded
        status code: 400, request id: 0474b16c-faee-402a-bf01-1e2a7c005714
@Arisfx Arisfx added the kind/bug Something isn't working label Feb 7, 2022
@sundowndev
Copy link
Contributor

sundowndev commented Feb 7, 2022

Hi @Arisfx, the rate limit error can occur when your cloud have a huge amount of resources, even when not managed by Terraform. This is part of known limitations of driftctl. Are you running in deep mode ? If yes, can you consider running in non-deep mode instead ? Note driftctl will no longer be able to show drifts in attributes. If you've identified which resource(s) is causing this, you can try to ignore this particular resource type using the driftignore or the filter flag. If none of those solutions fit your needs, could you give further details on your use case ?

Thanks 🙏🏻

@Arisfx
Copy link
Author

Arisfx commented Feb 10, 2022

Hi @sundowndev thank you for your response, seems if i run it several times, eventually it does the job, may i ask you if its possible to combine several filters? for example how can i filter for aws_route53_zone and aws_route53_record at the same time? Unfortunately, more than one filter commands can not be combined.
Thanks!

@sundowndev
Copy link
Contributor

I think what you're looking for is a driftignore file. You will find proper examples in the docs : https://docs.driftctl.com/0.20.0/usage/filtering/driftignore

How can i filter for aws_route53_zone and aws_route53_record at the same time?

# Search for drifts except for aws_route53_zone and aws_route53_record
aws_route53_zone.*
aws_route53_record.*
# Ignore all drifts except for aws_route53_zone and aws_route53_record
*
!aws_route53_zone.*
!aws_route53_record.*

Does that help ?

@Arisfx
Copy link
Author

Arisfx commented Feb 10, 2022

thank you for your response, perhaps i didnt make my self clear, i meant that we need to scan for only these 2 resources, not to exclude them :)

@eliecharra
Copy link
Contributor

thank you for your response, perhaps i didnt make my self clear, i meant that we need to scan for only these 2 resources, not to exclude them :)

You can do this with the second snippet of code that @sundowndev has posted above

# Ignore all drifts except for aws_route53_zone and aws_route53_record
*
!aws_route53_zone
!aws_route53_record

The first wildcard makes sure that we switch to an ignore everything mode but what is prefixed with !

This will save you a lot of API call and can definitively help with rate limit issues

@brunzefb
Copy link

brunzefb commented Apr 25, 2022

@sundowndev Just ran into the throttling issue as well with driftctl. Created a support call with AWS to increase API allowed rate. They told me it would be too many rates to increase, and ask the authors to implement an exponential backoff when making AWS calls that hit the Throttling exception. While this may kill the performance of the tool, maybe that does not matter so much -- especially if you are running it as a cron job once a day.

From AWS Support:

We would generally suggest that API calls should be made with a retry and exponential backoff in order to gracefully handle throttling when it occurs [2]. When narrowing down to calls from your IAM user around the reported times, I see a very aggressive call rate which suggests to me that this tool is not implementing such a backoff and retry strategy, or if it is, it is not retrying enough, or is not backing off enough. This strategy should work well with supported providers.

@aroes
Copy link

aroes commented May 17, 2022

This is an issue for me as well, I agree that a retry and backoff strategy should be implemented as @brunzefb 's AWS support suggests. Getting a full overview of a large account is almost impossible since the tool exits as soon as it runs into the error.

@gmaghera
Copy link

gmaghera commented Jun 6, 2022

Neither ignore nor retries seem to address this issue directly.

Would it be possible to break what driftclt does into batches? And perhaps give the user control over batch sizes and pause before moving to the next batch?

@eliecharra
Copy link
Contributor

Retry will address this issue. When we encounter a rate limit issue, we'll create an exponential backoff retry loop so requests will be postponed and the scan will take longed but will not be interrupted anymore. @moadibfr Is working on that, but we are also currently splitting up the enumeration from driftctl in a separate go module for a better separation of concern so it'll take time for the retry on rate limit mechanism to be implemented.

Would it be possible to break what driftclt does into batches.

That sounds complicated because the goal of driftctl is to enumerate resources, so you cannot batch a list if you do not have the list yet. We can think of another batching logic by using resources types for example, you can achieve this manually with the driftignore file, look my answer above.

We are aware that this is a very important pain for many of you and this rate limit issues is definitively on our plate 🙏🏻

@brunzefb
Copy link

I think how long the program runs is less important, so backoff/retry is a good thing. We are running driftctl on an EKS cluster with a Python wrapper as a pod launched by a cronjob. So if it takes an hour to run, it does not matter, if you run it every 12h. The wrapper compares the driftctl json output to expected output, and emails if there are diffs. We then have a stern talk with those AWS console users that did not use terraform for making the changes. I mostly care about IAM and security group changes, and if you limit driftctl to those, it is generally not API rate limited.

Best,
F.

@gmaghera
Copy link

Do you have an idea where solving this is on your roadmap?

@moadibfr
Copy link
Contributor

hey @gmaghera we identified how we could improve that but it is tied to the extraction and update of the enumeration in driftctl.
Unfortunately, I don't think we have information about when it could happen.
Maybe @sjourdan has more insight and could clarify this.

@gmaghera
Copy link

gmaghera commented Oct 17, 2022

Thank you for the update @moadibfr.

BTW, we moved over to using CloudQuery's drift measurement, because of the throttling issue. But they decided to stop supporting drift measurement, for reasons not known to me.

You have a special tool on your hands -- Hashicorp only recently announced drift detection support. With throttling handled, driftctl would be a sweet, sweet enterprise-level tool (it is already, albeit with some limitations).

@eliecharra
Copy link
Contributor

Very valuable feedback @gmaghera thanks 🙏🏻

We are very sorry that we could not share any status update on that 😟
We are currently in a complicated context regarding driftctl, the company behind it (cloudskiff) has been acquired one year ago and now our focus is currently not on actively improving driftctl.
Also unfortunately we made some changes that had put driftctl in a state where it's kinda complicated to work on for newcomers, so giving that issue to the community does not sounds like an decent option.

We'll keep you updated as soon as we could on that 🙏🏻

@b0bu
Copy link

b0bu commented Oct 27, 2022

Similar issue here but for Scanned resources and only from "within" AWS. When running from my laptop I don't get the issue but when running from ec2 or as part of a codebuild project it consistently fails for a single region with a relatively empty account 100 resources or so maybe less. Again only from within AWS. Any ideas?

Scanned states(3)
Scanned resources
    ThrottlingException: Rate exceeded
    status code: 400, request id: 259b7f44-6e33-431f-9435-dac2a30e2db6

@johnalotoski
Copy link

johnalotoski commented Feb 3, 2023

Using cpulimit may help to slow the rate of API calls down by throttling the CPU usage of the app as a whole.

Example, limit CPU usage to 25%:

cpulimit -l 25 driftctl scan --from tfstate://*.tfstate

Using cgroups would work better than this I think at the cost of being a bit more involved to apply.

Just playing around with cpulimit a little bit, limiting to about 5% limit on my machine doubles the scan time and stops the throttle errors. Going any lower (like all the way to 1%) causes some aws authentication errors to start being thrown, presumably because the app doesn't respond quickly enough for some of the handshakes or API flows.

So it seems like there is a sweet spot with something as simple as cpulimit to help with this -- at least for my machine anyway.

@bshramin
Copy link

Although the main issue is definitely not solved, there are a couple of helpful flags here that you can use to limit the scope you want to monitor in order to avoid the rate limit exception.

https://docs.driftctl.com/next/usage/cmd/scan-usage/

@drem-darios
Copy link

I was able to get past this error by implementing exponential backoff in the repository that was triggering the throttle exception. In my case, it was API Gateway limits I was hitting. You can see here: https://docs.aws.amazon.com/apigateway/latest/developerguide/limits.html that API Gateway allows 5 requests every 2 seconds per account for GetResources and I was hitting that limit pretty frequently. There is also a 10 request per second limit across all API Gateway management operations. To work around those limits, I added some code to the api_gateway_repository.go file that would exponentially backoff the requests in the case that we received a "TooManyRequestsException" error. I set the bar at 2 seconds since that was the limit we were hitting. Also, I had to add this logic to every function making a request to API Gateway since any of them could trigger the total operation limitation. (e.g. GetRestApisPage reaches the limit then a call to say GetAccount will trigger the throttle). Here is the logic for the GetRestApisPages as an example.

  const MaxRetries = 5
  
  if err != nil {
	  retries := 0
	  retry := true
  
	  for retry && retries < MaxRetries {
		  sleepTime := time.Duration(math.Pow(2, float64(retries))) * 2 * time.Second
		  logrus.Warn("Error caught during GetRestApisPages! Attempt number ", retries+1, "/", MaxRetries, ". Retrying after sleeping for ", sleepTime, "...")
		  time.Sleep(sleepTime)
		  logrus.Debug("Awake! Attempting to make GetRestApisPages call again.")
		  err = r.client.GetRestApisPages(&input,
			  func(resp *apigateway.GetRestApisOutput, lastPage bool) bool {
				  restApis = append(restApis, resp.Items...)
				  return !lastPage
			  },
		  )
		  if err != nil && strings.Contains(err.Error(), "TooManyRequestsException") {
			  retry = true
		  } else {
			  retry = false
		  }
  
		  retries++
	  }
  }

To reduce duplicate code, I implemented a function


func retryOnFailure(callback func() error, message string) error {
	retries := 0
	retry := true

	var err error
	for retry && retries < MaxRetries {
		sleepTime := time.Duration(math.Pow(2, float64(retries))) * 2 * time.Second
		logrus.Warn(message, "Attempt number ", retries+1, "/", MaxRetries, ". Retrying after sleeping for ", sleepTime, "...")
		time.Sleep(sleepTime)
		logrus.Debug("Awake! Attempting to make API call again.")

		err = callback()
		if err != nil && strings.Contains(err.Error(), "TooManyRequestsException") {
			retry = true
		} else {
			retry = false
		}

		retries++
	}
	return err
}

and now I can check for error on the first call, then go into exponential backoff if there was an error

if err != nil {
		err = retryOnFailure(func() error {
			logrus.Debug("Making a call to get rest APIs not found in cache")
			err = r.client.GetRestApisPages(&input,
				func(resp *apigateway.GetRestApisOutput, lastPage bool) bool {
					restApis = append(restApis, resp.Items...)
					return !lastPage
				},
			)
			return err
		}, "Error caught during GetRestApisPages!")
	}

I'm happy to contribute this code to the project if everyone thinks it will be helpful. This logic should probably be implemented in other places/repositories too...

@herrsergio
Copy link

Hi, I have a similar issue. I am executing driftctl in a subdirectory with its own Terraform backend state.

driftctl scan --only-managed

Using Terraform state tfstate+s3://XXXXX/XXXXX/XXXXX/terraform.tfstate found in terraform-backend.tf. Use the --from flag to specify another state file.
INFO[0001] Start reading IaC
Scanned states (1)
INFO[0003] Start scanning cloud provider


TooManyRequestsException: Too Many Requests
{
  RespMetadata: {
    StatusCode: 429,
    RequestID: "6ca8acef-5412-402d-8825-a72c10a15f77"
  },
  Message_: "Too Many Requests"
}

@nsballmann
Copy link

This doesn't seem to be related to the number of resources within the Terraform state. Even with just 12 resources I unfortunately constantly run into this issue in our CI despite prefixing driftctl with cpulimit --limit=5 --include-children -- (on 16 core machines inside Alpine containers).

Furthermore the Terraform state is not S3 hosted. So these connections don't even count towards the rate limit.

This happens so often that I already added allow_failure: true to the scan job, so the scan jobs don't block the GitLab MRs. 😬

@valdestron
Copy link

Regards this issue, could driftctl implement some caching mechanism and/or retry itself?
I think it would be nice if drifctl cache the state data until it gets error, then if its started again from the outer retry, it would pick up the cache and continue on tfstate files that is left.

It would be very convenient when you run

drifctl --from tfstate+s3://fileone --from tfstate+s3://filetwo --from tfstate+s3://filethree ....

could have a flag

driftctl --cached --from tfstate+s3://fileone --from tfstate+s3://filetwo --from tfstate+s3://filethree ....

Cached flag would have default:

  • 1minute cache ? - maybe something more smarter dont know
  • key hash all the arguments combined

this way when scan failed on the --from tfstate+s3://filetwo resource scans it could retry from where it left, or if there is no internal retry, it could pick up from where its left.

@nsballmann
Copy link

@valdestron shortly after my post I discovered this heading: https://github.com/snyk/driftctl?tab=readme-ov-file#this-project-is-now-in-maintenance-mode-we-cannot-promise-to-review-contributions-please-feel-free-to-fork-the-project-to-apply-any-changes-you-might-want-to-make which makes me think that driftctl development has stopped and it's time to replace it where ever we use it. Unfortunately, I haven't found a suitable successor for my use cases, yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests