Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Significant slowdowns running terraform for WAF resources on AWS provider v2.69.0 #14062

Closed
samtarplee opened this issue Jul 6, 2020 · 31 comments
Labels
bug Addresses a defect in current functionality. service/wafv2 Issues and PRs that pertain to the wafv2 service. upstream-terraform Addresses functionality related to the Terraform core binary.

Comments

@samtarplee
Copy link

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform Version

0.12.28

Affected Resource(s)

  • aws_wafv2_web_acl

Terraform Configuration Files

resource "aws_wafv2_web_acl" "demo-waf" {
  name        = "Demo-WAF"
  description = "Demo-WAF"
  scope       = "REGIONAL"

  default_action {
    block {}
  }

  rule {
    name     = "RateLimit"
    priority = 200

    action {
      block {}
    }

    statement {
      rate_based_statement {
        limit              = 1000
        aggregate_key_type = "IP"
      }
    }

    visibility_config {
      cloudwatch_metrics_enabled = false
      metric_name                = "demo_RateLimit"
      sampled_requests_enabled   = false
    }
  }

  rule {
    name     = "AWSManagedRulesCommonRuleSet"
    priority = 998

    override_action {
      none {}
    }

    statement {
      managed_rule_group_statement {
        name        = "AWSManagedRulesCommonRuleSet"
        vendor_name = "AWS"
      }
    }

    visibility_config {
      cloudwatch_metrics_enabled = false
      metric_name                = "AzureAD-AWSManagedRulesCommonRuleSet"
      sampled_requests_enabled   = false
    }
  }

  rule {
    name     = "AWSManagedRulesKnownBadInputsRuleSet"
    priority = 999

    override_action {
      none {}
    }

    statement {
      managed_rule_group_statement {
        name        = "AWSManagedRulesKnownBadInputsRuleSet"
        vendor_name = "AWS"
      }
    }

    visibility_config {
      cloudwatch_metrics_enabled = false
      metric_name                = "AzureAD-AWSManagedRulesKnownBadInputsRuleSet"
      sampled_requests_enabled   = false
    }
  }

  visibility_config {
    cloudwatch_metrics_enabled = true
    metric_name                = "Demo-WAF"
    sampled_requests_enabled   = false
  }
}

Debug Output

Expected Behavior

This should create a plan within a few seconds after running terraform plan, and validate the code after a terraform validate
It should also only take up to a couple of minutes to run a terraform plan

Actual Behavior

The plan and validate take a very long time to run - it works eventually, but it's taking upwards of 3 minutes for the validate, and five minutes, normally around ten for the plan. This is just for the one resource.
Apply takes even longer than this, presumably because it's running a plan on-top of doing other things.

If I downgrade my provider version to v 2.67.0, all of the actions are completed within a few seconds, as expected.

Steps to Reproduce

  1. Set provider version to v2.69.0
  2. terraform plan
  3. Set provider version to v2.67.0
  4. terraform plan

Important Factoids

This only seems to affect WAF resources, I've tried the provider in other projects and haven't seen any issues.
I'm unsure if it's limited to the web acl resource specifically, but that's the only one I've been able to reproduce it in

References

N/A

@ghost ghost added the service/wafv2 Issues and PRs that pertain to the wafv2 service. label Jul 6, 2020
@github-actions github-actions bot added the needs-triage Waiting for first response or review from a maintainer. label Jul 6, 2020
@breathingdust breathingdust added bug Addresses a defect in current functionality. and removed needs-triage Waiting for first response or review from a maintainer. labels Jul 6, 2020
@burizz
Copy link

burizz commented Jul 22, 2020

We are hitting the exact same problem. When the WAFv2 module is enabled our plan/apply time increases exponentially.

@sbchisholm
Copy link

I just noticed this yesterday. The following example takes about 11 seconds to run terraform plan with v2.68.0 but with v2.69.0 it's taking about 90 seconds. With the rules I have defined added back in it takes around 150 seconds.

provider "aws" {
  region  = "us-east-1"
  version = "~> 2.69.0"
}

resource "aws_wafv2_web_acl" "acl" {
  name        = "test_acl"
  description = "test_acl"
  scope       = "REGIONAL"

  default_action {
    allow {}
  }

  visibility_config {
    cloudwatch_metrics_enabled = true
    metric_name                = "test_acl"
    sampled_requests_enabled   = true
  }
}

@mohsen0
Copy link

mohsen0 commented Jul 28, 2020

it affects 2.70.0 as well, the issue is exacerbated if there are rules and other blocks.

@anGie44
Copy link
Contributor

anGie44 commented Jul 28, 2020

Hi all 👋 . From initial review of this issue, I see it unfortunately stems from #13961 which addressed #13862 and was introduced inv2.69.0 of the provider. By adding a needed nested level to match the API in these 4 statement types (AND, OR, NOT, and RATE_BASED), we've run in to this run-time issue. Looking at the community following of this issue and related WAFv2 resources, we plan to prioritize them after the upcoming major release of v3.0.0 of the provider.

@mohsen0
Copy link

mohsen0 commented Jul 28, 2020

Thank you for your quick response, I ask you to reconsider releasing the fix you made on #14073 before the major release
since the issue is very painful when there is a CI/CD system and multiple instances of WAFv2 (
https://github.com/umotif-public/terraform-aws-waf-webaclv2)
in our case, The terraform plan run time jumped from 3 min job to 55-minute.
it is a shame if users had to live with this if this is not going to release any time soon.

@tophercullen
Copy link

@anGie44 with v3.0.0 being released, do you have a target version and/or timeline for this being addressed?

@mohsen0
Copy link

mohsen0 commented Aug 11, 2020

I wonder how could you release this change with its current status, it speaks loudly on quality controls of the project. basically it ruins the whole experience. and it seems no one feels the need to release this quicker.
the side effect of long-running terraform applies/plans is the session token gets expired and the build constantly fails, extending the session time is not an option since role chaining has a limit of 1 hour.
FYI @bflad

@philslab-ninja
Copy link

To level things out I just want to say a big thank you to the all the awesome terraform aws provider contributors who donate their free time to make this project possible!

@mohsen0
Copy link

mohsen0 commented Aug 11, 2020

To level things out I just want to say a big thank you to the all the awesome terraform aws provider contributors who donate their free time to make this project possible!

I reiterate this sentiment too.
didn't want to sound harsh though I did, I hope it gets sorted soon. 😅

@Brother-Andy
Copy link

Provider 2.70.0
6 WAFv2 Web ACL resources
Plan and apply took 23-25 minutes each to process.
So without changes, every pipeline will take extra 20+ minutes to process.

@anGie44
Copy link
Contributor

anGie44 commented Aug 17, 2020

Hi all 👋 -- first off, apologies for the silence here! This has been prioritized and we are investigating with the Terraform Core team to further debug the behavior imposed by this rather large schema in the WebACL resource. I will update here accordingly with a more detailed response as to what our next steps will be once we can narrow down where the time is being spent during the terraform plan calls.

Please note, from the provider perspective, there isn't more we can do at the moment to lessen the burden of the slowness everyone is experiencing without directly reducing the number of supported statements i.e. a breaking change to revert the changes to support #13862 to see the runtimes previously seen in v2.67.0 of the provider.

@anGie44
Copy link
Contributor

anGie44 commented Aug 17, 2020

Following up here: this new issue in terraform tracks the behavior we're seeing with the nested statements and the significant slowdowns hashicorp/terraform#25889

Again, unfortunately there's not much we can do within the provider code except for making the schema less nested in the webACL resource atm. The hope would be to have upstream terraform optimizations take place in order to keep the schema depth as-is.

@dvishniakov
Copy link

@anGie44 , it seems that there is no quick fix for the specified upstream issue at this moment. It's probably also not feasible to expect WAFv2 API to change (or WAFv3 to be released) any time soon which will allow to specify a string (JSON, YAML) instead of structured data.
Does it make sense to introduce a new temporary workaround resource for WAFv2 ACL which supports only 2 nested levels of rules in order to have support of simple rule sets without compromising performance so much?

@dvishniakov
Copy link

It would be great if everybody who upvoted this specific issue could also upvote the corresponding upstream issue (mentioned above by anGie44). The ratio is 5:1 atm :(.

@adamhathcock
Copy link

Voted. The performance difference is quite huge with a modest set of rules. I just coverted to WAFv2 and was surprised.

@dvishniakov
Copy link

dvishniakov commented Sep 1, 2020

A few local (from home) performance tests timings (IP sets as an example from the same service but without a huge schema) with default parallelism settings:

Test Plan time Apply time
Creating 99 empty IP sets 20s 3m59s
Deleting 99 empty IP sets 2m31s 3m13s
Creating 20 empty WAFv2 ACLs 14m41s 9m26s
Deleting 20 empty WAFv2 ACLs 9m43s 6m6s

I don't want to test managing the default limit of 100 WAFv2 ACLs using Terraform. It probably will take more than an hour.
As a comparison, simple shell script using AWS CLI for sequential creating and removing 100 WAFv2 ACLs took 3m30s.

@Brother-Andy
Copy link

A quick and dirty workaround for me was to use 2.67.0 provider version for WAFv2 Web ACL resource. Everything is fine and validate-plan-apply steps are quick as usual.
However, since mentioned version knows nothing about aws_wafv2_web_acl_logging_configuration resource, it should be moved to the separate module 'cause terraform does not allow to combine different provider versions in one configuration.

Hope this helps!

@dvishniakov
Copy link

I've noticed a bug with versions prior to 3.0 when 'create before destroy' lifecycle attribute is used, which leads to an error during Terraform apply about duplicate statements or something like that. There was no notice that the bug was fixed, but when I've upgraded provider to v3.2 everything is fine. So if you're going to use the above workaround - test your scenarios.

@drmjo
Copy link

drmjo commented Sep 20, 2020

I just moved 3 resources to a new module to see if WAFv2 was causing the slowness

aws_wafv2_ip_set.devs[0]
aws_wafv2_web_acl.main_alb
aws_wafv2_web_acl.main_cdn

This module is now very very slow, compared to before adding the resources, about 10x slower.

2020/09/20 19:51:35 [INFO] Terraform version: 0.12.29
2020/09/20 19:51:35 [INFO] Go runtime version: go1.12.13
2020/09/20 19:51:35 [DEBUG] found valid plugin: "aws", "3.0.0", "/home/terraform/stack/.terraform/plugins/linux_amd64

@rabidscorpio
Copy link

I tried including wafv2 resources in one of my modules but disabled with count = 0 set and even then the planning and applying took 3 to 4 times as long vs without. I ended up commenting out all wafv2 resources for now, I was surprised even disabled resources still caused a slowdown, so I'll wait until these issues get smoothed out.

@immo-huneke-zuhlke
Copy link

Also see #5822

@doryanmzr
Copy link

Hi , the issue is still occurring with aws provider 3.0, is it going to be fixed soon?

@omriKaltura
Copy link

encountering the same issue with 3.10.0 version as well

@anGie44
Copy link
Contributor

anGie44 commented Oct 14, 2020

Good news @samtarplee and those following this issue, the upstream terraform issue hashicorp/terraform#25889 has a PR to fix the slowdowns experienced here 🎉 I'll provide an update here again when it lands in the forthcoming release of Terraform v0.13.5 (reference: https://github.com/hashicorp/terraform/blob/v0.13/CHANGELOG.md).

@phplucas
Copy link

encountering the same issue with 3.11.0 version as well

@pkolyvas
Copy link

Terraform 0.13.5 will be released today and include the speedup for deeply nested resources mentioned by @anGie44 :) Thanks for your patience.

@anGie44
Copy link
Contributor

anGie44 commented Oct 28, 2020

Confirming this issue has been resolved when using v0.13.5+ of Terraform.

Given the example in the description (Regional WebACL w/3 rules: 2 ManagedRuleGroups, 1 Rate-based) in us-west-1:

With provider[registry.terraform.io/hashicorp/aws] 3.11.0

Plan output time (5s):

terraform plan -out=plan.out
2020/10/27 22:30:41 [INFO] Terraform version: 0.13.5
...
2020/10/27 22:30:46 [TRACE] statemgr.Filesystem: unlocking terraform.tfstate using fcntl flock

Apply output time (3s):

terraform apply plan.out
2020/10/27 22:32:45 [INFO] Terraform version: 0.13.5
...
2020/10/27 22:32:48 [TRACE] eval: *terraform.evalCloseModule

Apply complete! Resources: 1 added, 0 changed, 0 destroyed.
...
2020/10/27 22:32:48 [TRACE] statemgr.Filesystem: unlocking terraform.tfstate using fcntl flock

Destroy output time (5s):

terraform destroy --force
2020/10/27 22:34:01 [INFO] Terraform version: 0.13.5
...
2020/10/27 22:34:06 [TRACE] statemgr.Filesystem: removing lock metadata file .terraform.tfstate.lock.info
Destroy complete! Resources: 1 destroyed.
...
2020-10-27T22:34:06.400-0400 [DEBUG] plugin: plugin exited

With provider[registry.terraform.io/hashicorp/aws] 3.0.0

Plan output time (12s):

terraform plan -out=plan.out
2020/10/27 22:24:00 [INFO] Terraform version: 0.13.5
...
2020/10/27 22:24:12 [TRACE] statemgr.Filesystem: unlocking terraform.tfstate using fcntl flock

Apply output time (14s):

terraform apply plan.out
2020/10/27 22:25:30 [INFO] Terraform version: 0.13.5
...
2020/10/27 22:25:44 [TRACE] statemgr.Filesystem: writing snapshot at terraform.tfstate

Apply complete! Resources: 1 added, 0 changed, 0 destroyed.
...
2020/10/27 22:25:44 [TRACE] statemgr.Filesystem: unlocking terraform.tfstate using fcntl flock

Destroy output time (27s):

 terraform destroy --force
2020/10/27 22:27:54 [INFO] Terraform version: 0.13.5
...
2020/10/27 22:28:21 [TRACE] statemgr.Filesystem: removing lock metadata file .terraform.tfstate.lock.info

Destroy complete! Resources: 1 destroyed.
...
2020-10-27T22:28:21.218-0400 [DEBUG] plugin: plugin exited

With provider[registry.terraform.io/hashicorp/aws] 2.69.0

Plan output time (15s):

terraform plan -out=plan.out
2020/10/27 22:15:15 [INFO] Terraform version: 0.13.5
...
2020/10/27 22:15:30 [TRACE] statemgr.Filesystem: unlocking terraform.tfstate using fcntl flock

Apply output time (14s):

terraform apply plan.out
2020/10/27 22:19:09 [INFO] Terraform version: 0.13.5
...
2020/10/27 22:19:23 [TRACE] statemgr.Filesystem: have already backed up original terraform.tfstate to terraform.tfstate.backup on a previous write
Apply complete! Resources: 1 added, 0 changed, 0 destroyed.
...
2020/10/27 22:19:23 [TRACE] statemgr.Filesystem: unlocking terraform.tfstate using fcntl flock

Destroy output time (29s):

terraform destroy --force
2020/10/27 22:21:21 [INFO] Terraform version: 0.13.5
...
2020/10/27 22:21:50 [TRACE] statemgr.Filesystem: writing snapshot at terraform.tfstate

Destroy complete! Resources: 1 destroyed.
...
2020-10-27T22:21:50.722-0400 [DEBUG] plugin: plugin exited

@tophercullen
Copy link

Can confirm as well. Upgrading to 13.5, plan/apply times in our CI/CD process dropped considerably, 75% in some cases.

@benoit74
Copy link

Thank you all for this wonderful optimization, in our case we went from about 245s to about 11s (for a lot of resources, not only a WAFv2, so a normal time indeed)... That's a tremendous improvement ! Kudos to everyone who made it possible.

@anGie44
Copy link
Contributor

anGie44 commented Oct 29, 2020

Closing this on account that the behavior is resolved with the upstream terraform upgrade (v0.13.5) - thank you all for your patience 👍 if facing any further issues related to wafv2 resources/data-sources, please feel free to create a new Github Issue for future tracking.

@anGie44 anGie44 closed this as completed Oct 29, 2020
@ghost
Copy link

ghost commented Nov 28, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thanks!

@hashicorp hashicorp locked as resolved and limited conversation to collaborators Nov 28, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Addresses a defect in current functionality. service/wafv2 Issues and PRs that pertain to the wafv2 service. upstream-terraform Addresses functionality related to the Terraform core binary.
Projects
None yet
Development

No branches or pull requests