Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resource/aws_ecs_task_definition: prevent spurious environment variable diffs #11463

Merged

Conversation

jbergknoff-rival
Copy link
Contributor

@jbergknoff-rival jbergknoff-rival commented Jan 3, 2020

An ECS container definition's environment variables are stored by ECS as a list (rather than a map), and the ECS service reorders them arbitrarily because there is no meaningful order to them. There's already code in place to sort the list of environment variable when computing the plan, so that, if that reordering is the only change, then the container definition is considered clean.

However, if anything else in the container definition changes, then we rightfully get a plan. The environment variables still show up in the plan, and, because of the reordering, there appears to be a big change. It's extremely difficult to inspect manually.

Here's an example:

provider "aws" {}

resource "aws_ecs_task_definition" "default" {
  family                = "reordering-demonstration"
  container_definitions = jsonencode(
    [
      {
        "name": "main",
        "image": "hello-world",
        "cpu": 10,
        "memory": 512,
        "essential": true,
        "environment": [
          {"name": "abc", "value": "123"},
          {"name": "def", "value": "456"},
          {"name": "ghi", "value": "789"},
          {"name": "jkl", "value": "321"},
          {"name": "mno", "value": "654"},
          {"name": "pqr", "value": "987"},
        ]
      }
    ]
  )
}

Apply, and then change the memory from 512 to 1024. The next plan is:

Terraform will perform the following actions:

  # aws_ecs_task_definition.default must be replaced
-/+ resource "aws_ecs_task_definition" "default" {
      ~ arn                      = "..." -> (known after apply)
      ~ container_definitions    = jsonencode(
          ~ [ # forces replacement
              ~ {
                    cpu          = 10
                  ~ environment  = [
                      - {
                          - name  = "pqr"
                          - value = "987"
                        },
                      - {
                          - name  = "ghi"
                          - value = "789"
                        },
                      - {
                          - name  = "jkl"
                          - value = "321"
                        },
                        {
                            name  = "abc"
                            value = "123"
                        },
                        {
                            name  = "def"
                            value = "456"
                        },
                      + {
                          + name  = "ghi"
                          + value = "789"
                        },
                      + {
                          + name  = "jkl"
                          + value = "321"
                        },
                        {
                            name  = "mno"
                            value = "654"
                        },
                      + {
                          + name  = "pqr"
                          + value = "987"
                        },
                    ]
                    essential    = true
                    image        = "hello-world"
                  ~ memory       = 512 -> 1024
                  - mountPoints  = [] -> null
                    name         = "main"
                  - portMappings = [] -> null
                  - volumesFrom  = [] -> null
                } # forces replacement,
            ]
        )
        family                   = "reordering-demonstration"
      ~ id                       = "reordering-demonstration" -> (known after apply)
      + network_mode             = (known after apply)
      - requires_compatibilities = [] -> null
      ~ revision                 = 5 -> (known after apply)
      - tags                     = {} -> null
    }

Plan: 1 to add, 0 to change, 1 to destroy.

The correct plan would show memory = 512 -> 1024 but environment would be left alone.

This PR attempts to correct the problem by ordering the list of environment variables (by name) when we read the task definition from ECS, and when we serialize to state. In my testing (with the toy example above), it fixes the issue.

This approach also gives a clean plan if environment variables in the container definition are reordered in the Terraform source. That seems like reasonable behavior to me, because there is no meaningful order to the list.

Community Note

  • Please vote on this pull request by adding a 👍 reaction to the original pull request comment to help the community and maintainers prioritize this request
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for pull request followers and do not help prioritize the request

Release note for CHANGELOG:

BUG FIXES:
* resource/aws_ecs_task_definition: don't show a spurious environment variable diff when another property on the task definition is changing.

Output from acceptance testing:

The useful thing to test here is that, while a change to some other attribute is legitimately resulting in a plan, the unchanged environment variables don't show up as changed in the diff. Unfortunately, I didn't see a way to inspect the contents of a plan using the testing framework. If somebody can advise on where/how to write a test for this, I'll add one. It probably should be a unit test, though, rather than an acceptance test.

@jbergknoff-rival jbergknoff-rival requested a review from a team January 3, 2020 00:03
@ghost ghost added size/XS Managed by automation to categorize the size of a PR. needs-triage Waiting for first response or review from a maintainer. service/ecs Issues and PRs that pertain to the ecs service. labels Jan 3, 2020
Copy link

@saich saich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding unit tests could speed up the review process

@MOURIK
Copy link

MOURIK commented Feb 28, 2020

I confirm the issue... Any update regarding this merge request ?
thanks in advance.

@rafilkmp3
Copy link

I really can't believe this is not merged yet

@rafilkmp3
Copy link

@jbergknoff-rival can I build myself the provider with your code? this is today my big issue

@jbergknoff-rival
Copy link
Contributor Author

can I build myself the provider with your code?

Yes, absolutely. My team runs a custom build of this provider with a bunch of useful PRs and bugfixes included (including this one).

@rafilkmp3
Copy link

another week on paradise

@pkoch
Copy link

pkoch commented Mar 29, 2020

@rafilkmp3 Still working on this? Do you know enough go/tf to perhaps add some tests to this patch, like @saich mentioned?

@saich @jbergknoff-rival: what next steps can we take here to see this get to master faster?

@rafilkmp3
Copy link

@pkoch I'm not working on it, I just use the @ jbergknoff-rival version and I feel powerless because it is not merged. Trust me if I had been able to develop tests to make this happen, I certainly would have done it a long time ago.

@jbergknoff-rival
Copy link
Contributor Author

@pkoch it's in the hands of the maintainers, and I don't know of anything we can do to get this merged. As far as I know, none of the maintainers have seen this PR. I'd like to add tests if somebody more in-the-know would advise on my question (in the PR description) about how to write a test against the contents of a plan.

@nevins-ellevation
Copy link

Could you do something like the existing testAccCheckEcsTaskDefinitionRecreated test but validate that when you re-ordered the environment it didn't change revisions?

@jbergknoff-rival
Copy link
Contributor Author

Could you do something like the existing testAccCheckEcsTaskDefinitionRecreated test but validate that when you re-ordered the environment it didn't change revisions?

Thanks for the suggestion. I don't think this would test what this PR addresses, though, which is: when there is a good reason to show a plan, the environment variables should only show up as changed if they're actually changing.

The PR doesn't affect any interactions with ECS, doesn't change whether plans are clean or not, etc.

@nevins-ellevation
Copy link

Could you do something like the existing testAccCheckEcsTaskDefinitionRecreated test but validate that when you re-ordered the environment it didn't change revisions?

Thanks for the suggestion. I don't think this would test what this PR addresses, though, which is: when there is a good reason to show a plan, the environment variables should only show up as changed if they're actually changing.

The PR doesn't affect any interactions with ECS, doesn't change whether plans are clean or not, etc.

My thought is that if you do two terraform runs, if changes in the order of the environmental variables doesn't cause the task revision to have changed between the runs you know that the changes in this PR are working correctly.

@jbergknoff-rival jbergknoff-rival force-pushed the jbergknoff/ecs-task-definition-env-ordering branch from 2e89b7b to e718a30 Compare April 2, 2020 14:57
@Jimmyscene
Copy link

Another bump, this bug is really annoying

@Jimmyscene
Copy link

@bflad any opinions on this?

@nathanielks
Copy link
Contributor

As far as I can tell, I don't believe there's any acceptance tests that can be added with the existing testing tool kit. The TestAwsEcsContainerDefinitionsAreEquivalent_arrays test already verifies that environment variables will be sorted when comparing local state to the AWS API. The challenge for this PR is it needs the ability to test the diff Terraform is generating, which I don't believe there's a testing API for currently.

I've done some digging and I believe it would be possible to add some sort of Plan diff testing functionality around here: https://github.com/terraform-providers/terraform-provider-aws/blob/2b074d0a45342f79c2cf763a26b5b4d157feadc7/vendor/github.com/hashicorp/terraform-plugin-sdk/helper/resource/testing_config.go#L164-L172

A DiffTestFunc property could be added to TestStep which could be run after the plan has been made and the diff itself could be tested using p.Changes. I'll report back if I'm able to get something working.

As an aside, I did learn about some testing functionality I wasn't aware of previously. When defining TestStep's, it's possible to specify PlanOnly and ExpectNonEmptyPlan for simply verifying a plan generates a change or not. Check out this test for an example of how to use those fields.

@nathanielks
Copy link
Contributor

One challenge I'm experiencing in attempting to test this out is making changes to testing_config.go doesn't appear to make any difference when running make testacc for some reason. I'm still a golang novice, so I am unsure of why it's not picking up the changes I'm making. For reference, I'm simply adding an additional logging statement to line 84 to verify changes to the file get picked up, and the log isn't getting output.

The change itself:

	// If this step is a PlanOnly step, skip over this first Plan and subsequent
	// Apply, and use the follow up Plan that checks for perpetual diffs
	if !step.PlanOnly {
		// Plan!
		if p, stepDiags := ctx.Plan(); stepDiags.HasErrors() {
			return state, newOperationError("plan", stepDiags)
		} else {
+			// Add an additional logging statement
+			log.Printf("[WARN] Test: please work?")
			log.Printf("[WARN] Test: Step plan: %s", legacyPlanComparisonString(newState, p.Changes))
		}

The command I'm running:

TF_LOG=WARN make testacc TEST=./aws TESTARGS='-run=TestAccAWSEcrRepository_image_scanning_configuration'

@rabidscorpio
Copy link

@bflad This is a clean PR that fixes a long-standing issue (see #3035), can we please get some traction on this?

@nwsparks
Copy link
Contributor

What is preventing this from moving forward?

@stuzlogle
Copy link

no idea if this is a blocker but the provider has a quarterly roadmap I found when looking for the progress on another issue - https://github.com/terraform-providers/terraform-provider-aws/blob/master/ROADMAP.md

@nwsparks
Copy link
Contributor

nwsparks commented Jun 19, 2020

If this was a feature addition my opinion would be different, but this is a major bug on a widely used service that that is incredibly frustrating to deal with as well as dangerous since it makes it difficult to tell what is actually changing in the environment of the application.

If you sort by reactions, this is the 3rd most upvoted open PR which per the faq is one of the main criteria for review. As you've mentioned @pkoch the PR appears fine and has multiple approvals, so people are just left to guess as to why it isn't being merged? Some guidance would surely be nice instead and perhaps more priority given to bug fixes in the future by the maintainers.

In any case, apologies for the distraction and appreciate the attempts to get it moving.

@pkoch
Copy link

pkoch commented Jun 19, 2020

Just learned that Hashicorp provides community office hours: https://www.hashicorp.com/community/office-hours/#technical-office-hours. Perhaps that's a good forum to present our case after we've made sure we've gone through all the expected due-diligence.

@nathanielks
Copy link
Contributor

Great idea, @pkoch 👍

@pkoch
Copy link

pkoch commented Jun 19, 2020

To those looking for a bit of relief from this problem, I can point out that ECS now supports using environmentFiles for EC2 launch type.

@pkoch
Copy link

pkoch commented Jun 19, 2020

I think @nathanielks is being a bit too bold in expanding the scope of this PR to make the diffing possible. I understand that the instructions do ask for an acceptance test coverage of new behavior but I'm not sure it's a reasonable ask if there's no testing infra that can help us right now. Adding that additional burden for that piece of infra to this PR seems undue.

One thing that bothers me here is that, even if we we managed to get that plan diffing feature in, since we have to assume AWS is going to be random and nondeterministic in their ordering, any acceptance test will inherently be flaky with false passes — AWS could happen to return just the right order. I'm not sure how allergic the maintainers of this repo are to that, but that'd be something I'd like an explicit ack on.

I think our best bet would be to add a well placed unit test. This is so internal and it's hard-to-control enough that I feel comfortable arguing in favor of that. I'm not sure which part of the code it should exercise, tho. I took a dumb approach at something by just fork-lifting StateFunc's meat, but I bet a person more familiar with the project's code organization and preferences would have a better idea than mine. You can check said unit test at https://gist.github.com/pkoch/33d5e70018e70b05b79daa2dc99ee867

Since we can't meet the established criteria for PR acceptance, we're stuck waiting for someone with more power than us.

Meanwhile, I took the liberty of booking an office hours slot with them. It's going to be at 12:30 - 13:00 Lisbon Time, on June 23, 2020. Wish me luck! 🤞

@nathanielks
Copy link
Contributor

Thanks for taking the time to book a spot with them, @pkoch!

@jboero
Copy link

jboero commented Jun 23, 2020

@paultyng do you or anyone on modules team have an idea when this can merge? @pkoch asking on office hours during Hashiconf EU. Thanks!

@pkoch
Copy link

pkoch commented Jun 23, 2020

Just left the call. Had a great nice chat with @jboero, where we tried getting the attention of some folks that might be able to provide some firmer guidance (as you can see in his comment), and we'll keep trying to move this forward once inch at a time, as people are able to squeeze this into their busy schedules.

Thank you so much for helping us out, @jboero! 🥰

@breathingdust
Copy link
Member

Hi all 👋 ,

Sorry for the delay in response. We recognize the value this has to the community and as such this item is already in our backlog as a priority to address. We are currently working through our ROADMAP items for this quarter, once that is complete we will be in position to offer feedback and hopefully merge this pull request.

We appreciate your patience!

@bflad bflad added bug Addresses a defect in current functionality. and removed needs-triage Waiting for first response or review from a maintainer. labels Jun 24, 2020
@bflad bflad self-assigned this Jun 24, 2020
Copy link
Member

@bflad bflad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks right to me and agreed that adding testing for this is not necessarily straightforward, so rather than holding this up further on that, I will approve this on a lack of regressions from our existing unit testing and acceptance testing. Thank you @jbergknoff-rival and everyone involved. 🚀

Output from acceptance testing:

--- PASS: TestAccAWSEcsTaskDefinition_arrays (18.99s)
--- PASS: TestAccAWSEcsTaskDefinition_basic (31.95s)
--- PASS: TestAccAWSEcsTaskDefinition_changeVolumesForcesNewResource (32.13s)
--- PASS: TestAccAWSEcsTaskDefinition_constraint (19.21s)
--- PASS: TestAccAWSEcsTaskDefinition_ExecutionRole (20.29s)
--- PASS: TestAccAWSEcsTaskDefinition_Fargate (25.76s)
--- PASS: TestAccAWSEcsTaskDefinition_Inactive (37.80s)
--- PASS: TestAccAWSEcsTaskDefinition_inferenceAccelerator (19.31s)
--- PASS: TestAccAWSEcsTaskDefinition_ProxyConfiguration (31.52s)
--- PASS: TestAccAWSEcsTaskDefinition_Tags (52.97s)
--- PASS: TestAccAWSEcsTaskDefinition_withDockerVolume (19.42s)
--- PASS: TestAccAWSEcsTaskDefinition_withDockerVolumeMinimalConfig (19.41s)
--- PASS: TestAccAWSEcsTaskDefinition_withEcsService (76.30s)
--- PASS: TestAccAWSEcsTaskDefinition_withEFSVolume (32.52s)
--- PASS: TestAccAWSEcsTaskDefinition_withEFSVolumeMinimal (32.26s)
--- PASS: TestAccAWSEcsTaskDefinition_withIPCMode (20.19s)
--- PASS: TestAccAWSEcsTaskDefinition_withNetworkMode (20.36s)
--- PASS: TestAccAWSEcsTaskDefinition_withPidMode (29.78s)
--- PASS: TestAccAWSEcsTaskDefinition_withScratchVolume (18.39s)
--- PASS: TestAccAWSEcsTaskDefinition_withTaskRoleArn (20.21s)
--- PASS: TestAccAWSEcsTaskDefinition_withTaskScopedDockerVolume (18.99s)

@bflad bflad added this to the v2.68.0 milestone Jun 24, 2020
@bflad bflad merged commit 0be34d4 into hashicorp:master Jun 24, 2020
bflad added a commit that referenced this pull request Jun 24, 2020
@jbergknoff-rival
Copy link
Contributor Author

@bflad thank you! While this is fresh in our minds, please also have a look at #13813 which solves the same problem for various resources with IAM-like policies.

@jbergknoff-rival jbergknoff-rival deleted the jbergknoff/ecs-task-definition-env-ordering branch June 25, 2020 16:26
@jbergknoff-rival jbergknoff-rival restored the jbergknoff/ecs-task-definition-env-ordering branch June 25, 2020 16:31
@ghost
Copy link

ghost commented Jun 26, 2020

This has been released in version 2.68.0 of the Terraform AWS provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading.

For further feature requests or bug reports with this functionality, please create a new GitHub issue following the template for triage. Thanks!

@pkoch
Copy link

pkoch commented Jun 26, 2020

@rafilkmp3 The day has come. Welcome to paradise.

@jfirebaugh
Copy link

Is anyone else getting even worse diffing with this change? I now see the "after" state represented as stringified JSON with no whitespace (i.e. virtually unreadable):

  # <resource> must be replaced
-/+ resource "aws_ecs_task_definition" "main" {
      ...
      ~ container_definitions    = jsonencode(
            <nicely formatted JSON data>
        ) -> "<stringified JSON with no whitespace>" # forces replacement
      ...
    }

@pkoch
Copy link

pkoch commented Jun 30, 2020

I have tried this out and only seen the expected behavior. :/

Can you perhaps produce a minimal example and open a new issue?

@jbergknoff-rival jbergknoff-rival deleted the jbergknoff/ecs-task-definition-env-ordering branch June 30, 2020 17:32
@jfirebaugh
Copy link

I reduced it to this diff, as generated by the 0.67.0 provider:

    logConfiguration = {
      logDriver = "awslogs"
      options = {
        ~ awslogs-region        = "us-west-2" -> null
          awslogs-group         = "group"
          awslogs-stream-prefix = "prefix"
      }
      secretOptions = []
    }

For whatever reason, the 0.68.0 provider is unable to generate a readable diff for this change. However, changing awslogs-region to null was an unintended change in my configuration, and after fixing it, I do see the expected behavior. So this is probably not something that will be commonly encountered.

@ghost
Copy link

ghost commented Jul 24, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thanks!

@hashicorp hashicorp locked and limited conversation to collaborators Jul 24, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Addresses a defect in current functionality. service/ecs Issues and PRs that pertain to the ecs service. size/XS Managed by automation to categorize the size of a PR.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inconsistent order of environment variables in aws_ecs_task_definition