New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws_volume_attachment workflow with skip_destroy #1017

Open
gtmtech opened this Issue Jun 30, 2017 · 3 comments

Comments

Projects
None yet
5 participants
@gtmtech

gtmtech commented Jun 30, 2017

Terraform 0.9.8

Firstly, I feel that aws_volume_attachment should always have skip_destroy=true as default. aws_volume_attachments are notoriously tricky in terraform, because they often prevent destruction of resources.

For example, terraform up an aws_instance, an aws_ebs_volume, and an aws_volume_attachment that connects them (without skip_destroy), and then try and plan -destroy all 3 (and apply)

Terraform will simply do this:

Error applying plan:
1 error(s) occurred:
* aws_volume_attachment.foo (destroy): 1 error(s) occurred:
* aws_volume_attachment.foo: Error waiting for Volume (vol-0398c9b5a8017xxxx) to detach from Instance: i-023644c6c4c02xxxx

Worse still, the skip_destroy flag must be successfully APPLIED to the resource IN THE STATEFILE, before you have any hope of terraform destroying it. If the skip_destroy="true" flag is merely on the aws_volume_attachment resource .tf file, and you try and destroy the resource, you still get the above timeout error. - this means the docs are technically wrong.

Sometimes, you destroy resources simply by deleting the resource declarations in the terraform code and plan/applying - as this does not work for the same reasons (skip_destroy has not been applied in the statefile), it just means that you then have an impossible situation and have to revert the codebase back, in order to add the flag and then terraform apply before terraform destroying.

One of three things should happen (in my view):

  • skip_destroy defaults to true
  • skip_destroy changes so that it skips the destroy only if there's a timeout, and it does not require being applied to the statefile first, but just works by its presence.
  • the destruction of the aws_volume_attachment resource in conjunction with the destruction of its aws_instance or its aws_ebs_volume should delete the aws_instance or aws_ebs_volume first -- as this will then allow the destruction of the aws_volume_attachment resource without force flag.

Unfortunately there is no lifecycle event to swap the order of destruction of dependencies for (3).

Any comments appreciated - its a bit of a thorny workflow at the moment.

@njam

This comment has been minimized.

Show comment
Hide comment
@njam

njam Oct 28, 2017

For example, terraform up an aws_instance, an aws_ebs_volume, and an aws_volume_attachment that connects them (without skip_destroy), and then try and plan -destroy all 3 (and apply) [...] Terraform will simply do this: Error applying plan

I can't confirm this behaviour with terraform 0.10.8 and terraform-aws 1.1.0.
All 3 resources are successfully destroyed.
Was this maybe fixed?

Example code: https://gist.github.com/njam/cf572606f23625b941aa7ab61e2569b3

njam commented Oct 28, 2017

For example, terraform up an aws_instance, an aws_ebs_volume, and an aws_volume_attachment that connects them (without skip_destroy), and then try and plan -destroy all 3 (and apply) [...] Terraform will simply do this: Error applying plan

I can't confirm this behaviour with terraform 0.10.8 and terraform-aws 1.1.0.
All 3 resources are successfully destroyed.
Was this maybe fixed?

Example code: https://gist.github.com/njam/cf572606f23625b941aa7ab61e2569b3

@nemosupremo

This comment has been minimized.

Show comment
Hide comment
@nemosupremo

nemosupremo Nov 29, 2017

@njam

I'm currently trying to figure out how to deal with this, I think the problem with your plan, and the problem that OP missed in his workflow, is the timeout commonly occurs when you try to destroy an attachment while it is still mounted. If you unmount the volume first, then run terraform apply, the destroy succeeds as you noticed. Likewise, you can probably replicate OP's issue by mounting the drive first.

I just started with Terraform and I "solved" the problem like so:

resource "aws_volume_attachment" "pritunl_att_data" {
  device_name = "/dev/sdd"
  instance_id = "${aws_instance.pritunl.id}"
  volume_id   = "${aws_ebs_volume.pritunl_data.id}"
  provisioner "remote-exec" {
      inline = [
        "if [ x`lsblk -ln -o FSTYPE /dev/xvdd` != 'xext4' ] ; then sudo mkfs.ext4 -L datanode /dev/xvdd ; fi",
        "sudo mount -a",
        "sudo mkdir -p /mnt/data/mongodb",
        "sudo chown -R mongodb:mongodb /mnt/data/mongodb",
        "sudo service mongod restart",
        "sudo sh -c \"echo 'yes' > /mnt/data/init\"",
      ]
      connection {
        user = "ubuntu"
        host = "${aws_eip.pritunl_ip.public_ip}"
        private_key = "${file("~/.ssh/master_rsa")}"
      }
    }
    provisioner "remote-exec" {
      when = "destroy"
      inline = [
        "sudo service mongod stop",
        "sudo umount /mnt/data"
      ]
      connection {
        user = "ubuntu"
        host = "${aws_eip.pritunl_ip.public_ip}"
        private_key = "${file("~/.ssh/master_rsa")}"
      }
    }
}

The problem I'm having however, is I think I'm running into hashicorp/terraform#16237, where the destroy provisioner causes a cycle.

I can solve the cycle by either 1. hard coding the instance's IP or 2. Adding in a data "aws_instances" with a filter. While of course 1 isn't desirable (I might not know the IP) 2, has its own set of problems when I apply the formula from scratch - data "aws_instances" returns 0 instances, which causes an error.

I thought there might be prior work on this, but it seems everyone is using skip_destroy. The issue I have with skip_destroy is if I'm moving an EBS instance, or changing the instance type, when I try to attach that instance again I get a timeout (because AWS thinks the EBS volume is already attached).

nemosupremo commented Nov 29, 2017

@njam

I'm currently trying to figure out how to deal with this, I think the problem with your plan, and the problem that OP missed in his workflow, is the timeout commonly occurs when you try to destroy an attachment while it is still mounted. If you unmount the volume first, then run terraform apply, the destroy succeeds as you noticed. Likewise, you can probably replicate OP's issue by mounting the drive first.

I just started with Terraform and I "solved" the problem like so:

resource "aws_volume_attachment" "pritunl_att_data" {
  device_name = "/dev/sdd"
  instance_id = "${aws_instance.pritunl.id}"
  volume_id   = "${aws_ebs_volume.pritunl_data.id}"
  provisioner "remote-exec" {
      inline = [
        "if [ x`lsblk -ln -o FSTYPE /dev/xvdd` != 'xext4' ] ; then sudo mkfs.ext4 -L datanode /dev/xvdd ; fi",
        "sudo mount -a",
        "sudo mkdir -p /mnt/data/mongodb",
        "sudo chown -R mongodb:mongodb /mnt/data/mongodb",
        "sudo service mongod restart",
        "sudo sh -c \"echo 'yes' > /mnt/data/init\"",
      ]
      connection {
        user = "ubuntu"
        host = "${aws_eip.pritunl_ip.public_ip}"
        private_key = "${file("~/.ssh/master_rsa")}"
      }
    }
    provisioner "remote-exec" {
      when = "destroy"
      inline = [
        "sudo service mongod stop",
        "sudo umount /mnt/data"
      ]
      connection {
        user = "ubuntu"
        host = "${aws_eip.pritunl_ip.public_ip}"
        private_key = "${file("~/.ssh/master_rsa")}"
      }
    }
}

The problem I'm having however, is I think I'm running into hashicorp/terraform#16237, where the destroy provisioner causes a cycle.

I can solve the cycle by either 1. hard coding the instance's IP or 2. Adding in a data "aws_instances" with a filter. While of course 1 isn't desirable (I might not know the IP) 2, has its own set of problems when I apply the formula from scratch - data "aws_instances" returns 0 instances, which causes an error.

I thought there might be prior work on this, but it seems everyone is using skip_destroy. The issue I have with skip_destroy is if I'm moving an EBS instance, or changing the instance type, when I try to attach that instance again I get a timeout (because AWS thinks the EBS volume is already attached).

@robax

This comment has been minimized.

Show comment
Hide comment
@robax

robax May 23, 2018

Also running into this issue. In our case, we're trying to use the remote provisioner to stop services and unmount an EBS volume "cleanly", this failed due to running into cycle issues exactly as @nemosupremo stated.

Our workaround is to do a dirty detachment using force_detach = true but this required a manual edit of the TF state file as described by OP @gtmtech. If others can confirm this I'll open a new issue.

robax commented May 23, 2018

Also running into this issue. In our case, we're trying to use the remote provisioner to stop services and unmount an EBS volume "cleanly", this failed due to running into cycle issues exactly as @nemosupremo stated.

Our workaround is to do a dirty detachment using force_detach = true but this required a manual edit of the TF state file as described by OP @gtmtech. If others can confirm this I'll open a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment