Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Massive amounts of "connection is shut down" errors which sometimes result in TERRAFORM CRASH with 1.11 #762

Closed
sofixa opened this issue May 10, 2019 · 6 comments

Comments

@sofixa
Copy link

commented May 10, 2019

Terraform Version

terraform 0.111.13

vSphere Provider Version

- Downloading plugin for provider "aws" (1.7.1)...
- Downloading plugin for provider "vsphere" (1.11.0)...
- Downloading plugin for provider "gitlab" (1.3.0)...
- Downloading plugin for provider "vault" (1.5.0)...

Affected Resource(s)

Please list the resources as a list, for example:

  • vsphere_virtual_machine
  • vsphere_resource_pool
  • vsphere_distributed_port_group
  • vsphere_vmfs_datastore
  • maybe others, probably core

Terraform Configuration Files

variable "front-pp_project"   { default = "v2" }
variable "front-pp_instance"  { default = "wwwpp-0" } # Counter happend with a number (1,2,3,4,etc.)
variable "front-pp_num-cpus"  { default = 2 }
variable "front-pp_memory"    { default = 4 }
variable "front-pp_disk-size" { default = 50 }

variable "pa3-front-pp_count"         { default = 2 }
variable "pa3-front-pp_count-offset"  { default = 0 }

resource "vsphere_virtual_machine" "pa3-frontpp" {
  count            = "${var.pa3-front-pp_count}"
  provider         = "vsphere.pa3"
  name             = "${var.client_name}.${var.front-pp_project}.${var.front-pp_instance}${count.index + 1 + var.pa3-front-pp_count-offset}"
  guest_id         = "${data.vsphere_virtual_machine.pa3_template9.guest_id}"
  resource_pool_id = "${data.vsphere_resource_pool.pa3_pool.id}"
  datastore_id     = "${data.vsphere_datastore.pa3_datastorepp_01.id}"
  folder           = "${var.folder}"

  num_cpus = "${var.front-pp_num-cpus}"
  memory   = "${var.front-pp_memory * 1024}"

  network_interface {
    network_id   = "${data.vsphere_network.pa3_adm.id}"
    adapter_type = "${var.network_interface_type}"
  }

  network_interface {
    network_id   = "${data.vsphere_network.pa3_frontpp.id}"
    adapter_type = "${var.network_interface_type}"
  }

  network_interface {
    network_id   = "${data.vsphere_network.pa3_backpp.id}"
    adapter_type = "${var.network_interface_type}"
  }

  disk {
    label            = "disk0.vmdk"
    size             = "${var.front-pp_disk-size}"
    eagerly_scrub    = "${var.eagerly_scrub}"
    thin_provisioned = "${var.thin_provisioned}"
  }

  lifecycle {
    ignore_changes = [ "clone.0.template_uuid","disk.0.io_share_count","disk.0.key" ]
  }

  cpu_hot_add_enabled        = true
  memory_hot_add_enabled     = true
  wait_for_guest_net_timeout = 0

  clone {
    template_uuid = "${data.vsphere_virtual_machine.pa3_template9.id}"
  }

  extra_config {
    "guestinfo.vmname" = "${var.client}.${var.front-pp_project}.${var.front-pp_instance}${count.index + 1 + var.pa3-front-pp_count-offset}"
  }
}

Debug Output

https://gist.github.com/sofixa/9604fadc993b59ae40122769cad513e6

for starters, i'll continue doing tests to try to reproduce the crash.

Panic Output

If Terraform produced a panic, please provide a link to a GitHub Gist containing
the output of the crash.log.

Expected Behavior

Terraform doing a regular refresh/plan/apply.

Actual Behavior

In a case where there are a few thousand refreshes, usually around ~300 fail with "Connection is shut down". For smaller cases, the refresh/plan usually pass, but sometimes the apply fails with a CRASH.

Steps to Reproduce

Please list the steps required to reproduce the issue, for example:

  1. use vsphere provider 1.11
  2. terraform refresh/plan/apply multiple times,

Cheers,
Adrian

@sofixa sofixa changed the title Massive amounts of "connection is shut down" errors which sometimes result in TERRAFORM CRASH Massive amounts of "connection is shut down" errors which sometimes result in TERRAFORM CRASH with 1.11 May 10, 2019
@bill-rich

This comment has been minimized.

Copy link
Contributor

commented May 10, 2019

Thanks for reporting this @sofixa! I have tried running through some repeated tests, but haven't been able to reproduce the issue yet. Are you consistently seeing the "connection is shut down" error? If possible, can you please also get a complete debug log from a run that includes the errors?

@sofixa

This comment has been minimized.

Copy link
Author

commented May 14, 2019

@bill-rich i can consistently reproduce the issue on a big project we have, but the debug output is 800,000 lines long, which is a little complicated to sanitize properly xD But from what i can see, it goes about the normal way to refresh (during the plan), and then "connection is shut down":

2019/05/13 09:44:27 [TRACE] root.test-front-preprod: eval: *terraform.EvalSequence
2019/05/13 09:44:27 [TRACE] root.test-front-preprod: eval: *terraform.EvalInterpolate
2019/05/13 09:44:27 [TRACE] root.test-front-preprod: eval: *terraform.EvalCountCheckComputed
2019/05/13 09:44:27 [TRACE] root.test-front-preprod: eval: *terraform.EvalIf
2019/05/13 09:44:27 [TRACE] root.test-front-preprod: eval: *terraform.EvalCountFixZeroOneBoundary
2019/05/13 09:44:27 [ERROR] root.test-front-preprod: eval: *terraform.EvalRefresh, err: vsphere_distributed_port_group.pa3-vlan: connection is shut down
2019/05/13 09:44:27 [ERROR] root.test-front-preprod: eval: *terraform.EvalSequence, err: vsphere_distributed_port_group.pa3-vlan: connection is shut down

And the same thing en masse for hundreds of resources.

On smaller projects it seems to work fine in some cases, and crash in others (if it's during the apply stage, we get the crash as in the gist in the issue -https://gist.github.com/sofixa/9604fadc993b59ae40122769cad513e6 ).

@hashibot hashibot bot removed the waiting-response label May 14, 2019
@bill-rich

This comment has been minimized.

Copy link
Contributor

commented May 17, 2019

I've still been having a hard time reproducing the issue, but its possible its due to working on a smaller scale in the test environment.

Were you using the v1.10.0 of the vSphere provider previously? The only changes between 1.10.0 and 1.11.0 were documentation and dependency updates, so I'm trying to narrow down exactly what introduced the new problem.

I have a branch which uses the earlier version of govmomi, would it be possible to build from that and let me know if that resolves the connection issues and panics?

@sofixa

This comment has been minimized.

Copy link
Author

commented May 17, 2019

@bill-rich yep, we were on 1.10 before 1.11 came out.. I'll give the branch a try on Monday and i'll try to reproduce the crashes we've been having with smaller projects on apply.

Cheers,
Adrian

@hashibot hashibot bot removed the waiting-response label May 17, 2019
@bill-rich

This comment has been minimized.

Copy link
Contributor

commented May 21, 2019

Sounds good! Please let me know what you find with the build with the earlier govmomi.

@bill-rich

This comment has been minimized.

Copy link
Contributor

commented Jun 14, 2019

This issue has been closed because there has been no response to our request for more information. With only the information that is currently in the issue, we don't have enough information to take action. Please reach out if you have or find the answers we need so that we can investigate further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.