Implement retry logic to enforce timeouts #1033

stormqueen1990 · 2022-11-16T21:09:04Z

Issue: rancher/rancher#37161

Problem

Currently, some resources in the Terraform Rancher2 provider have timeouts declared and described in documentation, but do not consume these timeout values for their operations. While most of these operations are expected to be short and finish within a reasonable amount of time (often within seconds), it does seem that they might end up hanging indefinitely should something go wrong in the backend.

Solution

Use the resource.Retry(time.Duration, RetryFunc) function to enclose blocks that we would want to have timeouts applied. As a bonus, retry functionality may be easily implemented in the future, if deemed necessary.

As outlined by @eliyamlevy in rancher/rancher#37161, the following resources will have this change applied:

rancher2_certificate (create, update, delete)
~~rancher2_cluster_sync (update, delete)~~ update and delete are no-ops for this resource, and create already has timeouts implemented
rancher2_cluster_template (create, update, delete)
rancher2_global_role (create, update, delete)
rancher2_registry (create, update, delete)
rancher2_role_template (create, update, delete)
rancher2_secret (create, update, delete)
rancher2_token (create, delete)

Testing

Engineering Testing

Manual Testing

Tested all changed resources manually with a newly-built binary for creation, update, and deletion to make sure no changes in behaviour happened.

Automated Testing

Acceptance test suite only.

QA Testing Considerations

Regressions Considerations

a-blender

Looks pretty good!

There are a few other spots in the rancher2 provider I noticed doesn't use the schema.Timeout

resourceRancher2CloudCredentialRead - add a retry?
resourceRancher2ClusterRead
resourceRancher2ClusterSyncRead
...

In general, I notice the timeout is used in the resource.WaitForState https://pkg.go.dev/github.com/hashicorp/terraform-plugin-sdk/v2/helper/resource#StateChangeConf.WaitForState function for resource Create and Delete functions, but not Read. Is there a reason why you only implemented the retries for the specific files in this PR? IMO, we may want to implement retries in the Read functions for those other resources for code parity and to reduce the chance of an apply getting stuck. Let me know what you think

rancher2/resource_rancher2_certificate.go

rancher2/resource_rancher2_role_template.go

stormqueen1990 · 2023-01-04T16:25:49Z

Hi there, @annablender!

In general, I notice the timeout is used in the resource.WaitForState https://pkg.go.dev/github.com/hashicorp/terraform-plugin-sdk/v2/helper/resource#StateChangeConf.WaitForState function for resource Create and Delete functions, but not Read. Is there a reason why you only implemented the retries for the specific files in this PR? IMO, we may want to implement retries in the Read functions for those other resources for code parity and to reduce the chance of an apply getting stuck. Let me know what you think

The main reason was that those were the files listed in the rancher/rancher#37161 issue, but I don't see why not expand it to all the resources, since the lack of timeouts might affect all the resources in this provider. I can update the PR to add those other resources you mentioned.

a-blender

@stormqueen1990 Thank you, a few more things

Additional updates look good. If you could add additional commits instead of push -f that'd be great. It will make it easier to see what's been added
Could you add a retry to the rancher2_cluster Update func? here

terraform-provider-rancher2/rancher2/resource_rancher2_cluster.go

Line 270 in 2c3bddd

err = client.APIBaseClient.ByID(managementClient.ClusterType, d.Id(), cluster)

. I think it will fix an issue I have on my plate
Please rollback removing cis v1 logging code. That has already been removed in another PR. If you are having build errors, rebase and push
Please smoketest with a rancher2_cluster rke cluster

I will re-review after this

Add retry logic to the following resources/operations: * resource_rancher2_certificate (create, update, delete) * resource_rancher2_cluster_template (create, update, delete) * resource_rancher2_global_role (create, update, delete) * resource_rancher2_registry (create, update, delete) * resource_rancher2_role_template (create, update, delete) * resource_rancher2_secret (create, update, delete) * resource_rancher2_token (create, delete) Plus add retry logic for all reads in the resource.

a-blender · 2023-03-08T17:08:35Z

Update: discussed with @stormqueen1990 offline. I think a retry in rancher2_cluster Update is needed to fix #1040 on upgrade. From the TF docs, WaitForState does not retry, it just waits for the timeout or an error condition and refreshes the upstream resource at intervals to see if the desired state has been achieved.

I will put in a separate PR for the retry I requested since this PR is already so large. Otherwise, as long as these changes have been dev tested re-request my review when ready

Remove the timeout implementation from the rancher2_feature and rancher2_setting resources as they do not take the timeouts configuration on their schemas.

Remove the timeout implementation from the rancher2_app and rancher2_multi_cluster_app implementations as they are not compatible with Rancher v2.7 and this pull request is targeting the Rancher v2.7 line for terraform-provider-rancher2.

Remove the timeout handling from the rancher2_project_role_template_binding as it does not seem to work and require more investigation.

stormqueen1990 · 2023-03-15T22:35:27Z

Smoke test for the changed resources

I smoke tested the changed resources marked in the checklist below using the following methodology:

Created a Terraform configuration using the resource.
Added some timeout values as I saw fit.
If the resource failed to provision in the allotted time, tweaked the timeout for the resource until it passed provisioning.
Changed some values to test the update timeout.
Deleted the created resources.

Note: all creations and updates were ran using terraform apply and all deletions were ran using terraform destroy.

rancher2/resource_rancher2_app_v2.go

resource "rancher2_app_v2" "app" {
  name          = var.app_name
  chart_name    = var.chart_name
  cluster_id    = data.rancher2_cluster.local.id
  namespace     = rancher2_namespace.namespace.id
  repo_name     = var.repository_name
  chart_version = var.chart_version

  timeouts {
    create = "60s"
    delete = "60s"
    update = "60s"
  }
}

rancher2/resource_rancher2_catalog_v2.go

resource "rancher2_catalog_v2" "catalog" {
  cluster_id = data.rancher2_cluster.local.id
  name       = var.catalog_name
  git_repo   = var.git_repository_address
  git_branch = var.git_branch_name

  annotations = var.catalog_annotations

  timeouts {
    create = "30s"
    delete = "30s"
    update = "30s"
  }
}

rancher2/resource_rancher2_certificate.go

resource "rancher2_certificate" "certificate" {
  certs      = base64encode(file(var.certificate_file_path))
  key        = base64encode(file(var.key_file_path))
  project_id = data.rancher2_project.project.id
  name       = var.certificate_name

  labels = var.certificate_labels

  timeouts {
    create = "5s"
    update = "5s"
    delete = "5s"
  }
}

rancher2/resource_rancher2_cloud_credential.go

resource "rancher2_cloud_credential" "cloud_credential" {
  name        = var.cloud_credential_name
  annotations = var.cloud_credential_annotations

  digitalocean_credential_config {
    access_token = var.cloud_credential_token
  }

  timeouts {
    create = "1s"
    delete = "1s"
    update = "1s"
  }
}

rancher2/resource_rancher2_cluster.go, rancher2/resource_rancher2_cluster_sync.go, rancher2/resource_rancher2_node_template.go, rancher2/resource_rancher2_node_pool.go

resource "rancher2_node_template" "node_template" {
  name                = var.node_template_name
  cloud_credential_id = data.rancher2_cloud_credential.cloud_credential.id
  engine_install_url  = var.engine_install_url

  digitalocean_config {
    image  = var.os_image
    region = var.do_region
    size   = var.machine_size
  }

  timeouts {
    create = "60s"
    update = "60s"
    delete = "60s"
  }
}

resource "rancher2_node_pool" "node_pool" {
  depends_on = [rancher2_node_template.node_template]

  name             = var.node_pool_name
  hostname_prefix  = var.pool_hostname_prefix
  cluster_id       = rancher2_cluster.cluster.id
  node_template_id = rancher2_node_template.node_template.id
  quantity         = 3
  control_plane    = true
  etcd             = true
  worker           = true

  timeouts {
    create = "5m"
    update = "5m"
    delete = "5m"
  }
}

resource "rancher2_cluster" "cluster" {
  depends_on = [rancher2_node_template.node_template]
  name       = var.cluster_name

  rke_config {
    network {
      plugin = "calico"
    }
    kubernetes_version = var.cluster_kubernetes_version
  }

  enable_network_policy = true

  timeouts {
    create = "5m"
    update = "5m"
    delete = "5m"
  }
}

resource "rancher2_cluster_sync" "cluster_sync" {
  cluster_id    = rancher2_cluster.test_cluster.id
  wait_catalogs = true
  node_pool_ids = [rancher2_node_pool.node_pool.id]
  state_confirm = 4

  timeouts {
    create = "10m"
    update = "10m"
    delete = "10m"
  }
}

rancher2/resource_rancher2_cluster_alert_group.go, rancher2/resource_rancher2_cluster_alert_rule.go

resource "rancher2_cluster_alert_group" "alert_group" {
  cluster_id = data.rancher2_cluster.local.id
  name       = var.alert_group_name

  group_interval_seconds = var.alert_group_interval_seconds

  timeouts {
    create = "5s"
    delete = "5s"
    update = "5s"
  }
}

resource "rancher2_cluster_alert_rule" "alert_rule" {
  cluster_id = data.rancher2_cluster.local.id
  group_id   = rancher2_cluster_alert_group.alert_group.id
  name       = var.alert_rule_name

  group_wait_seconds = var.alert_rule_group_wait_seconds

  timeouts {
    create = "5s"
    delete = "5s"
    update = "5s"
  }
}

rancher2/resource_rancher2_cluster_driver.go

resource "rancher2_cluster_driver" "cluster_driver" {
  name    = var.cluster_driver_name
  active  = false
  builtin = false
  url     = var.cluster_driver_url

  timeouts {
    create = "5s"
    update = "5s"
    delete = "5s"
  }
}

rancher2/resource_rancher2_role_template.go, rancher2/resource_rancher2_cluster_role_template_binding.go

resource "rancher2_role_template" "role_template" {
  name = var.role_template_name

  rules {
    api_groups = var.role_template_api_groups
    resources  = var.role_template_resources
    verbs      = var.role_template_verbs
  }

  timeouts {
    create = "5s"
    update = "5s"
    delete = "5s"
  }
}

resource "rancher2_cluster_role_template_binding" "role_template_binding" {
  name             = var.role_template_binding_name
  cluster_id       = data.rancher2_cluster.local.id
  role_template_id = rancher2_role_template.role_template.id

  user_id = data.rancher2_user.admin.id

  timeouts {
    create = "5s"
    update = "5s"
    delete = "5s"
  }
}

rancher2/resource_rancher2_cluster_template.go

resource "rancher2_cluster_template" "template" {
  name   = var.cluster_template_name
  labels = var.cluster_template_labels

  template_revisions {
    name = "v1"
    cluster_config {
      rke_config {
        network {
          plugin = "canal"
        }
      }
    }

    default = true
  }

  timeouts {
    create = "5s"
    delete = "5s"
    update = "5s"
  }
}

rancher2/resource_rancher2_config_map_v2.go

resource "rancher2_config_map_v2" "config_map" {
  name       = var.config_map_name
  cluster_id = data.rancher2_cluster.local.id
  data       = var.config_map_data

  labels = var.config_map_labels

  timeouts {
    create = "5s"
    delete = "5s"
    update = "5s"
  }
}

rancher2/resource_rancher2_etcd_backup.go

resource "rancher2_etcd_backup" "backup" {
  cluster_id  = data.rancher2_cluster.local.id
  annotations = var.etcd_backup_annotations

  timeouts {
    create = "5m"
    delete = "5m"
    update = "5m"
  }
}

rancher2/resource_rancher2_global_dns.go, rancher2/resource_rancher2_global_dns_provider.go

resource "rancher2_global_dns_provider" "global_dns_provider" {
  name        = var.global_dns_provider_name
  root_domain = var.global_dns_provider_root_domain

  timeouts {
    create = "5s"
    update = "5s"
    delete = "5s"
  }
}

resource "rancher2_global_dns" "global_dns" {
  fqdn        = var.global_dns_fqdn
  provider_id = rancher2_global_dns_provider.global_dns_provider.id

  timeouts {
    create = "5s"
    update = "5s"
    delete = "5s"
  }
}

rancher2/resource_rancher2_global_role.go, rancher2/resource_rancher2_global_role_binding.go

resource "rancher2_global_role" "role" {
  name = var.global_role_name

  rules {
    api_groups = var.global_role_api_groups
    resources  = var.global_role_resources
    verbs      = var.global_role_verbs
  }
  
  timeouts {
    create = "5s"
    update = "5s"
    delete = "5s"
  }
}

resource "rancher2_global_role_binding" "role_binding" {
  global_role_id = rancher2_global_role.role.id
  user_id        = data.rancher2_user.admin.id
  annotations    = var.global_role_binding_annotations
  
  timeouts {
    create = "5s"
    update = "5s"
    delete = "5s"
  }
}

rancher2/resource_rancher2_namespace.go

resource "rancher2_namespace" "namespace" {
  name       = var.namespace_name
  project_id = rancher2_project.project.id

  timeouts {
    create = "5s"
    update = "5s"
    delete = "5s"
  }
}

rancher2/resource_rancher2_node_driver.go

resource "rancher2_node_driver" "node_driver" {
  active  = false
  builtin = false
  url     = var.node_driver_url
  name    = var.node_driver_name

  timeouts {
    create = "5s"
    update = "5s"
    delete = "5s"
  }
}

rancher2/resource_rancher2_notifier.go

resource "rancher2_notifier" "notifier" {
  cluster_id = data.rancher2_cluster.local.id
  name       = var.notifier_name

  timeouts {
    create = "5s"
    update = "5s"
    delete = "5s"
  }
}

rancher2/resource_rancher2_pod_security_policy_template.go

resource "rancher2_pod_security_policy_template" "psp-template" {
  name                       = var.psp_template_name
  allow_privilege_escalation = false
  host_pid                   = false

  se_linux {
    rule = "RunAsAny"
  }

  run_as_user {
    rule = "MustRunAs"
    range {
      max = 1000
      min = 1000
    }
  }

  run_as_group {
    rule = "MustRunAs"
    range {
      max = 1000
      min = 1000
    }
  }

  fs_group {
    rule = "MustRunAs"
    range {
      max = 1000
      min = 1000
    }
  }

  supplemental_group {
    rule = "MustRunAs"
    range {
      max = 1001
      min = 1001
    }
  }

  timeouts {
    create = "5m"
    delete = "5m"
    update = "5m"
  }
}

rancher2/resource_rancher2_project.go

resource "rancher2_project" "project" {
  name        = var.project_name
  cluster_id  = data.rancher2_cluster.local.id
  description = var.project_description

  enable_project_monitoring = true

  labels = var.project_labels

  timeouts {
    create = "5s"
    update = "5s"
    delete = "5s"
  }
}

rancher2/resource_rancher2_project_alert_group.go, rancher2/resource_rancher2_project_alert_rule.go

resource "rancher2_project_alert_group" "alert_group" {
  name       = var.project_alert_group_name
  project_id = rancher2_project.project.id

  group_interval_seconds = var.project_alert_group_interval_seconds

  timeouts {
    create = "5s"
    delete = "5s"
    update = "5s"
  }
}

resource "rancher2_project_alert_rule" "alert_rule" {
  name       = var.project_alert_rule_name
  group_id   = rancher2_project_alert_group.alert_group.id
  project_id = rancher2_project.project.id
  
  group_wait_seconds = var.project_alert_rule_wait_seconds
  
  timeouts {
    create = "5s"
    delete = "5s"
    update = "5s"
  }
}

rancher2/resource_rancher2_registry.go

resource "rancher2_registry" "registry" {
  name       = var.registry_name
  project_id = rancher2_project.test_project.id

  labels = var.registry_labels

  registries {
    address = var.registry_address
  }

  timeouts {
    create = "5s"
    delete = "5s"
    update = "5s"
  }
}

rancher2/resource_rancher2_secret.go

resource "rancher2_secret" "secret" {
  data         = var.secret_data
  project_id   = rancher2_project.test_project.id
  namespace_id = rancher2_namespace.test_namespace.id
  name         = var.secret_name

  labels = var.secret_labels

  timeouts {
    create = "10s"
    delete = "10s"
    update = "10s"
  }
}

rancher2/resource_rancher2_secret_v2.go

resource "rancher2_secret_v2" "secret" {
  cluster_id = data.rancher2_cluster.local.id
  data       = var.secret_data
  name       = var.secret_name

  annotations = var.secret_annotations

  timeouts {
    create = "10s"
    delete = "10s"
    update = "10s"
  }
}

rancher2/resource_rancher2_storage_class_v2.go

resource "rancher2_storage_class_v2" "storage_class" {
  cluster_id      = data.rancher2_cluster.local.id
  k8s_provisioner = var.storage_class_provisioner_name
  name            = var.storage_class_name
  reclaim_policy  = var.storage_class_reclaim_policy
  labels          = var.storage_class_labels

  timeouts {
    create = "5s"
    update = "5s"
    delete = "5s"
  }
}

rancher2/resource_rancher2_token.go

resource "rancher2_token" "token" {
  cluster_id = data.rancher2_cluster.local.id
  ttl        = var.token_ttl
  labels     = var.token_labels

  timeouts {
    create = "5s"
    update = "5s"
    delete = "5s"
  }
}

rancher2/resource_rancher2_user.go

resource "rancher2_user" "user" {
  password = var.user_password
  username = var.user_username
  enabled  = var.user_enabled

  timeouts {
    create = "5s"
    update = "5s"
    delete = "5s"
  }
}

Additional considerations

Removed timeout implementation from rancher2/resource_rancher2_feature.go and rancher2/resource_rancher2_setting.go as those do not take timeout parameters in their schemas.
Removed timeout implementation from rancher2/resource_rancher2_app.go and rancher2/resource_rancher2_multi_cluster_app.go as this pull request is targeting Rancher v2.7+ and those resources are incompatible with Rancher v2.7.
Removed timeout implementation from rancher2/resource_rancher2_project_role_template_binding.go as it does not seem to be working with the latest released provider.
I have removed changes from rancher2_cluster_v2 and rancher2_machine_config_v2 as I have not been able to test those two resources. Changes to them will be added in a subsequent pull request.

a-blender · 2023-04-04T14:22:27Z

@stormqueen1990 Thank you for your work on this! Did you try testing prov using aws instead of DO (I think DO was what you were using if I'm correct) for rancher2_cluster_v2 and rancher2_machine_config_v2? Checking in on those last two since a lot of TF issues I get pertain to those resources.

stormqueen1990 · 2023-04-04T14:41:22Z

@stormqueen1990 Thank you for your work on this! Did you try testing prov using aws instead of DO (I think DO was what you were using if I'm correct) for rancher2_cluster_v2 and rancher2_machine_config_v2? Checking in on those last two since a lot of TF issues I get pertain to those resources.

Hi there, @a-blender! I unfortunately didn't get around to testing those two resources as I had a bunch of issues with AWS security configurations, and afterwards I had to pause my work on this. I could remove both of them from this pull request and create a separate one.

a-blender · 2023-04-04T14:52:42Z

@stormqueen1990 Yeah, if you could create a separate PR with those 2 resources that'd be great. I'll approve this PR after you do that. I also mostly test TF rancher2 / rke with aws as it's super easy to provision test nodes, so ping me internally. I want to help you get setup with that

These resources will be added to a separate pull request as I was not able to check them at this time.

stormqueen1990 · 2023-04-04T15:57:03Z

@stormqueen1990 Yeah, if you could create a separate PR with those 2 resources that'd be great. I'll approve this PR after you do that. I also mostly test TF rancher2 / rke with aws as it's super easy to provision test nodes, so ping me internally. I want to help you get setup with that

I've reverted both rancher2_cluster_v2 and rancher2_machine_config_v2 as discussed.

HarrisonWAffel

This looks good! Thought we likely want to squash on merge

stormqueen1990 added the area/terraform label Nov 16, 2022

stormqueen1990 requested a review from eliyamlevy November 16, 2022 21:09

stormqueen1990 self-assigned this Nov 16, 2022

stormqueen1990 requested review from jakefhyde, Josh-Diamond and a-blender November 18, 2022 15:53

stormqueen1990 force-pushed the apply-timeout-values branch from b841be6 to 03d983f Compare November 18, 2022 20:10

stormqueen1990 changed the title ~~[WIP] Implement retry logic to enforce timeouts~~ Implement retry logic to enforce timeouts Nov 18, 2022

stormqueen1990 marked this pull request as ready for review November 18, 2022 22:37

a-blender suggested changes Jan 4, 2023

View reviewed changes

rancher2/resource_rancher2_certificate.go Outdated Show resolved Hide resolved

rancher2/resource_rancher2_role_template.go Show resolved Hide resolved

a-blender removed the request for review from Josh-Diamond January 4, 2023 16:14

stormqueen1990 force-pushed the apply-timeout-values branch 2 times, most recently from 4e976f5 to 41d6a15 Compare January 18, 2023 17:55

stormqueen1990 requested a review from a-blender January 18, 2023 18:01

stormqueen1990 force-pushed the apply-timeout-values branch from 41d6a15 to 6ba9867 Compare January 24, 2023 15:49

a-blender reviewed Mar 8, 2023

View reviewed changes

a-blender requested a review from HarrisonWAffel March 8, 2023 15:34

stormqueen1990 force-pushed the apply-timeout-values branch from 6ba9867 to 898b4f8 Compare March 8, 2023 16:20

Mauren Berti added 3 commits March 15, 2023 15:51

Remove timeout from resources that do not take it.

aa1d0c3

Remove the timeout implementation from the rancher2_feature and rancher2_setting resources as they do not take the timeouts configuration on their schemas.

Remove timeout handling from project_role_template_binding.

6d53be9

Remove the timeout handling from the rancher2_project_role_template_binding as it does not seem to work and require more investigation.

stormqueen1990 requested a review from a-blender March 15, 2023 22:38

Revert resources cluster_v2 and machine_config_v2.

65b6141

These resources will be added to a separate pull request as I was not able to check them at this time.

a-blender approved these changes Apr 6, 2023

View reviewed changes

a-blender added this to the v2.7.2 - Terraform milestone Apr 6, 2023

HarrisonWAffel approved these changes Apr 6, 2023

View reviewed changes

a-blender merged commit b7178dd into rancher:master Apr 6, 2023

stormqueen1990 deleted the apply-timeout-values branch April 10, 2023 13:50

stormqueen1990 mentioned this pull request May 3, 2023

rancher2 Terraform provider: Some resources do not use assigned timeout values rancher/rancher#37161

Open

a-blender mentioned this pull request Apr 6, 2023

Add rancher2_custom_user_token resource #932

Merged

a-blender mentioned this pull request Jun 27, 2023

Error when upgrading cluster with terraform #1040

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement retry logic to enforce timeouts #1033

Implement retry logic to enforce timeouts #1033

stormqueen1990 commented Nov 16, 2022 •

edited by a-blender

a-blender left a comment •

edited

stormqueen1990 commented Jan 4, 2023

a-blender left a comment •

edited

a-blender commented Mar 8, 2023 •

edited

stormqueen1990 commented Mar 15, 2023 •

edited

a-blender commented Apr 4, 2023

stormqueen1990 commented Apr 4, 2023

a-blender commented Apr 4, 2023 •

edited

stormqueen1990 commented Apr 4, 2023 •

edited

HarrisonWAffel left a comment

Implement retry logic to enforce timeouts #1033

Implement retry logic to enforce timeouts #1033

Conversation

stormqueen1990 commented Nov 16, 2022 • edited by a-blender

Issue: rancher/rancher#37161

Problem

Solution

Testing

Engineering Testing

Manual Testing

Automated Testing

QA Testing Considerations

Regressions Considerations

a-blender left a comment • edited

Choose a reason for hiding this comment

stormqueen1990 commented Jan 4, 2023

a-blender left a comment • edited

Choose a reason for hiding this comment

a-blender commented Mar 8, 2023 • edited

stormqueen1990 commented Mar 15, 2023 • edited

Smoke test for the changed resources

Additional considerations

a-blender commented Apr 4, 2023

stormqueen1990 commented Apr 4, 2023

a-blender commented Apr 4, 2023 • edited

stormqueen1990 commented Apr 4, 2023 • edited

HarrisonWAffel left a comment

Choose a reason for hiding this comment

stormqueen1990 commented Nov 16, 2022 •

edited by a-blender

a-blender left a comment •

edited

a-blender left a comment •

edited

a-blender commented Mar 8, 2023 •

edited

stormqueen1990 commented Mar 15, 2023 •

edited

a-blender commented Apr 4, 2023 •

edited

stormqueen1990 commented Apr 4, 2023 •

edited