Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement retry logic to enforce timeouts #1033

Merged
merged 5 commits into from Apr 6, 2023

Conversation

stormqueen1990
Copy link
Contributor

@stormqueen1990 stormqueen1990 commented Nov 16, 2022

Issue: rancher/rancher#37161

Problem

Currently, some resources in the Terraform Rancher2 provider have timeouts declared and described in documentation, but do not consume these timeout values for their operations. While most of these operations are expected to be short and finish within a reasonable amount of time (often within seconds), it does seem that they might end up hanging indefinitely should something go wrong in the backend.

Solution

Use the resource.Retry(time.Duration, RetryFunc) function to enclose blocks that we would want to have timeouts applied. As a bonus, retry functionality may be easily implemented in the future, if deemed necessary.

As outlined by @eliyamlevy in rancher/rancher#37161, the following resources will have this change applied:

  • rancher2_certificate (create, update, delete)
  • rancher2_cluster_sync (update, delete) update and delete are no-ops for this resource, and create already has timeouts implemented
  • rancher2_cluster_template (create, update, delete)
  • rancher2_global_role (create, update, delete)
  • rancher2_registry (create, update, delete)
  • rancher2_role_template (create, update, delete)
  • rancher2_secret (create, update, delete)
  • rancher2_token (create, delete)

Testing

Engineering Testing

Manual Testing

  • Tested all changed resources manually with a newly-built binary for creation, update, and deletion to make sure no changes in behaviour happened.

Automated Testing

  • Acceptance test suite only.

QA Testing Considerations

Regressions Considerations

@stormqueen1990 stormqueen1990 self-assigned this Nov 16, 2022
@stormqueen1990 stormqueen1990 changed the title [WIP] Implement retry logic to enforce timeouts Implement retry logic to enforce timeouts Nov 18, 2022
@stormqueen1990 stormqueen1990 marked this pull request as ready for review November 18, 2022 22:37
Copy link
Contributor

@a-blender a-blender left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good!

There are a few other spots in the rancher2 provider I noticed doesn't use the schema.Timeout

  • resourceRancher2CloudCredentialRead - add a retry?
  • resourceRancher2ClusterRead
  • resourceRancher2ClusterSyncRead
  • ...

In general, I notice the timeout is used in the resource.WaitForState https://pkg.go.dev/github.com/hashicorp/terraform-plugin-sdk/v2/helper/resource#StateChangeConf.WaitForState function for resource Create and Delete functions, but not Read. Is there a reason why you only implemented the retries for the specific files in this PR? IMO, we may want to implement retries in the Read functions for those other resources for code parity and to reduce the chance of an apply getting stuck. Let me know what you think

rancher2/resource_rancher2_certificate.go Outdated Show resolved Hide resolved
rancher2/resource_rancher2_role_template.go Show resolved Hide resolved
@a-blender a-blender removed the request for review from Josh-Diamond January 4, 2023 16:14
@stormqueen1990
Copy link
Contributor Author

Hi there, @annablender!

In general, I notice the timeout is used in the resource.WaitForState https://pkg.go.dev/github.com/hashicorp/terraform-plugin-sdk/v2/helper/resource#StateChangeConf.WaitForState function for resource Create and Delete functions, but not Read. Is there a reason why you only implemented the retries for the specific files in this PR? IMO, we may want to implement retries in the Read functions for those other resources for code parity and to reduce the chance of an apply getting stuck. Let me know what you think

The main reason was that those were the files listed in the rancher/rancher#37161 issue, but I don't see why not expand it to all the resources, since the lack of timeouts might affect all the resources in this provider. I can update the PR to add those other resources you mentioned.

Copy link
Contributor

@a-blender a-blender left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stormqueen1990 Thank you, a few more things

  • Additional updates look good. If you could add additional commits instead of push -f that'd be great. It will make it easier to see what's been added
  • Could you add a retry to the rancher2_cluster Update func? here
    err = client.APIBaseClient.ByID(managementClient.ClusterType, d.Id(), cluster)
    . I think it will fix an issue I have on my plate
  • Please rollback removing cis v1 logging code. That has already been removed in another PR. If you are having build errors, rebase and push
  • Please smoketest with a rancher2_cluster rke cluster

I will re-review after this

Add retry logic to the following resources/operations:
* resource_rancher2_certificate (create, update, delete)
* resource_rancher2_cluster_template (create, update, delete)
* resource_rancher2_global_role (create, update, delete)
* resource_rancher2_registry (create, update, delete)
* resource_rancher2_role_template (create, update, delete)
* resource_rancher2_secret (create, update, delete)
* resource_rancher2_token (create, delete)

Plus add retry logic for all reads in the resource.
@a-blender
Copy link
Contributor

a-blender commented Mar 8, 2023

Update: discussed with @stormqueen1990 offline. I think a retry in rancher2_cluster Update is needed to fix #1040 on upgrade. From the TF docs, WaitForState does not retry, it just waits for the timeout or an error condition and refreshes the upstream resource at intervals to see if the desired state has been achieved.

I will put in a separate PR for the retry I requested since this PR is already so large. Otherwise, as long as these changes have been dev tested re-request my review when ready

Mauren Berti added 3 commits March 15, 2023 15:51
Remove the timeout implementation from the rancher2_feature and rancher2_setting
resources as they do not take the timeouts configuration on their schemas.
Remove the timeout implementation from the rancher2_app and rancher2_multi_cluster_app
implementations as they are not compatible with Rancher v2.7 and this pull request
is targeting the Rancher v2.7 line for terraform-provider-rancher2.
Remove the timeout handling from the rancher2_project_role_template_binding as it does not
seem to work and require more investigation.
@stormqueen1990
Copy link
Contributor Author

stormqueen1990 commented Mar 15, 2023

Smoke test for the changed resources

I smoke tested the changed resources marked in the checklist below using the following methodology:

  1. Created a Terraform configuration using the resource.
  2. Added some timeout values as I saw fit.
  3. If the resource failed to provision in the allotted time, tweaked the timeout for the resource until it passed provisioning.
  4. Changed some values to test the update timeout.
  5. Deleted the created resources.

Note: all creations and updates were ran using terraform apply and all deletions were ran using terraform destroy.

  • rancher2/resource_rancher2_app_v2.go

    resource "rancher2_app_v2" "app" {
      name          = var.app_name
      chart_name    = var.chart_name
      cluster_id    = data.rancher2_cluster.local.id
      namespace     = rancher2_namespace.namespace.id
      repo_name     = var.repository_name
      chart_version = var.chart_version
    
      timeouts {
        create = "60s"
        delete = "60s"
        update = "60s"
      }
    }
  • rancher2/resource_rancher2_catalog_v2.go

    resource "rancher2_catalog_v2" "catalog" {
      cluster_id = data.rancher2_cluster.local.id
      name       = var.catalog_name
      git_repo   = var.git_repository_address
      git_branch = var.git_branch_name
    
      annotations = var.catalog_annotations
    
      timeouts {
        create = "30s"
        delete = "30s"
        update = "30s"
      }
    }
  • rancher2/resource_rancher2_certificate.go

    resource "rancher2_certificate" "certificate" {
      certs      = base64encode(file(var.certificate_file_path))
      key        = base64encode(file(var.key_file_path))
      project_id = data.rancher2_project.project.id
      name       = var.certificate_name
    
      labels = var.certificate_labels
    
      timeouts {
        create = "5s"
        update = "5s"
        delete = "5s"
      }
    }
  • rancher2/resource_rancher2_cloud_credential.go

    resource "rancher2_cloud_credential" "cloud_credential" {
      name        = var.cloud_credential_name
      annotations = var.cloud_credential_annotations
    
      digitalocean_credential_config {
        access_token = var.cloud_credential_token
      }
    
      timeouts {
        create = "1s"
        delete = "1s"
        update = "1s"
      }
    }
  • rancher2/resource_rancher2_cluster.go, rancher2/resource_rancher2_cluster_sync.go, rancher2/resource_rancher2_node_template.go, rancher2/resource_rancher2_node_pool.go

    resource "rancher2_node_template" "node_template" {
      name                = var.node_template_name
      cloud_credential_id = data.rancher2_cloud_credential.cloud_credential.id
      engine_install_url  = var.engine_install_url
    
      digitalocean_config {
        image  = var.os_image
        region = var.do_region
        size   = var.machine_size
      }
    
      timeouts {
        create = "60s"
        update = "60s"
        delete = "60s"
      }
    }
    
    resource "rancher2_node_pool" "node_pool" {
      depends_on = [rancher2_node_template.node_template]
    
      name             = var.node_pool_name
      hostname_prefix  = var.pool_hostname_prefix
      cluster_id       = rancher2_cluster.cluster.id
      node_template_id = rancher2_node_template.node_template.id
      quantity         = 3
      control_plane    = true
      etcd             = true
      worker           = true
    
      timeouts {
        create = "5m"
        update = "5m"
        delete = "5m"
      }
    }
    
    resource "rancher2_cluster" "cluster" {
      depends_on = [rancher2_node_template.node_template]
      name       = var.cluster_name
    
      rke_config {
        network {
          plugin = "calico"
        }
        kubernetes_version = var.cluster_kubernetes_version
      }
    
      enable_network_policy = true
    
      timeouts {
        create = "5m"
        update = "5m"
        delete = "5m"
      }
    }
    
    resource "rancher2_cluster_sync" "cluster_sync" {
      cluster_id    = rancher2_cluster.test_cluster.id
      wait_catalogs = true
      node_pool_ids = [rancher2_node_pool.node_pool.id]
      state_confirm = 4
    
      timeouts {
        create = "10m"
        update = "10m"
        delete = "10m"
      }
    }
  • rancher2/resource_rancher2_cluster_alert_group.go, rancher2/resource_rancher2_cluster_alert_rule.go

    resource "rancher2_cluster_alert_group" "alert_group" {
      cluster_id = data.rancher2_cluster.local.id
      name       = var.alert_group_name
    
      group_interval_seconds = var.alert_group_interval_seconds
    
      timeouts {
        create = "5s"
        delete = "5s"
        update = "5s"
      }
    }
    
    resource "rancher2_cluster_alert_rule" "alert_rule" {
      cluster_id = data.rancher2_cluster.local.id
      group_id   = rancher2_cluster_alert_group.alert_group.id
      name       = var.alert_rule_name
    
      group_wait_seconds = var.alert_rule_group_wait_seconds
    
      timeouts {
        create = "5s"
        delete = "5s"
        update = "5s"
      }
    }
  • rancher2/resource_rancher2_cluster_driver.go

    resource "rancher2_cluster_driver" "cluster_driver" {
      name    = var.cluster_driver_name
      active  = false
      builtin = false
      url     = var.cluster_driver_url
    
      timeouts {
        create = "5s"
        update = "5s"
        delete = "5s"
      }
    }
  • rancher2/resource_rancher2_role_template.go, rancher2/resource_rancher2_cluster_role_template_binding.go

    resource "rancher2_role_template" "role_template" {
      name = var.role_template_name
    
      rules {
        api_groups = var.role_template_api_groups
        resources  = var.role_template_resources
        verbs      = var.role_template_verbs
      }
    
      timeouts {
        create = "5s"
        update = "5s"
        delete = "5s"
      }
    }
    
    resource "rancher2_cluster_role_template_binding" "role_template_binding" {
      name             = var.role_template_binding_name
      cluster_id       = data.rancher2_cluster.local.id
      role_template_id = rancher2_role_template.role_template.id
    
      user_id = data.rancher2_user.admin.id
    
      timeouts {
        create = "5s"
        update = "5s"
        delete = "5s"
      }
    }
  • rancher2/resource_rancher2_cluster_template.go

    resource "rancher2_cluster_template" "template" {
      name   = var.cluster_template_name
      labels = var.cluster_template_labels
    
      template_revisions {
        name = "v1"
        cluster_config {
          rke_config {
            network {
              plugin = "canal"
            }
          }
        }
    
        default = true
      }
    
      timeouts {
        create = "5s"
        delete = "5s"
        update = "5s"
      }
    }
  • rancher2/resource_rancher2_config_map_v2.go

    resource "rancher2_config_map_v2" "config_map" {
      name       = var.config_map_name
      cluster_id = data.rancher2_cluster.local.id
      data       = var.config_map_data
    
      labels = var.config_map_labels
    
      timeouts {
        create = "5s"
        delete = "5s"
        update = "5s"
      }
    }
  • rancher2/resource_rancher2_etcd_backup.go

    resource "rancher2_etcd_backup" "backup" {
      cluster_id  = data.rancher2_cluster.local.id
      annotations = var.etcd_backup_annotations
    
      timeouts {
        create = "5m"
        delete = "5m"
        update = "5m"
      }
    }
  • rancher2/resource_rancher2_global_dns.go, rancher2/resource_rancher2_global_dns_provider.go

    resource "rancher2_global_dns_provider" "global_dns_provider" {
      name        = var.global_dns_provider_name
      root_domain = var.global_dns_provider_root_domain
    
      timeouts {
        create = "5s"
        update = "5s"
        delete = "5s"
      }
    }
    
    resource "rancher2_global_dns" "global_dns" {
      fqdn        = var.global_dns_fqdn
      provider_id = rancher2_global_dns_provider.global_dns_provider.id
    
      timeouts {
        create = "5s"
        update = "5s"
        delete = "5s"
      }
    }
  • rancher2/resource_rancher2_global_role.go, rancher2/resource_rancher2_global_role_binding.go

    resource "rancher2_global_role" "role" {
      name = var.global_role_name
    
      rules {
        api_groups = var.global_role_api_groups
        resources  = var.global_role_resources
        verbs      = var.global_role_verbs
      }
      
      timeouts {
        create = "5s"
        update = "5s"
        delete = "5s"
      }
    }
    
    resource "rancher2_global_role_binding" "role_binding" {
      global_role_id = rancher2_global_role.role.id
      user_id        = data.rancher2_user.admin.id
      annotations    = var.global_role_binding_annotations
      
      timeouts {
        create = "5s"
        update = "5s"
        delete = "5s"
      }
    }
  • rancher2/resource_rancher2_namespace.go

    resource "rancher2_namespace" "namespace" {
      name       = var.namespace_name
      project_id = rancher2_project.project.id
    
      timeouts {
        create = "5s"
        update = "5s"
        delete = "5s"
      }
    }
  • rancher2/resource_rancher2_node_driver.go

    resource "rancher2_node_driver" "node_driver" {
      active  = false
      builtin = false
      url     = var.node_driver_url
      name    = var.node_driver_name
    
      timeouts {
        create = "5s"
        update = "5s"
        delete = "5s"
      }
    }
  • rancher2/resource_rancher2_notifier.go

    resource "rancher2_notifier" "notifier" {
      cluster_id = data.rancher2_cluster.local.id
      name       = var.notifier_name
    
      timeouts {
        create = "5s"
        update = "5s"
        delete = "5s"
      }
    }
  • rancher2/resource_rancher2_pod_security_policy_template.go

    resource "rancher2_pod_security_policy_template" "psp-template" {
      name                       = var.psp_template_name
      allow_privilege_escalation = false
      host_pid                   = false
    
      se_linux {
        rule = "RunAsAny"
      }
    
      run_as_user {
        rule = "MustRunAs"
        range {
          max = 1000
          min = 1000
        }
      }
    
      run_as_group {
        rule = "MustRunAs"
        range {
          max = 1000
          min = 1000
        }
      }
    
      fs_group {
        rule = "MustRunAs"
        range {
          max = 1000
          min = 1000
        }
      }
    
      supplemental_group {
        rule = "MustRunAs"
        range {
          max = 1001
          min = 1001
        }
      }
    
      timeouts {
        create = "5m"
        delete = "5m"
        update = "5m"
      }
    }
  • rancher2/resource_rancher2_project.go

    resource "rancher2_project" "project" {
      name        = var.project_name
      cluster_id  = data.rancher2_cluster.local.id
      description = var.project_description
    
      enable_project_monitoring = true
    
      labels = var.project_labels
    
      timeouts {
        create = "5s"
        update = "5s"
        delete = "5s"
      }
    }
  • rancher2/resource_rancher2_project_alert_group.go, rancher2/resource_rancher2_project_alert_rule.go

    resource "rancher2_project_alert_group" "alert_group" {
      name       = var.project_alert_group_name
      project_id = rancher2_project.project.id
    
      group_interval_seconds = var.project_alert_group_interval_seconds
    
      timeouts {
        create = "5s"
        delete = "5s"
        update = "5s"
      }
    }
    
    resource "rancher2_project_alert_rule" "alert_rule" {
      name       = var.project_alert_rule_name
      group_id   = rancher2_project_alert_group.alert_group.id
      project_id = rancher2_project.project.id
      
      group_wait_seconds = var.project_alert_rule_wait_seconds
      
      timeouts {
        create = "5s"
        delete = "5s"
        update = "5s"
      }
    }
  • rancher2/resource_rancher2_registry.go

    resource "rancher2_registry" "registry" {
      name       = var.registry_name
      project_id = rancher2_project.test_project.id
    
      labels = var.registry_labels
    
      registries {
        address = var.registry_address
      }
    
      timeouts {
        create = "5s"
        delete = "5s"
        update = "5s"
      }
    }
  • rancher2/resource_rancher2_secret.go

    resource "rancher2_secret" "secret" {
      data         = var.secret_data
      project_id   = rancher2_project.test_project.id
      namespace_id = rancher2_namespace.test_namespace.id
      name         = var.secret_name
    
      labels = var.secret_labels
    
      timeouts {
        create = "10s"
        delete = "10s"
        update = "10s"
      }
    }
  • rancher2/resource_rancher2_secret_v2.go

    resource "rancher2_secret_v2" "secret" {
      cluster_id = data.rancher2_cluster.local.id
      data       = var.secret_data
      name       = var.secret_name
    
      annotations = var.secret_annotations
    
      timeouts {
        create = "10s"
        delete = "10s"
        update = "10s"
      }
    }
  • rancher2/resource_rancher2_storage_class_v2.go

    resource "rancher2_storage_class_v2" "storage_class" {
      cluster_id      = data.rancher2_cluster.local.id
      k8s_provisioner = var.storage_class_provisioner_name
      name            = var.storage_class_name
      reclaim_policy  = var.storage_class_reclaim_policy
      labels          = var.storage_class_labels
    
      timeouts {
        create = "5s"
        update = "5s"
        delete = "5s"
      }
    }
  • rancher2/resource_rancher2_token.go

    resource "rancher2_token" "token" {
      cluster_id = data.rancher2_cluster.local.id
      ttl        = var.token_ttl
      labels     = var.token_labels
    
      timeouts {
        create = "5s"
        update = "5s"
        delete = "5s"
      }
    }
  • rancher2/resource_rancher2_user.go

    resource "rancher2_user" "user" {
      password = var.user_password
      username = var.user_username
      enabled  = var.user_enabled
    
      timeouts {
        create = "5s"
        update = "5s"
        delete = "5s"
      }
    }

Additional considerations

  • Removed timeout implementation from rancher2/resource_rancher2_feature.go and rancher2/resource_rancher2_setting.go as those do not take timeout parameters in their schemas.
  • Removed timeout implementation from rancher2/resource_rancher2_app.go and rancher2/resource_rancher2_multi_cluster_app.go as this pull request is targeting Rancher v2.7+ and those resources are incompatible with Rancher v2.7.
  • Removed timeout implementation from rancher2/resource_rancher2_project_role_template_binding.go as it does not seem to be working with the latest released provider.
  • I have removed changes from rancher2_cluster_v2 and rancher2_machine_config_v2 as I have not been able to test those two resources. Changes to them will be added in a subsequent pull request.

@a-blender
Copy link
Contributor

@stormqueen1990 Thank you for your work on this! Did you try testing prov using aws instead of DO (I think DO was what you were using if I'm correct) for rancher2_cluster_v2 and rancher2_machine_config_v2? Checking in on those last two since a lot of TF issues I get pertain to those resources.

@stormqueen1990
Copy link
Contributor Author

@stormqueen1990 Thank you for your work on this! Did you try testing prov using aws instead of DO (I think DO was what you were using if I'm correct) for rancher2_cluster_v2 and rancher2_machine_config_v2? Checking in on those last two since a lot of TF issues I get pertain to those resources.

Hi there, @a-blender! I unfortunately didn't get around to testing those two resources as I had a bunch of issues with AWS security configurations, and afterwards I had to pause my work on this. I could remove both of them from this pull request and create a separate one.

@a-blender
Copy link
Contributor

a-blender commented Apr 4, 2023

@stormqueen1990 Yeah, if you could create a separate PR with those 2 resources that'd be great. I'll approve this PR after you do that. I also mostly test TF rancher2 / rke with aws as it's super easy to provision test nodes, so ping me internally. I want to help you get setup with that

These resources will be added to a separate pull request as I was not able to check them at this time.
@stormqueen1990
Copy link
Contributor Author

stormqueen1990 commented Apr 4, 2023

@stormqueen1990 Yeah, if you could create a separate PR with those 2 resources that'd be great. I'll approve this PR after you do that. I also mostly test TF rancher2 / rke with aws as it's super easy to provision test nodes, so ping me internally. I want to help you get setup with that

I've reverted both rancher2_cluster_v2 and rancher2_machine_config_v2 as discussed.

@a-blender a-blender added this to the v2.7.2 - Terraform milestone Apr 6, 2023
Copy link
Contributor

@HarrisonWAffel HarrisonWAffel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good! Thought we likely want to squash on merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants