New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement retry logic to enforce timeouts #1033
Conversation
b841be6
to
03d983f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good!
There are a few other spots in the rancher2 provider I noticed doesn't use the schema.Timeout
resourceRancher2CloudCredentialRead
- add a retry?resourceRancher2ClusterRead
resourceRancher2ClusterSyncRead
- ...
In general, I notice the timeout is used in the resource.WaitForState
https://pkg.go.dev/github.com/hashicorp/terraform-plugin-sdk/v2/helper/resource#StateChangeConf.WaitForState function for resource Create and Delete functions, but not Read. Is there a reason why you only implemented the retries for the specific files in this PR? IMO, we may want to implement retries in the Read functions for those other resources for code parity and to reduce the chance of an apply getting stuck. Let me know what you think
Hi there, @annablender!
The main reason was that those were the files listed in the rancher/rancher#37161 issue, but I don't see why not expand it to all the resources, since the lack of timeouts might affect all the resources in this provider. I can update the PR to add those other resources you mentioned. |
4e976f5
to
41d6a15
Compare
41d6a15
to
6ba9867
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stormqueen1990 Thank you, a few more things
- Additional updates look good. If you could add additional commits instead of push -f that'd be great. It will make it easier to see what's been added
- Could you add a retry to the
rancher2_cluster
Update func? hereerr = client.APIBaseClient.ByID(managementClient.ClusterType, d.Id(), cluster) - Please rollback removing cis v1 logging code. That has already been removed in another PR. If you are having build errors, rebase and push
- Please smoketest with a rancher2_cluster rke cluster
I will re-review after this
Add retry logic to the following resources/operations: * resource_rancher2_certificate (create, update, delete) * resource_rancher2_cluster_template (create, update, delete) * resource_rancher2_global_role (create, update, delete) * resource_rancher2_registry (create, update, delete) * resource_rancher2_role_template (create, update, delete) * resource_rancher2_secret (create, update, delete) * resource_rancher2_token (create, delete) Plus add retry logic for all reads in the resource.
6ba9867
to
898b4f8
Compare
Update: discussed with @stormqueen1990 offline. I think a retry in I will put in a separate PR for the retry I requested since this PR is already so large. Otherwise, as long as these changes have been dev tested re-request my review when ready |
Remove the timeout implementation from the rancher2_feature and rancher2_setting resources as they do not take the timeouts configuration on their schemas.
Remove the timeout implementation from the rancher2_app and rancher2_multi_cluster_app implementations as they are not compatible with Rancher v2.7 and this pull request is targeting the Rancher v2.7 line for terraform-provider-rancher2.
Remove the timeout handling from the rancher2_project_role_template_binding as it does not seem to work and require more investigation.
Smoke test for the changed resourcesI smoke tested the changed resources marked in the checklist below using the following methodology:
Additional considerations
|
@stormqueen1990 Thank you for your work on this! Did you try testing prov using aws instead of DO (I think DO was what you were using if I'm correct) for |
Hi there, @a-blender! I unfortunately didn't get around to testing those two resources as I had a bunch of issues with AWS security configurations, and afterwards I had to pause my work on this. I could remove both of them from this pull request and create a separate one. |
@stormqueen1990 Yeah, if you could create a separate PR with those 2 resources that'd be great. I'll approve this PR after you do that. I also mostly test TF rancher2 / rke with aws as it's super easy to provision test nodes, so ping me internally. I want to help you get setup with that |
These resources will be added to a separate pull request as I was not able to check them at this time.
I've reverted both |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good! Thought we likely want to squash on merge
Issue: rancher/rancher#37161
Problem
Currently, some resources in the Terraform Rancher2 provider have timeouts declared and described in documentation, but do not consume these timeout values for their operations. While most of these operations are expected to be short and finish within a reasonable amount of time (often within seconds), it does seem that they might end up hanging indefinitely should something go wrong in the backend.
Solution
Use the
resource.Retry(time.Duration, RetryFunc)
function to enclose blocks that we would want to have timeouts applied. As a bonus, retry functionality may be easily implemented in the future, if deemed necessary.As outlined by @eliyamlevy in rancher/rancher#37161, the following resources will have this change applied:
rancher2_certificate
(create, update, delete)update and delete are no-ops for this resource, and create already has timeouts implementedrancher2_cluster_sync
(update, delete)rancher2_cluster_template
(create, update, delete)rancher2_global_role
(create, update, delete)rancher2_registry
(create, update, delete)rancher2_role_template
(create, update, delete)rancher2_secret
(create, update, delete)rancher2_token
(create, delete)Testing
Engineering Testing
Manual Testing
Automated Testing
QA Testing Considerations
Regressions Considerations