Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix when Helm provider ignores FAILED release state #161

Merged
merged 5 commits into from Jan 11, 2019

Conversation

stepanstipl
Copy link
Contributor

@stepanstipl stepanstipl commented Nov 26, 2018

This moves status field out of metadata non-computed field and uses CustomizeDiff in order to track a state of Helm release and enforce that the state is DEPLOYED.

This fixes behavior when Terraform would ignore previously created release that is in a FAILED state and failed to try to bring it to the desired state (assuming this will always be DEPLOYED)

See #159 for more detail (the issue has detailed steps about how to reproduce the issue).

This PR adds a test for this issue, to run the integration test(s):

  • you need to have kube cluster setup & kubectl configured (verify by running kubectl get pods)
  • you need to have Helm installed & Helm client configured (verify by running helm list)
  • set env variable TF_ACC to true to enable acceptance tests (export TF_ACC=true)
  • you need to have go 1.10+ installed and this project in correct path as per Go conventions($GOPATH/src/github.com/terraform-providers/terraform-provider-helm)
  • you can run all the tests by running
    go test -v helm/*
    
    but expect some of them to fail (I believe it's not related to this change, these were failing for me both with and without this code change),
    To run just the test for this issue:
    go test -v -run "TestAccResourceRelease_updateExistingFailed" helm/*
    

Update: After rebasing on latest master and fixing 2 remaining tests (some fail only when running agains GKE and not when running against minikube) all the tests are expected to pass, both against minikube & GKE.

@ghost ghost added the size/M label Nov 26, 2018
@ianks
Copy link

ianks commented Nov 30, 2018

We desperately need this. this bug is causing our state to become unrecoverable in some cases.

@legal90
Copy link
Contributor

legal90 commented Nov 30, 2018

@stepanstipl Thank you! That's really needed.

But I just looked on the trick with a default value and verification. Maybe we can make the "status" attribute Computed and assign its desired value in the scope of CustomizeDiff, like we do it with "version" attribute? I guess the resource UX would be a bit more clear, since this attribute will be computed-only, so user will not be able to set it via configuration.
Check it out if you are interested: legal90@5ceda8d

@stepanstipl
Copy link
Contributor Author

@legal90 thanks, that sounds good. I'll have a closer look and try to test if it works as expected in our case, but I like it! I originally tried to keep the status nested under the computed metadata, but I wasn't able to figure that one out, so I gave up on it and moved it one level up and as Optional field.

One other thing I noticed in your code - r.GetInfo().GetStatus().GetCode().String() vs r.Info.Status.Code.String(). I agree that the 2nd one is more readable, but at the same time I had an impression it's better to access objects via. accessor Get* methods if they're available. Myself being quite new to go - can you elaborate on your preference?

@legal90
Copy link
Contributor

legal90 commented Dec 3, 2018

r.GetInfo().GetStatus().GetCode().String() vs r.Info.Status.Code.String() <...> Myself being quite new to go - can you elaborate on your preference?

I was just keeping the same approach as in the rest of the code here. I believe this approach could be changed in a separated Pull-Request if needed.

@stepanstipl
Copy link
Contributor Author

I have refactored code as per @legal90's suggestion (thanks, it's better this way) - the status is now Computed field. The resulting functionality stays same. I've tested that It still does fix the issue in our use case, it does work for the steps descibed in #159 and the test code added originally stayed same and passes.

@meyskens
Copy link
Contributor

@stepanstipl can you rebase on master? We fixed a lot of things in the acceptance tests.

@bluemalkin
Copy link

Which scheduled release can we expect this to be in please? Keep up the good work!

@stepanstipl stepanstipl changed the title Fix when Helm provider ignores FAILED release state [WIP] Fix when Helm provider ignores FAILED release state Dec 14, 2018
This moves status field out of metadata non-computer field and uses
`CustomizeDiff` in order to track a state of Helm release and enforce
that the state is `DEPLOYED`.

This fixes behavior when Terraform would ignore previously created
release that is in a `FAILED` state and failed to try to bring it to the
desired state (assuming this will always be `DEPLOYED`)

See
hashicorp#159
for more details (the issue has detailed steps about how to reproduce the
issue).
@stepanstipl
Copy link
Contributor Author

@meyskens done, nice job with tests!

I've rebased on master and cleaned up my commits, also fixed a bug introduced by the earlier refactoring.

TestAccResourceRelease_updateAfterFail was still failing for me (even on the original master, but I think this is related to running tests on minikube, as service of type LoadBalancer will never be ready), so I've fixed that one as well.

Finally TestAccResourceRelease_repository was failing as well when running against GKE cluster with a message chart "telegraf" matching 1.0.2 not found in stable-repo index. I believe this has to do with a combination of newer helm (0.12) and the fact that telegraf chart is marked as deprecated everywhere (see https://github.com/helm/charts/tree/master/stable/telegraf).

Now the tests pass both against minikube and real GKE clusters.

@stepanstipl stepanstipl changed the title [WIP] Fix when Helm provider ignores FAILED release state Fix when Helm provider ignores FAILED release state Dec 14, 2018
When running against GKE cluster, the test with `telegraf` chart would
fail on chart "telegraf" matching 1.0.2 not found in stable-repo index.
I believe it has to do with combination of new helm version (0.12) and
the fact that all telegraf charts are marked as deprecated.
@stepanstipl
Copy link
Contributor Author

Hi, @meyskens is there anything else that's needed? Sorry for rushing, I'm just wondering if it looks like we can get this in :)

@meyskens
Copy link
Contributor

@stepanstipl sorry for the delay, the code looks great! Just going to run the tests again as it seems Travis timed out but still flagged it green.

@meyskens meyskens merged commit 1524e6f into hashicorp:master Jan 11, 2019
@legal90 legal90 mentioned this pull request Feb 4, 2019
if err == nil {
return d.SetNew("version", c.Metadata.Version)
if err != nil {
return nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stepanstipl Could you please remind - why did you make the call of getChart tolerant to errors?

As I see, if the helm client (library) fails to fetch the chart, that should be a valid point for throwing an error, like it happens with other places when getChart is called (see resourceReleaseCreate, resourceReleaseUpdate functions). Otherwise we can miss the diff and the update won't be triggered.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hah, it took me a while to remember/re-figure out why I did that, long time :).

So the CustomizeDiff, and therefore resourceDiff, is executed before the chart is available in some cases. In that scenario getChart would fail and therefore fail the whole plan/apply run, while in fact the run would complete just fine.

Good example of this is the TestAccResourceRelease_repository test case, when the chart is only available after the repo is added via the helm_repository resource.

Copy link
Contributor

@legal90 legal90 Feb 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stepanstipl Thank you for the explanation! Now I understand. Your test case failed because CustomizeDiff functions of all resources are called before these resources are created/updated.
I asked just because I run into the opposite situation - my custom repo was unavailable on the network level, but the terraform apply didn't show any diff or error, when it was expected.

It seems that downloading the chart on the diff evaluation stage is not a good idea and we need to find a way how to store everything needed in the terraform state.
/cc @rporres

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but the terraform apply didn't show any diff or error, when it was expected

@legal90 do you mean that even running the apply step, with helm_release resource trying to install chart from an unavailable repo, did not fail? I would expect the TF to fail there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stepanstipl My case was pretty dummy - the repository URL was unavailable and the HTTP connection to it was timing with a very long delay (~3-4 minutes), which just caused terraform apply to run very long without any error thrown.
Though, there were no diff expected in my case. You're right - if there would be a diff, then terraforn eventually throws an error because it downloads the chart once again in resourceReleaseUpdate.

@alexsomesan alexsomesan added this to the v0.8.0 milestone Feb 11, 2019
@hashicorp hashicorp locked and limited conversation to collaborators Apr 27, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants