Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix when Helm provider ignores FAILED release state #161

Merged
merged 5 commits into from Jan 11, 2019

Conversation

@stepanstipl
Copy link
Contributor

stepanstipl commented Nov 26, 2018

This moves status field out of metadata non-computed field and uses CustomizeDiff in order to track a state of Helm release and enforce that the state is DEPLOYED.

This fixes behavior when Terraform would ignore previously created release that is in a FAILED state and failed to try to bring it to the desired state (assuming this will always be DEPLOYED)

See #159 for more detail (the issue has detailed steps about how to reproduce the issue).

This PR adds a test for this issue, to run the integration test(s):

  • you need to have kube cluster setup & kubectl configured (verify by running kubectl get pods)
  • you need to have Helm installed & Helm client configured (verify by running helm list)
  • set env variable TF_ACC to true to enable acceptance tests (export TF_ACC=true)
  • you need to have go 1.10+ installed and this project in correct path as per Go conventions($GOPATH/src/github.com/terraform-providers/terraform-provider-helm)
  • you can run all the tests by running
    go test -v helm/*
    
    but expect some of them to fail (I believe it's not related to this change, these were failing for me both with and without this code change),
    To run just the test for this issue:
    go test -v -run "TestAccResourceRelease_updateExistingFailed" helm/*
    

Update: After rebasing on latest master and fixing 2 remaining tests (some fail only when running agains GKE and not when running against minikube) all the tests are expected to pass, both against minikube & GKE.

@hashibot hashibot bot added the size/M label Nov 26, 2018
@ianks

This comment has been minimized.

Copy link

ianks commented Nov 30, 2018

We desperately need this. this bug is causing our state to become unrecoverable in some cases.

@legal90

This comment has been minimized.

Copy link
Contributor

legal90 commented Nov 30, 2018

@stepanstipl Thank you! That's really needed.

But I just looked on the trick with a default value and verification. Maybe we can make the "status" attribute Computed and assign its desired value in the scope of CustomizeDiff, like we do it with "version" attribute? I guess the resource UX would be a bit more clear, since this attribute will be computed-only, so user will not be able to set it via configuration.
Check it out if you are interested: legal90@5ceda8d

@stepanstipl

This comment has been minimized.

Copy link
Contributor Author

stepanstipl commented Dec 2, 2018

@legal90 thanks, that sounds good. I'll have a closer look and try to test if it works as expected in our case, but I like it! I originally tried to keep the status nested under the computed metadata, but I wasn't able to figure that one out, so I gave up on it and moved it one level up and as Optional field.

One other thing I noticed in your code - r.GetInfo().GetStatus().GetCode().String() vs r.Info.Status.Code.String(). I agree that the 2nd one is more readable, but at the same time I had an impression it's better to access objects via. accessor Get* methods if they're available. Myself being quite new to go - can you elaborate on your preference?

@legal90

This comment has been minimized.

Copy link
Contributor

legal90 commented Dec 3, 2018

r.GetInfo().GetStatus().GetCode().String() vs r.Info.Status.Code.String() <...> Myself being quite new to go - can you elaborate on your preference?

I was just keeping the same approach as in the rest of the code here. I believe this approach could be changed in a separated Pull-Request if needed.

@stepanstipl

This comment has been minimized.

Copy link
Contributor Author

stepanstipl commented Dec 5, 2018

I have refactored code as per @legal90's suggestion (thanks, it's better this way) - the status is now Computed field. The resulting functionality stays same. I've tested that It still does fix the issue in our use case, it does work for the steps descibed in #159 and the test code added originally stayed same and passes.

@stepanstipl stepanstipl force-pushed the stepanstipl:fix-ignore-failed-up branch from c9048d6 to c74d30f Dec 5, 2018
@meyskens

This comment has been minimized.

Copy link
Collaborator

meyskens commented Dec 13, 2018

@stepanstipl can you rebase on master? We fixed a lot of things in the acceptance tests.

@bluemalkin

This comment has been minimized.

Copy link

bluemalkin commented Dec 13, 2018

Which scheduled release can we expect this to be in please? Keep up the good work!

@stepanstipl stepanstipl changed the title Fix when Helm provider ignores FAILED release state [WIP] Fix when Helm provider ignores FAILED release state Dec 14, 2018
stepanstipl added 4 commits Dec 14, 2018
This moves status field out of metadata non-computer field and uses
`CustomizeDiff` in order to track a state of Helm release and enforce
that the state is `DEPLOYED`.

This fixes behavior when Terraform would ignore previously created
release that is in a `FAILED` state and failed to try to bring it to the
desired state (assuming this will always be `DEPLOYED`)

See
#159
for more details (the issue has detailed steps about how to reproduce the
issue).
@stepanstipl stepanstipl force-pushed the stepanstipl:fix-ignore-failed-up branch from 5b0400a to e12727b Dec 14, 2018
@stepanstipl

This comment has been minimized.

Copy link
Contributor Author

stepanstipl commented Dec 14, 2018

@meyskens done, nice job with tests!

I've rebased on master and cleaned up my commits, also fixed a bug introduced by the earlier refactoring.

TestAccResourceRelease_updateAfterFail was still failing for me (even on the original master, but I think this is related to running tests on minikube, as service of type LoadBalancer will never be ready), so I've fixed that one as well.

Finally TestAccResourceRelease_repository was failing as well when running against GKE cluster with a message chart "telegraf" matching 1.0.2 not found in stable-repo index. I believe this has to do with a combination of newer helm (0.12) and the fact that telegraf chart is marked as deprecated everywhere (see https://github.com/helm/charts/tree/master/stable/telegraf).

Now the tests pass both against minikube and real GKE clusters.

@stepanstipl stepanstipl changed the title [WIP] Fix when Helm provider ignores FAILED release state Fix when Helm provider ignores FAILED release state Dec 14, 2018
When running against GKE cluster, the test with `telegraf` chart would
fail on chart "telegraf" matching 1.0.2 not found in stable-repo index.
I believe it has to do with combination of new helm version (0.12) and
the fact that all telegraf charts are marked as deprecated.
@stepanstipl stepanstipl force-pushed the stepanstipl:fix-ignore-failed-up branch from e2a4224 to c959468 Dec 14, 2018
@meyskens meyskens self-requested a review Dec 17, 2018
@stepanstipl

This comment has been minimized.

Copy link
Contributor Author

stepanstipl commented Jan 10, 2019

Hi, @meyskens is there anything else that's needed? Sorry for rushing, I'm just wondering if it looks like we can get this in :)

@meyskens

This comment has been minimized.

Copy link
Collaborator

meyskens commented Jan 11, 2019

@stepanstipl sorry for the delay, the code looks great! Just going to run the tests again as it seems Travis timed out but still flagged it green.

@meyskens meyskens merged commit 1524e6f into terraform-providers:master Jan 11, 2019
1 check passed
1 check passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@legal90 legal90 referenced this pull request Feb 4, 2019
if err == nil {
return d.SetNew("version", c.Metadata.Version)
if err != nil {
return nil

This comment has been minimized.

Copy link
@legal90

legal90 Feb 10, 2019

Contributor

@stepanstipl Could you please remind - why did you make the call of getChart tolerant to errors?

As I see, if the helm client (library) fails to fetch the chart, that should be a valid point for throwing an error, like it happens with other places when getChart is called (see resourceReleaseCreate, resourceReleaseUpdate functions). Otherwise we can miss the diff and the update won't be triggered.

This comment has been minimized.

Copy link
@stepanstipl

stepanstipl Feb 10, 2019

Author Contributor

Hah, it took me a while to remember/re-figure out why I did that, long time :).

So the CustomizeDiff, and therefore resourceDiff, is executed before the chart is available in some cases. In that scenario getChart would fail and therefore fail the whole plan/apply run, while in fact the run would complete just fine.

Good example of this is the TestAccResourceRelease_repository test case, when the chart is only available after the repo is added via the helm_repository resource.

This comment has been minimized.

Copy link
@legal90

legal90 Feb 10, 2019

Contributor

@stepanstipl Thank you for the explanation! Now I understand. Your test case failed because CustomizeDiff functions of all resources are called before these resources are created/updated.
I asked just because I run into the opposite situation - my custom repo was unavailable on the network level, but the terraform apply didn't show any diff or error, when it was expected.

It seems that downloading the chart on the diff evaluation stage is not a good idea and we need to find a way how to store everything needed in the terraform state.
/cc @rporres

This comment has been minimized.

Copy link
@stepanstipl

stepanstipl Feb 11, 2019

Author Contributor

but the terraform apply didn't show any diff or error, when it was expected

@legal90 do you mean that even running the apply step, with helm_release resource trying to install chart from an unavailable repo, did not fail? I would expect the TF to fail there.

This comment has been minimized.

Copy link
@legal90

legal90 Feb 11, 2019

Contributor

@stepanstipl My case was pretty dummy - the repository URL was unavailable and the HTTP connection to it was timing with a very long delay (~3-4 minutes), which just caused terraform apply to run very long without any error thrown.
Though, there were no diff expected in my case. You're right - if there would be a diff, then terraforn eventually throws an error because it downloads the chart once again in resourceReleaseUpdate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.