Provide optional rollback support for HelmReleases #2006

hiddeco · 2019-05-02T12:47:56Z

Fixes #1960

hiddeco · 2019-05-08T12:31:39Z

@squaremo @2opremio although this PR is still a draft, please take a look whenever you have time.

The reason it is still is a draft is described in #1960 (comment), but due to Helm not providing us with tools to observe successful roll-outs, I see no easy way to work around the issue described in this comment. (Given the feature is opt-in, maybe we can live with it for a while?)

2opremio · 2019-05-09T15:03:10Z

@hiddeco Won't the rollback mechanism change based on how #1960 (comment) is resolved? If it can differ, then I think it's better to review it once we have a design.

Given the feature is opt-in, maybe we can live with it for a while?

Uhm, I am not sure. An infinite loop is scary

hiddeco · 2019-05-15T20:17:37Z

@2opremio @squaremo took me some time to get the mechanics sorted, but here it is.

The operator now keeps track of what generation of a resource it has seen, this is done by updating the ObservedGeneration status after it has performed a non-deleting operation, it also adds a status condition for rollbacks, and keeps track of both condition change and update timestamps.

In case a rollback status condition is set, there are three options:

If the generation > observedGeneration, we attempt an upgrade as the values may have changed (the generation of a resource is not increased on status updates, this is by design and widely utilized in the k8s core).
If the chart has been rolled back, but we have observed an update of the Helm chart source after this happened (LastUpdateTime), we attempt an upgrade as the contents of the chart source may have changed (or the failure is old).
We bail and log a warning to the user that we skipped the release.

I am still on the fence about two scenarios:

1. In case the Helm operator is restarted, it will attempt an upgrade and rollback afterwards. I think this is because it gets added to the queue through the AddFunc event handler on the informer, which does not perform any state checks at the moment, need to investigate...
2. If the rollback was due to the contents of any valuesFrom source, we are unable to determine this has been fixed, as any changes in those sources do not result in a new resource generation and we resolve and merge those values JIT for the upgrade we want to perform, which is much later in the process.

2opremio · 2019-05-16T14:23:47Z

2. If the rollback was due to the contents of any valuesFrom source, we are unable to determine this has been fixed, as any changes in those sources do not result in a new resource generation and we resolve and merge those values JIT for the upgrade we want to perform, which is much later in the process.

Can't we store the valuesFrom in the status?

hiddeco · 2019-05-16T14:31:38Z

We could store a SHA hash of the merged values in a status (someone on Slack suggested something like this too). But to be able to make a comparison and detect a change, this would require us to resolve and merge all values before we schedule an upgrade (in operator.go), which does not look like an optimal solution to me.

2opremio · 2019-05-16T15:34:32Z

integrations/helm/chartsync/chartsync.go

+		fhr.Spec.Rollback.Recreate, fhr.Spec.Rollback.DisableHooks, fhr.Spec.Rollback.Wait)
+	if err != nil {
+		chs.logger.Log("warning", "unable to rollback chart release", "resource", fhr.ResourceID().String(), "release", name, "err", err)
+		chs.setCondition(fhr, fluxv1beta1.HelmReleaseRolledBack, v1.ConditionFalse, ReasonRollbackFailed, err.Error())


Maybe, only maybe, create variables for the reason and message and call chs.setCondition only once?

integrations/helm/release/release.go

2opremio · 2019-05-16T15:38:36Z

Code looks good, but I don't know enough about the Helm Operator to rubber-stamp this.

I am missing tests though (which I guess you planned to add later)

hiddeco · 2019-05-16T19:10:27Z

I don't know enough about the Helm Operator to rubber-stamp this

It may be an idea to pick up some of the FIXME/TODOs in integrations/helm/chartsync/chartsync.go and integrations/helm/release/release.go in the (near) future to prevent it from becoming heavy opinionated by my views and to spread some knowledge amongst maintainers.

The operator is not that complex and quite small as a whole, so you should be able to fit it into your head rapidly. (I was able to do this and I am not the fastest person on earth in Golang code bases at the moment).

2opremio · 2019-05-17T16:51:23Z

Thanks @hiddeco , I will take a look at those once I am done with the e2e test refactoring.

integrations/helm/release/release.go

hiddeco · 2019-05-20T16:50:59Z

@2opremio @squaremo it does now detect changes to external values files, how it does this is up for debate.

I personally am not very happy with it, i.e. using the *ChartChangeSync in the enqueueUpdateJob method feels wrong (decision making should be fast), but given that we need to resolve all external files this was the only entrypoint I could think of, and as we only do this for HelmReleases with rollbacks enabled the overhead should be minimal.

squaremo

I could really do with a state machine diagram to see what's going on!

integrations/apis/flux.weave.works/v1beta1/types.go

integrations/helm/status/status.go

adrian · 2019-06-26T14:48:56Z

I've been testing this branch and came across a problem with StatefulSets.

If an upgrade involves recreating the pods in a StatefulSet then Kubernetes will, as normal, start replacing the pods starting at the last ordinal, e.g.

acme-statefulset-0                 1/1     Running            0          5m26s
acme-statefulset-1                 1/1     Running            0          5m21s
acme-statefulset-2                 0/1     ContainerCreating   0          1s

If that pod fails to come up (e.g. the image is missing) the release is rolled back but the problem pod is left as-is, i.e.

acme-statefulset-0                 1/1     Running            0          5m26s
acme-statefulset-1                 1/1     Running            0          5m21s
acme-statefulset-2                 0/1     ImagePullBackOff   0          33s

The problem is addressed by including recreate: true in the HelmRelease as all pods in the release are deleted as part of the rollback. However, this seems heavy handed as it means all pods in the StatefulSet and any other workloads created by the helm chart are also deleted.

Is there a way to address this?

Could the --cleanup-on-fail flag in Helm v2.14.0 be a solution?

hiddeco · 2019-06-26T15:02:37Z

@adrian given that StatefulSets are known to be problematic to rollback (as mentioned in the issue linked in this PR), and the fact that we mimic helm upgrade --atomic. I am afraid this has little to do with the Helm operator itself, but more with Helm.

I am however interested in any solution you may come across which solves the problem for helm, so I can include the required flag(s) and/or add the solution to our docs. My advice to you would be to see if you can make it work with helm upgrade --atomic <maybe --cleanup-on-fail> <or --recreate-pods> (or maybe a different strategy).

Oh, and almost forgot -- awesome you are testing this :-)

adrian

FWIW I've been testing the operator on this branch for the past few days and it's working fine. I've tested,

upgrades with rollback.enabled: false
upgrades with rollback.enabled: true and rollback.recreate: false
upgrades with rollback.enabled: true and rollback.recreate: true

adrian · 2019-06-28T16:11:01Z

In relation to my question earlier about dealing with failed pods in StatefulSets...

--cleanup-on-fail isn't an option as it only deletes net new pods, i.e. pods for new workloads rather than updated pods.

--recreate recreates all pods belonging to all workloads deployed by the helm chart. This works, but leads to service downtime as the pods are being recreated.

One option we've implemented (not in an operator) is to delete all broken StatefulSet pods after issuing the helm rollback command. Maybe that's something that the operator could do?

hiddeco · 2019-06-28T16:20:37Z

FWIW I've been testing the operator on this branch for the past few days and it's working fine.

Awesome, going to rebase and tidy things up and bother @squaremo with reviewing again so it lands in master.

One option we've implemented (not in an operator) is to delete all broken StatefulSet pods after issuing the helm rollback command. Maybe that's something that the operator could do?

I am not very keen on implementing 'hacks' like this in the operator, the reason for this is that it may work for you but breaks things for other people. Instead, and this is probably not to your likening, I would just make note of that it is likely to give issues with some StatefulSet setups, due to shortcomings in Helm.

squaremo

Code looks fine to me, the design (in the state diagram) looks right, and we have some empirical evidence that it works as designed. Let's release it into the wild!

To be able to calculate a checksum for the values from external sources we need to be able to retrieve the chart path for a HelmRelease outside of the ReconcileReleaseDef method, as the path is required for chart file references.

Before this change the actual rollback status was not consulted in case the status of the chart fetch was successful, which could lead to the wrong state being reported back.

semyonslepov · 2019-07-01T16:16:05Z

@hiddeco this is awesome, thanks a lot for implementing this feature!
Would you mind also updating the docs on Helm integration: https://github.com/weaveworks/flux/blob/master/site/helm-integration.md ?

hiddeco · 2019-07-05T09:25:17Z

@semyonslepov thanks for your enthusiasm and interest in this feature, I merged a PR yesterday which documents the added fields to the HelmRelease.

Last, if you do not want to wait on the next Helm operator release, there is a prerelease build available which also adds experimental support for running multiple workers processing your releases (--workers=<num>).

hiddeco added enhancement helm labels May 2, 2019

hiddeco force-pushed the helm/1960-rollback branch 2 times, most recently from a5334b6 to 5babee8 Compare May 2, 2019 12:56

hiddeco mentioned this pull request May 2, 2019

Please provide "helm rollback" option in the HelmRelease #1960

Closed

hiddeco requested review from squaremo and 2opremio May 8, 2019 12:26

squaremo mentioned this pull request May 9, 2019

Tracing a bad deployment/HelmRelease #2028

Closed

kahootali mentioned this pull request May 9, 2019

UI for Flux & Helm Operator #2033

Closed

hiddeco force-pushed the helm/1960-rollback branch 2 times, most recently from 2786e0f to 60946e7 Compare May 15, 2019 19:30

hiddeco force-pushed the helm/1960-rollback branch from 60946e7 to b7a003b Compare May 16, 2019 11:52

2opremio reviewed May 16, 2019

View reviewed changes

integrations/helm/release/release.go Outdated Show resolved Hide resolved

hiddeco force-pushed the helm/1960-rollback branch from b7a003b to 3c9234c Compare May 16, 2019 20:44

hiddeco force-pushed the helm/1960-rollback branch 3 times, most recently from 1b9234b to bc2098c Compare May 20, 2019 16:35

hiddeco commented May 20, 2019

View reviewed changes

integrations/helm/release/release.go Outdated Show resolved Hide resolved

hiddeco commented May 20, 2019

View reviewed changes

integrations/helm/release/release.go Outdated Show resolved Hide resolved

hiddeco force-pushed the helm/1960-rollback branch from bc2098c to e7e95b7 Compare May 21, 2019 09:28

hiddeco marked this pull request as ready for review May 23, 2019 09:42

hiddeco force-pushed the helm/1960-rollback branch from e7e95b7 to 9e62648 Compare May 23, 2019 17:34

hiddeco mentioned this pull request May 31, 2019

Run helm test after chart deployed #1704

Closed

hiddeco force-pushed the helm/1960-rollback branch from 9e62648 to c841ff5 Compare May 31, 2019 14:08

hiddeco mentioned this pull request Jun 6, 2019

Failed Helm releases require manual user intervention to recover #2132

Closed

squaremo reviewed Jun 13, 2019

View reviewed changes

hiddeco force-pushed the helm/1960-rollback branch 2 times, most recently from 8822296 to 469ef18 Compare June 17, 2019 16:50

hiddeco mentioned this pull request Jun 24, 2019

flux-helm-operator status.go/annotator doesn't recover from transient TLS error to Kubernetes API #2178

Closed

adrian approved these changes Jun 28, 2019

View reviewed changes

hiddeco force-pushed the helm/1960-rollback branch from 469ef18 to 7b558bc Compare July 1, 2019 10:20

squaremo approved these changes Jul 1, 2019

View reviewed changes

hiddeco added 7 commits July 1, 2019 16:15

Provide optional rollback support for HelmReleases

e535961

Factor out retrieval of chart path and revision

ccaa830

To be able to calculate a checksum for the values from external sources we need to be able to retrieve the chart path for a HelmRelease outside of the ReconcileReleaseDef method, as the path is required for chart file references.

Store and compare SHA256 checksum of merged values

d929432

Mimic helm release --atomic instead of enqueuing

78c1d27

Add note about update timestamp usage

86764fe

Method name change and indent fix

daad1ff

Improve rolled back condition check

d386c35

Before this change the actual rollback status was not consulted in case the status of the chart fetch was successful, which could lead to the wrong state being reported back.

hiddeco force-pushed the helm/1960-rollback branch from 7b558bc to d386c35 Compare July 1, 2019 14:16

hiddeco merged commit 6077df2 into master Jul 1, 2019

hiddeco deleted the helm/1960-rollback branch July 1, 2019 14:28

hiddeco mentioned this pull request Jul 4, 2019

Document Helm operator rollback feature #2220

Merged

lloydchang mentioned this pull request Dec 6, 2021

WIP - Argocd/Flux Comparison Document open-gitops/documents#47

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide optional rollback support for HelmReleases #2006

Provide optional rollback support for HelmReleases #2006

hiddeco commented May 2, 2019 •

edited

Loading

hiddeco commented May 8, 2019

2opremio commented May 9, 2019 •

edited

Loading

hiddeco commented May 15, 2019 •

edited

Loading

2opremio commented May 16, 2019

hiddeco commented May 16, 2019 •

edited

Loading

2opremio May 16, 2019

2opremio commented May 16, 2019

hiddeco commented May 16, 2019

2opremio commented May 17, 2019

hiddeco commented May 20, 2019

squaremo left a comment

adrian commented Jun 26, 2019

hiddeco commented Jun 26, 2019 •

edited

Loading

adrian left a comment •

edited

Loading

adrian commented Jun 28, 2019

hiddeco commented Jun 28, 2019

squaremo left a comment

semyonslepov commented Jul 1, 2019

hiddeco commented Jul 5, 2019

Provide optional rollback support for HelmReleases #2006

Provide optional rollback support for HelmReleases #2006

Conversation

hiddeco commented May 2, 2019 • edited Loading

hiddeco commented May 8, 2019

2opremio commented May 9, 2019 • edited Loading

hiddeco commented May 15, 2019 • edited Loading

2opremio commented May 16, 2019

hiddeco commented May 16, 2019 • edited Loading

2opremio May 16, 2019

Choose a reason for hiding this comment

2opremio commented May 16, 2019

hiddeco commented May 16, 2019

2opremio commented May 17, 2019

hiddeco commented May 20, 2019

squaremo left a comment

Choose a reason for hiding this comment

adrian commented Jun 26, 2019

hiddeco commented Jun 26, 2019 • edited Loading

adrian left a comment • edited Loading

Choose a reason for hiding this comment

adrian commented Jun 28, 2019

hiddeco commented Jun 28, 2019

squaremo left a comment

Choose a reason for hiding this comment

semyonslepov commented Jul 1, 2019

hiddeco commented Jul 5, 2019

hiddeco commented May 2, 2019 •

edited

Loading

2opremio commented May 9, 2019 •

edited

Loading

hiddeco commented May 15, 2019 •

edited

Loading

hiddeco commented May 16, 2019 •

edited

Loading

hiddeco commented Jun 26, 2019 •

edited

Loading

adrian left a comment •

edited

Loading