Skip to content
This repository has been archived by the owner on Nov 1, 2022. It is now read-only.

Provide optional rollback support for HelmReleases #2006

Merged
merged 7 commits into from
Jul 1, 2019

Conversation

hiddeco
Copy link
Member

@hiddeco hiddeco commented May 2, 2019

Fixes #1960

image

@hiddeco
Copy link
Member Author

hiddeco commented May 8, 2019

@squaremo @2opremio although this PR is still a draft, please take a look whenever you have time.

The reason it is still is a draft is described in #1960 (comment), but due to Helm not providing us with tools to observe successful roll-outs, I see no easy way to work around the issue described in this comment. (Given the feature is opt-in, maybe we can live with it for a while?)

@2opremio
Copy link
Contributor

2opremio commented May 9, 2019

@hiddeco Won't the rollback mechanism change based on how #1960 (comment) is resolved? If it can differ, then I think it's better to review it once we have a design.

Given the feature is opt-in, maybe we can live with it for a while?

Uhm, I am not sure. An infinite loop is scary

@hiddeco hiddeco force-pushed the helm/1960-rollback branch 2 times, most recently from 2786e0f to 60946e7 Compare May 15, 2019 19:30
@hiddeco
Copy link
Member Author

hiddeco commented May 15, 2019

@2opremio @squaremo took me some time to get the mechanics sorted, but here it is.

The operator now keeps track of what generation of a resource it has seen, this is done by updating the ObservedGeneration status after it has performed a non-deleting operation, it also adds a status condition for rollbacks, and keeps track of both condition change and update timestamps.

In case a rollback status condition is set, there are three options:

  1. If the generation > observedGeneration, we attempt an upgrade as the values may have changed (the generation of a resource is not increased on status updates, this is by design and widely utilized in the k8s core).
  2. If the chart has been rolled back, but we have observed an update of the Helm chart source after this happened (LastUpdateTime), we attempt an upgrade as the contents of the chart source may have changed (or the failure is old).
  3. We bail and log a warning to the user that we skipped the release.

I am still on the fence about two scenarios:

1. In case the Helm operator is restarted, it will attempt an upgrade and rollback afterwards. I think this is because it gets added to the queue through the AddFunc event handler on the informer, which does not perform any state checks at the moment, need to investigate...
2. If the rollback was due to the contents of any valuesFrom source, we are unable to determine this has been fixed, as any changes in those sources do not result in a new resource generation and we resolve and merge those values JIT for the upgrade we want to perform, which is much later in the process.

@2opremio
Copy link
Contributor

2. If the rollback was due to the contents of any valuesFrom source, we are unable to determine this has been fixed, as any changes in those sources do not result in a new resource generation and we resolve and merge those values JIT for the upgrade we want to perform, which is much later in the process.

Can't we store the valuesFrom in the status?

@hiddeco
Copy link
Member Author

hiddeco commented May 16, 2019

We could store a SHA hash of the merged values in a status (someone on Slack suggested something like this too). But to be able to make a comparison and detect a change, this would require us to resolve and merge all values before we schedule an upgrade (in operator.go), which does not look like an optimal solution to me.

fhr.Spec.Rollback.Recreate, fhr.Spec.Rollback.DisableHooks, fhr.Spec.Rollback.Wait)
if err != nil {
chs.logger.Log("warning", "unable to rollback chart release", "resource", fhr.ResourceID().String(), "release", name, "err", err)
chs.setCondition(fhr, fluxv1beta1.HelmReleaseRolledBack, v1.ConditionFalse, ReasonRollbackFailed, err.Error())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, only maybe, create variables for the reason and message and call chs.setCondition only once?

@2opremio
Copy link
Contributor

Code looks good, but I don't know enough about the Helm Operator to rubber-stamp this.

I am missing tests though (which I guess you planned to add later)

@hiddeco
Copy link
Member Author

hiddeco commented May 16, 2019

I don't know enough about the Helm Operator to rubber-stamp this

It may be an idea to pick up some of the FIXME/TODOs in integrations/helm/chartsync/chartsync.go and integrations/helm/release/release.go in the (near) future to prevent it from becoming heavy opinionated by my views and to spread some knowledge amongst maintainers.

The operator is not that complex and quite small as a whole, so you should be able to fit it into your head rapidly. (I was able to do this and I am not the fastest person on earth in Golang code bases at the moment).

@2opremio
Copy link
Contributor

Thanks @hiddeco , I will take a look at those once I am done with the e2e test refactoring.

@hiddeco hiddeco force-pushed the helm/1960-rollback branch 3 times, most recently from 1b9234b to bc2098c Compare May 20, 2019 16:35
@hiddeco
Copy link
Member Author

hiddeco commented May 20, 2019

@2opremio @squaremo it does now detect changes to external values files, how it does this is up for debate.

I personally am not very happy with it, i.e. using the *ChartChangeSync in the enqueueUpdateJob method feels wrong (decision making should be fast), but given that we need to resolve all external files this was the only entrypoint I could think of, and as we only do this for HelmReleases with rollbacks enabled the overhead should be minimal.

Copy link
Member

@squaremo squaremo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could really do with a state machine diagram to see what's going on!

integrations/apis/flux.weave.works/v1beta1/types.go Outdated Show resolved Hide resolved
integrations/apis/flux.weave.works/v1beta1/types.go Outdated Show resolved Hide resolved
integrations/helm/status/status.go Outdated Show resolved Hide resolved
integrations/helm/status/status.go Show resolved Hide resolved
@adrian
Copy link

adrian commented Jun 26, 2019

I've been testing this branch and came across a problem with StatefulSets.

If an upgrade involves recreating the pods in a StatefulSet then Kubernetes will, as normal, start replacing the pods starting at the last ordinal, e.g.

acme-statefulset-0                 1/1     Running            0          5m26s
acme-statefulset-1                 1/1     Running            0          5m21s
acme-statefulset-2                 0/1     ContainerCreating   0          1s

If that pod fails to come up (e.g. the image is missing) the release is rolled back but the problem pod is left as-is, i.e.

acme-statefulset-0                 1/1     Running            0          5m26s
acme-statefulset-1                 1/1     Running            0          5m21s
acme-statefulset-2                 0/1     ImagePullBackOff   0          33s

The problem is addressed by including recreate: true in the HelmRelease as all pods in the release are deleted as part of the rollback. However, this seems heavy handed as it means all pods in the StatefulSet and any other workloads created by the helm chart are also deleted.

Is there a way to address this?

Could the --cleanup-on-fail flag in Helm v2.14.0 be a solution?

@hiddeco
Copy link
Member Author

hiddeco commented Jun 26, 2019

@adrian given that StatefulSets are known to be problematic to rollback (as mentioned in the issue linked in this PR), and the fact that we mimic helm upgrade --atomic. I am afraid this has little to do with the Helm operator itself, but more with Helm.

I am however interested in any solution you may come across which solves the problem for helm, so I can include the required flag(s) and/or add the solution to our docs. My advice to you would be to see if you can make it work with helm upgrade --atomic <maybe --cleanup-on-fail> <or --recreate-pods> (or maybe a different strategy).

Oh, and almost forgot -- awesome you are testing this :-)

Copy link

@adrian adrian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I've been testing the operator on this branch for the past few days and it's working fine. I've tested,

  • upgrades with rollback.enabled: false
  • upgrades with rollback.enabled: true and rollback.recreate: false
  • upgrades with rollback.enabled: true and rollback.recreate: true

@adrian
Copy link

adrian commented Jun 28, 2019

In relation to my question earlier about dealing with failed pods in StatefulSets...

--cleanup-on-fail isn't an option as it only deletes net new pods, i.e. pods for new workloads rather than updated pods.

--recreate recreates all pods belonging to all workloads deployed by the helm chart. This works, but leads to service downtime as the pods are being recreated.

One option we've implemented (not in an operator) is to delete all broken StatefulSet pods after issuing the helm rollback command. Maybe that's something that the operator could do?

@hiddeco
Copy link
Member Author

hiddeco commented Jun 28, 2019

FWIW I've been testing the operator on this branch for the past few days and it's working fine.

Awesome, going to rebase and tidy things up and bother @squaremo with reviewing again so it lands in master.

One option we've implemented (not in an operator) is to delete all broken StatefulSet pods after issuing the helm rollback command. Maybe that's something that the operator could do?

I am not very keen on implementing 'hacks' like this in the operator, the reason for this is that it may work for you but breaks things for other people. Instead, and this is probably not to your likening, I would just make note of that it is likely to give issues with some StatefulSet setups, due to shortcomings in Helm.

Copy link
Member

@squaremo squaremo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks fine to me, the design (in the state diagram) looks right, and we have some empirical evidence that it works as designed. Let's release it into the wild!

To be able to calculate a checksum for the values from external sources
we need to be able to retrieve the chart path for a HelmRelease outside
of the ReconcileReleaseDef method, as the path is required for chart
file references.
Before this change the actual rollback status was not consulted in
case the status of the chart fetch was successful, which could lead
to the wrong state being reported back.
@hiddeco hiddeco merged commit 6077df2 into master Jul 1, 2019
@hiddeco hiddeco deleted the helm/1960-rollback branch July 1, 2019 14:28
@semyonslepov
Copy link

@hiddeco this is awesome, thanks a lot for implementing this feature!
Would you mind also updating the docs on Helm integration: https://github.com/weaveworks/flux/blob/master/site/helm-integration.md ?

@hiddeco
Copy link
Member Author

hiddeco commented Jul 5, 2019

@semyonslepov thanks for your enthusiasm and interest in this feature, I merged a PR yesterday which documents the added fields to the HelmRelease.

Last, if you do not want to wait on the next Helm operator release, there is a prerelease build available which also adds experimental support for running multiple workers processing your releases (--workers=<num>).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Please provide "helm rollback" option in the HelmRelease
5 participants