Kubernetes provider: deployment-type server groups' rollback/resize doesn't work #1460

igcherkaev · 2017-03-03T01:25:18Z

Title

Resize and rollback doesn't work for server groups of type "deployment" on the kubernetes provider.

Cloud Provider

Kubernetes. Not sure how to assign labels :)

Environment

Running a self-hosted Kubernetes cluster (CoreOS stable).

Feature Area

Clusters.

Description

Whenever I try to rollback to a previous version bunch of tasks are executed (successfully) and I end up in weird state. In this case I wanted to roll back from active V005 to disabled V004.

If I try to resize the group, I end up with the same number of replicas. I can see Spinnaker kills or adds pods, but eventually it reverts back to what it was before.

Feels like API calls that deal with deployments have a bug.

If server group is not a deployment, it works just fine.

Steps to Reproduce

Create any server group with Kubernetes provider as "Deployment". Set up a pipeline to deploy and leave a few replica set disabled. Then try to rollback or resize.

Additional Details

Will provide upon request if needed and if I have it.

ethanfrogers · 2017-04-22T21:16:50Z

It looks like resize actually works as expected. I just tried with the latest version of Clouddriver and was able to resize a Deployment from 5 -> 2 -> 5 instances fine. Rollback still doesn't work as expected, however.

lwander · 2017-06-07T22:24:42Z

Hey Ethan - you mentioned wanting to take this on in Slack today.

A couple of thoughts:

Typically rollbacks are synthesized from a number of enable/disable ops, but here we likely want to do a Deployment style rollback directly.
We might be able to get away with having the default enable operation modify the currently selected replica set for a deployment. If that's the case, then 1. is not necessary.
Toggling labels on deployment objects may cause grief, so we need to be aware of that.

ethanfrogers · 2017-06-08T12:24:09Z

Thanks for the direction @lwander! I'm going to start digging into this today but my Java is a little rusty to it may day a few days to get acclimated.

ethanfrogers · 2017-06-09T19:03:20Z

Alright, time for 2 days worth of discovery:

The majority of the issue is that a Rollback is just an Enable -> Resize -> Disable operation which I imagine works fine for ReplicaSets because they can co-exist. We run into some funky behavior when one of those ReplicaSets is managed by a Deployment. I believe it's the Resize operation that modifies the Deployment such that it doesn't match an existing ReplicaSet and a new one is created (obviously outside of Spinnaker's management).

With the help of @lwander we've discovered a potential solution but implementation hasn't played out yet.

The solution would be to basically re-target the Deployment such that Enable basically does all of the work by deferring to Deployment replication semantics. From there, Resize should be a NoOp as well as Disable since the Deployment handles all of that in Kubernetes.

As far as implementation goes, there are 2 potential solutions that I see.

Modify the Deployment in the same way that DeployKubernetesAtomicOperation does when rolling out a new Server Group. I've tried implementing it this way but there are some small issues that still cause a new Replica Set to be created (imagePullSecrets aren't copied when you call credentials.apiAdaptor.getReplicaSet). This is probably a bit more doable short term once the kinks get worked out.
Actually use the Deployment Rollback API provided by the Kubernetes API Server. This can be tested with kubectl by doing kubectl rollout undo deployment/myapp --to-revision {n}. This could be done easily by selecting the Deployment revision of the Replica Set to be enabled and rolling the Deployment back to that revision. Unfortunately, the Fabric8 library doesn't support the sub resource that is needed for this, DeploymentRollback.

Link to open issue: fabric8io/kubernetes-client#404

ethanfrogers · 2017-06-09T23:12:52Z

Putting this here as a reminder:

The order of operations of the Orca Rollback operations causes some conflicts in the implementation of the 1st option. Since Orca Enables the target Server Group then captures the source Server Group size, it ends up capturing 0 since the Deployment has already done it's work. This causes the Resize operation to resize the targets Replica Set to 0 and you end up with no Pods running.

ethanfrogers · 2017-06-12T18:08:41Z

And we're back with more information:

After talking with @lwander we've decided that supporting this with the default Rollback implementation (enable -> resize -> disable) isn't going to be possible since Kubernetes Deployments have their own rollback semantics.

In this case, the rollback strategy really depends on the cloud provider and a generic set of steps causes conflicts. Further more, the strategy depends on the specific type of object we're using (Deployment vs ReplicaSet) which is way too specific to do anything about in Orca.

The best solution we could come up with is to give Orca the option of calling a cloud provider specific Rollback API if it's been implemented by the cloud provider but could defer to it's default enable -> resize -> disable flow for others. This would leave the existing behavior in place while allowing platforms with different rollback semantics to handle them as needed.

Another piece of this would be to use a library that supports sub-resources like DeploymentRollback so that we can actually call a Kubernetes API to handle the operation without having to retarget the Deployment manually.

sbyrne13 · 2017-10-25T11:29:43Z

Hi guys,

Any more updates on this?

Thanks

ethanfrogers · 2017-10-25T12:09:10Z

@sbyrne13 not at this time, unfortunately. when we explored the issue a large blocker was the client library used by clouddriver. it didn't support the correct kubernetes objects to utilize the actual api for rolling back deployments. now that an official one is implemented it may be more doable but would require rollback specific apis in clouddriver instead of synthesizing them from enable/disable/resize. it's something i wanted to work on but haven't been able to get the time yet.

lwander · 2018-05-16T17:09:51Z

Unfortunately this is largely unfixable for the v1 provider as explained above. Luckily the v2 provider has no problems handling these operations on deployments.

lwander self-assigned this Mar 3, 2017

lwander added bug provider/kubernetes-v1 labels Mar 3, 2017

lwander closed this as completed May 16, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kubernetes provider: deployment-type server groups' rollback/resize doesn't work #1460

Kubernetes provider: deployment-type server groups' rollback/resize doesn't work #1460

igcherkaev commented Mar 3, 2017 •

edited

ethanfrogers commented Apr 22, 2017

lwander commented Jun 7, 2017

ethanfrogers commented Jun 8, 2017

ethanfrogers commented Jun 9, 2017 •

edited

ethanfrogers commented Jun 9, 2017

ethanfrogers commented Jun 12, 2017

sbyrne13 commented Oct 25, 2017

ethanfrogers commented Oct 25, 2017 •

edited

lwander commented May 16, 2018

Kubernetes provider: deployment-type server groups' rollback/resize doesn't work #1460

Kubernetes provider: deployment-type server groups' rollback/resize doesn't work #1460

Comments

igcherkaev commented Mar 3, 2017 • edited

ethanfrogers commented Apr 22, 2017

lwander commented Jun 7, 2017

ethanfrogers commented Jun 8, 2017

ethanfrogers commented Jun 9, 2017 • edited

ethanfrogers commented Jun 9, 2017

ethanfrogers commented Jun 12, 2017

sbyrne13 commented Oct 25, 2017

ethanfrogers commented Oct 25, 2017 • edited

lwander commented May 16, 2018

igcherkaev commented Mar 3, 2017 •

edited

ethanfrogers commented Jun 9, 2017 •

edited

ethanfrogers commented Oct 25, 2017 •

edited