Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes provider: deployment-type server groups' rollback/resize doesn't work #1460

Closed
igcherkaev opened this issue Mar 3, 2017 · 9 comments

Comments

@igcherkaev
Copy link

igcherkaev commented Mar 3, 2017

Title

Resize and rollback doesn't work for server groups of type "deployment" on the kubernetes provider.

Cloud Provider

Kubernetes. Not sure how to assign labels :)

Environment

Running a self-hosted Kubernetes cluster (CoreOS stable).

Feature Area

Clusters.

Description

Whenever I try to rollback to a previous version bunch of tasks are executed (successfully) and I end up in weird state. In this case I wanted to roll back from active V005 to disabled V004.

pasted_image_at_2017_03_02_07_05_pm

If I try to resize the group, I end up with the same number of replicas. I can see Spinnaker kills or adds pods, but eventually it reverts back to what it was before.

Feels like API calls that deal with deployments have a bug.

If server group is not a deployment, it works just fine.

Steps to Reproduce

Create any server group with Kubernetes provider as "Deployment". Set up a pipeline to deploy and leave a few replica set disabled. Then try to rollback or resize.

Additional Details

Will provide upon request if needed and if I have it.

@ethanfrogers
Copy link
Contributor

It looks like resize actually works as expected. I just tried with the latest version of Clouddriver and was able to resize a Deployment from 5 -> 2 -> 5 instances fine. Rollback still doesn't work as expected, however.

@lwander
Copy link
Member

lwander commented Jun 7, 2017

Hey Ethan - you mentioned wanting to take this on in Slack today.

A couple of thoughts:

  1. Typically rollbacks are synthesized from a number of enable/disable ops, but here we likely want to do a Deployment style rollback directly.
  2. We might be able to get away with having the default enable operation modify the currently selected replica set for a deployment. If that's the case, then 1. is not necessary.
  3. Toggling labels on deployment objects may cause grief, so we need to be aware of that.

@ethanfrogers
Copy link
Contributor

Thanks for the direction @lwander! I'm going to start digging into this today but my Java is a little rusty to it may day a few days to get acclimated.

@ethanfrogers
Copy link
Contributor

ethanfrogers commented Jun 9, 2017

Alright, time for 2 days worth of discovery:

The majority of the issue is that a Rollback is just an Enable -> Resize -> Disable operation which I imagine works fine for ReplicaSets because they can co-exist. We run into some funky behavior when one of those ReplicaSets is managed by a Deployment. I believe it's the Resize operation that modifies the Deployment such that it doesn't match an existing ReplicaSet and a new one is created (obviously outside of Spinnaker's management).

With the help of @lwander we've discovered a potential solution but implementation hasn't played out yet.

The solution would be to basically re-target the Deployment such that Enable basically does all of the work by deferring to Deployment replication semantics. From there, Resize should be a NoOp as well as Disable since the Deployment handles all of that in Kubernetes.

As far as implementation goes, there are 2 potential solutions that I see.

  1. Modify the Deployment in the same way that DeployKubernetesAtomicOperation does when rolling out a new Server Group. I've tried implementing it this way but there are some small issues that still cause a new Replica Set to be created (imagePullSecrets aren't copied when you call credentials.apiAdaptor.getReplicaSet). This is probably a bit more doable short term once the kinks get worked out.

  2. Actually use the Deployment Rollback API provided by the Kubernetes API Server. This can be tested with kubectl by doing kubectl rollout undo deployment/myapp --to-revision {n}. This could be done easily by selecting the Deployment revision of the Replica Set to be enabled and rolling the Deployment back to that revision. Unfortunately, the Fabric8 library doesn't support the sub resource that is needed for this, DeploymentRollback.

Link to open issue: fabric8io/kubernetes-client#404

@ethanfrogers
Copy link
Contributor

Putting this here as a reminder:

The order of operations of the Orca Rollback operations causes some conflicts in the implementation of the 1st option. Since Orca Enables the target Server Group then captures the source Server Group size, it ends up capturing 0 since the Deployment has already done it's work. This causes the Resize operation to resize the targets Replica Set to 0 and you end up with no Pods running.

@ethanfrogers
Copy link
Contributor

And we're back with more information:

After talking with @lwander we've decided that supporting this with the default Rollback implementation (enable -> resize -> disable) isn't going to be possible since Kubernetes Deployments have their own rollback semantics.

In this case, the rollback strategy really depends on the cloud provider and a generic set of steps causes conflicts. Further more, the strategy depends on the specific type of object we're using (Deployment vs ReplicaSet) which is way too specific to do anything about in Orca.

The best solution we could come up with is to give Orca the option of calling a cloud provider specific Rollback API if it's been implemented by the cloud provider but could defer to it's default enable -> resize -> disable flow for others. This would leave the existing behavior in place while allowing platforms with different rollback semantics to handle them as needed.

Another piece of this would be to use a library that supports sub-resources like DeploymentRollback so that we can actually call a Kubernetes API to handle the operation without having to retarget the Deployment manually.

@sbyrne13
Copy link

Hi guys,

Any more updates on this?

Thanks

@ethanfrogers
Copy link
Contributor

ethanfrogers commented Oct 25, 2017

@sbyrne13 not at this time, unfortunately. when we explored the issue a large blocker was the client library used by clouddriver. it didn't support the correct kubernetes objects to utilize the actual api for rolling back deployments. now that an official one is implemented it may be more doable but would require rollback specific apis in clouddriver instead of synthesizing them from enable/disable/resize. it's something i wanted to work on but haven't been able to get the time yet.

@lwander
Copy link
Member

lwander commented May 16, 2018

Unfortunately this is largely unfixable for the v1 provider as explained above. Luckily the v2 provider has no problems handling these operations on deployments.

@lwander lwander closed this as completed May 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants