New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kubernetes provider: deployment-type server groups' rollback/resize doesn't work #1460
Comments
It looks like resize actually works as expected. I just tried with the latest version of Clouddriver and was able to resize a Deployment from 5 -> 2 -> 5 instances fine. Rollback still doesn't work as expected, however. |
Hey Ethan - you mentioned wanting to take this on in Slack today. A couple of thoughts:
|
Thanks for the direction @lwander! I'm going to start digging into this today but my Java is a little rusty to it may day a few days to get acclimated. |
Alright, time for 2 days worth of discovery: The majority of the issue is that a Rollback is just an Enable -> Resize -> Disable operation which I imagine works fine for ReplicaSets because they can co-exist. We run into some funky behavior when one of those ReplicaSets is managed by a Deployment. I believe it's the Resize operation that modifies the Deployment such that it doesn't match an existing ReplicaSet and a new one is created (obviously outside of Spinnaker's management). With the help of @lwander we've discovered a potential solution but implementation hasn't played out yet. The solution would be to basically re-target the Deployment such that Enable basically does all of the work by deferring to Deployment replication semantics. From there, Resize should be a NoOp as well as Disable since the Deployment handles all of that in Kubernetes. As far as implementation goes, there are 2 potential solutions that I see.
Link to open issue: fabric8io/kubernetes-client#404 |
Putting this here as a reminder: The order of operations of the Orca Rollback operations causes some conflicts in the implementation of the 1st option. Since Orca Enables the target Server Group then captures the source Server Group size, it ends up capturing 0 since the Deployment has already done it's work. This causes the Resize operation to resize the targets Replica Set to 0 and you end up with no Pods running. |
And we're back with more information: After talking with @lwander we've decided that supporting this with the default Rollback implementation ( In this case, the rollback strategy really depends on the cloud provider and a generic set of steps causes conflicts. Further more, the strategy depends on the specific type of object we're using ( The best solution we could come up with is to give Orca the option of calling a cloud provider specific Rollback API if it's been implemented by the cloud provider but could defer to it's default Another piece of this would be to use a library that supports sub-resources like |
Hi guys, Any more updates on this? Thanks |
@sbyrne13 not at this time, unfortunately. when we explored the issue a large blocker was the client library used by clouddriver. it didn't support the correct kubernetes objects to utilize the actual api for rolling back deployments. now that an official one is implemented it may be more doable but would require rollback specific apis in clouddriver instead of synthesizing them from enable/disable/resize. it's something i wanted to work on but haven't been able to get the time yet. |
Unfortunately this is largely unfixable for the v1 provider as explained above. Luckily the v2 provider has no problems handling these operations on deployments. |
Title
Resize and rollback doesn't work for server groups of type "deployment" on the kubernetes provider.
Cloud Provider
Kubernetes. Not sure how to assign labels :)
Environment
Running a self-hosted Kubernetes cluster (CoreOS stable).
Feature Area
Clusters.
Description
Whenever I try to rollback to a previous version bunch of tasks are executed (successfully) and I end up in weird state. In this case I wanted to roll back from active V005 to disabled V004.
If I try to resize the group, I end up with the same number of replicas. I can see Spinnaker kills or adds pods, but eventually it reverts back to what it was before.
Feels like API calls that deal with deployments have a bug.
If server group is not a deployment, it works just fine.
Steps to Reproduce
Create any server group with Kubernetes provider as "Deployment". Set up a pipeline to deploy and leave a few replica set disabled. Then try to rollback or resize.
Additional Details
Will provide upon request if needed and if I have it.
The text was updated successfully, but these errors were encountered: