Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Canary Deployments with Gloo Federation #6127

Open
guydc opened this issue Mar 21, 2022 · 9 comments
Open

Canary Deployments with Gloo Federation #6127

guydc opened this issue Mar 21, 2022 · 9 comments
Assignees
Labels
Area: Gloo Fed Issues related to the Gloo Fed project in solo-projects Size: L 1 - 2 weeks stale Issues that are stale. These will not be prioritized without further engagement on the issue. Type: Enhancement New feature or request

Comments

@guydc
Copy link

guydc commented Mar 21, 2022

Version

1.11.*

Is your feature request related to a problem? Please describe.

Gloo Edge supports in-place canary deployment: multiple control planes can reconcile the same CRs and produce XDS for two distinct data planes.

With Gloo Federation, It should be possible to perform a blue-green deployment that does not create any upgrade risk to existing clusters. Furthermore, Gloo Federation itself should support a blue-green deployment model, where a new federation version can be tested before it assumes control over existing clusters.

Describe the solution you'd like

This can be achieved by deploying an additional gloo-fed instance and creating new edge clusters with the latest gloo-edge version. Traffic is gradually shifted from old clusters to new ones. The Canary deployment concepts can be applied to Gloo Federation:

  • When upgrading a federated environment, two gloo-fed instances can co-exist in the federation cluster and reconcile the same CRs without collisions
  • Each gloo-fed instance is responsible for federating resources on GlooInstances with a matching gloo-edge version

Describe alternatives you've considered

No response

Additional Context

No response

@guydc guydc added the Type: Enhancement New feature or request label Mar 21, 2022
@chrisgaun
Copy link

Need estimate or alternatives.

@chrisgaun chrisgaun added this to the April 15th milestone Mar 29, 2022
@chrisgaun
Copy link

Need to understand level of effort on this one @sam-heilbron

@chrisgaun chrisgaun removed this from the Midterm milestone Apr 5, 2022
@chrisgaun chrisgaun added this to the Midterm milestone Apr 12, 2022
@sam-heilbron sam-heilbron added the Size: L 1 - 2 weeks label Apr 15, 2022
@sam-heilbron sam-heilbron removed their assignment Apr 15, 2022
@rinormaloku
Copy link
Contributor

I tested as an alternative if we can run two Gloo Federation instances at once, the second instance running in the opposite cluster (where again all clusters need to be registered and all resources deployed). I didn't like the UX of this alternative, hence it is crossed out.

But what I would like to circle back to is: How important is it to deploy gloo fed using the canary pattern?

Gloo Federation is reading the Gloo Edge instances running in the clusters, picking up some configuration applied by the user making the configuration in the clusters so that cross-cluster traffic is possible, and failover works.

From then on there aren't ongoing changes that Gloo Federation needs to reconcile.
If Gloo Fed is down it merely hinders applying a new configuration, but everything that was already applied keeps on working.

Having a pre-prod environment to test upgrading Gloo Federation should be all that's needed.

@guydc
Copy link
Author

guydc commented Apr 25, 2022

How important is it to deploy gloo fed using the canary pattern?

Gloo Fed is a privileged component that controls configuration for multiple edge control planes. I think that the blast radius from a malfunctioning new version can be significant. For example, consider a bug in the orphan termination functionality, that erases configuration from all federated clusters, leading to a complete system outage.

There are also inherent compatibility risks when following canary deployment practices for the edge control and data planes in a federated environment. Gloo Fed CRDs and clients may be incompatible with edges that are still running an older version. AFAIK, k8s CRD versioning practices are not applied, breaking changes occur from time to time, and downgrading is difficult in Gloo Edge:

IMHO, the safest way to upgrade a federated environment is:

  • spin up a new federation cluster
  • spin up and register new edge clusters
  • apply federated state
  • gradually steer traffic towards the new environment, while keeping the old environment live and up-to-date.

This scheme is not always feasible, especially when the federation clusters require state synchronization. The next best thing would be to support an in-cluster gloo fed canary deployment.

These solutions would only work if Federated CRDs are properly versioned and deprecated.

If Gloo Fed is down it merely hinders applying a new configuration, but everything that was already applied keeps on working.

If Gloo Fed is down:

  • A canary deployment process that spins up new edge clusters will fail, as new edges are not federated.
  • DR for failed edges is impossible
  • Service degrades as the system enters a "read only" state

Having a pre-prod environment to test upgrading Gloo Federation should be all that's needed.

It's not always possible to have a pre-prod environment that completely simulates production.

@rinormaloku
Copy link
Contributor

It's not always possible to have a pre-prod environment that completely simulates production.

That is an issue.

If Gloo Fed is down:
New edges & DR -- (those are very rare cases, with low likelihood to occur, unless the feature is used in a way that I haven't seen up to now)

  • Service degrades as the system enters a "read-only" state

The third issue is the most likely issue to occur. But the impact is completely negligible. The implementation of Gloo Edge is purposefully different from Istio, Gloo edge doesn't configure the gateway proxy with endpoints (IP addresses for every pod; a luxury that a service mesh cannot afford as it would cause excessive load on the DNS proxy).

Summary: Gloo Fed will only make tweaks when you apply Gloo Fed CRDs. Or if you change the Loadbalancer service in one of the gloo instances. (Those changes are not frequent, and at least shouldn't be done when you make a Gloo Fed update)

Though without Pre-prod environments, there is no alternative but to have some canary deployment approach to reduce the risk.

@chrisgaun
Copy link

Can limit the scope to having Gloo Fed backwards compatible with GE.

@guydc
Copy link
Author

guydc commented Jun 21, 2022

Can limit the scope to having Gloo Fed backwards compatible with GE.

Right. For example, the Gloo Mesh Control Plane is compatible with n-1 version relay agents to support rolling upgrade scenarios. Ideally, Gloo Fed should have similar compatibility with Gloo Edge.

Otherwise, some form of protection is required, to ensure that state of n-1 GEs is not corrupted and that GF doesn't run into global failures due to unexpected GE version under federation.

@chrisgaun chrisgaun modified the milestones: Midterm, Q3 Milestone Jun 29, 2022
@jenshu jenshu self-assigned this Sep 9, 2022
@jenshu
Copy link
Contributor

jenshu commented Sep 26, 2022

breakdown of tasks (not necessarily in order):

@jenshu jenshu modified the milestones: Q3 Milestone, Fed canary Oct 11, 2022
@kcbabo kcbabo modified the milestones: Fed canary, Q4 Deliverables Oct 14, 2022
@sam-heilbron sam-heilbron self-assigned this Nov 6, 2022
@jenshu jenshu added the Area: Gloo Fed Issues related to the Gloo Fed project in solo-projects label Mar 21, 2023
Copy link

github-actions bot commented Jun 2, 2024

This issue has been marked as stale because of no activity in the last 180 days. It will be closed in the next 180 days unless it is tagged "no stalebot" or other activity occurs.

@github-actions github-actions bot added the stale Issues that are stale. These will not be prioritized without further engagement on the issue. label Jun 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: Gloo Fed Issues related to the Gloo Fed project in solo-projects Size: L 1 - 2 weeks stale Issues that are stale. These will not be prioritized without further engagement on the issue. Type: Enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants