Canary Deployments with Gloo Federation #6127

guydc · 2022-03-21T12:21:01Z

Version

1.11.*

Is your feature request related to a problem? Please describe.

Gloo Edge supports in-place canary deployment: multiple control planes can reconcile the same CRs and produce XDS for two distinct data planes.

With Gloo Federation, It should be possible to perform a blue-green deployment that does not create any upgrade risk to existing clusters. Furthermore, Gloo Federation itself should support a blue-green deployment model, where a new federation version can be tested before it assumes control over existing clusters.

Describe the solution you'd like

This can be achieved by deploying an additional gloo-fed instance and creating new edge clusters with the latest gloo-edge version. Traffic is gradually shifted from old clusters to new ones. The Canary deployment concepts can be applied to Gloo Federation:

When upgrading a federated environment, two gloo-fed instances can co-exist in the federation cluster and reconcile the same CRs without collisions
Each gloo-fed instance is responsible for federating resources on GlooInstances with a matching gloo-edge version

Describe alternatives you've considered

No response

Additional Context

No response

The text was updated successfully, but these errors were encountered:

chrisgaun · 2022-03-29T13:21:37Z

Need estimate or alternatives.

chrisgaun · 2022-03-29T13:36:13Z

Need to understand level of effort on this one @sam-heilbron

rinormaloku · 2022-04-21T14:53:09Z

I tested as an alternative if we can run two Gloo Federation instances at once, the second instance running in the opposite cluster (where again all clusters need to be registered and all resources deployed). I didn't like the UX of this alternative, hence it is crossed out.

But what I would like to circle back to is: How important is it to deploy gloo fed using the canary pattern?

Gloo Federation is reading the Gloo Edge instances running in the clusters, picking up some configuration applied by the user making the configuration in the clusters so that cross-cluster traffic is possible, and failover works.

From then on there aren't ongoing changes that Gloo Federation needs to reconcile.
If Gloo Fed is down it merely hinders applying a new configuration, but everything that was already applied keeps on working.

Having a pre-prod environment to test upgrading Gloo Federation should be all that's needed.

guydc · 2022-04-25T07:26:06Z

How important is it to deploy gloo fed using the canary pattern?

Gloo Fed is a privileged component that controls configuration for multiple edge control planes. I think that the blast radius from a malfunctioning new version can be significant. For example, consider a bug in the orphan termination functionality, that erases configuration from all federated clusters, leading to a complete system outage.

There are also inherent compatibility risks when following canary deployment practices for the edge control and data planes in a federated environment. Gloo Fed CRDs and clients may be incompatible with edges that are still running an older version. AFAIK, k8s CRD versioning practices are not applied, breaking changes occur from time to time, and downgrading is difficult in Gloo Edge:

IMHO, the safest way to upgrade a federated environment is:

spin up a new federation cluster
spin up and register new edge clusters
apply federated state
gradually steer traffic towards the new environment, while keeping the old environment live and up-to-date.

This scheme is not always feasible, especially when the federation clusters require state synchronization. The next best thing would be to support an in-cluster gloo fed canary deployment.

These solutions would only work if Federated CRDs are properly versioned and deprecated.

If Gloo Fed is down it merely hinders applying a new configuration, but everything that was already applied keeps on working.

If Gloo Fed is down:

A canary deployment process that spins up new edge clusters will fail, as new edges are not federated.
DR for failed edges is impossible
Service degrades as the system enters a "read only" state

Having a pre-prod environment to test upgrading Gloo Federation should be all that's needed.

It's not always possible to have a pre-prod environment that completely simulates production.

rinormaloku · 2022-04-25T09:05:34Z

It's not always possible to have a pre-prod environment that completely simulates production.

That is an issue.

If Gloo Fed is down:
New edges & DR -- (those are very rare cases, with low likelihood to occur, unless the feature is used in a way that I haven't seen up to now)

Service degrades as the system enters a "read-only" state

The third issue is the most likely issue to occur. But the impact is completely negligible. The implementation of Gloo Edge is purposefully different from Istio, Gloo edge doesn't configure the gateway proxy with endpoints (IP addresses for every pod; a luxury that a service mesh cannot afford as it would cause excessive load on the DNS proxy).

Summary: Gloo Fed will only make tweaks when you apply Gloo Fed CRDs. Or if you change the Loadbalancer service in one of the gloo instances. (Those changes are not frequent, and at least shouldn't be done when you make a Gloo Fed update)

Though without Pre-prod environments, there is no alternative but to have some canary deployment approach to reduce the risk.

chrisgaun · 2022-05-17T13:21:51Z

Can limit the scope to having Gloo Fed backwards compatible with GE.

guydc · 2022-06-21T07:15:50Z

Can limit the scope to having Gloo Fed backwards compatible with GE.

Right. For example, the Gloo Mesh Control Plane is compatible with n-1 version relay agents to support rolling upgrade scenarios. Ideally, Gloo Fed should have similar compatibility with Gloo Edge.

Otherwise, some form of protection is required, to ensure that state of n-1 GEs is not corrupted and that GF doesn't run into global failures due to unexpected GE version under federation.

jenshu · 2022-09-26T07:52:50Z

breakdown of tasks (not necessarily in order):

make sure Gloo Fed CRDs are backwards-compatible: Fed canary upgrades: make sure CRDs are backwards compatible #7234
Gloo Fed controllers should ignore unknown fields when unmarshalling: Fed canary upgrades: ignore unknown fields #7235
Gloo Fed helm chart should install separate RBAC resources for each Gloo Fed install namespace: Fed canary upgrades: separate RBAC resources per Gloo Fed install #7236
Gloo Fed's federated resource statuses should be namespaced by Gloo Fed install namespace: Fed canary upgrades: namespaced federated resource statuses #7237
make sure status marshalling/unmarshalling is both forwards and backwards compatible: Fed canary upgrades: status should be forwards and backwards compatible #7238
push any Gloo Fed API updates to solo-apis: Fed canary upgrades: update solo-apis #7239
Docs: add a page about Gloo Fed canary upgrades: Fed canary upgrades: docs #7240
glooctl updates (if needed): Fed canary upgrades: glooctl #7241
UI updates (if needed): Fed canary upgrades: UI #7242

github-actions · 2024-06-02T10:10:28Z

This issue has been marked as stale because of no activity in the last 180 days. It will be closed in the next 180 days unless it is tagged "no stalebot" or other activity occurs.

guydc added the Type: Enhancement New feature or request label Mar 21, 2022

chrisgaun added this to the April 15th milestone Mar 29, 2022

chrisgaun removed this from the Midterm milestone Apr 5, 2022

chrisgaun assigned sam-heilbron Apr 12, 2022

chrisgaun added this to the Midterm milestone Apr 12, 2022

sam-heilbron added the Size: L 1 - 2 weeks label Apr 15, 2022

sam-heilbron removed their assignment Apr 15, 2022

chrisgaun modified the milestones: Midterm, Q3 Milestone Jun 29, 2022

jenshu self-assigned this Sep 9, 2022

jenshu modified the milestones: Q3 Milestone, Fed canary Oct 11, 2022

kcbabo modified the milestones: Fed canary, Q4 Deliverables Oct 14, 2022

sam-heilbron self-assigned this Nov 6, 2022

jenshu added the Area: Gloo Fed Issues related to the Gloo Fed project in solo-projects label Mar 21, 2023

github-actions bot added the stale Issues that are stale. These will not be prioritized without further engagement on the issue. label Jun 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Canary Deployments with Gloo Federation #6127

Canary Deployments with Gloo Federation #6127

guydc commented Mar 21, 2022 •

edited

Loading

chrisgaun commented Mar 29, 2022

chrisgaun commented Mar 29, 2022

rinormaloku commented Apr 21, 2022

guydc commented Apr 25, 2022

rinormaloku commented Apr 25, 2022

chrisgaun commented May 17, 2022

guydc commented Jun 21, 2022

jenshu commented Sep 26, 2022 •

edited

Loading

github-actions bot commented Jun 2, 2024

Canary Deployments with Gloo Federation #6127

Canary Deployments with Gloo Federation #6127

Comments

guydc commented Mar 21, 2022 • edited Loading

Version

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional Context

chrisgaun commented Mar 29, 2022

chrisgaun commented Mar 29, 2022

rinormaloku commented Apr 21, 2022

guydc commented Apr 25, 2022

rinormaloku commented Apr 25, 2022

chrisgaun commented May 17, 2022

guydc commented Jun 21, 2022

jenshu commented Sep 26, 2022 • edited Loading

github-actions bot commented Jun 2, 2024

guydc commented Mar 21, 2022 •

edited

Loading

jenshu commented Sep 26, 2022 •

edited

Loading