Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Support Multiple Entity Versions within a Lattice #363

Closed
autodidaddict opened this issue Jun 22, 2023 · 3 comments
Closed

[RFC] Support Multiple Entity Versions within a Lattice #363

autodidaddict opened this issue Jun 22, 2023 · 3 comments

Comments

@autodidaddict
Copy link
Member

autodidaddict commented Jun 22, 2023

Entity Versioning RFC

This request for comment seeks commentary on potential solutions to the multi-layered problem of supporting multiple versions of the same entity in the same environment.

⚠️ Out of Scope

Note that the following scenarios are explicitly out of scope for this RFC, and should not be considered at all when going through the options.

  • Rolling updates - wadm should be able to provide this functionality, where individual entities can be converted from one version to another without message loss.
  • Static update - The individual version of a rolling update, but applied to a single entity. All current wasmCloud hosts should support the ability to perform a "live update"

Behavior Expectations for Modern Clusters

The following is an outline of the different types of behavior that developers are likely to expect based on exposure to Kubernetes, Nomad, Mesos, Docker, and a large number of developers working in this space.

This section deliberately tries to avoid discussing specific technology implementations and designs so that it can focus solely on high-level user expectations and patterns.

Blue/Green Deployment

A blue/green deployment refers to two completely isolated but parallel environments. Typically the blue refers to what's running in production while green is a hot staging environment. For example, we might have v1 of our application in blue and v2 of the application in green.

Put another way, end users are only ever able to communicate with one environment (e.g. production/blue) at a time. Live customers cannot access the pending/green environment.

Also note that the choice of color is arbitrary here; the important aspect is the swap, not the colors.

We can perform tests and other verification against the green environment to make sure that everything is as we expect and all the acceptance tests pass. When we're ready to deploy the new version, we swap the "live" route to green.

Developers should expect to utilize some tooling to switch the live environment from one to the other. We also typically expect zero downtime during this switch, though having short gaps (measured in seconds or minutes) is also not unheard of here. As mentioned above, this is not related to rolling upgrades.

A/B Deployment

A/B deployments involve end users accessing two different versions of the same thing. These two versions could be in the same logical environment or they could be separate. This just depends on the tools and networking being employed.

An A/B deployment (also called A/B testing) is designed to gather live feedback and data on one version while still maintaining another. In some cases developers may want to route 50% of end user traffic to the "A" version while routing the other 50% to the "B".

Once developers are satisfied with the results of running both, a very common pattern is to wean end users off of the "old" version as the ratio of traffic routed to the new version increases over time. For example, you might start out at 50/50, then move to 60% new/40% old, then 70/30, 80/20, 90/10, and then ultimately drain all traffic away from the old entity, freeing it up for disposal.

Developers expect to be able to deploy two different versions of the same thing into the same environment (note that the end user is unaware of the distinction between environments), and control the ratio of request routing between the two.

In some advanced cases, A/B deployments may use content-based routing to provide smarter routing than just percentage splits.

Recap

To recap:

  • Blue/Green - isolated, share-nothing environments
  • A/B - multiple versions sharing the same end-user perceived environment

Considered Alternatives

The following is a list of some of the solutions that we're considering as ways to enable the end-user scenarios to improve the developer and operations experience.

Status Quo (Use Multiple Lattices)

This option is to leave the software as is and recommend that people use multiple lattices to deal with concurrent versioning concerns. Even if we do end up implementing one of the other options, we should still ensure that our documentation contains guides, tutorials, and advice on best practices for when to use multiple lattices and why.

Enable Versioned RPC within a Lattice (Preferred)

In the current version of the OTP host we have a very specific rule in place. This rule is such that if you attempt to start an entity (defined uniquely by public key) that has a different version string or revision number than one that is already known to be running, then we will reject the request. The condensed version of that is that we block you from running two versions of the same thing in the same lattice.

Enabling first-class versioning supported involves a number of steps, the first of which is lifting this ban on concurrent entities.

Implementation Details
This whole solution works on what amounts to a relatively small change, again relying on NATS for the lion's share of the work.

When an actor starts up, subscriptions will be made (or reused) to the following topics:

  • wasmbus.rpc.{lattice-id}.{actor-pk}
  • wasmbus.rpc.{lattice-id}.{actor-pk}.{version}

When a capability provider starts up, subscriptions will be made (or reused) to the following topics:

  • wasmbus.rpc.{lattice-id}.{provider-pk}.{link-name}
  • wasmbus.rpc.{lattice-id}.{provider-pk}.{link-name}.{version}

In all of the above cases, versions will be sanitized such that they will not interfere with normal NATS tokenization. For example, we might just base64 encode the version.

To enable everything from A/B, blue/green, to even things like failure simulation, we can use NATS subject mapping, as described in their documentation:

Any mapping made on the unversioned topic will instead deliver to the mapped target. We will want to document exactly how to set this up for common A/B, blue/green patterns and how we can use changes to an account JWT to change these mappings live at runtime. Note again that subject mapping will not be used for rolling/live updates.

To make this work, we will need to do (at least) the following:

  • Ensure that claims are cached with a combined key of public key and version
  • Modify cache behavior such that binary caches are differentiated by version
  • Control interface operations to stop and scale will accept an optional version. If this version is missing, and multiple versions of that entity are running on that target host, the host should reject the scale/stop attempt with an appropriate error message.
  • Ensure that actors queue subscribe to both the versioned and unversioned topic
  • Ensure that capability providers queue subscribe to both the versioned and unversioned topic
  • The numeric revision (rev) property on claims will not be used by this system (should we look at retiring this claim field altogether?)

This change would be part of the new host specification, and all supported hosts would need to support this behavior.

Benefits
The great part about this is that if people aren't using version mapping or starting concurrent revisions, then the whole system works exactly as it did before without breaking changes or even config changes. To make this work, we don't need to deploy any sidecars, proxies, or any other central components. The combination of the dual subscription pattern and NATS subject mappings would be enough.

Add Central Configurable Routing to a Lattice

In terms of solution breadth and depth, this one is the largest. Any implementation of this solution would also require adding version key to claims and binary caches.

The overall concept of this option is to allow for a lattice to obey custom routing rules by deploying a component that acts as a central proxy. The low-level details are extensive. If a host is started with the custom routing option enabled, then the actor RPC supervisor that dispatches calls would not be allowed to queue subscribe to the actor public key, but instead would have to "regular" subscribe to a topic that reflects the actor's public key, host, and version.

A routing component in the "middle" of invocations would be the thing that subscribes to today's actor and provider RPC topics. It would obtain the invocation, use the target selector information on it to dispatch, and then make a request: effectively acting as a proxy. This would require that the proxy be aware of the claims cache and a full inventory of the lattice so that it knows what version of any given entity is running where.

The routing component would listen on wasmbus.rpc.{lattice-id}.> and no other components in the lattice would be allowed to listen on that topic space. It would then proxy that call to the appropriate target based on its internal rules, which would have to be defined somewhere.

  • Need to track the version of every running entity (e.g. OTP process in the OTP host)
  • Refactor the topics used by actors and providers to subscribe for RPC
  • Control interface actions would need to take version into account or select a version (see previous option)
  • Invocations would need to carry the version of the source
  • Providers would have to be modified to support host-targeting RPC, new providers would have to conform to these rules
  • In order for a router to decide where to proxy the call, it needs routing rules that match source version x with target version y. Routing rule storage/retrieval now becomes mandatory for the lattice.
  • Introduces a network hop into all RPC when routing is on
    • Creates a single point of failure (SPOF) in routed environments
  • Would give us some enormous flexibility and power to grow with support for a central router.
  • ? It would be helpful if we could put routing logic directly into the NATS cluster
  • 🤔❓ - A potentially simpler implementation would be to use leaf nodes / side cars on every host so that proxying could take place without the host changing its subject patterns.

Summary

Our current strategy is to design and implement Enable Versioned RPC within a Lattice as outlined above. As usual we are always looking for feedback, ideas, and commentary on this to gain community perspective before implementation.

@stale
Copy link

stale bot commented Sep 30, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this has been closed too eagerly, please feel free to tag a maintainer so we can keep working on the issue. Thank you for contributing to wasmCloud!

@stale stale bot added the stale label Sep 30, 2023
connorsmith256 pushed a commit to connorsmith256/wasmCloud that referenced this issue Oct 17, 2023
Signed-off-by: Brooks Townsend <brooks@cosmonic.com>

Signed-off-by: Brooks Townsend <brooks@cosmonic.com>
connorsmith256 pushed a commit to connorsmith256/wasmCloud that referenced this issue Oct 17, 2023
Signed-off-by: Brooks Townsend <brooks@cosmonic.com>

Signed-off-by: Brooks Townsend <brooks@cosmonic.com>
@brooksmtownsend brooksmtownsend added this to the wasmCloud 1.0.0 milestone Nov 28, 2023
@brooksmtownsend
Copy link
Member

As per #1119, the consideration for treating the actor reference as a unique identifier instead of the signed public key is worth considering as an ability to run actors that are not specifically signed with wascap.

@brooksmtownsend
Copy link
Member

This is proposed in #1389 with the ability to use routing groups for more manual control over specifying entity versions and we'll track this effort there for simplicity 🙂

@brooksmtownsend brooksmtownsend closed this as not planned Won't fix, can't repro, duplicate, stale Jan 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Completed
Development

No branches or pull requests

3 participants