New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Support Multiple Entity Versions within a Lattice #363
Comments
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this has been closed too eagerly, please feel free to tag a maintainer so we can keep working on the issue. Thank you for contributing to wasmCloud! |
Signed-off-by: Brooks Townsend <brooks@cosmonic.com> Signed-off-by: Brooks Townsend <brooks@cosmonic.com>
Signed-off-by: Brooks Townsend <brooks@cosmonic.com> Signed-off-by: Brooks Townsend <brooks@cosmonic.com>
As per #1119, the consideration for treating the actor reference as a unique identifier instead of the signed public key is worth considering as an ability to run actors that are not specifically signed with |
This is proposed in #1389 with the ability to use routing groups for more manual control over specifying entity versions and we'll track this effort there for simplicity 🙂 |
Entity Versioning RFC
This request for comment seeks commentary on potential solutions to the multi-layered problem of supporting multiple versions of the same entity in the same environment.
Note that the following scenarios are explicitly out of scope for this RFC, and should not be considered at all when going through the options.
wadm
should be able to provide this functionality, where individual entities can be converted from one version to another without message loss.Behavior Expectations for Modern Clusters
The following is an outline of the different types of behavior that developers are likely to expect based on exposure to Kubernetes, Nomad, Mesos, Docker, and a large number of developers working in this space.
This section deliberately tries to avoid discussing specific technology implementations and designs so that it can focus solely on high-level user expectations and patterns.
Blue/Green Deployment
A blue/green deployment refers to two completely isolated but parallel environments. Typically the
blue
refers to what's running in production whilegreen
is a hot staging environment. For example, we might havev1
of our application in blue andv2
of the application in green.Put another way, end users are only ever able to communicate with one environment (e.g. production/blue) at a time. Live customers cannot access the pending/green environment.
Also note that the choice of color is arbitrary here; the important aspect is the swap, not the colors.
We can perform tests and other verification against the green environment to make sure that everything is as we expect and all the acceptance tests pass. When we're ready to deploy the new version, we swap the "live" route to green.
Developers should expect to utilize some tooling to switch the live environment from one to the other. We also typically expect zero downtime during this switch, though having short gaps (measured in seconds or minutes) is also not unheard of here. As mentioned above, this is not related to rolling upgrades.
A/B Deployment
A/B deployments involve end users accessing two different versions of the same thing. These two versions could be in the same logical environment or they could be separate. This just depends on the tools and networking being employed.
An A/B deployment (also called A/B testing) is designed to gather live feedback and data on one version while still maintaining another. In some cases developers may want to route 50% of end user traffic to the "A" version while routing the other 50% to the "B".
Once developers are satisfied with the results of running both, a very common pattern is to wean end users off of the "old" version as the ratio of traffic routed to the new version increases over time. For example, you might start out at 50/50, then move to 60% new/40% old, then 70/30, 80/20, 90/10, and then ultimately drain all traffic away from the old entity, freeing it up for disposal.
Developers expect to be able to deploy two different versions of the same thing into the same environment (note that the end user is unaware of the distinction between environments), and control the ratio of request routing between the two.
In some advanced cases, A/B deployments may use content-based routing to provide smarter routing than just percentage splits.
Recap
To recap:
Considered Alternatives
The following is a list of some of the solutions that we're considering as ways to enable the end-user scenarios to improve the developer and operations experience.
Status Quo (Use Multiple Lattices)
This option is to leave the software as is and recommend that people use multiple lattices to deal with concurrent versioning concerns. Even if we do end up implementing one of the other options, we should still ensure that our documentation contains guides, tutorials, and advice on best practices for when to use multiple lattices and why.
Enable Versioned RPC within a Lattice (Preferred)
In the current version of the OTP host we have a very specific rule in place. This rule is such that if you attempt to start an entity (defined uniquely by public key) that has a different version string or revision number than one that is already known to be running, then we will reject the request. The condensed version of that is that we block you from running two versions of the same thing in the same lattice.
Enabling first-class versioning supported involves a number of steps, the first of which is lifting this ban on concurrent entities.
Implementation Details
This whole solution works on what amounts to a relatively small change, again relying on NATS for the lion's share of the work.
When an actor starts up, subscriptions will be made (or reused) to the following topics:
wasmbus.rpc.{lattice-id}.{actor-pk}
wasmbus.rpc.{lattice-id}.{actor-pk}.{version}
When a capability provider starts up, subscriptions will be made (or reused) to the following topics:
wasmbus.rpc.{lattice-id}.{provider-pk}.{link-name}
wasmbus.rpc.{lattice-id}.{provider-pk}.{link-name}.{version}
In all of the above cases, versions will be sanitized such that they will not interfere with normal NATS tokenization. For example, we might just base64 encode the version.
To enable everything from A/B, blue/green, to even things like failure simulation, we can use NATS subject mapping, as described in their documentation:
Any mapping made on the unversioned topic will instead deliver to the mapped target. We will want to document exactly how to set this up for common A/B, blue/green patterns and how we can use changes to an account JWT to change these mappings live at runtime. Note again that subject mapping will not be used for rolling/live updates.
To make this work, we will need to do (at least) the following:
revision
(rev
) property on claims will not be used by this system (should we look at retiring this claim field altogether?)This change would be part of the new host specification, and all supported hosts would need to support this behavior.
Benefits
The great part about this is that if people aren't using version mapping or starting concurrent revisions, then the whole system works exactly as it did before without breaking changes or even config changes. To make this work, we don't need to deploy any sidecars, proxies, or any other central components. The combination of the dual subscription pattern and NATS subject mappings would be enough.
Add Central Configurable Routing to a Lattice
In terms of solution breadth and depth, this one is the largest. Any implementation of this solution would also require adding version key to claims and binary caches.
The overall concept of this option is to allow for a lattice to obey custom routing rules by deploying a component that acts as a central proxy. The low-level details are extensive. If a host is started with the custom routing option enabled, then the actor RPC supervisor that dispatches calls would not be allowed to queue subscribe to the actor public key, but instead would have to "regular" subscribe to a topic that reflects the actor's public key, host, and version.
A routing component in the "middle" of invocations would be the thing that subscribes to today's actor and provider RPC topics. It would obtain the invocation, use the target selector information on it to dispatch, and then make a request: effectively acting as a proxy. This would require that the proxy be aware of the claims cache and a full inventory of the lattice so that it knows what version of any given entity is running where.
The routing component would listen on
wasmbus.rpc.{lattice-id}.>
and no other components in the lattice would be allowed to listen on that topic space. It would then proxy that call to the appropriate target based on its internal rules, which would have to be defined somewhere.x
with target versiony
. Routing rule storage/retrieval now becomes mandatory for the lattice.Summary
Our current strategy is to design and implement Enable Versioned RPC within a Lattice as outlined above. As usual we are always looking for feedback, ideas, and commentary on this to gain community perspective before implementation.
The text was updated successfully, but these errors were encountered: