New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Single process multi-tenancy support #195
Comments
Here's what a new manifest might look like with multi-tenant actors: ---
labels:
sample: "Multi-tenant Echo"
actors:
- image_ref: wasmcloud.azurecr.io/echo:0.2.0
group_id: tenant1
- image_ref: wasmcloud.azurecr.io/echo:0.2.0
group_id: tenant2
capabilities:
- image_ref: wasmcloud.azurecr.io/httpserver:0.1.0
links:
- actor: ${MBCFOPM6JW2APJLXJD3Z5O4CN7CPYJ2B4FTKLJUR5YR5MITIU7HD3WD5}
actor_group: tenant1
contract_id: "wasmcloud:httpserver"
provider_id: "VAG3QITQQ2ODAOWB5TTQSDJ53XK3SHBEIFNK4AYJ5RKAX2UNSCAPHA5M"
values:
PORT: 8080
- actor: ${MBCFOPM6JW2APJLXJD3Z5O4CN7CPYJ2B4FTKLJUR5YR5MITIU7HD3WD5}
actor_group: tenant2
contract_id: "wasmcloud:httpserver"
provider_id: "VAG3QITQQ2ODAOWB5TTQSDJ53XK3SHBEIFNK4AYJ5RKAX2UNSCAPHA5M"
values:
PORT: 8081 |
Big 👍🏻 on keeping it all backwards-compatible, I like the idea that the groups and link names (and for that matter, lattice namespace prefix) can all be left off and work fine in a single-tenant scenario. Opt-in complexity lives on. |
Checking if there are any objections to this from @stevelr |
Some initial probing on the code required to make this work essentially adds a The entire system internals were originally built around the idea that the single characteristic of an actor that uniquely differentiated it from all other things was just the public key. We already had the notion that capability providers had a complex key (Contract ID and public key) so when we added the link name, it was easy for capability providers. Capability provider documentation instructs the provider never to infer any meaning on the inside of the value used for the This makes for some ugly code and I would immediately classify the addition of code like this as code debt. It will make the code more difficult to read and maintain, because there will be single-variable references to an entity that we know contains 2 things. The alternative is to break the API contract for capability providers, capability provider configuration, and for the dispatcher to deal with this new complex key. This is cleaner, more readable, and does not incur tech debt .... but it's a sweeping breaking change that will require every first party (and third, if there are any yet) capability provider to upgrade, rebuild, and redeploy. |
So, it boils down to the following:
|
I am inclined to say that the first option is the best, and that if people want two copies of the same actor (e.g. 2 echos) to be running with different configurations because they are being run on behalf of 2 tenants sharing the same host, then perhaps those two tenants should be forced to sign that actor differently, giving them unique identifiers. |
what if we deprecate the This gives us a hook to avoiding hard-breaking api changes in the future, because we will always have a way to check what api version is in use. We would then be able to deprecate an api and give it a sunset period. |
If we were going to overload the actor ID string, I recommend adding a prefix, like two ascii chars plus a period, "ac." so it's quick to check the string type, and then the library (because all code that peeks into the actor id string would have to use the library!) would branch on that prefix tag and parse the rest of the string. |
I know at least one piece of code (that I wrote) in http-server provider that makes an assumption on the length of the actor ID string |
I want to approach this from a different scenario, and see if I can appeal to a different use case that may affect what choice of the three could be the best solution. Imagine you have your prod deployment where you have an actor (HTTP API with a few CRUD endpoints) running with the appropriate capability provider links. In an advanced world where we can do canary deployments or a gradual rollout, we may want to launch that same actor (albeit a newer version) with the same provider (potentially with an updated interface, for example). If we want to only divert 10% of production traffic to use the new API so we can monitor for breaking changes, today we would have to use a different link name for the provider (fine) and sign the new actor with a different public key (less fine). I think this scenario can come up in a few different ways, as you can see with the krustlet use case of using a single wasmcloud host for multitenancy. This RFC already talks about a scenario like this but just to put it in another perspective that it may be a pain if we don't have this group id. That being said, I also see the concern with hiding the information. I'm usually tempted to get a solution that just works, but I have a feeling that breaking the API now is going to be easier than later. Breaking all capability providers is unfortunate, potentially necessary, but would this mean that after this version there will be incompatibilities between capability providers and actors? E.g. will the newly updated http-server provider be incompatible with the echo actor that we built a while ago? If so, I really don't like having fragmented interactions between actors and providers 🤔 |
just leaving a note here so I don't forget - we may want to think more about how group ids are generated and disclosed, concerning security and privacy.
|
Woops, that's one reason why we documented it a while back saying not to make any assumptions about the module contents. |
@stevelr I have a few thoughts for your notes (2, 3), I think actors and providers shouldn't have a concept of the group ID, in my mind I see this (and the link name) only as a way to supply different configurations between identical instances of actors and providers. I also think (4) that group id should be similar to provider link_name, e.g. text supplied by the configuring party. Agreed with (1) though, but I'm not sure what that attack vector ends up looking like edit: this wasn't mean to be a definitive answer, just what my thoughts were |
There may be a common solution for blue-green deployments (as Brooks describes above), development vs production, and multi-tenancy |
So here's some context you folks might not have. I've been looking into using This is almost entirely unrelated to the current issue, with the exception that if we wanted to bundle breaking changes into the same place, we could potentially wait to sort out this actor multi-tenancy thing until we finish with the named pipes spike and subsequent implementation.
Actors should not be aware of their own membership within a group. Runtime configuration should be the only way in which to place an actor into a group. Think of the edge case where two potential targets reside in the same host, differentiated only by a group ID - we now essentially have to ensure that the target and the source of the invocation in an actor-to-actor scenario have the same group ID for multi-tenant security. |
For security purposes, if someone has access to a host to create link definitions (e.g. via the control interface, |
presumably, when an actor invokes host_call, the host can tell which actor instance it came from, without relying on the parameters provided in host_call, and from that look up its group id? |
Consumers can add call-specific security to verify that an actor is allowed to call another actor, or can restrict an actor from talking to a provider they already have signed claims for. Yes, the host can infer the source actor so it could keep the capability provider entirely ignorant of the group ID (unless the capability provider looked inside the module field...) |
Yes this can be captured in the initial closure when we set the wapc callback function. |
Another tidbit of context: we'll also have to ensure that |
Could all call aliases / links / actors be backwards-compatible by specifying that if a group ID is not specified, that group ID is default? |
This is the subject of my question 3 above
If the cap provider can initiate a message to an actor (which it can do for async communication to an actor, such as forwarding a nats subscription, or an http-server forwarding an HttpRequest), it needs to know the group id so it can get to the correct actor. that means that a malicious - or buggy - capability provider can change the group id, and we can no longer guarantee that one tenant's data doesn't spill over to other tenants. Off the top of my head, I don't know an easy way to prevent this since cap provider linked to actors affiliated with different tenants has to be able to initiate calls to any of them. However, if there were a dedicated "channel" for each link (possibly named pipes?), it might make this kind of error less likely. Also, if the channel is managed by the host, and the group within the channel is not visible to the cap provider, it may provide slightly better privacy for the tenant's data. |
yes that's the general idea. Call aliases, link definitions, actor public keys -- all just get |
for an analogy to demonstrate the privacy issue, let's say an actor is your browser's http poxy, and the cap provider is your ISP. You don't want your browser history to be tracked by the ISP, so you don't want the provider to be able to associate your tenant id with the data, unless you have explicitly added it to the data protocol. We should design the api to minimize this kind of privacy leakage. |
Right so for those playing the home game...basically what we're getting at is that a capability provider could grab data for an actor with a group ID of What bugs me about this edge case is that it's entirely possible to do something similar right now. A malicious capability provider provider has a list of bound actors. It could choose to take an HTTP request bound for actor A and deliver it to actor B just for giggles. |
I'm starting to think that the We'd have to mitigate this by including the root |
Agreed. It's also the same as the cap provider emailing the data out or posting it to twitter. It's native code and not in a sandbox so there has to be some level of trust. I just want us to be really explicit about the threat model and affirmatively decide up-front what privacy guarantees we can and can't make. If we can identify a risk that can be mitigated in api & protocol design, then we should do that. |
💯 - whatever attack vectors there are for this, I want to ensure that we know which ones they are and the likelihood that they'll be exploited. We'll want to document this not necessarily as a risk for first party providers, but definitely for trusting community providers (native, not WASI in the future). But, to circle back around to the topic at hand - do we think it's worth it implementing this "group ID" complex key for actors, or do we simply recommend that if you want multiple tenants you run multiple wasmCloud host processes (the empty wasmcloud process doesn't even register in |
For context.. another question... isn't the notion of using a single |
Advantages of running each tenant in its own host:
|
|
Unless there are any other objections, I'll create an ADR entry that describes our stance on multi-tenancy, which, in short, is that a single |
No objections, though I would like to emphasize the need for #46 (benchmark tests for the runtime) as we should avoid long host startup times and keep the binary size low if we want to advocate for multiple hosts for multi-tenancy. I like framing the host as a "pod" in some sense, while the actors are containers. Does @thomastaylor312 have any thoughts? |
Unsure how the tenancy requirements will affect benchmarks/performance. What we're suggesting here for tenancy doesn't change the codebase, only the documentation. |
+1 for ADR |
Signed-off-by: Brooks Townsend <brooks@cosmonic.com> Signed-off-by: Brooks Townsend <brooks@cosmonic.com>
Background
The way wasmCloud uniquely identifies actors is by their public key. The public key is a direct correlation to the private key that was used to sign the module. This generally means that whenever people are building their own actors, the actors will have unique public keys and will therefore never interfere with each other.
wasmCloud assumes that if a given actor occurs multiple times within the same lattice (or eventually within the same host once we support #54 local actor pools). Essentially the actor's public/primary key is the unit of horizontal scaleout uniqueness. If I want to run 5 copies of my actor, then all 5 of those copies are considered horizontally scaled actors and must be identical - including their configuration. This is the same concept as increasing the scale count of a pod - Each of those pods must have been created with the same podspec, otherwise the horizontal scale group will be internally inconsistent.
The Problem
The problem is with multi-tenancy within a single host process. We never really considered the edge case that multiple different owning entities would ever want to run the same actors in the same host but with different configurations. However, this is actually a pretty common multi-tenant scenario, as evidenced by krustlet's use of a single wasmCloud host runtime per node (and not per pod).
Easy steps to reproduce the problem:
echo
actor in a wasmCloud host processecho
actor to port8080
of the HTTP server provider.echo
actor to the host.echo
actors running in the same host. You cannot then bind the echo actor to the HTTP server provider with port8080
for tenant 1 and port8081
for tenant 2.Proposed Solution
The proposed solution to this multi-tenant problem would be to give actors a second field in a composite key. Capability providers already have this, where they have both a public key and a link name. Actors could be started with a
group id
. This group ID would be part of that actor's unique identifier. Horizontal scale-out of an actor would then occur for the combination of the actor's public key and its group ID. Dispatch from capability providers to actors would be done through the new composite key of public key + group ID. Link definitions would optionally be able to specify the source actor's group ID (if left off, the group ID would be assumed to bedefault
). This would enable the scenario where two different copies of the same exact echo actor could be running inside the same wasmCloud host with different configuration bindings, while still enabling each of those actors to horizontally scale independently.This would operate very much like the concept of a
queue group
in message broker terminology. If no group ID is specified at the time an actor is started, then it would usedefault
, just the way capability providers do. This would provide some measure of backwards compatibility in that scenarios that don't require strict multi-tenancy would continue to operate just fine as if nothing had changed.The text was updated successfully, but these errors were encountered: