Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Single process multi-tenancy support #195

Closed
autodidaddict opened this issue May 7, 2021 · 35 comments
Closed

[RFC] Single process multi-tenancy support #195

autodidaddict opened this issue May 7, 2021 · 35 comments
Assignees
Labels
enhancement New feature or request RFC spike Requires more investigation

Comments

@autodidaddict
Copy link
Member

Background

The way wasmCloud uniquely identifies actors is by their public key. The public key is a direct correlation to the private key that was used to sign the module. This generally means that whenever people are building their own actors, the actors will have unique public keys and will therefore never interfere with each other.

wasmCloud assumes that if a given actor occurs multiple times within the same lattice (or eventually within the same host once we support #54 local actor pools). Essentially the actor's public/primary key is the unit of horizontal scaleout uniqueness. If I want to run 5 copies of my actor, then all 5 of those copies are considered horizontally scaled actors and must be identical - including their configuration. This is the same concept as increasing the scale count of a pod - Each of those pods must have been created with the same podspec, otherwise the horizontal scale group will be internally inconsistent.

The Problem

The problem is with multi-tenancy within a single host process. We never really considered the edge case that multiple different owning entities would ever want to run the same actors in the same host but with different configurations. However, this is actually a pretty common multi-tenant scenario, as evidenced by krustlet's use of a single wasmCloud host runtime per node (and not per pod).

Easy steps to reproduce the problem:

  1. Follow any of the tutorials and add/start the echo actor in a wasmCloud host process
  2. Bind that echo actor to port 8080 of the HTTP server provider.
  3. Now, a second tenant also follows the same tutorial, and wants to add the echo actor to the host.
    1. Even if you could add the actor twice (which you currently cannot), the host would assume this is a horizontal scaleout operation, and not a multi-tenant operation.
  4. Now you have two echo actors running in the same host. You cannot then bind the echo actor to the HTTP server provider with port 8080 for tenant 1 and port 8081 for tenant 2.

Proposed Solution

The proposed solution to this multi-tenant problem would be to give actors a second field in a composite key. Capability providers already have this, where they have both a public key and a link name. Actors could be started with a group id. This group ID would be part of that actor's unique identifier. Horizontal scale-out of an actor would then occur for the combination of the actor's public key and its group ID. Dispatch from capability providers to actors would be done through the new composite key of public key + group ID. Link definitions would optionally be able to specify the source actor's group ID (if left off, the group ID would be assumed to be default). This would enable the scenario where two different copies of the same exact echo actor could be running inside the same wasmCloud host with different configuration bindings, while still enabling each of those actors to horizontally scale independently.

This would operate very much like the concept of a queue group in message broker terminology. If no group ID is specified at the time an actor is started, then it would use default, just the way capability providers do. This would provide some measure of backwards compatibility in that scenarios that don't require strict multi-tenancy would continue to operate just fine as if nothing had changed.

@autodidaddict autodidaddict added enhancement New feature or request RFC spike Requires more investigation labels May 7, 2021
@autodidaddict
Copy link
Member Author

Here's what a new manifest might look like with multi-tenant actors:

---
labels:
    sample: "Multi-tenant Echo"
actors:
    - image_ref: wasmcloud.azurecr.io/echo:0.2.0
      group_id: tenant1
    - image_ref: wasmcloud.azurecr.io/echo:0.2.0
      group_id: tenant2
capabilities:
    - image_ref: wasmcloud.azurecr.io/httpserver:0.1.0
links:
  - actor: ${MBCFOPM6JW2APJLXJD3Z5O4CN7CPYJ2B4FTKLJUR5YR5MITIU7HD3WD5}
    actor_group: tenant1
    contract_id: "wasmcloud:httpserver"
    provider_id: "VAG3QITQQ2ODAOWB5TTQSDJ53XK3SHBEIFNK4AYJ5RKAX2UNSCAPHA5M"
    values:
      PORT: 8080
  - actor: ${MBCFOPM6JW2APJLXJD3Z5O4CN7CPYJ2B4FTKLJUR5YR5MITIU7HD3WD5}
    actor_group: tenant2
    contract_id: "wasmcloud:httpserver"
    provider_id: "VAG3QITQQ2ODAOWB5TTQSDJ53XK3SHBEIFNK4AYJ5RKAX2UNSCAPHA5M"
    values:
      PORT: 8081

@brooksmtownsend
Copy link
Member

Big 👍🏻 on keeping it all backwards-compatible, I like the idea that the groups and link names (and for that matter, lattice namespace prefix) can all be left off and work fine in a single-tenant scenario. Opt-in complexity lives on.

@autodidaddict
Copy link
Member Author

Checking if there are any objections to this from @stevelr

@autodidaddict
Copy link
Member Author

Some initial probing on the code required to make this work essentially adds a group_id field to the WasmcloudEntity enum, and it requires that everywhere we were dispatching a string for the actor's public key, we then have to pack that string to contain both the actor's public key and the actor's group ID. This might be a string like MBCFOPM6JW2APJLXJD3Z5O4CN7CPYJ2B4FTKLJUR5YR5MITIU7HD3WD5|default rather than just MBCFOPM6JW2APJLXJD3Z5O4CN7CPYJ2B4FTKLJUR5YR5MITIU7HD3WD5.

The entire system internals were originally built around the idea that the single characteristic of an actor that uniquely differentiated it from all other things was just the public key. We already had the notion that capability providers had a complex key (Contract ID and public key) so when we added the link name, it was easy for capability providers.

Capability provider documentation instructs the provider never to infer any meaning on the inside of the value used for the module field in capability configuration for a link definition. So, all existing capability providers could continue to function and dispatch properly as long as the module field for configuration were updated to contain the new "packed" key string instead of the old one.

This makes for some ugly code and I would immediately classify the addition of code like this as code debt. It will make the code more difficult to read and maintain, because there will be single-variable references to an entity that we know contains 2 things.

The alternative is to break the API contract for capability providers, capability provider configuration, and for the dispatcher to deal with this new complex key. This is cleaner, more readable, and does not incur tech debt .... but it's a sweeping breaking change that will require every first party (and third, if there are any yet) capability provider to upgrade, rebuild, and redeploy.

@autodidaddict
Copy link
Member Author

autodidaddict commented May 13, 2021

So, it boils down to the following:

  • Do nothing. Require that within a single wasmCloud host, only a single horizontally scalable actor with a given public key can exist at a time. This option would not prevent the future implementation of "Actor pools", allowing for a single host to instantiate multiple copies of the same actor.
  • Use the "packed string" for the new dispatch key for actors, not breaking the existing API but adding hidden meaning to single variables and incurring tech debt and possible maintenance cost.
  • Break the API, require dispatching in and out and calling from actor-to-actor to use explicit group identifiers.

@autodidaddict
Copy link
Member Author

I am inclined to say that the first option is the best, and that if people want two copies of the same actor (e.g. 2 echos) to be running with different configurations because they are being run on behalf of 2 tenants sharing the same host, then perhaps those two tenants should be forced to sign that actor differently, giving them unique identifiers.

@stevelr
Copy link
Contributor

stevelr commented May 13, 2021

what if we deprecate the capability_provider! macro, and make a new one that takes an extra parameter - an api version number, and define a second FFI entry point that has the semantics of "get_api_version()", which returns the u32 version number. The host would check for the presence of this FFI when loading a capability provider, and if it's not there, revert to backwards-compatibility mode.

This gives us a hook to avoiding hard-breaking api changes in the future, because we will always have a way to check what api version is in use. We would then be able to deprecate an api and give it a sunset period.

@stevelr
Copy link
Contributor

stevelr commented May 13, 2021

If we were going to overload the actor ID string, I recommend adding a prefix, like two ascii chars plus a period, "ac." so it's quick to check the string type, and then the library (because all code that peeks into the actor id string would have to use the library!) would branch on that prefix tag and parse the rest of the string.

@stevelr
Copy link
Contributor

stevelr commented May 13, 2021

I know at least one piece of code (that I wrote) in http-server provider that makes an assumption on the length of the actor ID string

@brooksmtownsend
Copy link
Member

I want to approach this from a different scenario, and see if I can appeal to a different use case that may affect what choice of the three could be the best solution.

Imagine you have your prod deployment where you have an actor (HTTP API with a few CRUD endpoints) running with the appropriate capability provider links. In an advanced world where we can do canary deployments or a gradual rollout, we may want to launch that same actor (albeit a newer version) with the same provider (potentially with an updated interface, for example). If we want to only divert 10% of production traffic to use the new API so we can monitor for breaking changes, today we would have to use a different link name for the provider (fine) and sign the new actor with a different public key (less fine). I think this scenario can come up in a few different ways, as you can see with the krustlet use case of using a single wasmcloud host for multitenancy. This RFC already talks about a scenario like this but just to put it in another perspective that it may be a pain if we don't have this group id.

That being said, I also see the concern with hiding the information. I'm usually tempted to get a solution that just works, but I have a feeling that breaking the API now is going to be easier than later. Breaking all capability providers is unfortunate, potentially necessary, but would this mean that after this version there will be incompatibilities between capability providers and actors? E.g. will the newly updated http-server provider be incompatible with the echo actor that we built a while ago? If so, I really don't like having fragmented interactions between actors and providers 🤔

@stevelr
Copy link
Contributor

stevelr commented May 13, 2021

just leaving a note here so I don't forget - we may want to think more about how group ids are generated and disclosed, concerning security and privacy.

  1. it must be impossible for an actor to pretend they are in a group they are not
  2. should an actor be able to know what group they are in?
  3. should the group id be hidden from the capability provider?
  4. should group id be numeric or random?

@autodidaddict
Copy link
Member Author

I know at least one piece of code (that I wrote) in http-server provider that makes an assumption on the length of the actor ID string

Woops, that's one reason why we documented it a while back saying not to make any assumptions about the module contents.

@brooksmtownsend
Copy link
Member

brooksmtownsend commented May 13, 2021

@stevelr I have a few thoughts for your notes (2, 3), I think actors and providers shouldn't have a concept of the group ID, in my mind I see this (and the link name) only as a way to supply different configurations between identical instances of actors and providers. I also think (4) that group id should be similar to provider link_name, e.g. text supplied by the configuring party. Agreed with (1) though, but I'm not sure what that attack vector ends up looking like

edit: this wasn't mean to be a definitive answer, just what my thoughts were

@stevelr
Copy link
Contributor

stevelr commented May 13, 2021

Imagine you have your prod deployment where you have an actor (HTTP API with a few CRUD endpoints) running with the appropriate capability provider links. In an advanced world where we can do canary deployments or a gradual rollout

There may be a common solution for blue-green deployments (as Brooks describes above), development vs production, and multi-tenancy

@autodidaddict
Copy link
Member Author

autodidaddict commented May 13, 2021

So here's some context you folks might not have. I've been looking into using named pipes as a means by which the wasmcloud host communicates with capability providers. This might be a way to introduce any potential breaking changes because this would be an entirely new means of communication. The thought there is you have a named pipe called Vxxx_default_in and Vxxx_default_out , where default is the provider's secondary key, the link_name. When a host makes a request, it writes to the "in" and reads from "out". When a provider publishes to a host (e.g. delivers a subscribed message from pubsub), the provider writes it to "out".

This is almost entirely unrelated to the current issue, with the exception that if we wanted to bundle breaking changes into the same place, we could potentially wait to sort out this actor multi-tenancy thing until we finish with the named pipes spike and subsequent implementation.

just leaving a note here so I don't forget - we may want to think more about how group ids are generated and disclosed, concerning security and privacy.

1. it must be impossible for an actor to pretend they are in a group they are not

2. should an actor be able to know what group they are in?

3. should the group id be hidden from the capability provider?

4. should group id be numeric or random?

Actors should not be aware of their own membership within a group. Runtime configuration should be the only way in which to place an actor into a group. Think of the edge case where two potential targets reside in the same host, differentiated only by a group ID - we now essentially have to ensure that the target and the source of the invocation in an actor-to-actor scenario have the same group ID for multi-tenant security.

@brooksmtownsend
Copy link
Member

brooksmtownsend commented May 13, 2021

For security purposes, if someone has access to a host to create link definitions (e.g. via the control interface, wash ctl link), that means that they are successfully authenticated with NATS into that lattice. We make no attempts to restrict security after that point correct? Just trying to understand the possible attack vector of an actor trying to join an existing group

@stevelr
Copy link
Contributor

stevelr commented May 13, 2021

presumably, when an actor invokes host_call, the host can tell which actor instance it came from, without relying on the parameters provided in host_call, and from that look up its group id?

@autodidaddict
Copy link
Member Author

Consumers can add call-specific security to verify that an actor is allowed to call another actor, or can restrict an actor from talking to a provider they already have signed claims for. Yes, the host can infer the source actor so it could keep the capability provider entirely ignorant of the group ID (unless the capability provider looked inside the module field...)

@autodidaddict
Copy link
Member Author

presumably, when an actor invokes host_call, the host can tell which actor instance it came from, without relying on the parameters provided in host_call, and from that look up its group id?

Yes this can be captured in the initial closure when we set the wapc callback function.

@autodidaddict
Copy link
Member Author

Another tidbit of context: we'll also have to ensure that call aliases are scoped to a group ID as well.

@brooksmtownsend
Copy link
Member

Could all call aliases / links / actors be backwards-compatible by specifying that if a group ID is not specified, that group ID is default?

@stevelr
Copy link
Contributor

stevelr commented May 13, 2021

This is the subject of my question 3 above

(unless the capability provider looked inside the module field...)

If the cap provider can initiate a message to an actor (which it can do for async communication to an actor, such as forwarding a nats subscription, or an http-server forwarding an HttpRequest), it needs to know the group id so it can get to the correct actor.

that means that a malicious - or buggy - capability provider can change the group id, and we can no longer guarantee that one tenant's data doesn't spill over to other tenants. Off the top of my head, I don't know an easy way to prevent this since cap provider linked to actors affiliated with different tenants has to be able to initiate calls to any of them. However, if there were a dedicated "channel" for each link (possibly named pipes?), it might make this kind of error less likely. Also, if the channel is managed by the host, and the group within the channel is not visible to the cap provider, it may provide slightly better privacy for the tenant's data.

@autodidaddict
Copy link
Member Author

yes that's the general idea. Call aliases, link definitions, actor public keys -- all just get default appended if they aren't using an explicit group ID.

@stevelr
Copy link
Contributor

stevelr commented May 13, 2021

for an analogy to demonstrate the privacy issue, let's say an actor is your browser's http poxy, and the cap provider is your ISP. You don't want your browser history to be tracked by the ISP, so you don't want the provider to be able to associate your tenant id with the data, unless you have explicitly added it to the data protocol. We should design the api to minimize this kind of privacy leakage.

@autodidaddict
Copy link
Member Author

This is the subject of my question 3 above

(unless the capability provider looked inside the module field...)

If the cap provider can initiate a message to an actor (which it can do for async communication to an actor, such as forwarding a nats subscription, or an http-server forwarding an HttpRequest), it needs to know the group id so it can get to the correct actor.

that means that a malicious - or buggy - capability provider can change the group id, and we can no longer guarantee that one tenant's data doesn't spill over to other tenants. Off the top of my head, I don't know an easy way to prevent this since cap provider linked to actors affiliated with different tenants has to be able to initiate calls to any of them. However, if there were a dedicated "channel" for each link (possibly named pipes?), it might make this kind of error less likely. Also, if the channel is managed by the host, and the group within the channel is not visible to the cap provider, it may provide slightly better privacy for the tenant's data.

Right so for those playing the home game...basically what we're getting at is that a capability provider could grab data for an actor with a group ID of x and then manipulate the target actor and send it to y instead. This could be the result of a bug or because of malicious intent. Malicious intent should be fairly well ruled out since all providers have claims and signed verification provenance paths.

What bugs me about this edge case is that it's entirely possible to do something similar right now. A malicious capability provider provider has a list of bound actors. It could choose to take an HTTP request bound for actor A and deliver it to actor B just for giggles.

@autodidaddict
Copy link
Member Author

autodidaddict commented May 13, 2021

I'm starting to think that the Occam's Razor of WarGames may apply here - the only winning code is not to write it. Everything feels safer when we use the wasmCloud host process as the multi-tenancy boundary. Of course, when/if we switch to named pipes, now we have the possibility that two processes could share data without being in the same lattice.

We'd have to mitigate this by including the root pid in the pipe name to guarantee that if the same provider is being used in a different host on the same box connected to a different lattice they don't talk to each other.

@stevelr
Copy link
Contributor

stevelr commented May 13, 2021

What bugs me about this edge case is that it's entirely possible to do something similar right now. A malicious capability provider provider has a list of bound actors. It could choose to take an HTTP request bound for actor A and deliver it to actor B just for giggles.

Agreed. It's also the same as the cap provider emailing the data out or posting it to twitter. It's native code and not in a sandbox so there has to be some level of trust.

I just want us to be really explicit about the threat model and affirmatively decide up-front what privacy guarantees we can and can't make. If we can identify a risk that can be mitigated in api & protocol design, then we should do that.

@autodidaddict
Copy link
Member Author

I just want us to be really explicit about the threat model and affirmatively decide up-front what privacy guarantees we can and can't make. If we can identify a risk that can be mitigated in api & protocol design, then we should do that.

💯 - whatever attack vectors there are for this, I want to ensure that we know which ones they are and the likelihood that they'll be exploited. We'll want to document this not necessarily as a risk for first party providers, but definitely for trusting community providers (native, not WASI in the future).

But, to circle back around to the topic at hand - do we think it's worth it implementing this "group ID" complex key for actors, or do we simply recommend that if you want multiple tenants you run multiple wasmCloud host processes (the empty wasmcloud process doesn't even register in top).

@autodidaddict
Copy link
Member Author

For context.. another question... isn't the notion of using a single wasmcloud host process to run actors that belong to different tenants similar to the notion of allowing two different tenants to have containers running in the same pod in Kubernetes? Nobody would ever expect this kind of security hole in kube and, moreover, I don't think anyone would take our word for it that no data was ever exposed or exfiltrated, since the data's sitting in the same process as some other tenant's code.

@stevelr
Copy link
Contributor

stevelr commented May 13, 2021

Advantages of running each tenant in its own host:

  • security model is simpler
  • api is simpler
  • debugging will be simpler (if you want to set a breakpoint in the host while reproducing a bug, it doesn't have to block all traffic)
  • accounting for multi-tenant hosting providers is simpler (if they want to bill tenants by proportional usage)
  • each tenant can decide which versions of wasmcloud host and capability providers to run and set the versions in the manifest

@stevelr
Copy link
Contributor

stevelr commented May 13, 2021

  • also, it's not a breaking api change

@autodidaddict autodidaddict moved this from To do to In progress in 0.19 - The March of the Runtimes May 18, 2021
@autodidaddict
Copy link
Member Author

Unless there are any other objections, I'll create an ADR entry that describes our stance on multi-tenancy, which, in short, is that a single wasmCloud host + lattice prefix cannot hold multi-tenant actors. This means that the same actor cannot exist in the same host with different configurations for the same providers (e.g. multi-tenancy).

@brooksmtownsend
Copy link
Member

No objections, though I would like to emphasize the need for #46 (benchmark tests for the runtime) as we should avoid long host startup times and keep the binary size low if we want to advocate for multiple hosts for multi-tenancy. I like framing the host as a "pod" in some sense, while the actors are containers. Does @thomastaylor312 have any thoughts?

@autodidaddict
Copy link
Member Author

Unsure how the tenancy requirements will affect benchmarks/performance. What we're suggesting here for tenancy doesn't change the codebase, only the documentation.

@stevelr
Copy link
Contributor

stevelr commented May 18, 2021

+1 for ADR

@autodidaddict autodidaddict added this to In progress in wasmCloud 1.0-RC0 May 24, 2021
wasmCloud 1.0-RC0 automation moved this from In progress to Done Jul 15, 2021
rvolosatovs pushed a commit to rvolosatovs/wasmCloud that referenced this issue Mar 21, 2023
Signed-off-by: Brooks Townsend <brooks@cosmonic.com>

Signed-off-by: Brooks Townsend <brooks@cosmonic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request RFC spike Requires more investigation
Projects
No open projects
Development

No branches or pull requests

3 participants