Move admin and meta component into cluster controller role #1180

tillrohrmann · 2024-02-13T14:29:59Z

This PR moves the Admin and Meta components from the worker role into the cluster controller role. For this to happen, we needed to do the following things:

Expose invocation control API on Worker grpc service and modify Admin service to use it for killing/cancelling of invocations
Add Schema grpc service which allows to fetch schema information and update worker role to periodically fetch schema information
Update subscriptions based on the latest schema information (start new and stop old subscriptions)
Expose storage query API on Worker grpc service and update Admin to use this API to query the storage. The query results are sent as arrow-flight encoded RecordBatches.

Currently, the grpc address is hard-coded to 127.0.0.1:5122 which prevents the system from being run as separate processes.

tillrohrmann · 2024-02-13T15:25:34Z

e2e test are failing: https://github.com/restatedev/restate/actions/runs/7888143370/job/21525412929?pr=1180. Investigating what I broke.

tillrohrmann · 2024-02-13T15:32:37Z

The e2e tests were broken because of restatedev/e2e#265. They should be fixed now.

tillrohrmann · 2024-02-13T16:03:48Z

Still failing. Investigating what is going wrong with the e2e tests.

tillrohrmann · 2024-02-13T16:53:33Z

I suspect that there is race condition between the e2e test invoking a service and the schema information having propagated to the ingress.

AhmedSoliman

Great work @tillrohrmann. The changes look good to me, but I'd love to get clarification on the couple of questions about potential data-loss.

AhmedSoliman · 2024-02-14T09:46:25Z

crates/node/src/roles/cluster_controller.rs

+        invocation_termination: InvocationTermination,
+    ) -> Result<(), Error> {
+        let invocation_termination =
+            bincode::serde::encode_to_vec(invocation_termination, bincode::config::standard())


Is this a transitional step or do you think we will continue with the nested serialization for the long-run?

I thought of it as a transitional step to simplify things for the time being. I guess once we introduce the Admin grpc service we will specify the InvocationTermination request properly and then could reuse it here.

crates/node/src/roles/cluster_controller.rs

crates/node/src/roles/worker.rs

AhmedSoliman · 2024-02-14T10:04:49Z

crates/node/src/roles/worker.rs

+                .map_err(WorkerRoleError::Worker),
+        );
+        component_set.spawn(
+            Self::reload_schemas(


I assumed we will need to perform a blocking initial schema fetch before starting the worker. Do we have a risk of partition processors failing/ignoring invocations because they can't see the schema on startup, or if it takes time? I'm worried about the risk of data loss if PP drops an invocation because it thinks the service have been removed (vs. that we are running with old Meta information).

The PP won't drop any invocations it does not know about. What can happen is that the ingress rejects for some time new invocations until it learns about the updated schema information. What can also happen, and this is probably not so nice, is that the PP will try to invoke some invocations for which it does not know the deployment information. This would then lead to a couple of retries which causes some noise in the logs. To prevent this from happening trying to fetch the schema before starting the worker is probably a good idea. This won't solve the problem in all cases but in most.

This sounds like a fragile design assumption that might not stand the test of time. The good news is that it won't cause problems at the moment, but I think we can have a more robust approach with the following design:

Schema is versioned.

When schema is updated on an admin node, the admin node can inform all nodes with "ingress" role with the update. This is either a notification of the higher version number, or a full push of the new schema object. Ideally, we finish up the ingress version notification before we respond to user-requested schema changes, this is to achieve monotonic read.

Ingress writes its last-seen schema version in all messages going through bifrost or in RPC to workers. A worker processing an invocation message will compare the schema version, if it's running with an older version, it pulls the new schema before moving forward. If the worker has the same schema as the incoming message (whether from another PP or ingress) it can more confidently reject invocations if we need to do that in the future.

Sending the schema information version with the invocation messages is a good idea to achieve monotonic reads :-)

AhmedSoliman · 2024-02-14T10:13:16Z

crates/node/src/roles/worker.rs

+                    }
+                }
+                Err(err) => {
+                    debug!("Failed fetching schema information: {err}")


Should this be a debug in the single-node setup? Why do we expect failures to happen?

In the single-node setup this should indeed not happen. It is a bit of a preparational step for allowing the cluster controller and the worker role to be run as two separate processes.

I'd rather make this scream loud if we are confident that it shouldn't happen :)

By allowing to separate the admin from the worker role which will be the next step I intend to implement, it will be possible that the worker cannot reach the admin.

crates/node-services/proto/schema.proto

crates/node-services/proto/worker.proto

AhmedSoliman · 2024-02-14T10:18:24Z

crates/node-services/proto/schema.proto

+
+package dev.restate.schema;
+
+service Schema {


Maybe this can be the seed for the Metadata service? I also think we should postfix all grpc services with something to make it distinct from potential message types, perhaps with Svc at the end or something.

Note: The postfix comment isn't related to this particular PR.

Yes, I thought so too concerning the Metadata service. Should we start right away calling it MetadataSvc instead? Given that these services are currently only internally used, it is probably also possible to rename them later.

Re postfix: +1 will change it.

crates/node/src/server/service.rs

This introduces a central system to manage long-running and background restate async tasks. The core of this proposal is to help us lean more towards spawning self-contained tasks that are addressable, trackable, and cancellable, than current deep future poll trees. It also allows nice possibilities like: - Limited structured concurrency (no auto-waiting for children though) - Graceful cancellations/shutdown reduces the risk of cancellation-unsafe drops - Potentially less memory, deep nested future state machines become shallower. - A single place where we can auto-propagate tracing context, flag tasks by priority, schedule tasks on different runtime, provide observability of what kind of tasks are running, etc. - Scoped tasks allow scoped cancellation. (cooperatively cancel all tasks for a partition id, or a specific tenant in the future, or be specific and filter specific kinds of tasks) - Limit concurrency of certain tasks by kind, partition, or tenant, etc. - Distributing tasks among multiple tokio runtimes based on the task kind - Support for different abort-policy based on the task kind. A the moment, I only migrated a small pieces of our code to this system to make the merging with #1180 easier. Once #1180 is merged, I'll move the rest of our services and bifrost to use it.

tillrohrmann · 2024-02-15T09:12:35Z

The current CI run fails because of conflicting changes on main. Will rebase onto the latest main to resolve the problem.

AhmedSoliman

🚀

This introduces a central system to manage long-running and background restate async tasks. The core of this proposal is to help us lean more towards spawning self-contained tasks that are addressable, trackable, and cancellable, than current deep future poll trees. It also allows nice possibilities like: - Limited structured concurrency (no auto-waiting for children though) - Graceful cancellations/shutdown reduces the risk of cancellation-unsafe drops - Potentially less memory, deep nested future state machines become shallower. - A single place where we can auto-propagate tracing context, flag tasks by priority, schedule tasks on different runtime, provide observability of what kind of tasks are running, etc. - Scoped tasks allow scoped cancellation. (cooperatively cancel all tasks for a partition id, or a specific tenant in the future, or be specific and filter specific kinds of tasks) - Limit concurrency of certain tasks by kind, partition, or tenant, etc. - Distributing tasks among multiple tokio runtimes based on the task kind - Support for different abort-policy based on the task kind. A the moment, I only migrated a small pieces of our code to this system to make the merging with #1180 easier. Once #1180 is merged, I'll move the rest of our services and bifrost to use it.

As a first step to move the Meta service into the cluster controller role, this commit removes the restate_worker_api::Handle dependency from this component.

This commit decouples the meta and admin components from the worker and moves them over to the cluster controller role. The way the worker learns about schema information changes is by periodically asking the schema service for updates. At the moment, every schema information fetch request will send the whole list of schema update commands over. Currently, it is not possible to query the storage via the Admin service. This will be fixed with a follow-up commit.

The QueryStorage grpc method allows to query the storage of a worker. It returns the result as arrow-flight encoded RecordBatches. The admin service uses this service to forward queries it receives on /query to create a response for them.

…gistry

This commit renames the following grpc services: * Worker -> WorkerSvc * ClusterController -> ClusterControllerSvc * NodeCtrl -> NodeCtrlSvc

This commit enables the admin service to push schema updates to the worker in case of schema changes.

This introduces a central system to manage long-running and background restate async tasks. The core of this proposal is to help us lean more towards spawning self-contained tasks that are addressable, trackable, and cancellable, than current deep future poll trees. It also allows nice possibilities like: - Limited structured concurrency (no auto-waiting for children though) - Graceful cancellations/shutdown reduces the risk of cancellation-unsafe drops - Potentially less memory, deep nested future state machines become shallower. - A single place where we can auto-propagate tracing context, flag tasks by priority, schedule tasks on different runtime, provide observability of what kind of tasks are running, etc. - Scoped tasks allow scoped cancellation. (cooperatively cancel all tasks for a partition id, or a specific tenant in the future, or be specific and filter specific kinds of tasks) - Limit concurrency of certain tasks by kind, partition, or tenant, etc. - Distributing tasks among multiple tokio runtimes based on the task kind - Support for different abort-policy based on the task kind. A the moment, I only migrated a small pieces of our code to this system to make the merging with #1180 easier. Once #1180 is merged, I'll move the rest of our services and bifrost to use it.

tillrohrmann requested a review from AhmedSoliman February 13, 2024 14:29

tillrohrmann force-pushed the separate-admin-from-worker branch from bf4d604 to d6647b3 Compare February 13, 2024 14:37

tillrohrmann mentioned this pull request Feb 14, 2024

Let RestateDeployer wait until the schema information has propagated to the ingress restatedev/e2e#266

Open

AhmedSoliman reviewed Feb 14, 2024

View reviewed changes

AhmedSoliman mentioned this pull request Feb 14, 2024

Task Center init #1181

Merged

tillrohrmann requested a review from AhmedSoliman February 14, 2024 21:49

tillrohrmann force-pushed the separate-admin-from-worker branch from c90de31 to ec89678 Compare February 15, 2024 09:40

AhmedSoliman approved these changes Feb 15, 2024

View reviewed changes

tillrohrmann force-pushed the separate-admin-from-worker branch from 2849b1c to d8cb5cf Compare February 15, 2024 11:57

tillrohrmann added 9 commits February 15, 2024 16:09

Remove currently unused Bifrost component from Admin

5f35715

Remove restate_worker_api::Handle dependency from Meta

5a75bb0

As a first step to move the Meta service into the cluster controller role, this commit removes the restate_worker_api::Handle dependency from this component.

Bump datafusion to 35.0.0 and arrow to 50.0.0

53f5851

Add QueryStorage method to Worker grpc service

43323ea

The QueryStorage grpc method allows to query the storage of a worker. It returns the result as arrow-flight encoded RecordBatches. The admin service uses this service to forward queries it receives on /query to create a response for them.

Append --load to docker build command to load generated image into re…

12a0933

…gistry

Rename Schema into MetadataSvc

0bbe3e8

Append Svc to node grpc services

7936da5

This commit renames the following grpc services: * Worker -> WorkerSvc * ClusterController -> ClusterControllerSvc * NodeCtrl -> NodeCtrlSvc

Push schema information to workers from admin service on schema changes

429b74a

This commit enables the admin service to push schema updates to the worker in case of schema changes.

tillrohrmann force-pushed the separate-admin-from-worker branch from d8cb5cf to 429b74a Compare February 15, 2024 15:14

tillrohrmann merged commit 429b74a into restatedev:main Feb 15, 2024

tillrohrmann deleted the separate-admin-from-worker branch February 15, 2024 15:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move admin and meta component into cluster controller role #1180

Move admin and meta component into cluster controller role #1180

tillrohrmann commented Feb 13, 2024

tillrohrmann commented Feb 13, 2024

tillrohrmann commented Feb 13, 2024

tillrohrmann commented Feb 13, 2024

tillrohrmann commented Feb 13, 2024

AhmedSoliman left a comment

AhmedSoliman Feb 14, 2024

tillrohrmann Feb 14, 2024

AhmedSoliman Feb 14, 2024

tillrohrmann Feb 14, 2024 •

edited

Loading

AhmedSoliman Feb 14, 2024

tillrohrmann Feb 14, 2024

AhmedSoliman Feb 14, 2024

tillrohrmann Feb 14, 2024

AhmedSoliman Feb 14, 2024

tillrohrmann Feb 14, 2024

AhmedSoliman Feb 14, 2024

tillrohrmann Feb 14, 2024

tillrohrmann commented Feb 15, 2024

AhmedSoliman left a comment

Move admin and meta component into cluster controller role #1180

Move admin and meta component into cluster controller role #1180

Conversation

tillrohrmann commented Feb 13, 2024

tillrohrmann commented Feb 13, 2024

tillrohrmann commented Feb 13, 2024

tillrohrmann commented Feb 13, 2024

tillrohrmann commented Feb 13, 2024

AhmedSoliman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tillrohrmann Feb 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tillrohrmann commented Feb 15, 2024

AhmedSoliman left a comment

Choose a reason for hiding this comment

tillrohrmann Feb 14, 2024 •

edited

Loading