Add decommissioning state to ScalarDB Server #851

brfrn169 · 2023-04-20T22:41:29Z

This PR adds a decommissioning state to ScalarDB Server to achieve graceful shutdown. Please take a look!

brfrn169 · 2023-04-20T22:46:05Z

server/src/main/java/com/scalar/db/server/ScalarDbServer.java

-                  shutdown(MAX_WAIT_TIME_MILLIS, TimeUnit.MILLISECONDS);
-                  logger.info("The server shut down.");
+                  logger.info("Signal received. Decommissioning ...");
+                  decommission();


Before shutting down ScalarDB Server, change the healthService response to NOT_SERVING and sleep for some time. This way, we can wait for no new requests to come in.

komamitsu · 2023-04-21T01:48:49Z

@brfrn169 In the current implementation, io.grpc.Server#shutdown() which rejects new requests and io.grpc.Server#awaitTermination() are called. So, it looks to work to me, but probably I'm missing something. Can you share what issue we're going to solve?

kota2and3kan · 2023-04-21T02:22:40Z

@komamitsu (@brfrn169 )
The summary of issues that we are going to resolve is the following.

In the current deployment on the Kubernetes environment, we use Envoy with ScalarDB Server, there is a possibility that the clients receive a 5xx error when we do the rolling update of ScalarDB Server.

And, we already fixed a similar issue in the ScalarDB Cluster. So, I will explain the behavior of ScalarDB Cluster first. After that, I will explain the behavior (what we are going to resolve) in the ScalarDB Server.

ScalarDB Cluster (already fixed some issues)

When we shutdown (send SIGTERM to) ScalarDB Cluster, it works as follows.

Returns NOT_SERVING and waiting 30sec by default.
Envoy detects NOT_SERVING (i.e., not healthy) in its health check feature.
Envoy removes the ScalarDB Cluster node from the load balancing pool (i.e., Envoy doesn't send new requests to ScalarDB Cluster node that we will shutdown).
(After waiting 30sec) ScalarDB Cluster starts shutdown process.
ScalarDB Cluster waits for all ongoing Tx will complete.
After all ongoing Tx is completed, ScalarDB Cluster stops.

Note: This fix was done by this PR.
https://github.com/scalar-labs/scalardb-cluster/pull/44

By the above process, ScalarDB Cluster can guarantee that Don't receive a new request after the shutdown process starts. So, we can avoid receiving 5xx errors from the perspective of the client when we do the rolling update.

ScalarDB Server (what we are going to fix)

When we shutdown (send SIGTERM to) ScalarDB Server, it works as follows.

ScalarDB Server starts shutdown process immediately. After this process, a new request receives a 5xx error.
ScalarDB Server waits for all ongoing Tx will complete.
Endpoint resource (i.e., service discovery information for Envoy) updated by Kubernetes.
Envoy removes the ScalarDB Server from the load balancing pool based on the update of the Endpoint resource.
After all ongoing Tx is completed, ScalarDB Cluster stops.

Note: 2 and 3 runs in parallel.

Since there is a time lag between 1. and 4., there is a possibility that the Envoy sends a new request to the ScalarDB Server that is in stopping progress. In this case, the ScalarDB Server returns a 5xx error to the client.

To fix this issue, we update ScalarDB Server to implement the same graceful shutdown process as ScalarDB Cluster. After this update, the Envoy can detect and treat ScalarDB Server's shutdown process properly.

brfrn169 · 2023-04-21T04:56:19Z

@kota2and3kan Thank you for your explanation!

komamitsu · 2023-04-21T05:01:41Z

@kota2and3kan Thanks for the detailed background! It's very helpful for me.

Just out of curiosity, Kubernetes (and/or Envoy?) doesn't have a rolling update orchestration feature like the following?

Envoy marks one of the pods as disabled (or just removes it from available pod list?)
the target pods is restarted
Envoy marks the restarted pod as enabled (or just adds it from available pod list?)

AWS's CodeDeploy has this kind of features and I just though it would be great if k8s supports that.

komamitsu

LGTM! 👍

kota2and3kan · 2023-04-24T02:26:32Z

@komamitsu

Just out of curiosity, Kubernetes (and/or Envoy?) doesn't have a rolling update orchestration feature like the following?

In conclusion, Kubernetes has a rolling update feature and Envoy has health check/service discovery features.

However, since Envoy is one of the application pods (i.e., it is not a system pod) from the perspective of Kubernetes, I think there is no feature that Kubernetes and Envoy work on cooperating on the Kubernetes side.

Envoy marks one of the pods as disabled (or just removes it from available pod list?)

the target pods is restarted

Envoy marks the restarted pod as enabled (or just adds it from the available pod list?)

In your question, 1. and 3. are the Envoy's behavior (from the perspective of Kubernetes, it is an application's behavior). And, 2 is a Kubernetes' behavior.

Since Envoy is one of the application pods, Kubernetes cannot control Envoy's behavior.

Also, Envoy is a generic proxy (it is not a dedicated proxy for Kubernetes). So, I think Envoy cannot control Kubernetes' pod deletion behavior.

Therefore, we have to achieve a graceful shutdown by combining several functions of Kubernetes and Envoy.

For your reference, I will explain Kubernetes behavior and Envoy behavior respectively. Also, I explain the challenges of ScalarDB Server.

Kubernetes' feature

As a Kubernetes feature, there is a Service resource. You can use this resource for service discovery (you can use it like a Load Balancer). And, this Service resource uses the pod list in the Endpoint resource to decide which pod it sends a request to.

For example, if there are three pods, Endpoint has information (IP address of pods) as follows.

Pods (3 pods)

$ kubectl get pod
NAME                                     READY   STATUS    RESTARTS   AGE
scalardb-cluster-node-76699489b4-fnz97   1/1     Running   0          55s
scalardb-cluster-node-76699489b4-j4tqh   1/1     Running   0          55s
scalardb-cluster-node-76699489b4-mtzxx   1/1     Running   0          55s

Service (you can use this resource for service discovery)

$ kubectl get svc scalardb-cluster-metrics
NAME                       TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)    AGE
scalardb-cluster-metrics   ClusterIP   10.111.18.11   <none>        9080/TCP   2m2s

Endpoint (manage list of pods)

$ kubectl get endpoints scalardb-cluster-metrics
NAME                       ENDPOINTS                                               AGE
scalardb-cluster-metrics   10.244.1.151:9080,10.244.1.152:9080,10.244.1.153:9080   2m24s

As you can see, the endpoint resource has a pod list that includes 3 IP addresses. So, if you access the Service scalardb-cluster-metrics, you can access one of the three pods. The Service resource sends your request to one of the pods like a load balancer.

And, if we delete the pod, Kubernetes updates this IP address list in the Endpoint resource. For example, if I delete one pod, we can see the Endpoint resource (IP address list) is updated by Kubernetes as follows. (In this test case, a new pod is automatically created by the Deployment resource when I delete one pod manually.)

Initial state of pods (there are three pods)

$ kubectl get pod
NAME                                     READY   STATUS    RESTARTS   AGE
scalardb-cluster-node-8485886d5c-ggx99   1/1     Running   0          2m52s
scalardb-cluster-node-8485886d5c-r29hm   1/1     Running   0          2m31s
scalardb-cluster-node-8485886d5c-svwm2   1/1     Running   0          2m10s

Initial state of Endpoint (there are three IP addresses of pods)

$ kubectl get endpoints scalardb-cluster-metrics
NAME                       ENDPOINTS                                               AGE
scalardb-cluster-metrics   10.244.1.160:9080,10.244.1.161:9080,10.244.1.162:9080   17m

Delete one pod manually

$ kubectl delete pod scalardb-cluster-node-8485886d5c-ggx99
pod "scalardb-cluster-node-8485886d5c-ggx99" deleted

Pod status when the pod deletion is in progress (One pod is Terminating status and a new pod is ContainerCreating status)

$ kubectl get pod
NAME                                     READY   STATUS              RESTARTS   AGE
scalardb-cluster-node-8485886d5c-ggx99   1/1     Terminating         0          3m10s
scalardb-cluster-node-8485886d5c-j8qjq   0/1     ContainerCreating   0          0s
scalardb-cluster-node-8485886d5c-r29hm   1/1     Running             0          2m49s
scalardb-cluster-node-8485886d5c-svwm2   1/1     Running             0          2m28s

Endpoint status when the pod deletion is in progress
```
$ kubectl get endpoints scalardb-cluster-metrics
NAME                       ENDPOINTS                             AGE
scalardb-cluster-metrics   10.244.1.161:9080,10.244.1.162:9080   18m
```
- As you can see, the Endpoint resource (a list of pod IPs) has only two IP addresses. The IP of the deleted pod is automatically removed from this list by Kubernetes.

Pod status after the new pod created

$ kubectl get pod
NAME                                     READY   STATUS    RESTARTS   AGE
scalardb-cluster-node-8485886d5c-j8qjq   1/1     Running   0          41s
scalardb-cluster-node-8485886d5c-r29hm   1/1     Running   0          3m30s
scalardb-cluster-node-8485886d5c-svwm2   1/1     Running   0          3m9s

Endpoint status after the new pod created

$ kubectl get endpoints scalardb-cluster-metrics
NAME                       ENDPOINTS                                               AGE
scalardb-cluster-metrics   10.244.1.161:9080,10.244.1.162:9080,10.244.1.163:9080   18m

As you can see, the Endpoint resource (a list of pod IPs) has three IP addresses. The IP of the created new pod is automatically added to this list by Kubernetes.

In the Kubernetes environment, the restarting pod means delete the pod and create a new pod. When we restart the pod, Kubernetes updates the Endpoint (IP address list) automatically like the above.

Note: You can see the description of this behavior in the official document Termination of Pods. Also, you can see the details on pod termination behavior in this article.

This is a behavior of Kubernetes' Service resource and Endpoint resource which are native features of Kubernetes. So, basically, Kubernetes can perform the rolling update by updating the Endpoint (a list of the pod for load balancing) based on the pod status.

However, Kubernetes is an asynchronous system. So, pod deletion and updating Endpoint are done asynchronously. In other words, Kubernetes doesn't guarantee the order of each operation. So, there is a possibility that the Service resource sends new requests to the pod that is in the deleting status.

To reduce this issue, users/applications can sleep before starting the shutdown process (i.e., wait for updating Endpoint by sleep). You can do this using Kubernetes' feature preStop hook or implement sleep in your application.

Note: You can see the details on this workaround in this article and this article.

ScalarDB Server specific things

As I mentioned above, Kubernetes updates a pod list automatically and can perform the rolling update. So, we can do the rolling update by using the Service resource as a load balancer in the Kubernetes environment.

However, we cannot use the Service resource directly as a load balancer for ScalarDB Server.

Since the Service resource is an L4 load balancer, it cannot treat L7 connections (e.g., state of gRPC) properly. And, ScalarDB Server uses the Bidirectional streaming RPC of gRPC.

So, we have to use some load balancer that can treat L7 (Bidirectional streaming RPC of gPRC) properly for connections between ScalarDB Server and clients. This is why we use Envoy as a load balancer for ScalarDB Server. Envoy can treat L7 (gRPC) connections.

Envoy's feature

Envoy has a Health checking feature for upstream servers. If the health check failed, Envoy doesn't send a new request to the failed upstream server.

Also, regarding maintaining the upstream server list, Envoy can detect pod deletion based on the service discovery. If the service discovery information is changed, Envoy updates its upstream server list based on the service discovery. For example, in the Kubernetes environment, the service discovery means a Service resource. So, Envoy can detect pod deletion in the Kubernetes environment as follows.

Envoy maintains its upstream server list based on the Service (strictly Endpoint) resource.
One pod is deleted.
Kubernetes removes the pod from the Endpoint resource. In other words, the Service resource (service discovery for Envoy) is updated.
Envoy detects the update of the Service resource.
Envoy updates its upstream server list based on the updated Service resource of Kubernetes.

You can see the details of the behavior (service discovery + health check behavior) in the official document On eventually consistent service discovery.

Issues of ScalarDB Server

As I mentioned, Envoy can detect pod deletion and update the upstream server list. However, in both health check and service discovery cases, there is a time lag between pod deletion and Envoy detects pod deletion.

It is difficult to remove this time lag completely, because the health check runs every few seconds, and updating the Endpoint resource process is done asynchronously (basically, Kubernetes is an asynchronous system.)

So, there is a possibility that the Envoy sends a new request to the deletion status pod. This is an issue that we want to resolve.

How to resolve the issue

To resolve ScalarDB Server's issue that I mentioned above, we have to achieve a graceful shutdown by combining the following functions.

Endpoint updates (Kubernetes' feature)
Health check (Envoy's feature)
Service discovery (Envoy's feature)
gRPC Health Check (Implemented in ScalarDB Server)
Sleep (Implemented in ScalarDB Server)

By using these functions, we can achieve the following behavior (i.e., graceful shutdown) for the ScalarDB Server.

Delete/Stop ScalarDB Server pod.
Kubernetes remove the deleted pod IP from the Endpoint resource.
Kubernetes send SIGTERM to the ScalarDB Server process in the pod.
After receiving SIGTERM, ScalarDB Server returns NOT_SERVING as a result of gPRC Health Check and waits for 30sec.
Envoy detects NOT_SERVING of the pod. After that, Envoy doesn't send new requests to the pod.
Envoy detects service discovery's (i.e., Endpoint resource's) change and removes the pod from its upstream server list.
After waiting for 30sec, ScalarDB Server starts its actual shutdown process.
ScalarDB Server completes all ongoing Tx. After that, ScalarDB Server stops.

Note: From 2. to 6. operations run in parallel.

kota2and3kan

LGTM! Thank you!

komamitsu · 2023-04-24T02:57:15Z

@kota2and3kan Thanks for the great explanation!

feeblefakie

LGTM! Thank you!

Torch3333

LGTM, thank you!

Add decommissioning state to ScalarDB Server

0ca7bfb

brfrn169 added the improvement label Apr 20, 2023

brfrn169 self-assigned this Apr 20, 2023

brfrn169 commented Apr 20, 2023

View reviewed changes

brfrn169 requested review from komamitsu, feeblefakie, Torch3333 and kota2and3kan April 20, 2023 22:49

komamitsu approved these changes Apr 21, 2023

View reviewed changes

kota2and3kan approved these changes Apr 24, 2023

View reviewed changes

feeblefakie approved these changes Apr 24, 2023

View reviewed changes

Torch3333 approved these changes Apr 25, 2023

View reviewed changes

Merge branch 'master' into add-decommissioning-state-to-server

5fcd725

brfrn169 merged commit 62c6c2a into master Apr 25, 2023
13 checks passed

brfrn169 deleted the add-decommissioning-state-to-server branch April 25, 2023 06:28

brfrn169 added this to Done in 3.9.0 Apr 27, 2023

brfrn169 added a commit that referenced this pull request Apr 27, 2023

Add decommissioning state to ScalarDB Server (#851)

bb25cad

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add decommissioning state to ScalarDB Server #851

Add decommissioning state to ScalarDB Server #851

brfrn169 commented Apr 20, 2023 •

edited

brfrn169 Apr 20, 2023

komamitsu commented Apr 21, 2023

kota2and3kan commented Apr 21, 2023

brfrn169 commented Apr 21, 2023

komamitsu commented Apr 21, 2023

komamitsu left a comment

kota2and3kan commented Apr 24, 2023

kota2and3kan left a comment

komamitsu commented Apr 24, 2023

feeblefakie left a comment

Torch3333 left a comment

Add decommissioning state to ScalarDB Server #851

Add decommissioning state to ScalarDB Server #851

Conversation

brfrn169 commented Apr 20, 2023 • edited

brfrn169 Apr 20, 2023

Choose a reason for hiding this comment

komamitsu commented Apr 21, 2023

kota2and3kan commented Apr 21, 2023

ScalarDB Cluster (already fixed some issues)

ScalarDB Server (what we are going to fix)

brfrn169 commented Apr 21, 2023

komamitsu commented Apr 21, 2023

komamitsu left a comment

Choose a reason for hiding this comment

kota2and3kan commented Apr 24, 2023

Kubernetes' feature

ScalarDB Server specific things

Envoy's feature

Issues of ScalarDB Server

How to resolve the issue

kota2and3kan left a comment

Choose a reason for hiding this comment

komamitsu commented Apr 24, 2023

feeblefakie left a comment

Choose a reason for hiding this comment

Torch3333 left a comment

Choose a reason for hiding this comment

brfrn169 commented Apr 20, 2023 •

edited