Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add decommissioning state to ScalarDB Server #851

Merged
merged 2 commits into from
Apr 25, 2023

Conversation

brfrn169
Copy link
Collaborator

@brfrn169 brfrn169 commented Apr 20, 2023

This PR adds a decommissioning state to ScalarDB Server to achieve graceful shutdown. Please take a look!

@brfrn169 brfrn169 self-assigned this Apr 20, 2023
shutdown(MAX_WAIT_TIME_MILLIS, TimeUnit.MILLISECONDS);
logger.info("The server shut down.");
logger.info("Signal received. Decommissioning ...");
decommission();
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before shutting down ScalarDB Server, change the healthService response to NOT_SERVING and sleep for some time. This way, we can wait for no new requests to come in.

@komamitsu
Copy link
Contributor

@brfrn169 In the current implementation, io.grpc.Server#shutdown() which rejects new requests and io.grpc.Server#awaitTermination() are called. So, it looks to work to me, but probably I'm missing something. Can you share what issue we're going to solve?

@kota2and3kan
Copy link
Contributor

@komamitsu (@brfrn169 )
The summary of issues that we are going to resolve is the following.

In the current deployment on the Kubernetes environment, we use Envoy with ScalarDB Server, there is a possibility that the clients receive a 5xx error when we do the rolling update of ScalarDB Server.

And, we already fixed a similar issue in the ScalarDB Cluster. So, I will explain the behavior of ScalarDB Cluster first. After that, I will explain the behavior (what we are going to resolve) in the ScalarDB Server.

ScalarDB Cluster (already fixed some issues)

When we shutdown (send SIGTERM to) ScalarDB Cluster, it works as follows.

  1. Returns NOT_SERVING and waiting 30sec by default.
  2. Envoy detects NOT_SERVING (i.e., not healthy) in its health check feature.
  3. Envoy removes the ScalarDB Cluster node from the load balancing pool (i.e., Envoy doesn't send new requests to ScalarDB Cluster node that we will shutdown).
  4. (After waiting 30sec) ScalarDB Cluster starts shutdown process.
  5. ScalarDB Cluster waits for all ongoing Tx will complete.
  6. After all ongoing Tx is completed, ScalarDB Cluster stops.

Note: This fix was done by this PR.
https://github.com/scalar-labs/scalardb-cluster/pull/44

By the above process, ScalarDB Cluster can guarantee that Don't receive a new request after the shutdown process starts. So, we can avoid receiving 5xx errors from the perspective of the client when we do the rolling update.

ScalarDB Server (what we are going to fix)

When we shutdown (send SIGTERM to) ScalarDB Server, it works as follows.

  1. ScalarDB Server starts shutdown process immediately. After this process, a new request receives a 5xx error.
  2. ScalarDB Server waits for all ongoing Tx will complete.
  3. Endpoint resource (i.e., service discovery information for Envoy) updated by Kubernetes.
  4. Envoy removes the ScalarDB Server from the load balancing pool based on the update of the Endpoint resource.
  5. After all ongoing Tx is completed, ScalarDB Cluster stops.

Note: 2 and 3 runs in parallel.

Since there is a time lag between 1. and 4., there is a possibility that the Envoy sends a new request to the ScalarDB Server that is in stopping progress. In this case, the ScalarDB Server returns a 5xx error to the client.

To fix this issue, we update ScalarDB Server to implement the same graceful shutdown process as ScalarDB Cluster. After this update, the Envoy can detect and treat ScalarDB Server's shutdown process properly.

@brfrn169
Copy link
Collaborator Author

@kota2and3kan Thank you for your explanation!

@komamitsu
Copy link
Contributor

@kota2and3kan Thanks for the detailed background! It's very helpful for me.

Just out of curiosity, Kubernetes (and/or Envoy?) doesn't have a rolling update orchestration feature like the following?

  1. Envoy marks one of the pods as disabled (or just removes it from available pod list?)
  2. the target pods is restarted
  3. Envoy marks the restarted pod as enabled (or just adds it from available pod list?)

AWS's CodeDeploy has this kind of features and I just though it would be great if k8s supports that.

Copy link
Contributor

@komamitsu komamitsu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 👍

@kota2and3kan
Copy link
Contributor

@komamitsu

Just out of curiosity, Kubernetes (and/or Envoy?) doesn't have a rolling update orchestration feature like the following?

In conclusion, Kubernetes has a rolling update feature and Envoy has health check/service discovery features.

However, since Envoy is one of the application pods (i.e., it is not a system pod) from the perspective of Kubernetes, I think there is no feature that Kubernetes and Envoy work on cooperating on the Kubernetes side.

  1. Envoy marks one of the pods as disabled (or just removes it from available pod list?)
  2. the target pods is restarted
  3. Envoy marks the restarted pod as enabled (or just adds it from the available pod list?)

In your question, 1. and 3. are the Envoy's behavior (from the perspective of Kubernetes, it is an application's behavior). And, 2 is a Kubernetes' behavior.

Since Envoy is one of the application pods, Kubernetes cannot control Envoy's behavior.

Also, Envoy is a generic proxy (it is not a dedicated proxy for Kubernetes). So, I think Envoy cannot control Kubernetes' pod deletion behavior.

Therefore, we have to achieve a graceful shutdown by combining several functions of Kubernetes and Envoy.


For your reference, I will explain Kubernetes behavior and Envoy behavior respectively. Also, I explain the challenges of ScalarDB Server.

Kubernetes' feature

As a Kubernetes feature, there is a Service resource. You can use this resource for service discovery (you can use it like a Load Balancer). And, this Service resource uses the pod list in the Endpoint resource to decide which pod it sends a request to.

For example, if there are three pods, Endpoint has information (IP address of pods) as follows.

  • Pods (3 pods)
    $ kubectl get pod
    NAME                                     READY   STATUS    RESTARTS   AGE
    scalardb-cluster-node-76699489b4-fnz97   1/1     Running   0          55s
    scalardb-cluster-node-76699489b4-j4tqh   1/1     Running   0          55s
    scalardb-cluster-node-76699489b4-mtzxx   1/1     Running   0          55s
  • Service (you can use this resource for service discovery)
    $ kubectl get svc scalardb-cluster-metrics
    NAME                       TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)    AGE
    scalardb-cluster-metrics   ClusterIP   10.111.18.11   <none>        9080/TCP   2m2s
  • Endpoint (manage list of pods)
    $ kubectl get endpoints scalardb-cluster-metrics
    NAME                       ENDPOINTS                                               AGE
    scalardb-cluster-metrics   10.244.1.151:9080,10.244.1.152:9080,10.244.1.153:9080   2m24s

As you can see, the endpoint resource has a pod list that includes 3 IP addresses. So, if you access the Service scalardb-cluster-metrics, you can access one of the three pods. The Service resource sends your request to one of the pods like a load balancer.

And, if we delete the pod, Kubernetes updates this IP address list in the Endpoint resource. For example, if I delete one pod, we can see the Endpoint resource (IP address list) is updated by Kubernetes as follows. (In this test case, a new pod is automatically created by the Deployment resource when I delete one pod manually.)

  • Initial state of pods (there are three pods)
    $ kubectl get pod
    NAME                                     READY   STATUS    RESTARTS   AGE
    scalardb-cluster-node-8485886d5c-ggx99   1/1     Running   0          2m52s
    scalardb-cluster-node-8485886d5c-r29hm   1/1     Running   0          2m31s
    scalardb-cluster-node-8485886d5c-svwm2   1/1     Running   0          2m10s
  • Initial state of Endpoint (there are three IP addresses of pods)
    $ kubectl get endpoints scalardb-cluster-metrics
    NAME                       ENDPOINTS                                               AGE
    scalardb-cluster-metrics   10.244.1.160:9080,10.244.1.161:9080,10.244.1.162:9080   17m
  • Delete one pod manually
    $ kubectl delete pod scalardb-cluster-node-8485886d5c-ggx99
    pod "scalardb-cluster-node-8485886d5c-ggx99" deleted
  • Pod status when the pod deletion is in progress (One pod is Terminating status and a new pod is ContainerCreating status)
    $ kubectl get pod
    NAME                                     READY   STATUS              RESTARTS   AGE
    scalardb-cluster-node-8485886d5c-ggx99   1/1     Terminating         0          3m10s
    scalardb-cluster-node-8485886d5c-j8qjq   0/1     ContainerCreating   0          0s
    scalardb-cluster-node-8485886d5c-r29hm   1/1     Running             0          2m49s
    scalardb-cluster-node-8485886d5c-svwm2   1/1     Running             0          2m28s
  • Endpoint status when the pod deletion is in progress
    $ kubectl get endpoints scalardb-cluster-metrics
    NAME                       ENDPOINTS                             AGE
    scalardb-cluster-metrics   10.244.1.161:9080,10.244.1.162:9080   18m
    • As you can see, the Endpoint resource (a list of pod IPs) has only two IP addresses. The IP of the deleted pod is automatically removed from this list by Kubernetes.
  • Pod status after the new pod created
    $ kubectl get pod
    NAME                                     READY   STATUS    RESTARTS   AGE
    scalardb-cluster-node-8485886d5c-j8qjq   1/1     Running   0          41s
    scalardb-cluster-node-8485886d5c-r29hm   1/1     Running   0          3m30s
    scalardb-cluster-node-8485886d5c-svwm2   1/1     Running   0          3m9s
  • Endpoint status after the new pod created
    $ kubectl get endpoints scalardb-cluster-metrics
    NAME                       ENDPOINTS                                               AGE
    scalardb-cluster-metrics   10.244.1.161:9080,10.244.1.162:9080,10.244.1.163:9080   18m
    • As you can see, the Endpoint resource (a list of pod IPs) has three IP addresses. The IP of the created new pod is automatically added to this list by Kubernetes.

In the Kubernetes environment, the restarting pod means delete the pod and create a new pod. When we restart the pod, Kubernetes updates the Endpoint (IP address list) automatically like the above.

Note: You can see the description of this behavior in the official document Termination of Pods. Also, you can see the details on pod termination behavior in this article.

This is a behavior of Kubernetes' Service resource and Endpoint resource which are native features of Kubernetes. So, basically, Kubernetes can perform the rolling update by updating the Endpoint (a list of the pod for load balancing) based on the pod status.

However, Kubernetes is an asynchronous system. So, pod deletion and updating Endpoint are done asynchronously. In other words, Kubernetes doesn't guarantee the order of each operation. So, there is a possibility that the Service resource sends new requests to the pod that is in the deleting status.

To reduce this issue, users/applications can sleep before starting the shutdown process (i.e., wait for updating Endpoint by sleep). You can do this using Kubernetes' feature preStop hook or implement sleep in your application.

Note: You can see the details on this workaround in this article and this article.

ScalarDB Server specific things

As I mentioned above, Kubernetes updates a pod list automatically and can perform the rolling update. So, we can do the rolling update by using the Service resource as a load balancer in the Kubernetes environment.

However, we cannot use the Service resource directly as a load balancer for ScalarDB Server.

Since the Service resource is an L4 load balancer, it cannot treat L7 connections (e.g., state of gRPC) properly. And, ScalarDB Server uses the Bidirectional streaming RPC of gRPC.

So, we have to use some load balancer that can treat L7 (Bidirectional streaming RPC of gPRC) properly for connections between ScalarDB Server and clients. This is why we use Envoy as a load balancer for ScalarDB Server. Envoy can treat L7 (gRPC) connections.

Envoy's feature

Envoy has a Health checking feature for upstream servers. If the health check failed, Envoy doesn't send a new request to the failed upstream server.

Also, regarding maintaining the upstream server list, Envoy can detect pod deletion based on the service discovery. If the service discovery information is changed, Envoy updates its upstream server list based on the service discovery. For example, in the Kubernetes environment, the service discovery means a Service resource. So, Envoy can detect pod deletion in the Kubernetes environment as follows.

  1. Envoy maintains its upstream server list based on the Service (strictly Endpoint) resource.
  2. One pod is deleted.
  3. Kubernetes removes the pod from the Endpoint resource. In other words, the Service resource (service discovery for Envoy) is updated.
  4. Envoy detects the update of the Service resource.
  5. Envoy updates its upstream server list based on the updated Service resource of Kubernetes.

You can see the details of the behavior (service discovery + health check behavior) in the official document On eventually consistent service discovery.

Issues of ScalarDB Server

As I mentioned, Envoy can detect pod deletion and update the upstream server list. However, in both health check and service discovery cases, there is a time lag between pod deletion and Envoy detects pod deletion.

It is difficult to remove this time lag completely, because the health check runs every few seconds, and updating the Endpoint resource process is done asynchronously (basically, Kubernetes is an asynchronous system.)

So, there is a possibility that the Envoy sends a new request to the deletion status pod. This is an issue that we want to resolve.

How to resolve the issue

To resolve ScalarDB Server's issue that I mentioned above, we have to achieve a graceful shutdown by combining the following functions.

  • Endpoint updates (Kubernetes' feature)
  • Health check (Envoy's feature)
  • Service discovery (Envoy's feature)
  • gRPC Health Check (Implemented in ScalarDB Server)
  • Sleep (Implemented in ScalarDB Server)

By using these functions, we can achieve the following behavior (i.e., graceful shutdown) for the ScalarDB Server.

  1. Delete/Stop ScalarDB Server pod.
  2. Kubernetes remove the deleted pod IP from the Endpoint resource.
  3. Kubernetes send SIGTERM to the ScalarDB Server process in the pod.
  4. After receiving SIGTERM, ScalarDB Server returns NOT_SERVING as a result of gPRC Health Check and waits for 30sec.
  5. Envoy detects NOT_SERVING of the pod. After that, Envoy doesn't send new requests to the pod.
  6. Envoy detects service discovery's (i.e., Endpoint resource's) change and removes the pod from its upstream server list.
  7. After waiting for 30sec, ScalarDB Server starts its actual shutdown process.
  8. ScalarDB Server completes all ongoing Tx. After that, ScalarDB Server stops.

Note: From 2. to 6. operations run in parallel.

Copy link
Contributor

@kota2and3kan kota2and3kan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you!

@komamitsu
Copy link
Contributor

@kota2and3kan Thanks for the great explanation!

Copy link
Contributor

@feeblefakie feeblefakie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you!

Copy link
Contributor

@Torch3333 Torch3333 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!

@brfrn169 brfrn169 merged commit 62c6c2a into master Apr 25, 2023
13 checks passed
@brfrn169 brfrn169 deleted the add-decommissioning-state-to-server branch April 25, 2023 06:28
@brfrn169 brfrn169 added this to Done in 3.9.0 Apr 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
3.9.0
Done
Development

Successfully merging this pull request may close these issues.

None yet

5 participants