-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add decommissioning state to ScalarDB Server #851
Conversation
shutdown(MAX_WAIT_TIME_MILLIS, TimeUnit.MILLISECONDS); | ||
logger.info("The server shut down."); | ||
logger.info("Signal received. Decommissioning ..."); | ||
decommission(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before shutting down ScalarDB Server, change the healthService response to NOT_SERVING
and sleep for some time. This way, we can wait for no new requests to come in.
@brfrn169 In the current implementation, |
@komamitsu (@brfrn169 ) In the current deployment on the Kubernetes environment, we use Envoy with ScalarDB Server, there is a possibility that the clients receive a And, we already fixed a similar issue in the ScalarDB Cluster. So, I will explain the behavior of ScalarDB Cluster first. After that, I will explain the behavior (what we are going to resolve) in the ScalarDB Server. ScalarDB Cluster (already fixed some issues)When we shutdown (send
Note: This fix was done by this PR. By the above process, ScalarDB Cluster can guarantee that ScalarDB Server (what we are going to fix)When we shutdown (send
Note: Since there is a time lag between To fix this issue, we update ScalarDB Server to implement the same |
@kota2and3kan Thank you for your explanation! |
@kota2and3kan Thanks for the detailed background! It's very helpful for me. Just out of curiosity, Kubernetes (and/or Envoy?) doesn't have a rolling update orchestration feature like the following?
AWS's CodeDeploy has this kind of features and I just though it would be great if k8s supports that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! 👍
In conclusion, Kubernetes has a rolling update feature and Envoy has health check/service discovery features. However, since Envoy is one of the application pods (i.e., it is not a system pod) from the perspective of Kubernetes, I think there is no feature that Kubernetes and Envoy work on cooperating on the Kubernetes side.
In your question, Since Envoy is one of the application pods, Kubernetes cannot control Envoy's behavior. Also, Envoy is a generic proxy (it is not a dedicated proxy for Kubernetes). So, I think Envoy cannot control Kubernetes' pod deletion behavior. Therefore, we have to achieve a graceful shutdown by combining several functions of Kubernetes and Envoy. For your reference, I will explain Kubernetes behavior and Envoy behavior respectively. Also, I explain the challenges of ScalarDB Server. Kubernetes' featureAs a Kubernetes feature, there is a For example, if there are three pods,
As you can see, the endpoint resource has a pod list that includes 3 IP addresses. So, if you access the Service And, if we delete the pod, Kubernetes updates this IP address list in the
In the Kubernetes environment, the restarting pod means Note: You can see the description of this behavior in the official document Termination of Pods. Also, you can see the details on pod termination behavior in this article. This is a behavior of Kubernetes' However, Kubernetes is an asynchronous system. So, pod deletion and updating To reduce this issue, users/applications can sleep before starting the shutdown process (i.e., wait for updating Note: You can see the details on this workaround in this article and this article. ScalarDB Server specific thingsAs I mentioned above, Kubernetes updates a pod list automatically and can perform the rolling update. So, we can do the rolling update by using the However, we cannot use the Since the So, we have to use some load balancer that can treat L7 ( Envoy's featureEnvoy has a Health checking feature for upstream servers. If the health check failed, Envoy doesn't send a new request to the failed upstream server. Also, regarding maintaining the upstream server list, Envoy can detect pod deletion based on the service discovery. If the service discovery information is changed, Envoy updates its upstream server list based on the service discovery. For example, in the Kubernetes environment, the service discovery means a
You can see the details of the behavior (service discovery + health check behavior) in the official document On eventually consistent service discovery. Issues of ScalarDB ServerAs I mentioned, Envoy can detect pod deletion and update the upstream server list. However, in both It is difficult to remove this time lag completely, because the health check runs every few seconds, and updating the Endpoint resource process is done asynchronously (basically, Kubernetes is an asynchronous system.) So, there is a possibility that the Envoy sends a new request to the deletion status pod. This is an issue that we want to resolve. How to resolve the issueTo resolve ScalarDB Server's issue that I mentioned above, we have to achieve a graceful shutdown by combining the following functions.
By using these functions, we can achieve the following behavior (i.e., graceful shutdown) for the ScalarDB Server.
Note: From |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thank you!
@kota2and3kan Thanks for the great explanation! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you!
This PR adds a decommissioning state to ScalarDB Server to achieve graceful shutdown. Please take a look!