-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using LB for a scylla cluster and introducing single DB node disruptions we get failed: Connection refused (Connection refused)
responses
#1841
Comments
That's not what the LB does - its job is to send the traffic only to nodes that are considered ready / in the pool.
Is this done gracefully (drain+shutdown) or ungracefully? For ungraceful shutdown, it always takes a few seconds before it notices the node is down and takes it out of the pool. Graceful shutdowns are being fixed in #1077 |
I think it is quite expected that a load balancer will have a (hopefully!) short delay in recognizing that one of the nodes are down, and send a request to a node that is down, and fail. The DynamoDB documentation explains that a DynamoDB client should be ready for such failures, and retry the request:
And explains that Amazon's client libraries already do this retry automatically:
What I don't understand, though (I don't know what kind of load balancer you are using...), is how the client saw a "connection refused". Isn't the connection made to the load balancer (not the dead node), and that is always responsive? |
@tnozicka and @nyh thanks for the answers.
We use the one provisioned in K8S as a @tnozicka |
There are multiple ways and different LBs can be configured. With pure Service based LB, it's just iptable rules on the backend. But in the end client is expected to get an error on ungraceful node terminations.
To my knowledge, this is not possible, nor desired. LB doesn't know if a query can be retried, that's why this is left on clients. (It can fail in the middle, do half of the work, trigger external event, or other edge cases.) |
The Scylla Operator project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
/lifecycle stale |
The Scylla Operator project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
/lifecycle rotten |
The Scylla Operator project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
/close not-planned |
@scylla-operator-bot[bot]: Closing this issue, marking it as "Not Planned". In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What happened?
We ran 3 hour stress commands using YCSB load tool that covers the alternator feature.
We used the scylla load-balancer endpoint in K8S created by the scylla-operator for it.
And doing various disruptions to some DB nodes compatible with
quorum
we get following errors:Which are unexpected because it is what should be solved by the load-balancer solution.
What did you expect to happen?
I expected that load-balancer resends the failed request with the
connection refused
error to another alive endpoint/node.How can we reproduce it (as minimally and precisely as possible)?
Scylla Operator version
v1.13.0-alpha.0-49-gf356138-latest
Kubernetes platform name and version
v1.27.11-eks-b9c9ed7
Please attach the must-gather archive.
Jenkins job URL
Argus
Anything else we need to know?
No response
The text was updated successfully, but these errors were encountered: