Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using LB for a scylla cluster and introducing single DB node disruptions we get failed: Connection refused (Connection refused) responses #1841

Open
vponomaryov opened this issue Mar 18, 2024 · 4 comments
Labels
kind/support Categorizes issue or PR as a support question. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done.

Comments

@vponomaryov
Copy link
Contributor

What happened?

We ran 3 hour stress commands using YCSB load tool that covers the alternator feature.
We used the scylla load-balancer endpoint in K8S created by the scylla-operator for it.
And doing various disruptions to some DB nodes compatible with quorum we get following errors:

...
85599 [Thread-8] ERROR site.ycsb.db.DynamoDBClient  -com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to sct-cluster-client.scylla.svc:8043 [sct-cluster-client.scylla.svc/172.20.73.197] failed: Connection refused (Connection refused)
...
3366487 [Thread-17] ERROR site.ycsb.db.DynamoDBClient  -com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to sct-cluster-client.scylla.svc:8043 [sct-cluster-client.scylla.svc/172.20.73.197] failed: Connection refused (Connection refused)
...
6441893 [Thread-14] ERROR site.ycsb.db.DynamoDBClient  -com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to sct-cluster-client.scylla.svc:8043 [sct-cluster-client.scylla.svc/172.20.73.197] failed: Connection refused (Connection refused)
...
7282246 [Thread-15] ERROR site.ycsb.db.DynamoDBClient  -com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to sct-cluster-client.scylla.svc:8043 [sct-cluster-client.scylla.svc/172.20.73.197] failed: Connection refused (Connection refused)
...
10442006 [Thread-11] ERROR site.ycsb.db.DynamoDBClient  -com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to sct-cluster-client.scylla.svc:8043 [sct-cluster-client.scylla.svc/172.20.73.197] failed: Connection refused (Connection refused)
10442145 [Thread-8] ERROR site.ycsb.db.DynamoDBClient  -com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to sct-cluster-client.scylla.svc:8043 [sct-cluster-client.scylla.svc/172.20.73.197] failed: Connection refused (Connection refused)
...

Which are unexpected because it is what should be solved by the load-balancer solution.

What did you expect to happen?

I expected that load-balancer resends the failed request with the connection refused error to another alive endpoint/node.

How can we reproduce it (as minimally and precisely as possible)?

  • Setup scylla-operator with enabled alternator feature
  • Create 3-node Scylla cluster
  • Start some load against scylla using alternator port and LB address
  • Make one of DB node go down during running stress command(s)

Scylla Operator version

v1.13.0-alpha.0-49-gf356138-latest

Kubernetes platform name and version

v1.27.11-eks-b9c9ed7

Please attach the must-gather archive.

Jenkins job URL
Argus

Anything else we need to know?

No response

@vponomaryov vponomaryov added the kind/bug Categorizes issue or PR as related to a bug. label Mar 18, 2024
@scylla-operator-bot scylla-operator-bot bot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Mar 18, 2024
@tnozicka
Copy link
Member

I expected that load-balancer resends the failed request

That's not what the LB does - its job is to send the traffic only to nodes that are considered ready / in the pool.

Make one of DB node go down during running stress command(s)

Is this done gracefully (drain+shutdown) or ungracefully?

For ungraceful shutdown, it always takes a few seconds before it notices the node is down and takes it out of the pool.

Graceful shutdowns are being fixed in #1077

@tnozicka tnozicka added kind/support Categorizes issue or PR as a support question. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. labels Mar 18, 2024
@scylla-operator-bot scylla-operator-bot bot removed the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Mar 18, 2024
@tnozicka tnozicka added needs-priority Indicates a PR lacks a `priority/foo` label and requires one. and removed kind/bug Categorizes issue or PR as related to a bug. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Mar 18, 2024
@nyh
Copy link

nyh commented Mar 18, 2024

I think it is quite expected that a load balancer will have a (hopefully!) short delay in recognizing that one of the nodes are down, and send a request to a node that is down, and fail. The DynamoDB documentation explains that a DynamoDB client should be ready for such failures, and retry the request:

Numerous components on a network, such as DNS servers, switches, load balancers, and others, can generate errors anywhere in the life of a given request. The usual technique for dealing with these error responses in a networked environment is to implement retries in the client application. This technique increases the reliability of the application.

And explains that Amazon's client libraries already do this retry automatically:

Each AWS SDK implements retry logic automatically.

What I don't understand, though (I don't know what kind of load balancer you are using...), is how the client saw a "connection refused". Isn't the connection made to the load balancer (not the dead node), and that is always responsive?

@vponomaryov
Copy link
Contributor Author

@tnozicka and @nyh thanks for the answers.

What I don't understand, though (I don't know what kind of load balancer you are using...), is how the client saw a "connection refused". Isn't the connection made to the load balancer (not the dead node), and that is always responsive?

We use the one provisioned in K8S as a K8S loadbalancer service configured by the scylla-operator.
If it was always responsive we would not have this bugreport.
So, as I understand, that loadbalancer returns query results from endpoints it routed to.

@tnozicka
Do I understand you correctly, that it is not possible to configure that K8S LB service to resend queries to other alive endpoints in case of the Connection refused error?

@tnozicka
Copy link
Member

tnozicka commented Mar 19, 2024

What I don't understand, though (I don't know what kind of load balancer you are using...), is how the client saw a "connection refused". Isn't the connection made to the load balancer (not the dead node), and that is always responsive?

There are multiple ways and different LBs can be configured. With pure Service based LB, it's just iptable rules on the backend. But in the end client is expected to get an error on ungraceful node terminations.

Do I understand you correctly, that it is not possible to configure that K8S LB service to resend queries to other alive endpoints in case of the Connection refused error?

To my knowledge, this is not possible, nor desired. LB doesn't know if a query can be retried, that's why this is left on clients. (It can fail in the middle, do half of the work, trigger external event, or other edge cases.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/support Categorizes issue or PR as a support question. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done.
Projects
None yet
Development

No branches or pull requests

3 participants