Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using LB for a scylla cluster and introducing single DB node disruptions we get failed: Connection refused (Connection refused) responses #1841

Closed
vponomaryov opened this issue Mar 18, 2024 · 8 comments
Labels
kind/support Categorizes issue or PR as a support question. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done.

Comments

@vponomaryov
Copy link
Contributor

What happened?

We ran 3 hour stress commands using YCSB load tool that covers the alternator feature.
We used the scylla load-balancer endpoint in K8S created by the scylla-operator for it.
And doing various disruptions to some DB nodes compatible with quorum we get following errors:

...
85599 [Thread-8] ERROR site.ycsb.db.DynamoDBClient  -com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to sct-cluster-client.scylla.svc:8043 [sct-cluster-client.scylla.svc/172.20.73.197] failed: Connection refused (Connection refused)
...
3366487 [Thread-17] ERROR site.ycsb.db.DynamoDBClient  -com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to sct-cluster-client.scylla.svc:8043 [sct-cluster-client.scylla.svc/172.20.73.197] failed: Connection refused (Connection refused)
...
6441893 [Thread-14] ERROR site.ycsb.db.DynamoDBClient  -com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to sct-cluster-client.scylla.svc:8043 [sct-cluster-client.scylla.svc/172.20.73.197] failed: Connection refused (Connection refused)
...
7282246 [Thread-15] ERROR site.ycsb.db.DynamoDBClient  -com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to sct-cluster-client.scylla.svc:8043 [sct-cluster-client.scylla.svc/172.20.73.197] failed: Connection refused (Connection refused)
...
10442006 [Thread-11] ERROR site.ycsb.db.DynamoDBClient  -com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to sct-cluster-client.scylla.svc:8043 [sct-cluster-client.scylla.svc/172.20.73.197] failed: Connection refused (Connection refused)
10442145 [Thread-8] ERROR site.ycsb.db.DynamoDBClient  -com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to sct-cluster-client.scylla.svc:8043 [sct-cluster-client.scylla.svc/172.20.73.197] failed: Connection refused (Connection refused)
...

Which are unexpected because it is what should be solved by the load-balancer solution.

What did you expect to happen?

I expected that load-balancer resends the failed request with the connection refused error to another alive endpoint/node.

How can we reproduce it (as minimally and precisely as possible)?

  • Setup scylla-operator with enabled alternator feature
  • Create 3-node Scylla cluster
  • Start some load against scylla using alternator port and LB address
  • Make one of DB node go down during running stress command(s)

Scylla Operator version

v1.13.0-alpha.0-49-gf356138-latest

Kubernetes platform name and version

v1.27.11-eks-b9c9ed7

Please attach the must-gather archive.

Jenkins job URL
Argus

Anything else we need to know?

No response

@vponomaryov vponomaryov added the kind/bug Categorizes issue or PR as related to a bug. label Mar 18, 2024
@scylla-operator-bot scylla-operator-bot bot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Mar 18, 2024
@tnozicka
Copy link
Member

I expected that load-balancer resends the failed request

That's not what the LB does - its job is to send the traffic only to nodes that are considered ready / in the pool.

Make one of DB node go down during running stress command(s)

Is this done gracefully (drain+shutdown) or ungracefully?

For ungraceful shutdown, it always takes a few seconds before it notices the node is down and takes it out of the pool.

Graceful shutdowns are being fixed in #1077

@tnozicka tnozicka added kind/support Categorizes issue or PR as a support question. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. labels Mar 18, 2024
@scylla-operator-bot scylla-operator-bot bot removed the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Mar 18, 2024
@tnozicka tnozicka added needs-priority Indicates a PR lacks a `priority/foo` label and requires one. and removed kind/bug Categorizes issue or PR as related to a bug. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Mar 18, 2024
@nyh
Copy link

nyh commented Mar 18, 2024

I think it is quite expected that a load balancer will have a (hopefully!) short delay in recognizing that one of the nodes are down, and send a request to a node that is down, and fail. The DynamoDB documentation explains that a DynamoDB client should be ready for such failures, and retry the request:

Numerous components on a network, such as DNS servers, switches, load balancers, and others, can generate errors anywhere in the life of a given request. The usual technique for dealing with these error responses in a networked environment is to implement retries in the client application. This technique increases the reliability of the application.

And explains that Amazon's client libraries already do this retry automatically:

Each AWS SDK implements retry logic automatically.

What I don't understand, though (I don't know what kind of load balancer you are using...), is how the client saw a "connection refused". Isn't the connection made to the load balancer (not the dead node), and that is always responsive?

@vponomaryov
Copy link
Contributor Author

@tnozicka and @nyh thanks for the answers.

What I don't understand, though (I don't know what kind of load balancer you are using...), is how the client saw a "connection refused". Isn't the connection made to the load balancer (not the dead node), and that is always responsive?

We use the one provisioned in K8S as a K8S loadbalancer service configured by the scylla-operator.
If it was always responsive we would not have this bugreport.
So, as I understand, that loadbalancer returns query results from endpoints it routed to.

@tnozicka
Do I understand you correctly, that it is not possible to configure that K8S LB service to resend queries to other alive endpoints in case of the Connection refused error?

@tnozicka
Copy link
Member

tnozicka commented Mar 19, 2024

What I don't understand, though (I don't know what kind of load balancer you are using...), is how the client saw a "connection refused". Isn't the connection made to the load balancer (not the dead node), and that is always responsive?

There are multiple ways and different LBs can be configured. With pure Service based LB, it's just iptable rules on the backend. But in the end client is expected to get an error on ungraceful node terminations.

Do I understand you correctly, that it is not possible to configure that K8S LB service to resend queries to other alive endpoints in case of the Connection refused error?

To my knowledge, this is not possible, nor desired. LB doesn't know if a query can be retried, that's why this is left on clients. (It can fail in the middle, do half of the work, trigger external event, or other edge cases.)

Copy link
Contributor

The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 30d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out

/lifecycle stale

@scylla-operator-bot scylla-operator-bot bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 7, 2024
Copy link
Contributor

The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 30d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out

/lifecycle rotten

@scylla-operator-bot scylla-operator-bot bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 7, 2024
Copy link
Contributor

The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 30d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out

/close not-planned

Copy link
Contributor

@scylla-operator-bot[bot]: Closing this issue, marking it as "Not Planned".

In response to this:

The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 30d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@scylla-operator-bot scylla-operator-bot bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/support Categorizes issue or PR as a support question. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done.
Projects
None yet
Development

No branches or pull requests

3 participants