Using LB for a scylla cluster and introducing single DB node disruptions we get `failed: Connection refused (Connection refused)` responses #1841

vponomaryov · 2024-03-18T12:53:40Z

What happened?

We ran 3 hour stress commands using YCSB load tool that covers the alternator feature.
We used the scylla load-balancer endpoint in K8S created by the scylla-operator for it.
And doing various disruptions to some DB nodes compatible with quorum we get following errors:

...
85599 [Thread-8] ERROR site.ycsb.db.DynamoDBClient  -com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to sct-cluster-client.scylla.svc:8043 [sct-cluster-client.scylla.svc/172.20.73.197] failed: Connection refused (Connection refused)
...
3366487 [Thread-17] ERROR site.ycsb.db.DynamoDBClient  -com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to sct-cluster-client.scylla.svc:8043 [sct-cluster-client.scylla.svc/172.20.73.197] failed: Connection refused (Connection refused)
...
6441893 [Thread-14] ERROR site.ycsb.db.DynamoDBClient  -com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to sct-cluster-client.scylla.svc:8043 [sct-cluster-client.scylla.svc/172.20.73.197] failed: Connection refused (Connection refused)
...
7282246 [Thread-15] ERROR site.ycsb.db.DynamoDBClient  -com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to sct-cluster-client.scylla.svc:8043 [sct-cluster-client.scylla.svc/172.20.73.197] failed: Connection refused (Connection refused)
...
10442006 [Thread-11] ERROR site.ycsb.db.DynamoDBClient  -com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to sct-cluster-client.scylla.svc:8043 [sct-cluster-client.scylla.svc/172.20.73.197] failed: Connection refused (Connection refused)
10442145 [Thread-8] ERROR site.ycsb.db.DynamoDBClient  -com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to sct-cluster-client.scylla.svc:8043 [sct-cluster-client.scylla.svc/172.20.73.197] failed: Connection refused (Connection refused)
...

Which are unexpected because it is what should be solved by the load-balancer solution.

What did you expect to happen?

I expected that load-balancer resends the failed request with the connection refused error to another alive endpoint/node.

How can we reproduce it (as minimally and precisely as possible)?

Setup scylla-operator with enabled alternator feature
Create 3-node Scylla cluster
Start some load against scylla using alternator port and LB address
Make one of DB node go down during running stress command(s)

Scylla Operator version

v1.13.0-alpha.0-49-gf356138-latest

Kubernetes platform name and version

v1.27.11-eks-b9c9ed7

Please attach the must-gather archive.

kubernetes-c52bd59b.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/c52bd59b-69ff-4d82-9096-a23a66b08d41/20240318_115457/kubernetes-c52bd59b.tar.gz
kubernetes-must-gather-c52bd59b.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/c52bd59b-69ff-4d82-9096-a23a66b08d41/20240318_115457/kubernetes-must-gather-c52bd59b.tar.gz
db-cluster-c52bd59b.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/c52bd59b-69ff-4d82-9096-a23a66b08d41/20240318_115457/db-cluster-c52bd59b.tar.gz
sct-runner-events-c52bd59b.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/c52bd59b-69ff-4d82-9096-a23a66b08d41/20240318_115457/sct-runner-events-c52bd59b.tar.gz
sct-c52bd59b.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/c52bd59b-69ff-4d82-9096-a23a66b08d41/20240318_115457/sct-c52bd59b.log.tar.gz
loader-set-c52bd59b.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/c52bd59b-69ff-4d82-9096-a23a66b08d41/20240318_115457/loader-set-c52bd59b.tar.gz
monitor-set-c52bd59b.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/c52bd59b-69ff-4d82-9096-a23a66b08d41/20240318_115457/monitor-set-c52bd59b.tar.gz
parallel-timelines-report-c52bd59b.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/c52bd59b-69ff-4d82-9096-a23a66b08d41/20240318_115457/parallel-timelines-report-c52bd59b.tar.gz

Jenkins job URL
Argus

Anything else we need to know?

No response

The text was updated successfully, but these errors were encountered:

tnozicka · 2024-03-18T14:24:05Z

I expected that load-balancer resends the failed request

That's not what the LB does - its job is to send the traffic only to nodes that are considered ready / in the pool.

Make one of DB node go down during running stress command(s)

Is this done gracefully (drain+shutdown) or ungracefully?

For ungraceful shutdown, it always takes a few seconds before it notices the node is down and takes it out of the pool.

Graceful shutdowns are being fixed in #1077

nyh · 2024-03-18T16:47:01Z

I think it is quite expected that a load balancer will have a (hopefully!) short delay in recognizing that one of the nodes are down, and send a request to a node that is down, and fail. The DynamoDB documentation explains that a DynamoDB client should be ready for such failures, and retry the request:

Numerous components on a network, such as DNS servers, switches, load balancers, and others, can generate errors anywhere in the life of a given request. The usual technique for dealing with these error responses in a networked environment is to implement retries in the client application. This technique increases the reliability of the application.

And explains that Amazon's client libraries already do this retry automatically:

Each AWS SDK implements retry logic automatically.

What I don't understand, though (I don't know what kind of load balancer you are using...), is how the client saw a "connection refused". Isn't the connection made to the load balancer (not the dead node), and that is always responsive?

vponomaryov · 2024-03-18T18:33:21Z

@tnozicka and @nyh thanks for the answers.

What I don't understand, though (I don't know what kind of load balancer you are using...), is how the client saw a "connection refused". Isn't the connection made to the load balancer (not the dead node), and that is always responsive?

We use the one provisioned in K8S as a K8S loadbalancer service configured by the scylla-operator.
If it was always responsive we would not have this bugreport.
So, as I understand, that loadbalancer returns query results from endpoints it routed to.

@tnozicka
Do I understand you correctly, that it is not possible to configure that K8S LB service to resend queries to other alive endpoints in case of the Connection refused error?

tnozicka · 2024-03-19T08:24:48Z

What I don't understand, though (I don't know what kind of load balancer you are using...), is how the client saw a "connection refused". Isn't the connection made to the load balancer (not the dead node), and that is always responsive?

There are multiple ways and different LBs can be configured. With pure Service based LB, it's just iptable rules on the backend. But in the end client is expected to get an error on ungraceful node terminations.

Do I understand you correctly, that it is not possible to configure that K8S LB service to resend queries to other alive endpoints in case of the Connection refused error?

To my knowledge, this is not possible, nor desired. LB doesn't know if a query can be retried, that's why this is left on clients. (It can fail in the middle, do half of the work, trigger external event, or other edge cases.)

scylla-operator-bot · 2024-07-07T10:43:29Z

The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 30d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out

/lifecycle stale

scylla-operator-bot · 2024-08-07T10:34:35Z

The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 30d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out

/lifecycle rotten

scylla-operator-bot · 2024-09-06T10:43:19Z

The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 30d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out

/close not-planned

scylla-operator-bot · 2024-09-06T10:43:22Z

@scylla-operator-bot[bot]: Closing this issue, marking it as "Not Planned".

In response to this:

The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 30d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

vponomaryov added the kind/bug Categorizes issue or PR as related to a bug. label Mar 18, 2024

scylla-operator-bot bot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Mar 18, 2024

tnozicka added kind/support Categorizes issue or PR as a support question. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. labels Mar 18, 2024

scylla-operator-bot bot removed the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Mar 18, 2024

tnozicka added needs-priority Indicates a PR lacks a `priority/foo` label and requires one. and removed kind/bug Categorizes issue or PR as related to a bug. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Mar 18, 2024

scylla-operator-bot bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 7, 2024

scylla-operator-bot bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 7, 2024

scylla-operator-bot bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using LB for a scylla cluster and introducing single DB node disruptions we get `failed: Connection refused (Connection refused)` responses #1841

Using LB for a scylla cluster and introducing single DB node disruptions we get `failed: Connection refused (Connection refused)` responses #1841

vponomaryov commented Mar 18, 2024

tnozicka commented Mar 18, 2024

nyh commented Mar 18, 2024

vponomaryov commented Mar 18, 2024

tnozicka commented Mar 19, 2024 •

edited

Loading

scylla-operator-bot bot commented Jul 7, 2024

scylla-operator-bot bot commented Aug 7, 2024

scylla-operator-bot bot commented Sep 6, 2024

scylla-operator-bot bot commented Sep 6, 2024

Using LB for a scylla cluster and introducing single DB node disruptions we get failed: Connection refused (Connection refused) responses #1841

Using LB for a scylla cluster and introducing single DB node disruptions we get failed: Connection refused (Connection refused) responses #1841

Comments

vponomaryov commented Mar 18, 2024

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Scylla Operator version

Kubernetes platform name and version

Please attach the must-gather archive.

Anything else we need to know?

tnozicka commented Mar 18, 2024

nyh commented Mar 18, 2024

vponomaryov commented Mar 18, 2024

tnozicka commented Mar 19, 2024 • edited Loading

scylla-operator-bot bot commented Jul 7, 2024

scylla-operator-bot bot commented Aug 7, 2024

scylla-operator-bot bot commented Sep 6, 2024

scylla-operator-bot bot commented Sep 6, 2024

Using LB for a scylla cluster and introducing single DB node disruptions we get `failed: Connection refused (Connection refused)` responses #1841

Using LB for a scylla cluster and introducing single DB node disruptions we get `failed: Connection refused (Connection refused)` responses #1841

tnozicka commented Mar 19, 2024 •

edited

Loading