cluster elasticity - new nodes never get balanced work with previous nodes, until the busiest node is killed, stopping the workload #15335

fgelcer · 2023-09-10T11:14:18Z

Issue description

This issue is a regression.
This issue is NOT a regression. (at least, i think it is not)

the test does elasticity stress in the cluster, adding and decommission 1 node at the time, for multiple times in a row

Impact

user impact is that, once the node that is the busiest (since many nodes were added and removed) is decommissioned, the client starts getting NoHostAvailableException (stopping the test):

java.io.IOException: Operation x10 on key(s) [304e3535394f33363830]: Error executing: (NoHostAvailableException): All host(s) tried for query failed (tried: /10.4.2.253:9042 (com.datastax.driver.core.exceptions.ConnectionException: [/10.4.2.253:9042] Write attempt on defunct connection))

seeing in the monitor that, when adding a new node, and then decommissioning a node, we got into a situation where part of the nodes are very close to 100% load, while others are around 25%:

but that is already seen a bit earlier:

there is a live monitor for the test, if it would help

How frequently does it reproduce?

very frequent, but not 100% of the cases.. between 6 last runs on master, 2 passed, 4 failed with the same error, at different times:
build11 - failed
build12 - failed (SCT issue, already fixed)
build13 - passed
build14 - failed
build15 - failed
build16 - passed

more information on the other failures, in the links above

Installation details

Kernel Version: 5.15.0-1044-aws
Scylla version (or git commit hash): 5.4.0~dev-20230907.cfc70810d335 with build-id 6dec2e4d10afbef8a279fc0d01018ac4bd74d6a6

Cluster size: 3 nodes (i4i.large)

Scylla Nodes used in this run:

longevity-5gb-1h-GrowShrinkClusterN-db-node-06b78ec0-9 (54.171.151.221 | 10.4.1.79) (shards: 2)
longevity-5gb-1h-GrowShrinkClusterN-db-node-06b78ec0-8 (54.73.51.202 | 10.4.1.85) (shards: 2)
longevity-5gb-1h-GrowShrinkClusterN-db-node-06b78ec0-7 (34.245.27.225 | 10.4.3.237) (shards: 2)
longevity-5gb-1h-GrowShrinkClusterN-db-node-06b78ec0-6 (34.243.76.60 | 10.4.3.7) (shards: 2)
longevity-5gb-1h-GrowShrinkClusterN-db-node-06b78ec0-5 (34.242.66.129 | 10.4.3.15) (shards: 2)
longevity-5gb-1h-GrowShrinkClusterN-db-node-06b78ec0-4 (34.244.57.5 | 10.4.2.253) (shards: 2)
longevity-5gb-1h-GrowShrinkClusterN-db-node-06b78ec0-3 (54.216.86.39 | 10.4.1.69) (shards: 2)
longevity-5gb-1h-GrowShrinkClusterN-db-node-06b78ec0-23 (54.77.67.92 | 10.4.2.12) (shards: 2)
longevity-5gb-1h-GrowShrinkClusterN-db-node-06b78ec0-22 (54.154.60.224 | 10.4.1.58) (shards: 2)
longevity-5gb-1h-GrowShrinkClusterN-db-node-06b78ec0-21 (54.72.114.181 | 10.4.2.162) (shards: 2)
longevity-5gb-1h-GrowShrinkClusterN-db-node-06b78ec0-20 (34.248.205.45 | 10.4.1.53) (shards: 2)
longevity-5gb-1h-GrowShrinkClusterN-db-node-06b78ec0-2 (52.17.213.168 | 10.4.3.197) (shards: 2)
longevity-5gb-1h-GrowShrinkClusterN-db-node-06b78ec0-19 (54.246.18.38 | 10.4.1.200) (shards: 2)
longevity-5gb-1h-GrowShrinkClusterN-db-node-06b78ec0-18 (54.171.101.93 | 10.4.0.79) (shards: 2)
longevity-5gb-1h-GrowShrinkClusterN-db-node-06b78ec0-17 (54.246.249.8 | 10.4.1.17) (shards: 2)
longevity-5gb-1h-GrowShrinkClusterN-db-node-06b78ec0-16 (54.73.68.224 | 10.4.1.245) (shards: 2)
longevity-5gb-1h-GrowShrinkClusterN-db-node-06b78ec0-15 (34.241.18.125 | 10.4.3.0) (shards: 2)
longevity-5gb-1h-GrowShrinkClusterN-db-node-06b78ec0-14 (52.49.253.216 | 10.4.3.130) (shards: 2)
longevity-5gb-1h-GrowShrinkClusterN-db-node-06b78ec0-13 (54.171.106.163 | 10.4.3.36) (shards: 2)
longevity-5gb-1h-GrowShrinkClusterN-db-node-06b78ec0-12 (3.250.221.76 | 10.4.3.44) (shards: 2)
longevity-5gb-1h-GrowShrinkClusterN-db-node-06b78ec0-11 (54.246.4.198 | 10.4.2.18) (shards: 2)
longevity-5gb-1h-GrowShrinkClusterN-db-node-06b78ec0-10 (3.250.196.116 | 10.4.3.24) (shards: 2)
longevity-5gb-1h-GrowShrinkClusterN-db-node-06b78ec0-1 (34.254.100.10 | 10.4.0.129) (shards: 2)

OS / Image: ami-03f6dcf8dab61bf9e (aws: undefined_region)

Test: grow_shrink_cluster
Test id: 06b78ec0-f454-461d-b965-d8cf98ff1a75
Test name: scylla-staging/fabio/seed_node_failure_on_add_new_node/grow_shrink_cluster
Test config file(s):

longevity-5gb-1h-GrowShrinkClusterNemesis.yaml

Logs and commands

Restore Monitor Stack command: $ hydra investigate show-monitor 06b78ec0-f454-461d-b965-d8cf98ff1a75
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 06b78ec0-f454-461d-b965-d8cf98ff1a75

Logs:

db-cluster-06b78ec0.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/06b78ec0-f454-461d-b965-d8cf98ff1a75/20230908_061937/db-cluster-06b78ec0.tar.gz
sct-runner-events-06b78ec0.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/06b78ec0-f454-461d-b965-d8cf98ff1a75/20230908_061937/sct-runner-events-06b78ec0.tar.gz
sct-06b78ec0.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/06b78ec0-f454-461d-b965-d8cf98ff1a75/20230908_061937/sct-06b78ec0.log.tar.gz
loader-set-06b78ec0.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/06b78ec0-f454-461d-b965-d8cf98ff1a75/20230908_061937/loader-set-06b78ec0.tar.gz
monitor-set-06b78ec0.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/06b78ec0-f454-461d-b965-d8cf98ff1a75/20230908_061937/monitor-set-06b78ec0.tar.gz

Jenkins job URL
Argus

The text was updated successfully, but these errors were encountered:

fgelcer · 2023-09-10T11:15:53Z

in addition to the master ones (including the one reported in this issue, that are my staging jobs), it happened 3 out of 3 jobs:

roydahan · 2023-09-10T11:43:18Z

The problem you see is that a new node that was added isn't serving requests.
From the live monitor it started here:

mykaul · 2023-09-10T12:52:50Z

HWLB is extremely low on all nodes.

fgelcer · 2023-09-10T13:58:13Z

HWLB is extremely low on all nodes.

what does it mean?

fgelcer · 2023-09-10T14:05:56Z

The problem you see is that a new node that was added isn't serving requests. From the live monitor it started here:

@asias , following your https://github.com/scylladb/scylla-dtest/pull/3463#issuecomment-1707541714 i have few questions for you.
this test doesn't "replace" a node, but add a new node to the cluster, and then decommission a node from the cluster (doesn't have to be the node we recently added), and at the time in the image above, we are adding a node, and it is bootstraping, but we never get init - serving nor init - Scylla version 5.4.0~dev-0.20230907.cfc70810d335 initialization completed. , but because we most likely see the new node is UN, we started to decommission a node, making the joining node to fail... and from here, we will not have a full cluster running, making eventually c-s to fail with NoHostAvailableException, when the last serving node is decommissioned...

what can we do to reliably know when a node has fully joined the cluster (and is serving I/O - reads)? checking nodetool status seems now to not be reliable, as it "IS" up normal, but not serving I/O yet...

mykaul · 2023-09-10T14:59:32Z

HWLB is extremely low on all nodes.

what does it mean?

That they never get enough cache 'warm enough' to accept enough traffic.

fgelcer · 2023-09-10T17:58:29Z

HWLB is extremely low on all nodes.

what does it mean?

That they never get enough cache 'warm enough' to accept enough traffic.

no, it doesn't seem to be the problem... the problem seems to be that, we see the result of nodetool status that the new node reports UN, so we moved forward to decommissioning a node, but in fact, that new node was not yet serving requests...
@yarongilor has been working on a dtest related to it, and he was told by @asias / @bhalevy that even it has a bug (in his scenario), it is not very likely to be fixed... but in this case, we are shooting ourselves

fgelcer · 2023-09-10T18:05:15Z

as can be seen in this output, we have node's IP 10.4.1.17 status (3 times):

Datacenter: eu-west
===================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address     Load       Tokens       Owns    Host ID                               Rack
UN  10.4.1.245  8.33 GB    256          ?       ae23b947-2c15-450c-a555-3b930dc0c029  1a
UN  10.4.1.17   4.66 GB    256          ?       054a8e62-4e6a-4402-b87f-ea4f87a55b22  1a
UN  10.4.3.130  8.54 GB    256          ?       f816253b-8d66-451c-9114-a3f3ab972175  1a
UN  10.4.1.79   9.4 GB     256          ?       7188cef5-229d-4493-bfd3-c0823220ed7b  1a

mykaul · 2023-09-11T07:26:04Z

HWLB is extremely low on all nodes.

what does it mean?

That they never get enough cache 'warm enough' to accept enough traffic.

no, it doesn't seem to be the problem... the problem seems to be that, we see the result of nodetool status that the new node reports UN, so we moved forward to decommissioning a node, but in fact, that new node was not yet serving requests... @yarongilor has been working on a dtest related to it, and he was told by @asias / @bhalevy that even it has a bug (in his scenario), it is not very likely to be fixed... but in this case, we are shooting ourselves

You need to check the CQL port, if it's available and responsive, before moving to the next node.

roydahan · 2023-09-11T10:36:09Z

HWLB is extremely low on all nodes.

what does it mean?

That they never get enough cache 'warm enough' to accept enough traffic.

no, it doesn't seem to be the problem... the problem seems to be that, we see the result of nodetool status that the new node reports UN, so we moved forward to decommissioning a node, but in fact, that new node was not yet serving requests... @yarongilor has been working on a dtest related to it, and he was told by @asias / @bhalevy that even it has a bug (in his scenario), it is not very likely to be fixed... but in this case, we are shooting ourselves

You need to check the CQL port, if it's available and responsive, before moving to the next node.

This was reliable in the past but no longer, since for replacing a node, the new node starting to get writes during the replacement even before the node is "READY" to serve.

mykaul · 2023-09-11T10:37:18Z

HWLB is extremely low on all nodes.

what does it mean?

That they never get enough cache 'warm enough' to accept enough traffic.

no, it doesn't seem to be the problem... the problem seems to be that, we see the result of nodetool status that the new node reports UN, so we moved forward to decommissioning a node, but in fact, that new node was not yet serving requests... @yarongilor has been working on a dtest related to it, and he was told by @asias / @bhalevy that even it has a bug (in his scenario), it is not very likely to be fixed... but in this case, we are shooting ourselves

You need to check the CQL port, if it's available and responsive, before moving to the next node.

This was reliable in the past but no longer, since for replacing a node, the new node starting to get writes during the replacement even before the node is "READY" to serve.

That's OK - RPC may be ready before CQL, but you shouldn't move to the next node's operation, before this one is ready to serve via CQL.

dorlaor · 2023-09-12T13:46:33Z

It sounds pretty hacky to rely on undocumented checks to figure out whether
a node is ready to serve requests and to figure out its capacity. Ideally we can
piggy back on the nodetool status and use the existing UN as a mean of readiness,
ignoring HWLB.

If needed, we can develop an official API for node's capacity, especially with tablets
coming soon

following issue investigation on grow and shrink cluster individual nemesis, we observed that the nodes being added, are taking much longer than what we expected (we check for `nodetool status` being `UN` before we continue), but in this case, nodes were not yet fully serving I/O, making eventually the cluster to lose quorum, stopping the c-s workload with `NoHostAvailableException`. Refs: scylladb/scylladb#15335

fgelcer · 2023-09-13T20:35:52Z

@mykaul , i'm not sure this is the proper fix, but working around it on SCT --> scylladb/scylla-cluster-tests#6601

fgelcer · 2023-09-13T21:09:15Z

refs to #12015

fgelcer · 2023-09-13T21:14:33Z

refs to #8275

following issue investigation on grow and shrink cluster individual nemesis, we observed that the nodes being added, are taking much longer than what we expected (we check for `nodetool status` being `UN` before we continue), but in this case, nodes were not yet fully serving I/O, making eventually the cluster to lose quorum, stopping the c-s workload with `NoHostAvailableException`. Refs: scylladb/scylladb#15335

mykaul · 2024-01-01T08:34:21Z

I assume this is a test issue that was fixed/improved (and nevertheless, a parallel discussion is how do we ensure a node is 'healthy' and 'ready' ...)

fgelcer added the triage/master Looking for assignee label Sep 10, 2023

roydahan assigned fgelcer Sep 10, 2023

fgelcer mentioned this issue Sep 13, 2023

fix(wait for node init): check native_transport scylladb/scylla-cluster-tests#6601

Merged

7 tasks

mykaul closed this as completed Jan 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cluster elasticity - new nodes never get balanced work with previous nodes, until the busiest node is killed, stopping the workload #15335

cluster elasticity - new nodes never get balanced work with previous nodes, until the busiest node is killed, stopping the workload #15335

fgelcer commented Sep 10, 2023

Logs:

fgelcer commented Sep 10, 2023

roydahan commented Sep 10, 2023

mykaul commented Sep 10, 2023

fgelcer commented Sep 10, 2023

fgelcer commented Sep 10, 2023

mykaul commented Sep 10, 2023

fgelcer commented Sep 10, 2023

fgelcer commented Sep 10, 2023 •

edited

mykaul commented Sep 11, 2023

roydahan commented Sep 11, 2023

mykaul commented Sep 11, 2023

dorlaor commented Sep 12, 2023

fgelcer commented Sep 13, 2023

fgelcer commented Sep 13, 2023

fgelcer commented Sep 13, 2023

mykaul commented Jan 1, 2024

cluster elasticity - new nodes never get balanced work with previous nodes, until the busiest node is killed, stopping the workload #15335

cluster elasticity - new nodes never get balanced work with previous nodes, until the busiest node is killed, stopping the workload #15335

Comments

fgelcer commented Sep 10, 2023

Issue description

Impact

How frequently does it reproduce?

Installation details

Logs:

fgelcer commented Sep 10, 2023

roydahan commented Sep 10, 2023

mykaul commented Sep 10, 2023

fgelcer commented Sep 10, 2023

fgelcer commented Sep 10, 2023

mykaul commented Sep 10, 2023

fgelcer commented Sep 10, 2023

fgelcer commented Sep 10, 2023 • edited

mykaul commented Sep 11, 2023

roydahan commented Sep 11, 2023

mykaul commented Sep 11, 2023

dorlaor commented Sep 12, 2023

fgelcer commented Sep 13, 2023

fgelcer commented Sep 13, 2023

fgelcer commented Sep 13, 2023

mykaul commented Jan 1, 2024

fgelcer commented Sep 10, 2023 •

edited