Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster elasticity - new nodes never get balanced work with previous nodes, until the busiest node is killed, stopping the workload #15335

Closed
1 of 2 tasks
fgelcer opened this issue Sep 10, 2023 · 16 comments
Assignees
Labels
triage/master Looking for assignee

Comments

@fgelcer
Copy link

fgelcer commented Sep 10, 2023

Issue description

  • This issue is a regression.
  • This issue is NOT a regression. (at least, i think it is not)

the test does elasticity stress in the cluster, adding and decommission 1 node at the time, for multiple times in a row

Impact

user impact is that, once the node that is the busiest (since many nodes were added and removed) is decommissioned, the client starts getting NoHostAvailableException (stopping the test):

java.io.IOException: Operation x10 on key(s) [304e3535394f33363830]: Error executing: (NoHostAvailableException): All host(s) tried for query failed (tried: /10.4.2.253:9042 (com.datastax.driver.core.exceptions.ConnectionException: [/10.4.2.253:9042] Write attempt on defunct connection))

seeing in the monitor that, when adding a new node, and then decommissioning a node, we got into a situation where part of the nodes are very close to 100% load, while others are around 25%:

Screenshot from 2023-09-10 14-09-12

but that is already seen a bit earlier:

Screenshot from 2023-09-10 14-13-35

there is a live monitor for the test, if it would help

How frequently does it reproduce?

very frequent, but not 100% of the cases.. between 6 last runs on master, 2 passed, 4 failed with the same error, at different times:
build11 - failed
build12 - failed (SCT issue, already fixed)
build13 - passed
build14 - failed
build15 - failed
build16 - passed

more information on the other failures, in the links above

Installation details

Kernel Version: 5.15.0-1044-aws
Scylla version (or git commit hash): 5.4.0~dev-20230907.cfc70810d335 with build-id 6dec2e4d10afbef8a279fc0d01018ac4bd74d6a6

Cluster size: 3 nodes (i4i.large)

Scylla Nodes used in this run:

  • longevity-5gb-1h-GrowShrinkClusterN-db-node-06b78ec0-9 (54.171.151.221 | 10.4.1.79) (shards: 2)
  • longevity-5gb-1h-GrowShrinkClusterN-db-node-06b78ec0-8 (54.73.51.202 | 10.4.1.85) (shards: 2)
  • longevity-5gb-1h-GrowShrinkClusterN-db-node-06b78ec0-7 (34.245.27.225 | 10.4.3.237) (shards: 2)
  • longevity-5gb-1h-GrowShrinkClusterN-db-node-06b78ec0-6 (34.243.76.60 | 10.4.3.7) (shards: 2)
  • longevity-5gb-1h-GrowShrinkClusterN-db-node-06b78ec0-5 (34.242.66.129 | 10.4.3.15) (shards: 2)
  • longevity-5gb-1h-GrowShrinkClusterN-db-node-06b78ec0-4 (34.244.57.5 | 10.4.2.253) (shards: 2)
  • longevity-5gb-1h-GrowShrinkClusterN-db-node-06b78ec0-3 (54.216.86.39 | 10.4.1.69) (shards: 2)
  • longevity-5gb-1h-GrowShrinkClusterN-db-node-06b78ec0-23 (54.77.67.92 | 10.4.2.12) (shards: 2)
  • longevity-5gb-1h-GrowShrinkClusterN-db-node-06b78ec0-22 (54.154.60.224 | 10.4.1.58) (shards: 2)
  • longevity-5gb-1h-GrowShrinkClusterN-db-node-06b78ec0-21 (54.72.114.181 | 10.4.2.162) (shards: 2)
  • longevity-5gb-1h-GrowShrinkClusterN-db-node-06b78ec0-20 (34.248.205.45 | 10.4.1.53) (shards: 2)
  • longevity-5gb-1h-GrowShrinkClusterN-db-node-06b78ec0-2 (52.17.213.168 | 10.4.3.197) (shards: 2)
  • longevity-5gb-1h-GrowShrinkClusterN-db-node-06b78ec0-19 (54.246.18.38 | 10.4.1.200) (shards: 2)
  • longevity-5gb-1h-GrowShrinkClusterN-db-node-06b78ec0-18 (54.171.101.93 | 10.4.0.79) (shards: 2)
  • longevity-5gb-1h-GrowShrinkClusterN-db-node-06b78ec0-17 (54.246.249.8 | 10.4.1.17) (shards: 2)
  • longevity-5gb-1h-GrowShrinkClusterN-db-node-06b78ec0-16 (54.73.68.224 | 10.4.1.245) (shards: 2)
  • longevity-5gb-1h-GrowShrinkClusterN-db-node-06b78ec0-15 (34.241.18.125 | 10.4.3.0) (shards: 2)
  • longevity-5gb-1h-GrowShrinkClusterN-db-node-06b78ec0-14 (52.49.253.216 | 10.4.3.130) (shards: 2)
  • longevity-5gb-1h-GrowShrinkClusterN-db-node-06b78ec0-13 (54.171.106.163 | 10.4.3.36) (shards: 2)
  • longevity-5gb-1h-GrowShrinkClusterN-db-node-06b78ec0-12 (3.250.221.76 | 10.4.3.44) (shards: 2)
  • longevity-5gb-1h-GrowShrinkClusterN-db-node-06b78ec0-11 (54.246.4.198 | 10.4.2.18) (shards: 2)
  • longevity-5gb-1h-GrowShrinkClusterN-db-node-06b78ec0-10 (3.250.196.116 | 10.4.3.24) (shards: 2)
  • longevity-5gb-1h-GrowShrinkClusterN-db-node-06b78ec0-1 (34.254.100.10 | 10.4.0.129) (shards: 2)

OS / Image: ami-03f6dcf8dab61bf9e (aws: undefined_region)

Test: grow_shrink_cluster
Test id: 06b78ec0-f454-461d-b965-d8cf98ff1a75
Test name: scylla-staging/fabio/seed_node_failure_on_add_new_node/grow_shrink_cluster
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor 06b78ec0-f454-461d-b965-d8cf98ff1a75
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 06b78ec0-f454-461d-b965-d8cf98ff1a75

Logs:

Jenkins job URL
Argus

@fgelcer fgelcer added the triage/master Looking for assignee label Sep 10, 2023
@fgelcer
Copy link
Author

fgelcer commented Sep 10, 2023

in addition to the master ones (including the one reported in this issue, that are my staging jobs), it happened 3 out of 3 jobs:

@roydahan
Copy link

The problem you see is that a new node that was added isn't serving requests.
From the live monitor it started here:

Screenshot 2023-09-10 at 14 42 22

@mykaul
Copy link
Contributor

mykaul commented Sep 10, 2023

HWLB is extremely low on all nodes.

@fgelcer
Copy link
Author

fgelcer commented Sep 10, 2023

HWLB is extremely low on all nodes.

what does it mean?

@fgelcer
Copy link
Author

fgelcer commented Sep 10, 2023

The problem you see is that a new node that was added isn't serving requests. From the live monitor it started here:

Screenshot 2023-09-10 at 14 42 22

@asias , following your https://github.com/scylladb/scylla-dtest/pull/3463#issuecomment-1707541714 i have few questions for you.
this test doesn't "replace" a node, but add a new node to the cluster, and then decommission a node from the cluster (doesn't have to be the node we recently added), and at the time in the image above, we are adding a node, and it is bootstraping, but we never get init - serving nor init - Scylla version 5.4.0~dev-0.20230907.cfc70810d335 initialization completed. , but because we most likely see the new node is UN, we started to decommission a node, making the joining node to fail... and from here, we will not have a full cluster running, making eventually c-s to fail with NoHostAvailableException, when the last serving node is decommissioned...

what can we do to reliably know when a node has fully joined the cluster (and is serving I/O - reads)? checking nodetool status seems now to not be reliable, as it "IS" up normal, but not serving I/O yet...

@mykaul
Copy link
Contributor

mykaul commented Sep 10, 2023

HWLB is extremely low on all nodes.

what does it mean?

That they never get enough cache 'warm enough' to accept enough traffic.

@fgelcer
Copy link
Author

fgelcer commented Sep 10, 2023

HWLB is extremely low on all nodes.

what does it mean?

That they never get enough cache 'warm enough' to accept enough traffic.

no, it doesn't seem to be the problem... the problem seems to be that, we see the result of nodetool status that the new node reports UN, so we moved forward to decommissioning a node, but in fact, that new node was not yet serving requests...
@yarongilor has been working on a dtest related to it, and he was told by @asias / @bhalevy that even it has a bug (in his scenario), it is not very likely to be fixed... but in this case, we are shooting ourselves

@fgelcer
Copy link
Author

fgelcer commented Sep 10, 2023

as can be seen in this output, we have node's IP 10.4.1.17 status (3 times):

Datacenter: eu-west
===================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address     Load       Tokens       Owns    Host ID                               Rack
UN  10.4.1.245  8.33 GB    256          ?       ae23b947-2c15-450c-a555-3b930dc0c029  1a
UN  10.4.1.17   4.66 GB    256          ?       054a8e62-4e6a-4402-b87f-ea4f87a55b22  1a
UN  10.4.3.130  8.54 GB    256          ?       f816253b-8d66-451c-9114-a3f3ab972175  1a
UN  10.4.1.79   9.4 GB     256          ?       7188cef5-229d-4493-bfd3-c0823220ed7b  1a

@mykaul
Copy link
Contributor

mykaul commented Sep 11, 2023

HWLB is extremely low on all nodes.

what does it mean?

That they never get enough cache 'warm enough' to accept enough traffic.

no, it doesn't seem to be the problem... the problem seems to be that, we see the result of nodetool status that the new node reports UN, so we moved forward to decommissioning a node, but in fact, that new node was not yet serving requests... @yarongilor has been working on a dtest related to it, and he was told by @asias / @bhalevy that even it has a bug (in his scenario), it is not very likely to be fixed... but in this case, we are shooting ourselves

You need to check the CQL port, if it's available and responsive, before moving to the next node.

@roydahan
Copy link

HWLB is extremely low on all nodes.

what does it mean?

That they never get enough cache 'warm enough' to accept enough traffic.

no, it doesn't seem to be the problem... the problem seems to be that, we see the result of nodetool status that the new node reports UN, so we moved forward to decommissioning a node, but in fact, that new node was not yet serving requests... @yarongilor has been working on a dtest related to it, and he was told by @asias / @bhalevy that even it has a bug (in his scenario), it is not very likely to be fixed... but in this case, we are shooting ourselves

You need to check the CQL port, if it's available and responsive, before moving to the next node.

This was reliable in the past but no longer, since for replacing a node, the new node starting to get writes during the replacement even before the node is "READY" to serve.

@mykaul
Copy link
Contributor

mykaul commented Sep 11, 2023

HWLB is extremely low on all nodes.

what does it mean?

That they never get enough cache 'warm enough' to accept enough traffic.

no, it doesn't seem to be the problem... the problem seems to be that, we see the result of nodetool status that the new node reports UN, so we moved forward to decommissioning a node, but in fact, that new node was not yet serving requests... @yarongilor has been working on a dtest related to it, and he was told by @asias / @bhalevy that even it has a bug (in his scenario), it is not very likely to be fixed... but in this case, we are shooting ourselves

You need to check the CQL port, if it's available and responsive, before moving to the next node.

This was reliable in the past but no longer, since for replacing a node, the new node starting to get writes during the replacement even before the node is "READY" to serve.

That's OK - RPC may be ready before CQL, but you shouldn't move to the next node's operation, before this one is ready to serve via CQL.

@dorlaor
Copy link
Contributor

dorlaor commented Sep 12, 2023

It sounds pretty hacky to rely on undocumented checks to figure out whether
a node is ready to serve requests and to figure out its capacity. Ideally we can
piggy back on the nodetool status and use the existing UN as a mean of readiness,
ignoring HWLB.

If needed, we can develop an official API for node's capacity, especially with tablets
coming soon

fgelcer pushed a commit to fgelcer/scylla-cluster-tests that referenced this issue Sep 13, 2023
following issue investigation on grow and shrink
cluster individual nemesis, we observed that
the nodes being added, are taking much longer
than what we expected (we check for `nodetool status`
being `UN` before we continue), but in this case,
nodes were not yet fully serving I/O, making
eventually the cluster to lose quorum, stopping
the c-s workload with `NoHostAvailableException`.

Refs: scylladb/scylladb#15335
fgelcer pushed a commit to fgelcer/scylla-cluster-tests that referenced this issue Sep 13, 2023
following issue investigation on grow and shrink
cluster individual nemesis, we observed that
the nodes being added, are taking much longer
than what we expected (we check for `nodetool status`
being `UN` before we continue), but in this case,
nodes were not yet fully serving I/O, making
eventually the cluster to lose quorum, stopping
the c-s workload with `NoHostAvailableException`.

Refs: scylladb/scylladb#15335
@fgelcer
Copy link
Author

fgelcer commented Sep 13, 2023

@mykaul , i'm not sure this is the proper fix, but working around it on SCT --> scylladb/scylla-cluster-tests#6601

@fgelcer
Copy link
Author

fgelcer commented Sep 13, 2023

refs to #12015

@fgelcer
Copy link
Author

fgelcer commented Sep 13, 2023

refs to #8275

fgelcer pushed a commit to fgelcer/scylla-cluster-tests that referenced this issue Sep 14, 2023
following issue investigation on grow and shrink
cluster individual nemesis, we observed that
the nodes being added, are taking much longer
than what we expected (we check for `nodetool status`
being `UN` before we continue), but in this case,
nodes were not yet fully serving I/O, making
eventually the cluster to lose quorum, stopping
the c-s workload with `NoHostAvailableException`.

Refs: scylladb/scylladb#15335
fgelcer pushed a commit to fgelcer/scylla-cluster-tests that referenced this issue Sep 14, 2023
following issue investigation on grow and shrink
cluster individual nemesis, we observed that
the nodes being added, are taking much longer
than what we expected (we check for `nodetool status`
being `UN` before we continue), but in this case,
nodes were not yet fully serving I/O, making
eventually the cluster to lose quorum, stopping
the c-s workload with `NoHostAvailableException`.

Refs: scylladb/scylladb#15335
fgelcer pushed a commit to fgelcer/scylla-cluster-tests that referenced this issue Sep 14, 2023
following issue investigation on grow and shrink
cluster individual nemesis, we observed that
the nodes being added, are taking much longer
than what we expected (we check for `nodetool status`
being `UN` before we continue), but in this case,
nodes were not yet fully serving I/O, making
eventually the cluster to lose quorum, stopping
the c-s workload with `NoHostAvailableException`.

Refs: scylladb/scylladb#15335
fruch pushed a commit to scylladb/scylla-cluster-tests that referenced this issue Sep 14, 2023
following issue investigation on grow and shrink
cluster individual nemesis, we observed that
the nodes being added, are taking much longer
than what we expected (we check for `nodetool status`
being `UN` before we continue), but in this case,
nodes were not yet fully serving I/O, making
eventually the cluster to lose quorum, stopping
the c-s workload with `NoHostAvailableException`.

Refs: scylladb/scylladb#15335
@mykaul
Copy link
Contributor

mykaul commented Jan 1, 2024

I assume this is a test issue that was fixed/improved (and nevertheless, a parallel discussion is how do we ensure a node is 'healthy' and 'ready' ...)

@mykaul mykaul closed this as completed Jan 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage/master Looking for assignee
Projects
None yet
Development

No branches or pull requests

4 participants