-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cluster elasticity - new nodes never get balanced work with previous nodes, until the busiest node is killed, stopping the workload #15335
Comments
in addition to the master ones (including the one reported in this issue, that are my staging jobs), it happened 3 out of 3 jobs: |
HWLB is extremely low on all nodes. |
what does it mean? |
@asias , following your https://github.com/scylladb/scylla-dtest/pull/3463#issuecomment-1707541714 i have few questions for you. what can we do to reliably know when a node has fully joined the cluster (and is serving I/O - reads)? checking |
That they never get enough cache 'warm enough' to accept enough traffic. |
no, it doesn't seem to be the problem... the problem seems to be that, we see the result of |
as can be seen in this output, we have node's IP
|
You need to check the CQL port, if it's available and responsive, before moving to the next node. |
This was reliable in the past but no longer, since for replacing a node, the new node starting to get writes during the replacement even before the node is "READY" to serve. |
That's OK - RPC may be ready before CQL, but you shouldn't move to the next node's operation, before this one is ready to serve via CQL. |
It sounds pretty hacky to rely on undocumented checks to figure out whether If needed, we can develop an official API for node's capacity, especially with tablets |
following issue investigation on grow and shrink cluster individual nemesis, we observed that the nodes being added, are taking much longer than what we expected (we check for `nodetool status` being `UN` before we continue), but in this case, nodes were not yet fully serving I/O, making eventually the cluster to lose quorum, stopping the c-s workload with `NoHostAvailableException`. Refs: scylladb/scylladb#15335
following issue investigation on grow and shrink cluster individual nemesis, we observed that the nodes being added, are taking much longer than what we expected (we check for `nodetool status` being `UN` before we continue), but in this case, nodes were not yet fully serving I/O, making eventually the cluster to lose quorum, stopping the c-s workload with `NoHostAvailableException`. Refs: scylladb/scylladb#15335
@mykaul , i'm not sure this is the proper fix, but working around it on SCT --> scylladb/scylla-cluster-tests#6601 |
refs to #12015 |
refs to #8275 |
following issue investigation on grow and shrink cluster individual nemesis, we observed that the nodes being added, are taking much longer than what we expected (we check for `nodetool status` being `UN` before we continue), but in this case, nodes were not yet fully serving I/O, making eventually the cluster to lose quorum, stopping the c-s workload with `NoHostAvailableException`. Refs: scylladb/scylladb#15335
following issue investigation on grow and shrink cluster individual nemesis, we observed that the nodes being added, are taking much longer than what we expected (we check for `nodetool status` being `UN` before we continue), but in this case, nodes were not yet fully serving I/O, making eventually the cluster to lose quorum, stopping the c-s workload with `NoHostAvailableException`. Refs: scylladb/scylladb#15335
following issue investigation on grow and shrink cluster individual nemesis, we observed that the nodes being added, are taking much longer than what we expected (we check for `nodetool status` being `UN` before we continue), but in this case, nodes were not yet fully serving I/O, making eventually the cluster to lose quorum, stopping the c-s workload with `NoHostAvailableException`. Refs: scylladb/scylladb#15335
following issue investigation on grow and shrink cluster individual nemesis, we observed that the nodes being added, are taking much longer than what we expected (we check for `nodetool status` being `UN` before we continue), but in this case, nodes were not yet fully serving I/O, making eventually the cluster to lose quorum, stopping the c-s workload with `NoHostAvailableException`. Refs: scylladb/scylladb#15335
I assume this is a test issue that was fixed/improved (and nevertheless, a parallel discussion is how do we ensure a node is 'healthy' and 'ready' ...) |
Issue description
the test does elasticity stress in the cluster, adding and decommission 1 node at the time, for multiple times in a row
Impact
user impact is that, once the node that is the busiest (since many nodes were added and removed) is decommissioned, the client starts getting
NoHostAvailableException
(stopping the test):seeing in the monitor that, when adding a new node, and then decommissioning a node, we got into a situation where part of the nodes are very close to 100% load, while others are around 25%:
but that is already seen a bit earlier:
there is a live monitor for the test, if it would help
How frequently does it reproduce?
very frequent, but not 100% of the cases.. between 6 last runs on master, 2 passed, 4 failed with the same error, at different times:
build11 - failed
build12 - failed (SCT issue, already fixed)
build13 - passed
build14 - failed
build15 - failed
build16 - passed
more information on the other failures, in the links above
Installation details
Kernel Version: 5.15.0-1044-aws
Scylla version (or git commit hash):
5.4.0~dev-20230907.cfc70810d335
with build-id6dec2e4d10afbef8a279fc0d01018ac4bd74d6a6
Cluster size: 3 nodes (i4i.large)
Scylla Nodes used in this run:
OS / Image:
ami-03f6dcf8dab61bf9e
(aws: undefined_region)Test:
grow_shrink_cluster
Test id:
06b78ec0-f454-461d-b965-d8cf98ff1a75
Test name:
scylla-staging/fabio/seed_node_failure_on_add_new_node/grow_shrink_cluster
Test config file(s):
Logs and commands
$ hydra investigate show-monitor 06b78ec0-f454-461d-b965-d8cf98ff1a75
$ hydra investigate show-logs 06b78ec0-f454-461d-b965-d8cf98ff1a75
Logs:
Jenkins job URL
Argus
The text was updated successfully, but these errors were encountered: