-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TestUpdateClusterLayout::test_simple_decommission_node_while_query_info failed #17903
Comments
@fruch , that's different. This time it's RF=1 case and, respectively, some other error. |
So any ideas what should be done here ? Retries on the test level too are needed? |
Yes, this particular test case (RF=1) also needs retry policy |
By default there retry, but it does it with any wait time between each try In the latest driver I've introduced retry with exponential backoffs, it might help. Worst case we'll use the |
We don't know exactly when the failure happened, but there's is some time skew between the gossiper state on the different nodes, maybe it could be related.
Maybe without consistent topology change this test is bound to be flaky... |
@Annamikhlin the recent case is with rf=2 and it could be a distinct issue (comparing to rf=1) |
A wild guess: |
@xemul mentions that #17124 (comment) could be related. |
This seems like a race between the node state in e_r_m and the gossip state: inet_address_vector_replica_set storage_proxy::get_endpoints_for_reading(const sstring& ks_name, const locator::effective_replication_map& erm, const dht::token& token) const {
auto endpoints = erm.get_endpoints_for_reading(token);
auto it = boost::range::remove_if(endpoints, std::not_fn(std::bind_front(&storage_proxy::is_alive, this)));
endpoints.erase(it, endpoints.end()); |
Maybe we should reconsider return unavailable error in this case, when we don't have enough live nodes to satisfy cl. |
Also, as @xemul suggest, we could keep the node liveliness state in locator::node (available via locator::topology) and update it via the storage_service so it would stick to the effective_replication_map / token_metadata / topology version in use. |
similar situation here: #17786 (comment) |
Reads to decommissioning node should be drained before we mark it as dead or purge it from the cluster. The correct decommissioning procedure is as follows:
We don't need any hacks like storing liveness in locator::topology or retrying in driver whatever. I believe we did follow the correct order of operations. There had to be a regression at some point, which caused this issue to pop up. And issues like #17786 (comment). The problem is that it's very easy to break the order if we are not extra careful. We need good, precise tests in test.py that would not become flaky once we introduce a regression like that, but catch the problem instantly. |
@kbr-scylla This is a run without consistent topology, so there's no synchronization. |
There is -- through node_ops RPCs and ring_delay sleeps. It is not fault tolerant and it is not reliable, but in the happy case, when there are no network problems and no node failures, it should work. We should not be observing data path failures. It worked at some point. This issue did not exist forever. There was a regression. |
Seen again in https://jenkins.scylladb.com/job/scylla-master/job/next/7549/ (still running) |
spotted in enterprise as well, see https://jenkins.scylladb.com/job/scylla-enterprise/job/gating-dtest-release/1427/artifact/logs-full.release.002/ |
Seen again in 6.0 - https://jenkins.scylladb.com/job/scylla-6.0/job/next/7/ |
Where exactly? There is no ring delay with node_ops between write_both_read_old and write_both_read_new state because there is no write_both_read_new state. It goes directly from write_both_read_old to write_new_read_new (AKA normal). The problem here happens during a table scan ("select * from..."). This is a long running operation that hold an erm with a topology. If a node is decommission while a scan read is running the read will still get it from the topology it is working with, but the node will be declared as dead (non existing nodes are dead in gossiper::is_alive impl), so the read will fail.
At what point do you think it worked? |
Reproduces more easily with:
|
So it looks like this is not a regression, the problem was there since forever, the test at some point got modified and started catching it: #17124 (comment) @fruch mentioned:
There were attempts to fix the flakiness, but they succeeded only partially. https://github.com/scylladb/scylla-dtest/pull/4036. There's no point in fixing it for gossip mode. We moved to raft-based topology to fix issues like this one among other things. The most sensible way out, IMO, is to disable the test in gossip mode. @temichus reassigning to you. And taking off P1 label. |
PR to disable this test for gossip-topology-changes https://github.com/scylladb/scylla-dtest/pull/4398 |
Test was disabled in gossip mode, should not be flaky in raft-topo mode --- closing. We can reopen if we observe in raft-topo. |
seen on https://jenkins.scylladb.com/job/scylla-master/job/next/7407/testReport/junit/update_cluster_layout_tests/TestUpdateClusterLayout/Build___x86___dtest___test_simple_decommission_node_while_query_info_1_/
link to full log:
https://jenkins.scylladb.com/job/scylla-master/job/gating-dtest-release/8608/artifact/logs-full.release.001/dtest-gw3.log/*view*/
The text was updated successfully, but these errors were encountered: