There is no way to wait until the UP/NORMAL states of other nodes are processed #12015

kbr- · 2022-11-17T13:34:35Z

In test.py topology tests we often perform topology operations as soon as all nodes have finished bootstrapping or restarting. Such topology operations race with gossiper which may be in the middle of notifying different Scylla modules that some other node has entered NORMAL state or that it has restarted. This may disrupt the topology operations.

Examples of failures:

in Decommission can fail due to concurrent fiber dropping RPC connections #11780: when a node notices that another node has entered NORMAL state, storage_service::handle_state_normal is called. This function, among other things, may cause certain connections to be dropped and recreated a new. This races with and interrupts a decommission operation.
next promotion failure: https://jenkins.scylladb.com/job/scylla-master/job/next/5425/testReport/junit/(root)/non-boost%20tests/Test___test_raft_upgrade_release_7544/ - if I understand correctly, the cause here is that we started a removenode operation immediately after a peer was restarted; the operation raced with a gossiper notification calling stream_manager::on_restart, which closed all existing stream sessions to this peer, interrupting the removenode operation.

We could take the approach that dtests are taking: "sleep and pray" (pray that everything is ready once we finish sleeping).
But:

sleeping may wait longer than necessary, while we want our tests to be as fast as possible - the test.py tests are part of regular development
sleeping does not guarantee that everything has been processed.

Instead, there should be a way of waiting until node X notices that Y is UP and NORMAL, and X finishes all processing related to this fact (i.e. all gossiper notifications, storage_service::handle_state_normal) etc.

It seems that this may require introducing some additional state for each peer (to answer the question: "have I finished processing this peer?") and, for example, a REST API endpoint to query this state.

Note: our documentation says that one should check that all other nodes are UP and NORMAL before performing a topology operations, using nodetool status. This works in practice, because in practice the administrator is slower than Scylla's internal processing of other nodes' states. But in theory this is incorrect. I checked how nodetool status deduces that a node is NORMAL: it checks whether it is joining or leaving; if neither is true, it assumes the node is normal. This doesn't guarantee that storage_service::handle_state_normal and similar functions have finished. I'm not even sure whether it guarantees that the gossiper endpoint_state for this node actually holds NORMAL status. Is it always true that ~joining && ~leaving implies normal? Even if it was true, this is not good enough for tests, which need a stronger guarantee (that all processing has finished).

The text was updated successfully, but these errors were encountered:

kbr- · 2022-11-17T13:36:03Z

cc @nyh you might find this interesting.

mykaul · 2022-11-17T13:40:02Z

I think this is also important for a rolling upgrade in K8S environment - you don't want to continue to the next K8S node drain, before a Scylla node is truly up, communicating and doing work.

kbr- · 2022-11-17T13:42:09Z

Indeed, or more generally, if you want to "automate the administrator" - this cannot be done correctly unless we have a reliable way to query a node's status regarding how it views other nodes.

nyh · 2022-11-17T13:45:55Z

@kbr- thanks, this is interesting because it may explain a few, though not all, of the failures we saw when trying to eliminate the silly sleeps in dtest (https://github.com/scylladb/scylla-dtest/pull/2455).

While I agree with you that we could add all sorts of operations (REST API, nodetool, etc.) that checks if node X is aware of Y, and so on, I think there a better solution: I think we need to uphold the requirement that:

Node X must not accept CQL requests until it is sure it can process them

Here it means that either node X should not listen to CQL requests until it knows it can perform them (including topology-change requests), or, that if X receives a request that it temporarily can't perform (because the gossip hasn't finished) it can wait until it can - or send the request to a different node if that's possible. This is, for example, what we do for nodes joining a cluster - they receive CQL requests, but if they receive a read request, this request is sent to replicas which hold the data and the node doesn't attempt to (incorrectly) respond from its own data.

I think adding additional synchronization requests will be a mistake. If we add those, we'll need to start using them in test.py, in dtest, in kubernetes, and who knows where else. We'll need to document this weird new request and explain it. Everything will be so much simpler if we just had the invariant: If a node accepts a CQL request, then it will work. That's it.

UPDATE: I got confused a bit above, we're talking about topology change requests, not CQL requests (we would have CQL requests for schema changes, not topology changes). So maybe we need to delay not CQL requests but some other type of requests (nodetool?). But I think my request that a node refuses to accept a request that it still can't perform is the best solution.

zimnx · 2022-11-17T13:59:04Z

Everything will be so much simpler if we just had the invariant: If a node accepts a CQL request, then it will work. That's it.

In K8s we are struggling without this invariant. Our tests are flaky because requests are timedout or we get errors even when our readiness check passes.

Currenly we qualify node as being ready to serve requests based on whether node itself reported UN via GET /gossiper/endpoint/live/ and GET /storage_service/host_id and whether CQL port has been opened (GET /storage_service/native_transport) .
If one of these conditions fails, we remove node from passthrough load balancer, and requests to it have nowhere to go.

We already proposed such endpoint in #8275 but it doesn't have much traction. We haven't checked suggested endpoint though, we are still using the above ones.

We don't have anything for toplogy changes, we do it once all nodes reports themselfes (we don't cross-check) as UN. Endpoint for toplogy check readiness would be beneficial to us.

kbr- · 2022-11-17T14:00:55Z

Node X must not accept CQL requests until it is sure it can process them

and some non-CQL requests too (topology change requests are done through rest api endpoints), but yes, this should probably be the way.

BTW. Raft-based topology changes can solve this:

we move the STATUS of peers into group 0
move STATUS processing into state_machine::apply (apply doesn't return until processing finishes)
before performing a topology operation, we call raft read_barrier, which ensures that we have the latest group 0 state
after calling read_barrier, verify that the STATUS of all peers is NORMAL, otherwise call read_barrier again, and so on

This will result in what @nyh is proposing.

kbr- · 2022-11-17T14:02:50Z

If a node accepts a CQL request, then it will work

Well, obviously you cannot have such guarantee, but you can try to guarantee something like:
If a node accepts a CQL request, and there is no subsequent failure, the request will work
(although depends how you define "failure"; e.g. is a temporary increase in network latency a failure or not?)

avikivity · 2022-11-17T14:54:53Z

Agree with adding a REST endpoint to improve the test. Re having the CQL server (and alternator) wait for stability before listening, that seems like a good idea but need to think about it carefully. Esp. if we move administrative functions to CQL we can end up with losing the ability to fix a problem.

Perhaps we need several CQL listeners (on unix-domain sockets): a local administrative listener, and a local pre-boot listener that is started very early (like a serial console for servers).

denesb · 2022-11-18T09:48:06Z

Agree with adding a REST endpoint to improve the test. Re having the CQL server (and alternator) wait for stability before listening, that seems like a good idea but need to think about it carefully. Esp. if we move administrative functions to CQL we can end up with losing the ability to fix a problem.

Perhaps we need several CQL listeners (on unix-domain sockets): a local administrative listener, and a local pre-boot listener that is started very early (like a serial console for servers).

This was also proposed in the context of the maintenance mode (#5489): have a separate CQL port for local administrator.

tgrabiec · 2023-01-13T00:11:05Z

BTW. Raft-based topology changes can solve this:

we move the STATUS of peers into group 0

move STATUS processing into state_machine::apply (apply doesn't return until processing finishes)

before performing a topology operation, we call raft read_barrier, which ensures that we have the latest group 0 state

after calling read_barrier, verify that the STATUS of all peers is NORMAL, otherwise call read_barrier again, and so on

I don't get why you need the loop to check the STATUSes. In raft-based topology, STATUS is held in system.token_metadata and managed by group0, so handle_state_normal() is not called by the gossiper anymore, but by the group0 state manchine. When a topology change starts, it has the only right to change the STATUS. It is a logical error for STATUS to not be NORMAL for all nodes. The previous topology change should either complete, or abort.

alecco · 2023-05-04T12:31:18Z

Fixed in #12540 (already merged)

kbr-scylla · 2023-05-04T13:18:06Z

Using the mechanism from #12540 does not guarantee that handle_state_normal has finished though.

Recently storage_service::wait_for_normal_state_handled_on_boot was introduced. It is used internally by Scylla on boot. However, since it works only on boot, it doesn't help for the case of another node restarting.

A full solution could work like this:

define some per-node value that is incremented on each restart and exchanged with other nodes. I think the gossiper generation_number is such a value, so I'll assume we use it below
in handle_state_normal, remember the greatest generation_number for each node that was handled by this function. We could extend the existing std::unordered_set<gms::inet_address> _normal_state_handled_on_boot by turning it into unordered_map<gms::inet_address, generation_type> for example.
expose a REST API that allows querying the data structure from previous point
in test, after restarting a node, fetch its generation_number (there's already a REST API for that IIRC); then on another node, check if it handled normal state for the restarted node with this or greater number. If not, sleep and retry and so on.

mykaul · 2023-08-28T08:41:17Z

@DoronArazii - we probably wish to have #12540 backported, even if it's not a complete solution, to 5.2 and 5.1.

DoronArazii · 2023-08-29T09:06:46Z

@scylladb/scylla-maint please consider backport

kbr-scylla · 2023-08-29T10:19:59Z

#12540 and #13240 need to go together

kbr-scylla · 2023-08-29T10:40:14Z

Queued #12540 and #13240 to next-5.1 and next-5.2

DoronArazii · 2023-08-31T15:33:18Z

Why is it open @kbr-scylla? what needs to be done?

mykaul · 2023-10-17T09:20:44Z

Why is it open @kbr-scylla? what needs to be done?

ping @kbr-scylla

kbr-scylla · 2023-10-17T11:06:55Z

The problem described in the original post does not occur in raft topology mode, you can safely start a new topology operation as soon as previous one has finished. Even better, you can queue up operations, the coordinator will handle them in serialized fashion.

Almost all our test.py tests are running in raft-topology mode now. The ones that don't, use their old workarounds. In any case, we no longer struggle with the problems described in the original issue.

Nevertheless, we should probably create a tool for administrators to communicate with the new topology coordinator to check committed node statuses and so on. It would be useful for observability, debugging and whatnot.

But creating such a tool deserves a separate issue.

So, I'm closing this one.

avikivity · 2023-10-31T14:05:54Z

Why is this still marked as backport candidate? Is any work remaining?

kbr-scylla · 2023-11-02T08:36:06Z

Not that I know of. Removing the labels.

kbr- added area/test Issues related to the testing system code and environment tests/test.py labels Nov 17, 2022

kbr- assigned kostja Nov 17, 2022

kbr-scylla mentioned this issue Dec 14, 2022

storage_service: Enable Repair Based Node Operations (RBNO) by default for all node ops #12173

Merged

kbr-scylla mentioned this issue Jan 12, 2023

Keep range tombstones with rows in memtables and cache #12048

Merged

kostja assigned alecco Jan 13, 2023

kostja added this to the 5.3 milestone Jan 13, 2023

kbr-scylla mentioned this issue Feb 16, 2023

Wait for node to be down in replace #12887

Open

kbr-scylla mentioned this issue Mar 23, 2023

Topology tests: node bootstrap sometimes fails due to Operation timed out for system_distributed_everywhere.cdc_generation_descriptions_v2 #13302

Closed

alecco mentioned this issue May 4, 2023

api: gossiper: get alive nodes after reaching current shard 0 version #12540

Merged

alecco closed this as completed May 4, 2023

kbr-scylla reopened this May 4, 2023

kostja modified the milestones: 5.3, 5.4 May 12, 2023

kostja assigned kbr-scylla and unassigned alecco Jul 17, 2023

mykaul added Backport candidate backport/5.2 Issues that should be backported to 5.2 branch once they'll be fixed labels Aug 28, 2023

mykaul added the Requires-Backport-to-5.1 label Aug 28, 2023

fgelcer mentioned this issue Sep 13, 2023

cluster elasticity - new nodes never get balanced work with previous nodes, until the busiest node is killed, stopping the workload #15335

Closed

2 tasks

kbr-scylla closed this as not planned Won't fix, can't repro, duplicate, stale Oct 17, 2023

kbr-scylla mentioned this issue Oct 17, 2023

raft topology: create a tool to obtain topology (coordinator) status #15735

Open

kbr-scylla removed Backport candidate backport/5.2 Issues that should be backported to 5.2 branch once they'll be fixed Requires-Backport-to-5.1 labels Nov 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

There is no way to wait until the UP/NORMAL states of other nodes are processed #12015

There is no way to wait until the UP/NORMAL states of other nodes are processed #12015

kbr- commented Nov 17, 2022

kbr- commented Nov 17, 2022

mykaul commented Nov 17, 2022

kbr- commented Nov 17, 2022

nyh commented Nov 17, 2022 •

edited

zimnx commented Nov 17, 2022 •

edited

kbr- commented Nov 17, 2022 •

edited

kbr- commented Nov 17, 2022

avikivity commented Nov 17, 2022

denesb commented Nov 18, 2022

tgrabiec commented Jan 13, 2023 •

edited

alecco commented May 4, 2023

kbr-scylla commented May 4, 2023

mykaul commented Aug 28, 2023

DoronArazii commented Aug 29, 2023

kbr-scylla commented Aug 29, 2023

kbr-scylla commented Aug 29, 2023

DoronArazii commented Aug 31, 2023

mykaul commented Oct 17, 2023

kbr-scylla commented Oct 17, 2023

avikivity commented Oct 31, 2023

kbr-scylla commented Nov 2, 2023

There is no way to wait until the UP/NORMAL states of other nodes are processed #12015

There is no way to wait until the UP/NORMAL states of other nodes are processed #12015

Comments

kbr- commented Nov 17, 2022

kbr- commented Nov 17, 2022

mykaul commented Nov 17, 2022

kbr- commented Nov 17, 2022

nyh commented Nov 17, 2022 • edited

zimnx commented Nov 17, 2022 • edited

kbr- commented Nov 17, 2022 • edited

kbr- commented Nov 17, 2022

avikivity commented Nov 17, 2022

denesb commented Nov 18, 2022

tgrabiec commented Jan 13, 2023 • edited

alecco commented May 4, 2023

kbr-scylla commented May 4, 2023

mykaul commented Aug 28, 2023

DoronArazii commented Aug 29, 2023

kbr-scylla commented Aug 29, 2023

kbr-scylla commented Aug 29, 2023

DoronArazii commented Aug 31, 2023

mykaul commented Oct 17, 2023

kbr-scylla commented Oct 17, 2023

avikivity commented Oct 31, 2023

kbr-scylla commented Nov 2, 2023

nyh commented Nov 17, 2022 •

edited

zimnx commented Nov 17, 2022 •

edited

kbr- commented Nov 17, 2022 •

edited

tgrabiec commented Jan 13, 2023 •

edited