`node.start(wait_other_notice=True, ...)` may return too early because implementation of `wait_other_notice` may observe past state #563

kbr-scylla · 2024-03-08T13:12:53Z

wait_other_notice in ccmlib/scylla_node.py currently works by looking at the /gossiper/live endpoint API

Line 334 in bd519ed

if wait_other_notice:

Line 1361 in bd519ed

    
           def watch_rest_for_alive(self, nodes, timeout=120, wait_normal_token_owner=True):

If other nodes see the node as UP, and the node sees other nodes as UP, the check will succeed.

But consider the following:

2 nodes A, B
kill node A
start node A with wait_other_notice=True

The start call will often return quickly, but not because node B noticed that A restarted
but because node B didn't even notice that A was killed yet!

So node B may still consider A as UP because it was UP before it was killed; B's failure detector just didn't react to that event yet.

That's why the old implementation in ccmlib/node.py used log markers, it watched other nodes' logs from a point after the node was restarted, effectively serializing the events.

This bug was introduced in #462.

cc @temichus

The text was updated successfully, but these errors were encountered:

kbr-scylla · 2024-03-08T13:14:49Z

We could fix it by combining the old behavior (with log markers) with the new behavior (watching REST). For example -- we could wait for the "is now UP" message first, then use watch_rest_for_alive.

Although, looking at Scylla code, I don't really understand why "is now UP" is not sufficient; it happens strictly after modifying live_endpoints set in gossiper, which is used by the /gossiper/live REST API.
Perhaps it's necessary for some older versions of Scylla.
And I see that watch_rest_for_alive contains additional logic for watching for NORMAL too, so that is still useful.

kbr-scylla · 2024-03-08T13:26:47Z

Example failure:
https://github.com/scylladb/scylla-dtest/pull/4042/files

the test sometimes fails with

        # starting node1 - it should reconnect and run as is
        node1.start(wait_other_notice=True, wait_for_binary_proto=True)
        #  due to CL=1 it does not make sense to read while not all nodes are UP
        # for node in cluster.nodelist():
        #     wait_for_nodes_status(node, ["UN", "UN", "UN"])
    
>       result = list(session.execute("SELECT * FROM cf"))

update_cluster_layout_tests.py:1136: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
cassandra/cluster.py:2726: in cassandra.cluster.Session.execute
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   cassandra.Unavailable: Error from server: code=1000 [Unavailable exception] message="Cannot achieve consistency level for cl ONE. Requires 1, alive 0" info={'consistency': 'ONE', 'required_replicas': 1, 'alive_replicas': 0}

(https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/artsiom_mishuta/job/dtest-pytest-gating/240/testReport/junit/update_cluster_layout_tests/TestUpdateClusterLayout/test_simple_kill_remained_node_while_decommissioning_96_100_/)

because we execute SELECT too fast. start returned too early, and then node 2 marked node 1 as DOWN, and then we executed SELECT before node2 marked node 1 as UP again.

…r_alive scylladb#462 introduced watch_rest_for_alive that replaced the calls to watch_log_for_alive on the scylla node(s) start path. But a node is killed and then restarted, and other nodes miss the kill event, `watch_rest_for_alive` will consider that node already as up as seen by the other nodes, while previously, watch_log_for_alive, waited until other nodes discovered this node as up again, based on markes taken right before (re)starting that node. This change brings this call back. Fixes scylladb#563 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

nyh · 2024-03-14T16:09:49Z

But consider the following:
* 2 nodes A, B
* kill node A
* start node A with `wait_other_notice=True`
The start call will often return quickly, but not because node B noticed that A restarted but because node B didn't even notice that A was killed yet!

I see that you and Benny liked #564 so I'll merge it, but I want to point out that I think the real problem in your example above is that "kill node A" neglects to wait for node A to be killed. If you want to do "kill node A" followed in quick succession by "start node A", it doesn't make sense to not wait for node A to be killed before starting it again.

The benefit of #564 is that it will allow the test to succeed despite the test bug that it forgets to wait for the "kill node A" to finish, so in that sense it is a good patch because it makes tests less easy to write incorrectly.

kbr-scylla · 2024-03-14T16:12:27Z

but I want to point out that I think the real problem in your example above is that "kill node A" neglects to wait for node A to be killed. If you want to do "kill node A" followed in quick succession by "start node A", it doesn't make sense to not wait for node A to be killed before starting it again.

I see it differently. The semantics of wait_other_notice should be as the name suggests: we wait until other nodes notice this particular start event. (Not some earlier start event.)

The test is correct when written using this assumption, this semantics. (And it was indeed so -- before #462)

…r_alive #462 introduced watch_rest_for_alive that replaced the calls to watch_log_for_alive on the scylla node(s) start path. But a node is killed and then restarted, and other nodes miss the kill event, `watch_rest_for_alive` will consider that node already as up as seen by the other nodes, while previously, watch_log_for_alive, waited until other nodes discovered this node as up again, based on markes taken right before (re)starting that node. This change brings this call back. Fixes #563 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

kbr-scylla assigned fruch and nyh Mar 8, 2024

kbr-scylla mentioned this issue Mar 11, 2024

scylla_cluster: fix handling of wait_other_notice #462

Merged

bhalevy mentioned this issue Mar 12, 2024

scylla_cluster, scylla_node: watch_log_for_alive before watch_rest_for_alive #564

Merged

fruch closed this as completed in #564 Mar 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`node.start(wait_other_notice=True, ...)` may return too early because implementation of `wait_other_notice` may observe past state #563

`node.start(wait_other_notice=True, ...)` may return too early because implementation of `wait_other_notice` may observe past state #563

kbr-scylla commented Mar 8, 2024

kbr-scylla commented Mar 8, 2024

kbr-scylla commented Mar 8, 2024

nyh commented Mar 14, 2024

kbr-scylla commented Mar 14, 2024

node.start(wait_other_notice=True, ...) may return too early because implementation of wait_other_notice may observe past state #563

node.start(wait_other_notice=True, ...) may return too early because implementation of wait_other_notice may observe past state #563

Comments

kbr-scylla commented Mar 8, 2024

kbr-scylla commented Mar 8, 2024

kbr-scylla commented Mar 8, 2024

nyh commented Mar 14, 2024

kbr-scylla commented Mar 14, 2024

`node.start(wait_other_notice=True, ...)` may return too early because implementation of `wait_other_notice` may observe past state #563

`node.start(wait_other_notice=True, ...)` may return too early because implementation of `wait_other_notice` may observe past state #563