Tests / Sanity Tests / replace_address_test.TestReplaceAddress.test_replace_shutdown_node[use_host_id-rbo_disabled] #15602

denesb · 2023-10-02T07:40:07Z

Seen in #15581 (comment).
Jenkins job: https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/4005/
Test result link: https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/4005/testReport/junit/replace_address_test/TestReplaceAddress/Tests___Sanity_Tests___test_replace_shutdown_node_use_host_id_rbo_disabled_/

self = <replace_address_test.TestReplaceAddress object at 0x7f7363d4a550>
use_host_id = True

    @pytest.mark.parametrize("use_host_id", [True, False], ids=["use_host_id", "use_endpoint"])
    def test_replace_shutdown_node(self, use_host_id: bool):
        """
        @jira_ticket CASSANDRA-9871
        Test that we can replace a node that is shutdown gracefully.
        """
>       self._replace_node_test(gently=True, use_host_id=use_host_id)

replace_address_test.py:93: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
replace_address_test.py:135: in _replace_node_test
    node4.start(replace_node_host_id=replace_node_host_id,
../scylla/.local/lib/python3.11/site-packages/ccmlib/scylla_node.py:677: in start
    scylla_process = self._start_scylla(args, marks, update_pid,
../scylla/.local/lib/python3.11/site-packages/ccmlib/scylla_node.py:330: in _start_scylla
    node.watch_rest_for_alive(self, timeout=t)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <ccmlib.scylla_node.ScyllaNode object at 0x7f7349437c10>
nodes = <ccmlib.scylla_node.ScyllaNode object at 0x7f734abdd8d0>, timeout = 120

    def watch_rest_for_alive(self, nodes, timeout=120):
        """
        Use the REST API to wait until this node detects that the nodes listed
        in "nodes" become fully operational and knows of its tokens.
        This is similar to watch_log_for_alive but uses ScyllaDB's REST API
        instead of the log file and waits for the node to be really useable,
        not just "UP" (see issue #461)
        """
        logging.getLogger('urllib3.connectionpool').disabled = True
        try:
            tofind = nodes if isinstance(nodes, list) else [nodes]
            tofind = set([node.address() for node in tofind])
            url_live = f"http://{self.address()}:10000/gossiper/endpoint/live"
            url_joining = f"http://{self.address()}:10000/storage_service/nodes/joining"
            url_tokens = f"http://{self.address()}:10000/storage_service/tokens/"
            endtime = time.time() + timeout
            while time.time() < endtime:
                live = set()
                response = requests.get(url=url_live)
                if response.text:
                    live = set(response.json())
                response = requests.get(url=url_joining)
                if response.text:
                    live = live - set(response.json())
                # Verify that node knows not only about the existance of the
                # other node, but also its tokens:
                if tofind.issubset(live):
                    # This node thinks that all given nodes are alive and not
                    # "joining", we're almost done, but still need to verify
                    # that the node knows the others' tokens.
                    check = tofind
                    tofind = set()
                    for n in check:
                        response = requests.get(url=url_tokens+n)
                        if response.text == '[]':
                            tofind.add(n)
                if not tofind:
                    return
                time.sleep(0.1)
>           raise TimeoutError(f"watch_rest_for_alive() timeout after {timeout} seconds")
E           ccmlib.node.TimeoutError: watch_rest_for_alive() timeout after 120 seconds

../scylla/.local/lib/python3.11/site-packages/ccmlib/scylla_node.py:1372: TimeoutError

The text was updated successfully, but these errors were encountered:

fruch · 2023-11-16T07:22:16Z

happened again on Nov 14
https://70f106c98484448dbc4705050eb3f7e9.us-east-1.aws.found.io:9243/goto/98885b60-844f-11ee-bf28-5927cc3d3ce3

seems like we don't wait enough time in wait_other_notice, and the replacement might take more then 2min

for now taking it out gating

kbr-scylla · 2024-02-22T14:07:36Z

Logs are gone, jenkins job was garbage collected, cannot investigate
Please upload logs to GH whenever reporting an issue

kbr-scylla · 2024-02-22T14:10:59Z

Lucky, found another recent failure https://jenkins.scylladb.com/job/scylla-master/job/dtest-release/494/testReport/junit/replace_address_test/TestReplaceAddress/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split007___test_replace_shutdown_node_use_host_id_rbo_disabled_/
reopening and attaching logs

1707866770783_replace_address_test.py TestReplaceAddress test_replace_shutdown_node[use_host_id-rbo_disabled].zip
dtest-gw3.log

kbr-scylla · 2024-02-22T17:04:58Z

Node 4 (replacing node) booted.

INFO  2024-02-13 23:24:07,532 [shard 0:main] init - Scylla version 5.5.0~dev-0.20240213.3d81138852ba initialization completed.

But node 3 (replaced node) tried to resurrect?

INFO  2024-02-13 23:24:17,474 [shard 0: gms] gossip - FatClient 127.0.67.3 has been silent for 30000ms, removing from gossip
INFO  2024-02-13 23:24:17,475 [shard 0: gms] token_metadata - Removed node 99497bfb-ef2b-4856-a536-26bad13a4f40 as pending replacing endpoint which replaces existing node d4a0cf4d-b0c1-40bb-ae48-5d35fa86b5c5
INFO  2024-02-13 23:24:17,477 [shard 0: gms] gossip - removed 127.0.67.3 from _seeds, updated _seeds list = {127.0.67.1, 127.0.67.2}
INFO  2024-02-13 23:24:17,478 [shard 0: gms] gossip - Removed endpoint 127.0.67.3
WARN  2024-02-13 23:24:17,478 [shard 0: gms] gossip - === Gossip round FAIL: std::runtime_error (Host 127.0.67.3 does not have HOST_ID application_state)
INFO  2024-02-13 23:25:17,482 [shard 0: gms] gossip - 60000 ms elapsed, 127.0.67.3 gossip quarantine over
INFO  2024-02-13 23:25:17,922 [shard 0: gms] storage_service - Node 127.0.67.3 is in normal state, tokens: {-4294785196284927379, -4861343634213618721, -5321786174098909553, -7397747633982920162, 7356494811708677683, 220211549138195190, -3254444366566174500, 2893846497263795551, 1441074784552911625, -1770083480539845664, 6007430040555390715, 6298074233657520839, -5960709600521919108, 3759709800052062132, 6805735493811173483, 4505968732975790374, 1149706327595469562, 5561802904431211825, 9187846645534030414, 1376818661132105206, -7526210123239036990, 4994893635795998427, -4907920732886584587, -7438962222159366326, 4562078233970137158, -4435835693242168289, -6400216993867311901, 1771918098767231915, -9050458712746831038, 7875718615884556084, 7264855817667468185, -5400037845737851586, 5073381457322254605, 2824382474789100424, 1806474726863825245, -7233051299448812091, -7812644691462924492, 3782161773407008067, 9124504504748572572, -6482356163169012491, -9124395635065338387, -895994578848459920, -637670179518427525, -8442617363914781338, 1391960502142093578, 6273010418711457372, 7772116188879005730, -5246242773263942917, -4363078225417525730, 4452444831117025462, 7368453204489438542, -1276346978399395942, 7470723079546312477, 5268357887550599716, 2285911414331956501, 4810542258267088311, 4878869439099407527, -6314861654871981437, -6228806299448937051, -2289513586494258769, 257533754088335416, -7613515624575858020, -786993627763809860, 3170824115594718290, 5263507339014608373, -5655713019788863819, -6589766235180477050, 5709688465830681285, 5851064951209203948, 5931866822586243795, -4883515015159375676, 3355841149544094328, 5055070262128511759, 8930290523081110999, 8500175884397641440, -4859260425629062993, -8679122196560786309, 8169752934807027700, -2773038958057637852, -524753512615049682, -7785490701440818959, 8546311487042443236, -3454699277745797070, 4177504123408392224, 1825432326136456062, 3607858911278382877, 312646185654790247, -1745076696763312518, -6187171015933714575, -8393204340568429202, 8895628278764738676, -9110163120909716488, -6319871966704173715, -7593413510641732723, 1532135982859705596, 4753304106197110164, 3482505779338347803, -131947579693855985, 6881869152325684405, -3382109113091462735, 9211772746952534052, 7318561116742014360, -6338515371628010655, -4929451052734580635, -1326037342839268022, -462965217257465356, 523481979744513702, 5480025941145932833, -5713619175497303017, 683569722471203099, -1650146971812744986, -9012684641488856011, 1632158356957350685, -4698198776414384507, -1501620467428039534, -6607498917374653067, -1394657947395999699, 3082922541147126250, -6074024783379394679, 2175428808194677779, 1768623148019483736, 5578103676273252472, 601866373353952494, -7291149617792803441, 5185311057854867375, -441681849224078849, -7409230421985838758, -5952600469158351141, -8941737782836266266, -1740202383987325083, 8550313474396886194, -3822436722759924766, -3468586080835976864, 1216806325736339257, -667205661643043192, 8049373115413880167, -8175843797999452956, -6291542050531447671, -2842551664367602489, -1940402746507924194, -5484225238694228081, 8223168795917985376, 5721931271423970623, 6908714333930132185, 8868984960055389249, -7964813283601137826, 6149077196891026304, -7502931205442085762, -2517476347390353261, -1987582794588111576, -2864167392456114800, -5694342848502422040, -4513403407461767128, 1544576961942315033, 7393184154717769750, 6356748234595481964, -7765717711278636795, 3877565178975028780, 2288325340133808294, -3115383758561304203, -8848641033274969543, 7121276507929140888, 5624362942731640273, -8547475470942775717, 4078982383972396217, 7804907891580699721, 5033765915163216701, -9199456479663479593, -8391572929571693779, 6290985590214301596, 6306663374322376978, 1100121287315314867, 6435408865483207858, 8628504625178044975, -8093117753331404749, 6152418604516091038, 615347797078586772, 6480770966382294525, -973907173050842359, 6985886940278869013, -9210864788055687243, -7651965112643874660, 4243845615799226895, 1238459330137714742, 7716482442987566531, -1363671779322788083, 8873299752325832993, -7351272378371454174, -5340289963575795604, -2131577493191595822, -7172142046894206032, -3224033468766845683, -4413088492262412567, 8448642054028785646, -7839840566333865337, 7391063644050345412, 8589910247401964878, 3265297810234072654, 6848218286232705568, -959720364147361659, -549133060901738335, -1520112576649944912, 8581783051264586275, -6401421775981519373, 2608856350729682597, 1534715803071568445, -2113540808685775866, 2759609806641798944, -306116465711719864, 7860950987566022033, -3407600168957683258, 4914743676836525302, -7261070961936075513, -8456921759205566053, -4596626740219748345, 1707709669183986173, 8697481040494849216, -7744149612602688377, -6957651067445711848, -6484547393173871378, 6253476950012317369, 950752870410446712, -177944226174345963, 3482110815350016143, 1403223431196117681, -4170240261107926102, 4999053211028683431, -1476671084191114625, 8779212983580631516, 8538586552099952393, 4013303631696412294, -8206639392587447640, 6176385437575686727, -2609203090447589598, -2525053388271646900, 2814357603368803028, -939634251239111665, 1871520776113811337, -248224517866277714, -7144003872446926546, -4641291601409223626, 1366891219659876773, -5528838293582775189, -4598540042863463619, -6390904993812440085, 3698914605541515809, -643378177740342220, 3396178958383039687, -5702016786751655492, -3453094970352278960, 1203509602083182956, 6166974658942478270, -96444361315035726, 2179108537336109039, -1192228410685037644, 5141257828416945090}
INFO  2024-02-13 23:25:17,923 [shard 0: gms] storage_service - Set host_id=d4a0cf4d-b0c1-40bb-ae48-5d35fa86b5c5 to be owned by node=127.0.67.3
INFO  2024-02-13 23:25:17,923 [shard 0: gms] storage_service - handle_state_normal: Nodes 127.0.67.3 and 127.0.67.4 have the same token -4294785196284927379. Ignoring 127.0.67.3

INFO  2024-02-13 23:25:17,924 [shard 0: gms] storage_service - handle_state_normal: endpoints_to_remove endpoint=127.0.67.3
WARN  2024-02-13 23:25:17,924 [shard 0: gms] gossip - Fail to send EchoMessage to 127.0.67.3: seastar::rpc::closed_error (connection is closed)
INFO  2024-02-13 23:25:17,930 [shard 0: gms] gossip - Removed endpoint 127.0.67.3
INFO  2024-02-13 23:25:17,930 [shard 0: gms] storage_service - handle_state_normal for 127.0.67.3/d4a0cf4d-b0c1-40bb-ae48-5d35fa86b5c5 finished
INFO  2024-02-13 23:25:17,935 [shard 0:comp] compaction - [Compact system.scylla_local 25f797f0-cac7-11ee-a26b-f5fb68f403a6] Compacting [/jenkins/workspace/scylla-master/dtest-release/scylla/.dtest/dtest-_1dg7k7h/test/node4/data/system/scylla_local-2972ec7ffb2038ddaac1d876f2e3fcbd/me-3gdl_1t25_5jbxc2gx3hyllwq5ra-big-Data.db:level=0:origin=memtable,/jenkins/workspace/scylla-master/dtest-release/scylla/.dtest/dtest-_1dg7k7h/test/node4/data/system/scylla_local-2972ec7ffb2038ddaac1d876f2e3fcbd/me-3gdl_1szn_07i282gx3hyllwq5ra-big-Data.db:level=0:origin=compaction]
INFO  2024-02-13 23:25:17,936 [shard 0: gms] gossip - InetAddress 127.0.67.3 is now DOWN, status = shutdown

could be unrelated.

In any case if node 4 booted successfully the "wait for alive" check should succeed

kbr-scylla · 2024-02-22T17:09:34Z

The wait started at

23:23:45,180 985     ccm                            DEBUG    cluster.py          :762  | test_replace_shutdown_node[use_host_id-rbo_disabled]: node4: Starting scylla: args=['/jenkins/workspace/scylla-master/dtest-release/scylla/.dtest/dtest-_1dg7k7h/test/node4/bin/scylla', '--options-file', '/jenkins/workspace/scylla-master/dtest-release/scylla/.dtest/dtest-_1dg7k7h/test/node4/conf/scylla.yaml', '--log-to-stdout', '1', '--abort-on-seastar-bad-alloc', '--abort-on-lsa-bad-alloc', '1', '--abort-on-internal-error', '1', '--api-address', '127.0.67.4', '--smp', '2', '--memory', '1024M', '--developer-mode', 'true', '--default-log-level', 'info', '--overprovisioned', '--prometheus-address', '127.0.67.4', '--replace-node-first-boot', 'd4a0cf4d-b0c1-40bb-ae48-5d35fa86b5c5', '--unsafe-bypass-fsync', '1', '--kernel-page-cache', '1', '--commitlog-use-o-dsync', '0', '--max-networking-io-control-blocks', '1000'] wait_other_notice=True wait_for_binary_proto=True

and the boot finished at

INFO  2024-02-13 23:24:07,532 [shard 0:main] init - Scylla version 5.5.0~dev-0.20240213.3d81138852ba initialization completed.

That is within the 120 seconds timeout. So @fruch it's not a problem with too low timeout. Some problem with node 1 not seeing node 4 in correct state

23:26:07,597 985     ccm                            DEBUG    cluster.py          :762  | test_replace_shutdown_node[use_host_id-rbo_disabled]: node1: watch_rest_for_alive: tofind={'127.0.67.4'} found=set(): tofind_host_id_map={'127.0.67.4': '99497bfb-ef2b-4856-a536-26bad13a4f40'} found_host_id_map={}

kbr-scylla · 2024-02-22T17:10:25Z

Ah, node 1 is waiting for node 4 to enter NORMAL status, but it never observes it

kbr-scylla · 2024-02-22T17:11:34Z

Could it be yet another instance of the same root cause?
#16902

kbr-scylla · 2024-02-22T17:12:55Z

Nodes not observing node 4 as entering NORMAL, could be also the reason why node 4 received a gossip event about node 3 (from some other node that thinks node 3 is not yet replaced)

kbr-scylla · 2024-02-22T17:13:52Z

Could it be yet another instance of the same root cause?
#16902

And #16668

In testing, we've observed multiple cases where nodes would fail to observe updated application states of other nodes in gossiper. For example: - in scylladb#16902, a node would finish bootstrapping and enter NORMAL state, propagating this information through gossiper. However, other nodes would never observe that the node entered NORMAL state, still thinking that it is in joining state. This would lead to further bad consequences down the line. - in scylladb#15393, a node got stuck in bootstrap, waiting for schema versions to converge. Convergence would never be achieved and the test eventually timed out. The node was observing outdated schema state of some existing node in gossip. I created a test that would bootstrap 3 nodes, then wait until they all observe each other as NORMAL, with timeout. Unfortunately, thousands of runs of this test on different machines failed to reproduce the problem. After banging my head against the wall failing to reproduce, I decided to sprinkle randomized sleeps across multiple places in gossiper code and finally: the test started catching the problem in about 1 in 1000 runs. With additional logging and additional head-banging, I determined the root cause. The following scenario can happen, 2 nodes are sufficient, let's call them A and B: - Node B calls `add_local_application_state` to update its gossiper state, for example, to propagate its new NORMAL status. - `add_local_application_state` takes a copy of the endpoint_state, and updates the copy: ``` auto local_state = *ep_state_before; for (auto& p : states) { auto& state = p.first; auto& value = p.second; value = versioned_value::clone_with_higher_version(value); local_state.add_application_state(state, value); } ``` `clone_with_higher_version` bumps `version` inside gms/version_generator.cc. - `add_local_application_state` calls `gossiper.replicate(...)` - `replicate` works in 2 phases to achieve exception safety: in first phase it copies the updated `local_state` to all shards into a separate map. In second phase the values from separate map are used to overwrite the endpoint_state map used for gossiping. Due to the cross-shard calls of the 1 phase, there is a yield before the second phase. *During this yield* the following happens: - `gossiper::run()` loop on B executes and bumps node B's `heart_beat`. This uses the monotonic version_generator, so it uses a higher version then the ones we used for states added above. Let's call this new version X. Note that X is larger than the versions used by application_states added above. - now node B handles a SYN or ACK message from node A, creating an ACK or ACK2 message in response. This message contains: - old application states (now including the update described above, because `replicate` is still sleeping before phase 2), - but bumped heart_beat == X from `gossiper::run()` loop, and sends the message. - node A receives the message and remembers that the max version across all states (including heart_beat) of node B is X. This means that it will no longer request or apply states from node B with versions smaller than X. - `gossiper.replicate(...)` on B wakes up, and overwrites endpoint_state with the ones it saved in phase 1. In particular it reverts heart_beat back to smaller value, but the larger problem is that it saves updated application_states that use versions smaller than X. - now when node B sends the updated application_states in ACK or ACK2 message to node A, node A will ignore them, because their versions are smaller than X. Or node B will never send them, because whenever node A requests states from node B, it only requests states with versions > X. Either way, node A will fail to observe new states of node B. If I understand correctly, this is a regression introduced in 38c2347, which introduced a yield in `replicate`. Before that, the updated state would be saved atomically on shard 0, there could be no `heart_beat` bump in-between making a copy of the local state, updating it, and then saving it. With the description above, it's easy to make a consistent ~100% reproducer for the problem -- introduce a longer sleep in `add_local_application_state` before second phase of replicate, to increase the chance that gossiper loop will execute and bump heart_beat version during the yield. Further commit adds a test based on that. The fix is to bump the heart_beat under local endpoint lock, which is also taken by `replicate`. Fixes: scylladb#15393 Fixes: scylladb#15602 Fixes: scylladb#16668 Fixes: scylladb#16902 Fixes: scylladb#17493 Fixes: scylladb#18118 Fixes: scylladb/scylla-enterprise#3720

In testing, we've observed multiple cases where nodes would fail to observe updated application states of other nodes in gossiper. For example: - in scylladb#16902, a node would finish bootstrapping and enter NORMAL state, propagating this information through gossiper. However, other nodes would never observe that the node entered NORMAL state, still thinking that it is in joining state. This would lead to further bad consequences down the line. - in scylladb#15393, a node got stuck in bootstrap, waiting for schema versions to converge. Convergence would never be achieved and the test eventually timed out. The node was observing outdated schema state of some existing node in gossip. I created a test that would bootstrap 3 nodes, then wait until they all observe each other as NORMAL, with timeout. Unfortunately, thousands of runs of this test on different machines failed to reproduce the problem. After banging my head against the wall failing to reproduce, I decided to sprinkle randomized sleeps across multiple places in gossiper code and finally: the test started catching the problem in about 1 in 1000 runs. With additional logging and additional head-banging, I determined the root cause. The following scenario can happen, 2 nodes are sufficient, let's call them A and B: - Node B calls `add_local_application_state` to update its gossiper state, for example, to propagate its new NORMAL status. - `add_local_application_state` takes a copy of the endpoint_state, and updates the copy: ``` auto local_state = *ep_state_before; for (auto& p : states) { auto& state = p.first; auto& value = p.second; value = versioned_value::clone_with_higher_version(value); local_state.add_application_state(state, value); } ``` `clone_with_higher_version` bumps `version` inside gms/version_generator.cc. - `add_local_application_state` calls `gossiper.replicate(...)` - `replicate` works in 2 phases to achieve exception safety: in first phase it copies the updated `local_state` to all shards into a separate map. In second phase the values from separate map are used to overwrite the endpoint_state map used for gossiping. Due to the cross-shard calls of the 1 phase, there is a yield before the second phase. *During this yield* the following happens: - `gossiper::run()` loop on B executes and bumps node B's `heart_beat`. This uses the monotonic version_generator, so it uses a higher version then the ones we used for states added above. Let's call this new version X. Note that X is larger than the versions used by application_states added above. - now node B handles a SYN or ACK message from node A, creating an ACK or ACK2 message in response. This message contains: - old application states (now including the update described above, because `replicate` is still sleeping before phase 2), - but bumped heart_beat == X from `gossiper::run()` loop, and sends the message. - node A receives the message and remembers that the max version across all states (including heart_beat) of node B is X. This means that it will no longer request or apply states from node B with versions smaller than X. - `gossiper.replicate(...)` on B wakes up, and overwrites endpoint_state with the ones it saved in phase 1. In particular it reverts heart_beat back to smaller value, but the larger problem is that it saves updated application_states that use versions smaller than X. - now when node B sends the updated application_states in ACK or ACK2 message to node A, node A will ignore them, because their versions are smaller than X. Or node B will never send them, because whenever node A requests states from node B, it only requests states with versions > X. Either way, node A will fail to observe new states of node B. If I understand correctly, this is a regression introduced in 38c2347, which introduced a yield in `replicate`. Before that, the updated state would be saved atomically on shard 0, there could be no `heart_beat` bump in-between making a copy of the local state, updating it, and then saving it. With the description above, it's easy to make a consistent reproducer for the problem -- introduce a longer sleep in `add_local_application_state` before second phase of replicate, to increase the chance that gossiper loop will execute and bump heart_beat version during the yield. Further commit adds a test based on that. The fix is to bump the heart_beat under local endpoint lock, which is also taken by `replicate`. Fixes: scylladb#15393 Fixes: scylladb#15602 Fixes: scylladb#16668 Fixes: scylladb#16902 Fixes: scylladb#17493 Fixes: scylladb#18118 Fixes: scylladb/scylla-enterprise#3720

In testing, we've observed multiple cases where nodes would fail to observe updated application states of other nodes in gossiper. For example: - in scylladb#16902, a node would finish bootstrapping and enter NORMAL state, propagating this information through gossiper. However, other nodes would never observe that the node entered NORMAL state, still thinking that it is in joining state. This would lead to further bad consequences down the line. - in scylladb#15393, a node got stuck in bootstrap, waiting for schema versions to converge. Convergence would never be achieved and the test eventually timed out. The node was observing outdated schema state of some existing node in gossip. I created a test that would bootstrap 3 nodes, then wait until they all observe each other as NORMAL, with timeout. Unfortunately, thousands of runs of this test on different machines failed to reproduce the problem. After banging my head against the wall failing to reproduce, I decided to sprinkle randomized sleeps across multiple places in gossiper code and finally: the test started catching the problem in about 1 in 1000 runs. With additional logging and additional head-banging, I determined the root cause. The following scenario can happen, 2 nodes are sufficient, let's call them A and B: - Node B calls `add_local_application_state` to update its gossiper state, for example, to propagate its new NORMAL status. - `add_local_application_state` takes a copy of the endpoint_state, and updates the copy: ``` auto local_state = *ep_state_before; for (auto& p : states) { auto& state = p.first; auto& value = p.second; value = versioned_value::clone_with_higher_version(value); local_state.add_application_state(state, value); } ``` `clone_with_higher_version` bumps `version` inside gms/version_generator.cc. - `add_local_application_state` calls `gossiper.replicate(...)` - `replicate` works in 2 phases to achieve exception safety: in first phase it copies the updated `local_state` to all shards into a separate map. In second phase the values from separate map are used to overwrite the endpoint_state map used for gossiping. Due to the cross-shard calls of the 1 phase, there is a yield before the second phase. *During this yield* the following happens: - `gossiper::run()` loop on B executes and bumps node B's `heart_beat`. This uses the monotonic version_generator, so it uses a higher version then the ones we used for states added above. Let's call this new version X. Note that X is larger than the versions used by application_states added above. - now node B handles a SYN or ACK message from node A, creating an ACK or ACK2 message in response. This message contains: - old application states (NOT including the update described above, because `replicate` is still sleeping before phase 2), - but bumped heart_beat == X from `gossiper::run()` loop, and sends the message. - node A receives the message and remembers that the max version across all states (including heart_beat) of node B is X. This means that it will no longer request or apply states from node B with versions smaller than X. - `gossiper.replicate(...)` on B wakes up, and overwrites endpoint_state with the ones it saved in phase 1. In particular it reverts heart_beat back to smaller value, but the larger problem is that it saves updated application_states that use versions smaller than X. - now when node B sends the updated application_states in ACK or ACK2 message to node A, node A will ignore them, because their versions are smaller than X. Or node B will never send them, because whenever node A requests states from node B, it only requests states with versions > X. Either way, node A will fail to observe new states of node B. If I understand correctly, this is a regression introduced in 38c2347, which introduced a yield in `replicate`. Before that, the updated state would be saved atomically on shard 0, there could be no `heart_beat` bump in-between making a copy of the local state, updating it, and then saving it. With the description above, it's easy to make a consistent reproducer for the problem -- introduce a longer sleep in `add_local_application_state` before second phase of replicate, to increase the chance that gossiper loop will execute and bump heart_beat version during the yield. Further commit adds a test based on that. The fix is to bump the heart_beat under local endpoint lock, which is also taken by `replicate`. Fixes: scylladb#15393 Fixes: scylladb#15602 Fixes: scylladb#16668 Fixes: scylladb#16902 Fixes: scylladb#17493 Fixes: scylladb#18118 Fixes: scylladb/scylla-enterprise#3720

…amil Braun In testing, we've observed multiple cases where nodes would fail to observe updated application states of other nodes in gossiper. For example: - in #16902, a node would finish bootstrapping and enter NORMAL state, propagating this information through gossiper. However, other nodes would never observe that the node entered NORMAL state, still thinking that it is in joining state. This would lead to further bad consequences down the line. - in #15393, a node got stuck in bootstrap, waiting for schema versions to converge. Convergence would never be achieved and the test eventually timed out. The node was observing outdated schema state of some existing node in gossip. I created a test that would bootstrap 3 nodes, then wait until they all observe each other as NORMAL, with timeout. Unfortunately, thousands of runs of this test on different machines failed to reproduce the problem. After banging my head against the wall failing to reproduce, I decided to sprinkle randomized sleeps across multiple places in gossiper code and finally: the test started catching the problem in about 1 in 1000 runs. With additional logging and additional head-banging, I determined the root cause. The following scenario can happen, 2 nodes are sufficient, let's call them A and B: - Node B calls `add_local_application_state` to update its gossiper state, for example, to propagate its new NORMAL status. - `add_local_application_state` takes a copy of the endpoint_state, and updates the copy: ``` auto local_state = *ep_state_before; for (auto& p : states) { auto& state = p.first; auto& value = p.second; value = versioned_value::clone_with_higher_version(value); local_state.add_application_state(state, value); } ``` `clone_with_higher_version` bumps `version` inside gms/version_generator.cc. - `add_local_application_state` calls `gossiper.replicate(...)` - `replicate` works in 2 phases to achieve exception safety: in first phase it copies the updated `local_state` to all shards into a separate map. In second phase the values from separate map are used to overwrite the endpoint_state map used for gossiping. Due to the cross-shard calls of the 1 phase, there is a yield before the second phase. *During this yield* the following happens: - `gossiper::run()` loop on B executes and bumps node B's `heart_beat`. This uses the monotonic version_generator, so it uses a higher version then the ones we used for states added above. Let's call this new version X. Note that X is larger than the versions used by application_states added above. - now node B handles a SYN or ACK message from node A, creating an ACK or ACK2 message in response. This message contains: - old application states (NOT including the update described above, because `replicate` is still sleeping before phase 2), - but bumped heart_beat == X from `gossiper::run()` loop, and sends the message. - node A receives the message and remembers that the max version across all states (including heart_beat) of node B is X. This means that it will no longer request or apply states from node B with versions smaller than X. - `gossiper.replicate(...)` on B wakes up, and overwrites endpoint_state with the ones it saved in phase 1. In particular it reverts heart_beat back to smaller value, but the larger problem is that it saves updated application_states that use versions smaller than X. - now when node B sends the updated application_states in ACK or ACK2 message to node A, node A will ignore them, because their versions are smaller than X. Or node B will never send them, because whenever node A requests states from node B, it only requests states with versions > X. Either way, node A will fail to observe new states of node B. If I understand correctly, this is a regression introduced in 38c2347, which introduced a yield in `replicate`. Before that, the updated state would be saved atomically on shard 0, there could be no `heart_beat` bump in-between making a copy of the local state, updating it, and then saving it. With the description above, it's easy to make a consistent reproducer for the problem -- introduce a longer sleep in `add_local_application_state` before second phase of replicate, to increase the chance that gossiper loop will execute and bump heart_beat version during the yield. Further commit adds a test based on that. The fix is to bump the heart_beat under local endpoint lock, which is also taken by `replicate`. The PR also adds a regression test. Fixes: #15393 Fixes: #15602 Fixes: #16668 Fixes: #16902 Fixes: #17493 Fixes: #18118 Ref: scylladb/scylla-enterprise#3720 Closes #18184 * github.com:scylladb/scylladb: test: reproducer for missing gossiper updates gossiper: lock local endpoint when updating heart_beat

In testing, we've observed multiple cases where nodes would fail to observe updated application states of other nodes in gossiper. For example: - in #16902, a node would finish bootstrapping and enter NORMAL state, propagating this information through gossiper. However, other nodes would never observe that the node entered NORMAL state, still thinking that it is in joining state. This would lead to further bad consequences down the line. - in #15393, a node got stuck in bootstrap, waiting for schema versions to converge. Convergence would never be achieved and the test eventually timed out. The node was observing outdated schema state of some existing node in gossip. I created a test that would bootstrap 3 nodes, then wait until they all observe each other as NORMAL, with timeout. Unfortunately, thousands of runs of this test on different machines failed to reproduce the problem. After banging my head against the wall failing to reproduce, I decided to sprinkle randomized sleeps across multiple places in gossiper code and finally: the test started catching the problem in about 1 in 1000 runs. With additional logging and additional head-banging, I determined the root cause. The following scenario can happen, 2 nodes are sufficient, let's call them A and B: - Node B calls `add_local_application_state` to update its gossiper state, for example, to propagate its new NORMAL status. - `add_local_application_state` takes a copy of the endpoint_state, and updates the copy: ``` auto local_state = *ep_state_before; for (auto& p : states) { auto& state = p.first; auto& value = p.second; value = versioned_value::clone_with_higher_version(value); local_state.add_application_state(state, value); } ``` `clone_with_higher_version` bumps `version` inside gms/version_generator.cc. - `add_local_application_state` calls `gossiper.replicate(...)` - `replicate` works in 2 phases to achieve exception safety: in first phase it copies the updated `local_state` to all shards into a separate map. In second phase the values from separate map are used to overwrite the endpoint_state map used for gossiping. Due to the cross-shard calls of the 1 phase, there is a yield before the second phase. *During this yield* the following happens: - `gossiper::run()` loop on B executes and bumps node B's `heart_beat`. This uses the monotonic version_generator, so it uses a higher version then the ones we used for states added above. Let's call this new version X. Note that X is larger than the versions used by application_states added above. - now node B handles a SYN or ACK message from node A, creating an ACK or ACK2 message in response. This message contains: - old application states (NOT including the update described above, because `replicate` is still sleeping before phase 2), - but bumped heart_beat == X from `gossiper::run()` loop, and sends the message. - node A receives the message and remembers that the max version across all states (including heart_beat) of node B is X. This means that it will no longer request or apply states from node B with versions smaller than X. - `gossiper.replicate(...)` on B wakes up, and overwrites endpoint_state with the ones it saved in phase 1. In particular it reverts heart_beat back to smaller value, but the larger problem is that it saves updated application_states that use versions smaller than X. - now when node B sends the updated application_states in ACK or ACK2 message to node A, node A will ignore them, because their versions are smaller than X. Or node B will never send them, because whenever node A requests states from node B, it only requests states with versions > X. Either way, node A will fail to observe new states of node B. If I understand correctly, this is a regression introduced in 38c2347, which introduced a yield in `replicate`. Before that, the updated state would be saved atomically on shard 0, there could be no `heart_beat` bump in-between making a copy of the local state, updating it, and then saving it. With the description above, it's easy to make a consistent reproducer for the problem -- introduce a longer sleep in `add_local_application_state` before second phase of replicate, to increase the chance that gossiper loop will execute and bump heart_beat version during the yield. Further commit adds a test based on that. The fix is to bump the heart_beat under local endpoint lock, which is also taken by `replicate`. Fixes: #15393 Fixes: #15602 Fixes: #16668 Fixes: #16902 Fixes: #17493 Fixes: #18118 Ref: scylladb/scylla-enterprise#3720 (cherry picked from commit a0b331b)

…rt_beat' from ScyllaDB In testing, we've observed multiple cases where nodes would fail to observe updated application states of other nodes in gossiper. For example: - in #16902, a node would finish bootstrapping and enter NORMAL state, propagating this information through gossiper. However, other nodes would never observe that the node entered NORMAL state, still thinking that it is in joining state. This would lead to further bad consequences down the line. - in #15393, a node got stuck in bootstrap, waiting for schema versions to converge. Convergence would never be achieved and the test eventually timed out. The node was observing outdated schema state of some existing node in gossip. I created a test that would bootstrap 3 nodes, then wait until they all observe each other as NORMAL, with timeout. Unfortunately, thousands of runs of this test on different machines failed to reproduce the problem. After banging my head against the wall failing to reproduce, I decided to sprinkle randomized sleeps across multiple places in gossiper code and finally: the test started catching the problem in about 1 in 1000 runs. With additional logging and additional head-banging, I determined the root cause. The following scenario can happen, 2 nodes are sufficient, let's call them A and B: - Node B calls `add_local_application_state` to update its gossiper state, for example, to propagate its new NORMAL status. - `add_local_application_state` takes a copy of the endpoint_state, and updates the copy: ``` auto local_state = *ep_state_before; for (auto& p : states) { auto& state = p.first; auto& value = p.second; value = versioned_value::clone_with_higher_version(value); local_state.add_application_state(state, value); } ``` `clone_with_higher_version` bumps `version` inside gms/version_generator.cc. - `add_local_application_state` calls `gossiper.replicate(...)` - `replicate` works in 2 phases to achieve exception safety: in first phase it copies the updated `local_state` to all shards into a separate map. In second phase the values from separate map are used to overwrite the endpoint_state map used for gossiping. Due to the cross-shard calls of the 1 phase, there is a yield before the second phase. *During this yield* the following happens: - `gossiper::run()` loop on B executes and bumps node B's `heart_beat`. This uses the monotonic version_generator, so it uses a higher version then the ones we used for states added above. Let's call this new version X. Note that X is larger than the versions used by application_states added above. - now node B handles a SYN or ACK message from node A, creating an ACK or ACK2 message in response. This message contains: - old application states (NOT including the update described above, because `replicate` is still sleeping before phase 2), - but bumped heart_beat == X from `gossiper::run()` loop, and sends the message. - node A receives the message and remembers that the max version across all states (including heart_beat) of node B is X. This means that it will no longer request or apply states from node B with versions smaller than X. - `gossiper.replicate(...)` on B wakes up, and overwrites endpoint_state with the ones it saved in phase 1. In particular it reverts heart_beat back to smaller value, but the larger problem is that it saves updated application_states that use versions smaller than X. - now when node B sends the updated application_states in ACK or ACK2 message to node A, node A will ignore them, because their versions are smaller than X. Or node B will never send them, because whenever node A requests states from node B, it only requests states with versions > X. Either way, node A will fail to observe new states of node B. If I understand correctly, this is a regression introduced in 38c2347, which introduced a yield in `replicate`. Before that, the updated state would be saved atomically on shard 0, there could be no `heart_beat` bump in-between making a copy of the local state, updating it, and then saving it. With the description above, it's easy to make a consistent reproducer for the problem -- introduce a longer sleep in `add_local_application_state` before second phase of replicate, to increase the chance that gossiper loop will execute and bump heart_beat version during the yield. Further commit adds a test based on that. The fix is to bump the heart_beat under local endpoint lock, which is also taken by `replicate`. The PR also adds a regression test. Fixes: #15393 Fixes: #15602 Fixes: #16668 Fixes: #16902 Fixes: #17493 Fixes: #18118 Ref: scylladb/scylla-enterprise#3720 (cherry picked from commit a0b331b) (cherry picked from commit 7295509) Refs #18184 Closes #18245 * github.com:scylladb/scylladb: test: reproducer for missing gossiper updates gossiper: lock local endpoint when updating heart_beat

In testing, we've observed multiple cases where nodes would fail to observe updated application states of other nodes in gossiper. For example: - in scylladb#16902, a node would finish bootstrapping and enter NORMAL state, propagating this information through gossiper. However, other nodes would never observe that the node entered NORMAL state, still thinking that it is in joining state. This would lead to further bad consequences down the line. - in scylladb#15393, a node got stuck in bootstrap, waiting for schema versions to converge. Convergence would never be achieved and the test eventually timed out. The node was observing outdated schema state of some existing node in gossip. I created a test that would bootstrap 3 nodes, then wait until they all observe each other as NORMAL, with timeout. Unfortunately, thousands of runs of this test on different machines failed to reproduce the problem. After banging my head against the wall failing to reproduce, I decided to sprinkle randomized sleeps across multiple places in gossiper code and finally: the test started catching the problem in about 1 in 1000 runs. With additional logging and additional head-banging, I determined the root cause. The following scenario can happen, 2 nodes are sufficient, let's call them A and B: - Node B calls `add_local_application_state` to update its gossiper state, for example, to propagate its new NORMAL status. - `add_local_application_state` takes a copy of the endpoint_state, and updates the copy: ``` auto local_state = *ep_state_before; for (auto& p : states) { auto& state = p.first; auto& value = p.second; value = versioned_value::clone_with_higher_version(value); local_state.add_application_state(state, value); } ``` `clone_with_higher_version` bumps `version` inside gms/version_generator.cc. - `add_local_application_state` calls `gossiper.replicate(...)` - `replicate` works in 2 phases to achieve exception safety: in first phase it copies the updated `local_state` to all shards into a separate map. In second phase the values from separate map are used to overwrite the endpoint_state map used for gossiping. Due to the cross-shard calls of the 1 phase, there is a yield before the second phase. *During this yield* the following happens: - `gossiper::run()` loop on B executes and bumps node B's `heart_beat`. This uses the monotonic version_generator, so it uses a higher version then the ones we used for states added above. Let's call this new version X. Note that X is larger than the versions used by application_states added above. - now node B handles a SYN or ACK message from node A, creating an ACK or ACK2 message in response. This message contains: - old application states (NOT including the update described above, because `replicate` is still sleeping before phase 2), - but bumped heart_beat == X from `gossiper::run()` loop, and sends the message. - node A receives the message and remembers that the max version across all states (including heart_beat) of node B is X. This means that it will no longer request or apply states from node B with versions smaller than X. - `gossiper.replicate(...)` on B wakes up, and overwrites endpoint_state with the ones it saved in phase 1. In particular it reverts heart_beat back to smaller value, but the larger problem is that it saves updated application_states that use versions smaller than X. - now when node B sends the updated application_states in ACK or ACK2 message to node A, node A will ignore them, because their versions are smaller than X. Or node B will never send them, because whenever node A requests states from node B, it only requests states with versions > X. Either way, node A will fail to observe new states of node B. If I understand correctly, this is a regression introduced in 38c2347, which introduced a yield in `replicate`. Before that, the updated state would be saved atomically on shard 0, there could be no `heart_beat` bump in-between making a copy of the local state, updating it, and then saving it. With the description above, it's easy to make a consistent reproducer for the problem -- introduce a longer sleep in `add_local_application_state` before second phase of replicate, to increase the chance that gossiper loop will execute and bump heart_beat version during the yield. Further commit adds a test based on that. The fix is to bump the heart_beat under local endpoint lock, which is also taken by `replicate`. Fixes: scylladb#15393 Fixes: scylladb#15602 Fixes: scylladb#16668 Fixes: scylladb#16902 Fixes: scylladb#17493 Fixes: scylladb#18118 Ref: scylladb/scylla-enterprise#3720

denesb added the symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework label Oct 2, 2023

denesb mentioned this issue Oct 2, 2023

logalloc, reader_concurrency_semaphore: cooperate on OOM kills #15581

Merged

mykaul assigned kbr-scylla Feb 14, 2024

kbr-scylla closed this as not planned Won't fix, can't repro, duplicate, stale Feb 22, 2024

kbr-scylla reopened this Feb 22, 2024

kbr-scylla mentioned this issue Apr 4, 2024

gossiper: lock local endpoint when updating heart_beat #18184

Merged

scylladb-promoter closed this as completed in a0b331b Apr 16, 2024

scylladb-promoter added the Backport candidate label Apr 16, 2024

mergify bot mentioned this issue Apr 16, 2024

[Backport 5.4] gossiper: lock local endpoint when updating heart_beat #18245

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tests / Sanity Tests / replace_address_test.TestReplaceAddress.test_replace_shutdown_node[use_host_id-rbo_disabled] #15602

Tests / Sanity Tests / replace_address_test.TestReplaceAddress.test_replace_shutdown_node[use_host_id-rbo_disabled] #15602

denesb commented Oct 2, 2023

fruch commented Nov 16, 2023

kbr-scylla commented Feb 22, 2024

kbr-scylla commented Feb 22, 2024

kbr-scylla commented Feb 22, 2024

kbr-scylla commented Feb 22, 2024

kbr-scylla commented Feb 22, 2024 •

edited

kbr-scylla commented Feb 22, 2024

kbr-scylla commented Feb 22, 2024

kbr-scylla commented Feb 22, 2024

Tests / Sanity Tests / replace_address_test.TestReplaceAddress.test_replace_shutdown_node[use_host_id-rbo_disabled] #15602

Tests / Sanity Tests / replace_address_test.TestReplaceAddress.test_replace_shutdown_node[use_host_id-rbo_disabled] #15602

Comments

denesb commented Oct 2, 2023

fruch commented Nov 16, 2023

kbr-scylla commented Feb 22, 2024

kbr-scylla commented Feb 22, 2024

kbr-scylla commented Feb 22, 2024

kbr-scylla commented Feb 22, 2024

kbr-scylla commented Feb 22, 2024 • edited

kbr-scylla commented Feb 22, 2024

kbr-scylla commented Feb 22, 2024

kbr-scylla commented Feb 22, 2024

kbr-scylla commented Feb 22, 2024 •

edited