Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tests / Sanity Tests / replace_address_test.TestReplaceAddress.test_replace_shutdown_node[use_host_id-rbo_disabled] #15602

Closed
Tracked by #17493
denesb opened this issue Oct 2, 2023 · 9 comments · Fixed by #18184
Assignees
Labels
Backport candidate symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework

Comments

@denesb
Copy link
Contributor

denesb commented Oct 2, 2023

Seen in #15581 (comment).
Jenkins job: https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/4005/
Test result link: https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/4005/testReport/junit/replace_address_test/TestReplaceAddress/Tests___Sanity_Tests___test_replace_shutdown_node_use_host_id_rbo_disabled_/

self = <replace_address_test.TestReplaceAddress object at 0x7f7363d4a550>
use_host_id = True

    @pytest.mark.parametrize("use_host_id", [True, False], ids=["use_host_id", "use_endpoint"])
    def test_replace_shutdown_node(self, use_host_id: bool):
        """
        @jira_ticket CASSANDRA-9871
        Test that we can replace a node that is shutdown gracefully.
        """
>       self._replace_node_test(gently=True, use_host_id=use_host_id)

replace_address_test.py:93: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
replace_address_test.py:135: in _replace_node_test
    node4.start(replace_node_host_id=replace_node_host_id,
../scylla/.local/lib/python3.11/site-packages/ccmlib/scylla_node.py:677: in start
    scylla_process = self._start_scylla(args, marks, update_pid,
../scylla/.local/lib/python3.11/site-packages/ccmlib/scylla_node.py:330: in _start_scylla
    node.watch_rest_for_alive(self, timeout=t)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <ccmlib.scylla_node.ScyllaNode object at 0x7f7349437c10>
nodes = <ccmlib.scylla_node.ScyllaNode object at 0x7f734abdd8d0>, timeout = 120

    def watch_rest_for_alive(self, nodes, timeout=120):
        """
        Use the REST API to wait until this node detects that the nodes listed
        in "nodes" become fully operational and knows of its tokens.
        This is similar to watch_log_for_alive but uses ScyllaDB's REST API
        instead of the log file and waits for the node to be really useable,
        not just "UP" (see issue #461)
        """
        logging.getLogger('urllib3.connectionpool').disabled = True
        try:
            tofind = nodes if isinstance(nodes, list) else [nodes]
            tofind = set([node.address() for node in tofind])
            url_live = f"http://{self.address()}:10000/gossiper/endpoint/live"
            url_joining = f"http://{self.address()}:10000/storage_service/nodes/joining"
            url_tokens = f"http://{self.address()}:10000/storage_service/tokens/"
            endtime = time.time() + timeout
            while time.time() < endtime:
                live = set()
                response = requests.get(url=url_live)
                if response.text:
                    live = set(response.json())
                response = requests.get(url=url_joining)
                if response.text:
                    live = live - set(response.json())
                # Verify that node knows not only about the existance of the
                # other node, but also its tokens:
                if tofind.issubset(live):
                    # This node thinks that all given nodes are alive and not
                    # "joining", we're almost done, but still need to verify
                    # that the node knows the others' tokens.
                    check = tofind
                    tofind = set()
                    for n in check:
                        response = requests.get(url=url_tokens+n)
                        if response.text == '[]':
                            tofind.add(n)
                if not tofind:
                    return
                time.sleep(0.1)
>           raise TimeoutError(f"watch_rest_for_alive() timeout after {timeout} seconds")
E           ccmlib.node.TimeoutError: watch_rest_for_alive() timeout after 120 seconds

../scylla/.local/lib/python3.11/site-packages/ccmlib/scylla_node.py:1372: TimeoutError
@denesb denesb added the symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework label Oct 2, 2023
@fruch
Copy link
Contributor

fruch commented Nov 16, 2023

happened again on Nov 14
https://70f106c98484448dbc4705050eb3f7e9.us-east-1.aws.found.io:9243/goto/98885b60-844f-11ee-bf28-5927cc3d3ce3

seems like we don't wait enough time in wait_other_notice, and the replacement might take more then 2min

for now taking it out gating

@kbr-scylla
Copy link
Contributor

Logs are gone, jenkins job was garbage collected, cannot investigate
Please upload logs to GH whenever reporting an issue

@kbr-scylla kbr-scylla closed this as not planned Won't fix, can't repro, duplicate, stale Feb 22, 2024
@kbr-scylla kbr-scylla reopened this Feb 22, 2024
@kbr-scylla
Copy link
Contributor

Node 4 (replacing node) booted.

INFO  2024-02-13 23:24:07,532 [shard 0:main] init - Scylla version 5.5.0~dev-0.20240213.3d81138852ba initialization completed.

But node 3 (replaced node) tried to resurrect?

INFO  2024-02-13 23:24:17,474 [shard 0: gms] gossip - FatClient 127.0.67.3 has been silent for 30000ms, removing from gossip
INFO  2024-02-13 23:24:17,475 [shard 0: gms] token_metadata - Removed node 99497bfb-ef2b-4856-a536-26bad13a4f40 as pending replacing endpoint which replaces existing node d4a0cf4d-b0c1-40bb-ae48-5d35fa86b5c5
INFO  2024-02-13 23:24:17,477 [shard 0: gms] gossip - removed 127.0.67.3 from _seeds, updated _seeds list = {127.0.67.1, 127.0.67.2}
INFO  2024-02-13 23:24:17,478 [shard 0: gms] gossip - Removed endpoint 127.0.67.3
WARN  2024-02-13 23:24:17,478 [shard 0: gms] gossip - === Gossip round FAIL: std::runtime_error (Host 127.0.67.3 does not have HOST_ID application_state)
INFO  2024-02-13 23:25:17,482 [shard 0: gms] gossip - 60000 ms elapsed, 127.0.67.3 gossip quarantine over
INFO  2024-02-13 23:25:17,922 [shard 0: gms] storage_service - Node 127.0.67.3 is in normal state, tokens: {-4294785196284927379, -4861343634213618721, -5321786174098909553, -7397747633982920162, 7356494811708677683, 220211549138195190, -3254444366566174500, 2893846497263795551, 1441074784552911625, -1770083480539845664, 6007430040555390715, 6298074233657520839, -5960709600521919108, 3759709800052062132, 6805735493811173483, 4505968732975790374, 1149706327595469562, 5561802904431211825, 9187846645534030414, 1376818661132105206, -7526210123239036990, 4994893635795998427, -4907920732886584587, -7438962222159366326, 4562078233970137158, -4435835693242168289, -6400216993867311901, 1771918098767231915, -9050458712746831038, 7875718615884556084, 7264855817667468185, -5400037845737851586, 5073381457322254605, 2824382474789100424, 1806474726863825245, -7233051299448812091, -7812644691462924492, 3782161773407008067, 9124504504748572572, -6482356163169012491, -9124395635065338387, -895994578848459920, -637670179518427525, -8442617363914781338, 1391960502142093578, 6273010418711457372, 7772116188879005730, -5246242773263942917, -4363078225417525730, 4452444831117025462, 7368453204489438542, -1276346978399395942, 7470723079546312477, 5268357887550599716, 2285911414331956501, 4810542258267088311, 4878869439099407527, -6314861654871981437, -6228806299448937051, -2289513586494258769, 257533754088335416, -7613515624575858020, -786993627763809860, 3170824115594718290, 5263507339014608373, -5655713019788863819, -6589766235180477050, 5709688465830681285, 5851064951209203948, 5931866822586243795, -4883515015159375676, 3355841149544094328, 5055070262128511759, 8930290523081110999, 8500175884397641440, -4859260425629062993, -8679122196560786309, 8169752934807027700, -2773038958057637852, -524753512615049682, -7785490701440818959, 8546311487042443236, -3454699277745797070, 4177504123408392224, 1825432326136456062, 3607858911278382877, 312646185654790247, -1745076696763312518, -6187171015933714575, -8393204340568429202, 8895628278764738676, -9110163120909716488, -6319871966704173715, -7593413510641732723, 1532135982859705596, 4753304106197110164, 3482505779338347803, -131947579693855985, 6881869152325684405, -3382109113091462735, 9211772746952534052, 7318561116742014360, -6338515371628010655, -4929451052734580635, -1326037342839268022, -462965217257465356, 523481979744513702, 5480025941145932833, -5713619175497303017, 683569722471203099, -1650146971812744986, -9012684641488856011, 1632158356957350685, -4698198776414384507, -1501620467428039534, -6607498917374653067, -1394657947395999699, 3082922541147126250, -6074024783379394679, 2175428808194677779, 1768623148019483736, 5578103676273252472, 601866373353952494, -7291149617792803441, 5185311057854867375, -441681849224078849, -7409230421985838758, -5952600469158351141, -8941737782836266266, -1740202383987325083, 8550313474396886194, -3822436722759924766, -3468586080835976864, 1216806325736339257, -667205661643043192, 8049373115413880167, -8175843797999452956, -6291542050531447671, -2842551664367602489, -1940402746507924194, -5484225238694228081, 8223168795917985376, 5721931271423970623, 6908714333930132185, 8868984960055389249, -7964813283601137826, 6149077196891026304, -7502931205442085762, -2517476347390353261, -1987582794588111576, -2864167392456114800, -5694342848502422040, -4513403407461767128, 1544576961942315033, 7393184154717769750, 6356748234595481964, -7765717711278636795, 3877565178975028780, 2288325340133808294, -3115383758561304203, -8848641033274969543, 7121276507929140888, 5624362942731640273, -8547475470942775717, 4078982383972396217, 7804907891580699721, 5033765915163216701, -9199456479663479593, -8391572929571693779, 6290985590214301596, 6306663374322376978, 1100121287315314867, 6435408865483207858, 8628504625178044975, -8093117753331404749, 6152418604516091038, 615347797078586772, 6480770966382294525, -973907173050842359, 6985886940278869013, -9210864788055687243, -7651965112643874660, 4243845615799226895, 1238459330137714742, 7716482442987566531, -1363671779322788083, 8873299752325832993, -7351272378371454174, -5340289963575795604, -2131577493191595822, -7172142046894206032, -3224033468766845683, -4413088492262412567, 8448642054028785646, -7839840566333865337, 7391063644050345412, 8589910247401964878, 3265297810234072654, 6848218286232705568, -959720364147361659, -549133060901738335, -1520112576649944912, 8581783051264586275, -6401421775981519373, 2608856350729682597, 1534715803071568445, -2113540808685775866, 2759609806641798944, -306116465711719864, 7860950987566022033, -3407600168957683258, 4914743676836525302, -7261070961936075513, -8456921759205566053, -4596626740219748345, 1707709669183986173, 8697481040494849216, -7744149612602688377, -6957651067445711848, -6484547393173871378, 6253476950012317369, 950752870410446712, -177944226174345963, 3482110815350016143, 1403223431196117681, -4170240261107926102, 4999053211028683431, -1476671084191114625, 8779212983580631516, 8538586552099952393, 4013303631696412294, -8206639392587447640, 6176385437575686727, -2609203090447589598, -2525053388271646900, 2814357603368803028, -939634251239111665, 1871520776113811337, -248224517866277714, -7144003872446926546, -4641291601409223626, 1366891219659876773, -5528838293582775189, -4598540042863463619, -6390904993812440085, 3698914605541515809, -643378177740342220, 3396178958383039687, -5702016786751655492, -3453094970352278960, 1203509602083182956, 6166974658942478270, -96444361315035726, 2179108537336109039, -1192228410685037644, 5141257828416945090}
INFO  2024-02-13 23:25:17,923 [shard 0: gms] storage_service - Set host_id=d4a0cf4d-b0c1-40bb-ae48-5d35fa86b5c5 to be owned by node=127.0.67.3
INFO  2024-02-13 23:25:17,923 [shard 0: gms] storage_service - handle_state_normal: Nodes 127.0.67.3 and 127.0.67.4 have the same token -4294785196284927379. Ignoring 127.0.67.3
INFO  2024-02-13 23:25:17,924 [shard 0: gms] storage_service - handle_state_normal: endpoints_to_remove endpoint=127.0.67.3
WARN  2024-02-13 23:25:17,924 [shard 0: gms] gossip - Fail to send EchoMessage to 127.0.67.3: seastar::rpc::closed_error (connection is closed)
INFO  2024-02-13 23:25:17,930 [shard 0: gms] gossip - Removed endpoint 127.0.67.3
INFO  2024-02-13 23:25:17,930 [shard 0: gms] storage_service - handle_state_normal for 127.0.67.3/d4a0cf4d-b0c1-40bb-ae48-5d35fa86b5c5 finished
INFO  2024-02-13 23:25:17,935 [shard 0:comp] compaction - [Compact system.scylla_local 25f797f0-cac7-11ee-a26b-f5fb68f403a6] Compacting [/jenkins/workspace/scylla-master/dtest-release/scylla/.dtest/dtest-_1dg7k7h/test/node4/data/system/scylla_local-2972ec7ffb2038ddaac1d876f2e3fcbd/me-3gdl_1t25_5jbxc2gx3hyllwq5ra-big-Data.db:level=0:origin=memtable,/jenkins/workspace/scylla-master/dtest-release/scylla/.dtest/dtest-_1dg7k7h/test/node4/data/system/scylla_local-2972ec7ffb2038ddaac1d876f2e3fcbd/me-3gdl_1szn_07i282gx3hyllwq5ra-big-Data.db:level=0:origin=compaction]
INFO  2024-02-13 23:25:17,936 [shard 0: gms] gossip - InetAddress 127.0.67.3 is now DOWN, status = shutdown

could be unrelated.

In any case if node 4 booted successfully the "wait for alive" check should succeed

@kbr-scylla
Copy link
Contributor

The wait started at

23:23:45,180 985     ccm                            DEBUG    cluster.py          :762  | test_replace_shutdown_node[use_host_id-rbo_disabled]: node4: Starting scylla: args=['/jenkins/workspace/scylla-master/dtest-release/scylla/.dtest/dtest-_1dg7k7h/test/node4/bin/scylla', '--options-file', '/jenkins/workspace/scylla-master/dtest-release/scylla/.dtest/dtest-_1dg7k7h/test/node4/conf/scylla.yaml', '--log-to-stdout', '1', '--abort-on-seastar-bad-alloc', '--abort-on-lsa-bad-alloc', '1', '--abort-on-internal-error', '1', '--api-address', '127.0.67.4', '--smp', '2', '--memory', '1024M', '--developer-mode', 'true', '--default-log-level', 'info', '--overprovisioned', '--prometheus-address', '127.0.67.4', '--replace-node-first-boot', 'd4a0cf4d-b0c1-40bb-ae48-5d35fa86b5c5', '--unsafe-bypass-fsync', '1', '--kernel-page-cache', '1', '--commitlog-use-o-dsync', '0', '--max-networking-io-control-blocks', '1000'] wait_other_notice=True wait_for_binary_proto=True

and the boot finished at

INFO  2024-02-13 23:24:07,532 [shard 0:main] init - Scylla version 5.5.0~dev-0.20240213.3d81138852ba initialization completed.

That is within the 120 seconds timeout. So @fruch it's not a problem with too low timeout. Some problem with node 1 not seeing node 4 in correct state

23:26:07,597 985     ccm                            DEBUG    cluster.py          :762  | test_replace_shutdown_node[use_host_id-rbo_disabled]: node1: watch_rest_for_alive: tofind={'127.0.67.4'} found=set(): tofind_host_id_map={'127.0.67.4': '99497bfb-ef2b-4856-a536-26bad13a4f40'} found_host_id_map={}

@kbr-scylla
Copy link
Contributor

kbr-scylla commented Feb 22, 2024

Ah, node 1 is waiting for node 4 to enter NORMAL status, but it never observes it

@kbr-scylla
Copy link
Contributor

Could it be yet another instance of the same root cause?
#16902

@kbr-scylla
Copy link
Contributor

Nodes not observing node 4 as entering NORMAL, could be also the reason why node 4 received a gossip event about node 3 (from some other node that thinks node 3 is not yet replaced)

@kbr-scylla
Copy link
Contributor

Could it be yet another instance of the same root cause?
#16902

And #16668

kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue Apr 4, 2024
In testing, we've observed multiple cases where nodes would fail to
observe updated application states of other nodes in gossiper.

For example:
- in scylladb#16902, a node would finish bootstrapping and enter
NORMAL state, propagating this information through gossiper. However,
other nodes would never observe that the node entered NORMAL state,
still thinking that it is in joining state. This would lead to further
bad consequences down the line.
- in scylladb#15393, a node got stuck in bootstrap, waiting for
schema versions to converge. Convergence would never be achieved and the
test eventually timed out. The node was observing outdated schema state
of some existing node in gossip.

I created a test that would bootstrap 3 nodes, then wait until they all
observe each other as NORMAL, with timeout. Unfortunately, thousands of
runs of this test on different machines failed to reproduce the problem.

After banging my head against the wall failing to reproduce, I decided
to sprinkle randomized sleeps across multiple places in gossiper code
and finally: the test started catching the problem in about 1 in 1000
runs.

With additional logging and additional head-banging, I determined
the root cause.

The following scenario can happen, 2 nodes are sufficient, let's call
them A and B:
- Node B calls `add_local_application_state` to update its gossiper
  state, for example, to propagate its new NORMAL status.
- `add_local_application_state` takes a copy of the endpoint_state, and
  updates the copy:
```
            auto local_state = *ep_state_before;
            for (auto& p : states) {
                auto& state = p.first;
                auto& value = p.second;
                value = versioned_value::clone_with_higher_version(value);
                local_state.add_application_state(state, value);
            }
```
  `clone_with_higher_version` bumps `version` inside
  gms/version_generator.cc.
- `add_local_application_state` calls `gossiper.replicate(...)`
- `replicate` works in 2 phases to achieve exception safety: in first
  phase it copies the updated `local_state` to all shards into a
  separate map. In second phase the values from separate map are used to
  overwrite the endpoint_state map used for gossiping.

  Due to the cross-shard calls of the 1 phase, there is a yield before
  the second phase. *During this yield* the following happens:
- `gossiper::run()` loop on B executes and bumps node B's `heart_beat`.
  This uses the monotonic version_generator, so it uses a higher version
  then the ones we used for states added above. Let's call this new version
  X. Note that X is larger than the versions used by application_states
  added above.
- now node B handles a SYN or ACK message from node A, creating
  an ACK or ACK2 message in response. This message contains:
    - old application states (now including the update described above,
      because `replicate` is still sleeping before phase 2),
    - but bumped heart_beat == X from `gossiper::run()` loop,
  and sends the message.
- node A receives the message and remembers that the max
  version across all states (including heart_beat) of node B is X.
  This means that it will no longer request or apply states from node B
  with versions smaller than X.
- `gossiper.replicate(...)` on B wakes up, and overwrites
  endpoint_state with the ones it saved in phase 1. In particular it
  reverts heart_beat back to smaller value, but the larger problem is that it
  saves updated application_states that use versions smaller than X.
- now when node B sends the updated application_states in ACK or ACK2
  message to node A, node A will ignore them, because their versions are
  smaller than X. Or node B will never send them, because whenever node
  A requests states from node B, it only requests states with versions >
  X. Either way, node A will fail to observe new states of node B.

If I understand correctly, this is a regression introduced in
38c2347, which introduced a yield in
`replicate`. Before that, the updated state would be saved atomically on
shard 0, there could be no `heart_beat` bump in-between making a copy of
the local state, updating it, and then saving it.

With the description above, it's easy to make a consistent ~100%
reproducer for the problem -- introduce a longer sleep in
`add_local_application_state` before second phase of replicate, to
increase the chance that gossiper loop will execute and bump heart_beat
version during the yield. Further commit adds a test based on that.

The fix is to bump the heart_beat under local endpoint lock, which is
also taken by `replicate`.

Fixes: scylladb#15393
Fixes: scylladb#15602
Fixes: scylladb#16668
Fixes: scylladb#16902
Fixes: scylladb#17493
Fixes: scylladb#18118
Fixes: scylladb/scylla-enterprise#3720
kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue Apr 4, 2024
In testing, we've observed multiple cases where nodes would fail to
observe updated application states of other nodes in gossiper.

For example:
- in scylladb#16902, a node would finish bootstrapping and enter
NORMAL state, propagating this information through gossiper. However,
other nodes would never observe that the node entered NORMAL state,
still thinking that it is in joining state. This would lead to further
bad consequences down the line.
- in scylladb#15393, a node got stuck in bootstrap, waiting for
schema versions to converge. Convergence would never be achieved and the
test eventually timed out. The node was observing outdated schema state
of some existing node in gossip.

I created a test that would bootstrap 3 nodes, then wait until they all
observe each other as NORMAL, with timeout. Unfortunately, thousands of
runs of this test on different machines failed to reproduce the problem.

After banging my head against the wall failing to reproduce, I decided
to sprinkle randomized sleeps across multiple places in gossiper code
and finally: the test started catching the problem in about 1 in 1000
runs.

With additional logging and additional head-banging, I determined
the root cause.

The following scenario can happen, 2 nodes are sufficient, let's call
them A and B:
- Node B calls `add_local_application_state` to update its gossiper
  state, for example, to propagate its new NORMAL status.
- `add_local_application_state` takes a copy of the endpoint_state, and
  updates the copy:
```
            auto local_state = *ep_state_before;
            for (auto& p : states) {
                auto& state = p.first;
                auto& value = p.second;
                value = versioned_value::clone_with_higher_version(value);
                local_state.add_application_state(state, value);
            }
```
  `clone_with_higher_version` bumps `version` inside
  gms/version_generator.cc.
- `add_local_application_state` calls `gossiper.replicate(...)`
- `replicate` works in 2 phases to achieve exception safety: in first
  phase it copies the updated `local_state` to all shards into a
  separate map. In second phase the values from separate map are used to
  overwrite the endpoint_state map used for gossiping.

  Due to the cross-shard calls of the 1 phase, there is a yield before
  the second phase. *During this yield* the following happens:
- `gossiper::run()` loop on B executes and bumps node B's `heart_beat`.
  This uses the monotonic version_generator, so it uses a higher version
  then the ones we used for states added above. Let's call this new version
  X. Note that X is larger than the versions used by application_states
  added above.
- now node B handles a SYN or ACK message from node A, creating
  an ACK or ACK2 message in response. This message contains:
    - old application states (now including the update described above,
      because `replicate` is still sleeping before phase 2),
    - but bumped heart_beat == X from `gossiper::run()` loop,
  and sends the message.
- node A receives the message and remembers that the max
  version across all states (including heart_beat) of node B is X.
  This means that it will no longer request or apply states from node B
  with versions smaller than X.
- `gossiper.replicate(...)` on B wakes up, and overwrites
  endpoint_state with the ones it saved in phase 1. In particular it
  reverts heart_beat back to smaller value, but the larger problem is that it
  saves updated application_states that use versions smaller than X.
- now when node B sends the updated application_states in ACK or ACK2
  message to node A, node A will ignore them, because their versions are
  smaller than X. Or node B will never send them, because whenever node
  A requests states from node B, it only requests states with versions >
  X. Either way, node A will fail to observe new states of node B.

If I understand correctly, this is a regression introduced in
38c2347, which introduced a yield in
`replicate`. Before that, the updated state would be saved atomically on
shard 0, there could be no `heart_beat` bump in-between making a copy of
the local state, updating it, and then saving it.

With the description above, it's easy to make a consistent
reproducer for the problem -- introduce a longer sleep in
`add_local_application_state` before second phase of replicate, to
increase the chance that gossiper loop will execute and bump heart_beat
version during the yield. Further commit adds a test based on that.

The fix is to bump the heart_beat under local endpoint lock, which is
also taken by `replicate`.

Fixes: scylladb#15393
Fixes: scylladb#15602
Fixes: scylladb#16668
Fixes: scylladb#16902
Fixes: scylladb#17493
Fixes: scylladb#18118
Fixes: scylladb/scylla-enterprise#3720
kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue Apr 4, 2024
In testing, we've observed multiple cases where nodes would fail to
observe updated application states of other nodes in gossiper.

For example:
- in scylladb#16902, a node would finish bootstrapping and enter
NORMAL state, propagating this information through gossiper. However,
other nodes would never observe that the node entered NORMAL state,
still thinking that it is in joining state. This would lead to further
bad consequences down the line.
- in scylladb#15393, a node got stuck in bootstrap, waiting for
schema versions to converge. Convergence would never be achieved and the
test eventually timed out. The node was observing outdated schema state
of some existing node in gossip.

I created a test that would bootstrap 3 nodes, then wait until they all
observe each other as NORMAL, with timeout. Unfortunately, thousands of
runs of this test on different machines failed to reproduce the problem.

After banging my head against the wall failing to reproduce, I decided
to sprinkle randomized sleeps across multiple places in gossiper code
and finally: the test started catching the problem in about 1 in 1000
runs.

With additional logging and additional head-banging, I determined
the root cause.

The following scenario can happen, 2 nodes are sufficient, let's call
them A and B:
- Node B calls `add_local_application_state` to update its gossiper
  state, for example, to propagate its new NORMAL status.
- `add_local_application_state` takes a copy of the endpoint_state, and
  updates the copy:
```
            auto local_state = *ep_state_before;
            for (auto& p : states) {
                auto& state = p.first;
                auto& value = p.second;
                value = versioned_value::clone_with_higher_version(value);
                local_state.add_application_state(state, value);
            }
```
  `clone_with_higher_version` bumps `version` inside
  gms/version_generator.cc.
- `add_local_application_state` calls `gossiper.replicate(...)`
- `replicate` works in 2 phases to achieve exception safety: in first
  phase it copies the updated `local_state` to all shards into a
  separate map. In second phase the values from separate map are used to
  overwrite the endpoint_state map used for gossiping.

  Due to the cross-shard calls of the 1 phase, there is a yield before
  the second phase. *During this yield* the following happens:
- `gossiper::run()` loop on B executes and bumps node B's `heart_beat`.
  This uses the monotonic version_generator, so it uses a higher version
  then the ones we used for states added above. Let's call this new version
  X. Note that X is larger than the versions used by application_states
  added above.
- now node B handles a SYN or ACK message from node A, creating
  an ACK or ACK2 message in response. This message contains:
    - old application states (NOT including the update described above,
      because `replicate` is still sleeping before phase 2),
    - but bumped heart_beat == X from `gossiper::run()` loop,
  and sends the message.
- node A receives the message and remembers that the max
  version across all states (including heart_beat) of node B is X.
  This means that it will no longer request or apply states from node B
  with versions smaller than X.
- `gossiper.replicate(...)` on B wakes up, and overwrites
  endpoint_state with the ones it saved in phase 1. In particular it
  reverts heart_beat back to smaller value, but the larger problem is that it
  saves updated application_states that use versions smaller than X.
- now when node B sends the updated application_states in ACK or ACK2
  message to node A, node A will ignore them, because their versions are
  smaller than X. Or node B will never send them, because whenever node
  A requests states from node B, it only requests states with versions >
  X. Either way, node A will fail to observe new states of node B.

If I understand correctly, this is a regression introduced in
38c2347, which introduced a yield in
`replicate`. Before that, the updated state would be saved atomically on
shard 0, there could be no `heart_beat` bump in-between making a copy of
the local state, updating it, and then saving it.

With the description above, it's easy to make a consistent
reproducer for the problem -- introduce a longer sleep in
`add_local_application_state` before second phase of replicate, to
increase the chance that gossiper loop will execute and bump heart_beat
version during the yield. Further commit adds a test based on that.

The fix is to bump the heart_beat under local endpoint lock, which is
also taken by `replicate`.

Fixes: scylladb#15393
Fixes: scylladb#15602
Fixes: scylladb#16668
Fixes: scylladb#16902
Fixes: scylladb#17493
Fixes: scylladb#18118
Fixes: scylladb/scylla-enterprise#3720
denesb added a commit that referenced this issue Apr 16, 2024
…amil Braun

In testing, we've observed multiple cases where nodes would fail to
observe updated application states of other nodes in gossiper.

For example:
- in #16902, a node would finish bootstrapping and enter
NORMAL state, propagating this information through gossiper. However,
other nodes would never observe that the node entered NORMAL state,
still thinking that it is in joining state. This would lead to further
bad consequences down the line.
- in #15393, a node got stuck in bootstrap, waiting for
schema versions to converge. Convergence would never be achieved and the
test eventually timed out. The node was observing outdated schema state
of some existing node in gossip.

I created a test that would bootstrap 3 nodes, then wait until they all
observe each other as NORMAL, with timeout. Unfortunately, thousands of
runs of this test on different machines failed to reproduce the problem.

After banging my head against the wall failing to reproduce, I decided
to sprinkle randomized sleeps across multiple places in gossiper code
and finally: the test started catching the problem in about 1 in 1000
runs.

With additional logging and additional head-banging, I determined
the root cause.

The following scenario can happen, 2 nodes are sufficient, let's call
them A and B:
- Node B calls `add_local_application_state` to update its gossiper
  state, for example, to propagate its new NORMAL status.
- `add_local_application_state` takes a copy of the endpoint_state, and
  updates the copy:
```
            auto local_state = *ep_state_before;
            for (auto& p : states) {
                auto& state = p.first;
                auto& value = p.second;
                value = versioned_value::clone_with_higher_version(value);
                local_state.add_application_state(state, value);
            }
```
  `clone_with_higher_version` bumps `version` inside
  gms/version_generator.cc.
- `add_local_application_state` calls `gossiper.replicate(...)`
- `replicate` works in 2 phases to achieve exception safety: in first
  phase it copies the updated `local_state` to all shards into a
  separate map. In second phase the values from separate map are used to
  overwrite the endpoint_state map used for gossiping.

  Due to the cross-shard calls of the 1 phase, there is a yield before
  the second phase. *During this yield* the following happens:
- `gossiper::run()` loop on B executes and bumps node B's `heart_beat`.
  This uses the monotonic version_generator, so it uses a higher version
  then the ones we used for states added above. Let's call this new version
  X. Note that X is larger than the versions used by application_states
  added above.
- now node B handles a SYN or ACK message from node A, creating
  an ACK or ACK2 message in response. This message contains:
    - old application states (NOT including the update described above,
      because `replicate` is still sleeping before phase 2),
    - but bumped heart_beat == X from `gossiper::run()` loop,
  and sends the message.
- node A receives the message and remembers that the max
  version across all states (including heart_beat) of node B is X.
  This means that it will no longer request or apply states from node B
  with versions smaller than X.
- `gossiper.replicate(...)` on B wakes up, and overwrites
  endpoint_state with the ones it saved in phase 1. In particular it
  reverts heart_beat back to smaller value, but the larger problem is that it
  saves updated application_states that use versions smaller than X.
- now when node B sends the updated application_states in ACK or ACK2
  message to node A, node A will ignore them, because their versions are
  smaller than X. Or node B will never send them, because whenever node
  A requests states from node B, it only requests states with versions >
  X. Either way, node A will fail to observe new states of node B.

If I understand correctly, this is a regression introduced in
38c2347, which introduced a yield in
`replicate`. Before that, the updated state would be saved atomically on
shard 0, there could be no `heart_beat` bump in-between making a copy of
the local state, updating it, and then saving it.

With the description above, it's easy to make a consistent
reproducer for the problem -- introduce a longer sleep in
`add_local_application_state` before second phase of replicate, to
increase the chance that gossiper loop will execute and bump heart_beat
version during the yield. Further commit adds a test based on that.

The fix is to bump the heart_beat under local endpoint lock, which is
also taken by `replicate`.

The PR also adds a regression test.

Fixes: #15393
Fixes: #15602
Fixes: #16668
Fixes: #16902
Fixes: #17493
Fixes: #18118
Ref: scylladb/scylla-enterprise#3720

Closes #18184

* github.com:scylladb/scylladb:
  test: reproducer for missing gossiper updates
  gossiper: lock local endpoint when updating heart_beat
mergify bot pushed a commit that referenced this issue Apr 16, 2024
In testing, we've observed multiple cases where nodes would fail to
observe updated application states of other nodes in gossiper.

For example:
- in #16902, a node would finish bootstrapping and enter
NORMAL state, propagating this information through gossiper. However,
other nodes would never observe that the node entered NORMAL state,
still thinking that it is in joining state. This would lead to further
bad consequences down the line.
- in #15393, a node got stuck in bootstrap, waiting for
schema versions to converge. Convergence would never be achieved and the
test eventually timed out. The node was observing outdated schema state
of some existing node in gossip.

I created a test that would bootstrap 3 nodes, then wait until they all
observe each other as NORMAL, with timeout. Unfortunately, thousands of
runs of this test on different machines failed to reproduce the problem.

After banging my head against the wall failing to reproduce, I decided
to sprinkle randomized sleeps across multiple places in gossiper code
and finally: the test started catching the problem in about 1 in 1000
runs.

With additional logging and additional head-banging, I determined
the root cause.

The following scenario can happen, 2 nodes are sufficient, let's call
them A and B:
- Node B calls `add_local_application_state` to update its gossiper
  state, for example, to propagate its new NORMAL status.
- `add_local_application_state` takes a copy of the endpoint_state, and
  updates the copy:
```
            auto local_state = *ep_state_before;
            for (auto& p : states) {
                auto& state = p.first;
                auto& value = p.second;
                value = versioned_value::clone_with_higher_version(value);
                local_state.add_application_state(state, value);
            }
```
  `clone_with_higher_version` bumps `version` inside
  gms/version_generator.cc.
- `add_local_application_state` calls `gossiper.replicate(...)`
- `replicate` works in 2 phases to achieve exception safety: in first
  phase it copies the updated `local_state` to all shards into a
  separate map. In second phase the values from separate map are used to
  overwrite the endpoint_state map used for gossiping.

  Due to the cross-shard calls of the 1 phase, there is a yield before
  the second phase. *During this yield* the following happens:
- `gossiper::run()` loop on B executes and bumps node B's `heart_beat`.
  This uses the monotonic version_generator, so it uses a higher version
  then the ones we used for states added above. Let's call this new version
  X. Note that X is larger than the versions used by application_states
  added above.
- now node B handles a SYN or ACK message from node A, creating
  an ACK or ACK2 message in response. This message contains:
    - old application states (NOT including the update described above,
      because `replicate` is still sleeping before phase 2),
    - but bumped heart_beat == X from `gossiper::run()` loop,
  and sends the message.
- node A receives the message and remembers that the max
  version across all states (including heart_beat) of node B is X.
  This means that it will no longer request or apply states from node B
  with versions smaller than X.
- `gossiper.replicate(...)` on B wakes up, and overwrites
  endpoint_state with the ones it saved in phase 1. In particular it
  reverts heart_beat back to smaller value, but the larger problem is that it
  saves updated application_states that use versions smaller than X.
- now when node B sends the updated application_states in ACK or ACK2
  message to node A, node A will ignore them, because their versions are
  smaller than X. Or node B will never send them, because whenever node
  A requests states from node B, it only requests states with versions >
  X. Either way, node A will fail to observe new states of node B.

If I understand correctly, this is a regression introduced in
38c2347, which introduced a yield in
`replicate`. Before that, the updated state would be saved atomically on
shard 0, there could be no `heart_beat` bump in-between making a copy of
the local state, updating it, and then saving it.

With the description above, it's easy to make a consistent
reproducer for the problem -- introduce a longer sleep in
`add_local_application_state` before second phase of replicate, to
increase the chance that gossiper loop will execute and bump heart_beat
version during the yield. Further commit adds a test based on that.

The fix is to bump the heart_beat under local endpoint lock, which is
also taken by `replicate`.

Fixes: #15393
Fixes: #15602
Fixes: #16668
Fixes: #16902
Fixes: #17493
Fixes: #18118
Ref: scylladb/scylla-enterprise#3720
(cherry picked from commit a0b331b)
kbr-scylla added a commit that referenced this issue Apr 16, 2024
In testing, we've observed multiple cases where nodes would fail to
observe updated application states of other nodes in gossiper.

For example:
- in #16902, a node would finish bootstrapping and enter
NORMAL state, propagating this information through gossiper. However,
other nodes would never observe that the node entered NORMAL state,
still thinking that it is in joining state. This would lead to further
bad consequences down the line.
- in #15393, a node got stuck in bootstrap, waiting for
schema versions to converge. Convergence would never be achieved and the
test eventually timed out. The node was observing outdated schema state
of some existing node in gossip.

I created a test that would bootstrap 3 nodes, then wait until they all
observe each other as NORMAL, with timeout. Unfortunately, thousands of
runs of this test on different machines failed to reproduce the problem.

After banging my head against the wall failing to reproduce, I decided
to sprinkle randomized sleeps across multiple places in gossiper code
and finally: the test started catching the problem in about 1 in 1000
runs.

With additional logging and additional head-banging, I determined
the root cause.

The following scenario can happen, 2 nodes are sufficient, let's call
them A and B:
- Node B calls `add_local_application_state` to update its gossiper
  state, for example, to propagate its new NORMAL status.
- `add_local_application_state` takes a copy of the endpoint_state, and
  updates the copy:
```
            auto local_state = *ep_state_before;
            for (auto& p : states) {
                auto& state = p.first;
                auto& value = p.second;
                value = versioned_value::clone_with_higher_version(value);
                local_state.add_application_state(state, value);
            }
```
  `clone_with_higher_version` bumps `version` inside
  gms/version_generator.cc.
- `add_local_application_state` calls `gossiper.replicate(...)`
- `replicate` works in 2 phases to achieve exception safety: in first
  phase it copies the updated `local_state` to all shards into a
  separate map. In second phase the values from separate map are used to
  overwrite the endpoint_state map used for gossiping.

  Due to the cross-shard calls of the 1 phase, there is a yield before
  the second phase. *During this yield* the following happens:
- `gossiper::run()` loop on B executes and bumps node B's `heart_beat`.
  This uses the monotonic version_generator, so it uses a higher version
  then the ones we used for states added above. Let's call this new version
  X. Note that X is larger than the versions used by application_states
  added above.
- now node B handles a SYN or ACK message from node A, creating
  an ACK or ACK2 message in response. This message contains:
    - old application states (NOT including the update described above,
      because `replicate` is still sleeping before phase 2),
    - but bumped heart_beat == X from `gossiper::run()` loop,
  and sends the message.
- node A receives the message and remembers that the max
  version across all states (including heart_beat) of node B is X.
  This means that it will no longer request or apply states from node B
  with versions smaller than X.
- `gossiper.replicate(...)` on B wakes up, and overwrites
  endpoint_state with the ones it saved in phase 1. In particular it
  reverts heart_beat back to smaller value, but the larger problem is that it
  saves updated application_states that use versions smaller than X.
- now when node B sends the updated application_states in ACK or ACK2
  message to node A, node A will ignore them, because their versions are
  smaller than X. Or node B will never send them, because whenever node
  A requests states from node B, it only requests states with versions >
  X. Either way, node A will fail to observe new states of node B.

If I understand correctly, this is a regression introduced in
38c2347, which introduced a yield in
`replicate`. Before that, the updated state would be saved atomically on
shard 0, there could be no `heart_beat` bump in-between making a copy of
the local state, updating it, and then saving it.

With the description above, it's easy to make a consistent
reproducer for the problem -- introduce a longer sleep in
`add_local_application_state` before second phase of replicate, to
increase the chance that gossiper loop will execute and bump heart_beat
version during the yield. Further commit adds a test based on that.

The fix is to bump the heart_beat under local endpoint lock, which is
also taken by `replicate`.

Fixes: #15393
Fixes: #15602
Fixes: #16668
Fixes: #16902
Fixes: #17493
Fixes: #18118
Ref: scylladb/scylla-enterprise#3720
(cherry picked from commit a0b331b)
kbr-scylla added a commit that referenced this issue Apr 17, 2024
…rt_beat' from ScyllaDB

In testing, we've observed multiple cases where nodes would fail to
observe updated application states of other nodes in gossiper.

For example:
- in #16902, a node would finish bootstrapping and enter
NORMAL state, propagating this information through gossiper. However,
other nodes would never observe that the node entered NORMAL state,
still thinking that it is in joining state. This would lead to further
bad consequences down the line.
- in #15393, a node got stuck in bootstrap, waiting for
schema versions to converge. Convergence would never be achieved and the
test eventually timed out. The node was observing outdated schema state
of some existing node in gossip.

I created a test that would bootstrap 3 nodes, then wait until they all
observe each other as NORMAL, with timeout. Unfortunately, thousands of
runs of this test on different machines failed to reproduce the problem.

After banging my head against the wall failing to reproduce, I decided
to sprinkle randomized sleeps across multiple places in gossiper code
and finally: the test started catching the problem in about 1 in 1000
runs.

With additional logging and additional head-banging, I determined
the root cause.

The following scenario can happen, 2 nodes are sufficient, let's call
them A and B:
- Node B calls `add_local_application_state` to update its gossiper
  state, for example, to propagate its new NORMAL status.
- `add_local_application_state` takes a copy of the endpoint_state, and
  updates the copy:
```
            auto local_state = *ep_state_before;
            for (auto& p : states) {
                auto& state = p.first;
                auto& value = p.second;
                value = versioned_value::clone_with_higher_version(value);
                local_state.add_application_state(state, value);
            }
```
  `clone_with_higher_version` bumps `version` inside
  gms/version_generator.cc.
- `add_local_application_state` calls `gossiper.replicate(...)`
- `replicate` works in 2 phases to achieve exception safety: in first
  phase it copies the updated `local_state` to all shards into a
  separate map. In second phase the values from separate map are used to
  overwrite the endpoint_state map used for gossiping.

  Due to the cross-shard calls of the 1 phase, there is a yield before
  the second phase. *During this yield* the following happens:
- `gossiper::run()` loop on B executes and bumps node B's `heart_beat`.
  This uses the monotonic version_generator, so it uses a higher version
  then the ones we used for states added above. Let's call this new version
  X. Note that X is larger than the versions used by application_states
  added above.
- now node B handles a SYN or ACK message from node A, creating
  an ACK or ACK2 message in response. This message contains:
    - old application states (NOT including the update described above,
      because `replicate` is still sleeping before phase 2),
    - but bumped heart_beat == X from `gossiper::run()` loop,
  and sends the message.
- node A receives the message and remembers that the max
  version across all states (including heart_beat) of node B is X.
  This means that it will no longer request or apply states from node B
  with versions smaller than X.
- `gossiper.replicate(...)` on B wakes up, and overwrites
  endpoint_state with the ones it saved in phase 1. In particular it
  reverts heart_beat back to smaller value, but the larger problem is that it
  saves updated application_states that use versions smaller than X.
- now when node B sends the updated application_states in ACK or ACK2
  message to node A, node A will ignore them, because their versions are
  smaller than X. Or node B will never send them, because whenever node
  A requests states from node B, it only requests states with versions >
  X. Either way, node A will fail to observe new states of node B.

If I understand correctly, this is a regression introduced in
38c2347, which introduced a yield in
`replicate`. Before that, the updated state would be saved atomically on
shard 0, there could be no `heart_beat` bump in-between making a copy of
the local state, updating it, and then saving it.

With the description above, it's easy to make a consistent
reproducer for the problem -- introduce a longer sleep in
`add_local_application_state` before second phase of replicate, to
increase the chance that gossiper loop will execute and bump heart_beat
version during the yield. Further commit adds a test based on that.

The fix is to bump the heart_beat under local endpoint lock, which is
also taken by `replicate`.

The PR also adds a regression test.

Fixes: #15393
Fixes: #15602
Fixes: #16668
Fixes: #16902
Fixes: #17493
Fixes: #18118
Ref: scylladb/scylla-enterprise#3720

(cherry picked from commit a0b331b)

(cherry picked from commit 7295509)

Refs #18184

Closes #18245

* github.com:scylladb/scylladb:
  test: reproducer for missing gossiper updates
  gossiper: lock local endpoint when updating heart_beat
dgarcia360 pushed a commit to dgarcia360/scylla that referenced this issue Apr 30, 2024
In testing, we've observed multiple cases where nodes would fail to
observe updated application states of other nodes in gossiper.

For example:
- in scylladb#16902, a node would finish bootstrapping and enter
NORMAL state, propagating this information through gossiper. However,
other nodes would never observe that the node entered NORMAL state,
still thinking that it is in joining state. This would lead to further
bad consequences down the line.
- in scylladb#15393, a node got stuck in bootstrap, waiting for
schema versions to converge. Convergence would never be achieved and the
test eventually timed out. The node was observing outdated schema state
of some existing node in gossip.

I created a test that would bootstrap 3 nodes, then wait until they all
observe each other as NORMAL, with timeout. Unfortunately, thousands of
runs of this test on different machines failed to reproduce the problem.

After banging my head against the wall failing to reproduce, I decided
to sprinkle randomized sleeps across multiple places in gossiper code
and finally: the test started catching the problem in about 1 in 1000
runs.

With additional logging and additional head-banging, I determined
the root cause.

The following scenario can happen, 2 nodes are sufficient, let's call
them A and B:
- Node B calls `add_local_application_state` to update its gossiper
  state, for example, to propagate its new NORMAL status.
- `add_local_application_state` takes a copy of the endpoint_state, and
  updates the copy:
```
            auto local_state = *ep_state_before;
            for (auto& p : states) {
                auto& state = p.first;
                auto& value = p.second;
                value = versioned_value::clone_with_higher_version(value);
                local_state.add_application_state(state, value);
            }
```
  `clone_with_higher_version` bumps `version` inside
  gms/version_generator.cc.
- `add_local_application_state` calls `gossiper.replicate(...)`
- `replicate` works in 2 phases to achieve exception safety: in first
  phase it copies the updated `local_state` to all shards into a
  separate map. In second phase the values from separate map are used to
  overwrite the endpoint_state map used for gossiping.

  Due to the cross-shard calls of the 1 phase, there is a yield before
  the second phase. *During this yield* the following happens:
- `gossiper::run()` loop on B executes and bumps node B's `heart_beat`.
  This uses the monotonic version_generator, so it uses a higher version
  then the ones we used for states added above. Let's call this new version
  X. Note that X is larger than the versions used by application_states
  added above.
- now node B handles a SYN or ACK message from node A, creating
  an ACK or ACK2 message in response. This message contains:
    - old application states (NOT including the update described above,
      because `replicate` is still sleeping before phase 2),
    - but bumped heart_beat == X from `gossiper::run()` loop,
  and sends the message.
- node A receives the message and remembers that the max
  version across all states (including heart_beat) of node B is X.
  This means that it will no longer request or apply states from node B
  with versions smaller than X.
- `gossiper.replicate(...)` on B wakes up, and overwrites
  endpoint_state with the ones it saved in phase 1. In particular it
  reverts heart_beat back to smaller value, but the larger problem is that it
  saves updated application_states that use versions smaller than X.
- now when node B sends the updated application_states in ACK or ACK2
  message to node A, node A will ignore them, because their versions are
  smaller than X. Or node B will never send them, because whenever node
  A requests states from node B, it only requests states with versions >
  X. Either way, node A will fail to observe new states of node B.

If I understand correctly, this is a regression introduced in
38c2347, which introduced a yield in
`replicate`. Before that, the updated state would be saved atomically on
shard 0, there could be no `heart_beat` bump in-between making a copy of
the local state, updating it, and then saving it.

With the description above, it's easy to make a consistent
reproducer for the problem -- introduce a longer sleep in
`add_local_application_state` before second phase of replicate, to
increase the chance that gossiper loop will execute and bump heart_beat
version during the yield. Further commit adds a test based on that.

The fix is to bump the heart_beat under local endpoint lock, which is
also taken by `replicate`.

Fixes: scylladb#15393
Fixes: scylladb#15602
Fixes: scylladb#16668
Fixes: scylladb#16902
Fixes: scylladb#17493
Fixes: scylladb#18118
Ref: scylladb/scylla-enterprise#3720
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Backport candidate symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants