dtest: replace_node_when_two_nodes_dead scenario of repair_based_node_operations_test.TestRepairBasedNodeOperations.test_ignore_dead_nodes_for_replace_option is failing with ccmlib.node.TimeoutError: watch_rest_for_alive() timeout after 120 seconds #17349

enaydanov · 2024-02-15T08:54:25Z

replace_node_when_two_nodes_dead scenario of repair_based_node_operations_test.TestRepairBasedNodeOperations.test_ignore_dead_nodes_for_replace_option is failing with ccmlib.node.TimeoutError: watch_rest_for_alive() timeout after 120 seconds:

dtest log:

...
06:09:19,940 805     repair_based_node_operations_test DEBUG    repair_based_node_operations_test.py:170  | test_ignore_dead_nodes_for_replace_option: Start 127.0.46.6 and 127.0.46.7 nodes
06:09:19,946 805     ccm                            DEBUG    cluster.py          :762  | test_ignore_dead_nodes_for_replace_option: node6: Starting scylla: args=['/jenkins/workspace/scylla-master/releng-testing/gating-dtest-release-with-consistent-topology-changes/scylla/.dtest/dtest-1k0j2lwn/test/node6/bin/scylla', '--options-file', '/jenkins/workspace/scylla-master/releng-testing/gating-dtest-release-with-consistent-topology-changes/scylla/.dtest/dtest-1k0j2lwn/test/node6/conf/scylla.yaml', '--log-to-stdout', '1', '--abort-on-seastar-bad-alloc', '--abort-on-lsa-bad-alloc', '1', '--abort-on-internal-error', '1', '--api-address', '127.0.46.6', '--smp', '2', '--memory', '1024M', '--developer-mode', 'true', '--default-log-level', 'info', '--overprovisioned', '--prometheus-address', '127.0.46.6', '--unsafe-bypass-fsync', '1'] wait_other_notice=True wait_for_binary_proto=True
06:09:21,427 805     cassandra.cluster              INFO     thread.py           :58   | test_ignore_dead_nodes_for_replace_option: New Cassandra host <Host: 127.0.46.8:9042 datacenter1> discovered
06:09:21,586 805     cassandra.pool                 WARNING  thread.py           :58   | test_ignore_dead_nodes_for_replace_option: Error attempting to reconnect to 127.0.46.7:9042, scheduling retry in 4.0 seconds: [Errno 111] Tried connecting to [('127.0.46.7', 9042)]. Last error: Connection refused
06:09:22,989 805     cassandra.pool                 INFO     thread.py           :58   | test_ignore_dead_nodes_for_replace_option: Successful reconnection to 127.0.46.6:9042, marking node up if it isn't already
06:09:22,989 805     cassandra.cluster              INFO     thread.py           :58   | test_ignore_dead_nodes_for_replace_option: Host 127.0.46.6:9042 may be up; will prepare queries and open connection pool
06:09:25,592 805     cassandra.pool                 WARNING  thread.py           :58   | test_ignore_dead_nodes_for_replace_option: Error attempting to reconnect to 127.0.46.7:9042, scheduling retry in 3.92 seconds: [Errno 111] Tried connecting to [('127.0.46.7', 9042)]. Last error: Connection refused
...
06:11:20,891 805     ccm                            DEBUG    cluster.py          :762  | test_ignore_dead_nodes_for_replace_option: node1: watch_rest_for_alive: tofind={'127.0.46.6'} found=set(): tofind_host_id_map={'127.0.46.6': 'cc76206a-052a-471e-9bc3-3f0b7166b4c3'} found_host_id_map={}
06:11:21,011 805     errors                         ERROR    conftest.py         :202  | test_ignore_dead_nodes_for_replace_option: test failed: 
self = <repair_based_node_operations_test.TestRepairBasedNodeOperations object at 0x7fbfb888bfd0>

    def test_ignore_dead_nodes_for_replace_option(self):
        """
        --ignore-dead-nodes-for-replace was added by commit
        https://github.com/scylladb/scylla/commit/eba4a4fba4e6f203742307c16448034f40713711
    
        This option allows ignoring dead nodes for replace operation.
        If this option is not set, replace operation will fail because one node is down.
        """
        enable_repair_based_node_ops = True
        self.prepare_cluster(nodes=7, enable_repair_based_node_ops=enable_repair_based_node_ops)
        self.prepare_schema(node=self.cluster.nodelist()[0])
    
        rbnos = RepairBasedNodeOperationsScenarios(tester=self)
>       rbnos.run_scenarios(rbno_enabled=enable_repair_based_node_ops, scenarios=[rbnos.replace_with_dead_nodes_scenario])

repair_based_node_operations_test.py:455: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
repair_based_node_operations_test.py:306: in run_scenarios
    tested_node = scenario.operation()
repair_based_node_operations_test.py:172: in replace_node_when_two_nodes_dead
    node.start(wait_other_notice=True, wait_for_binary_proto=True)
../scylla/.local/lib/python3.11/site-packages/ccmlib/scylla_node.py:683: in start
    scylla_process = self._start_scylla(args=args, marks=marks, update_pid=update_pid,
../scylla/.local/lib/python3.11/site-packages/ccmlib/scylla_node.py:337: in _start_scylla
    node.watch_rest_for_alive(self, timeout=t, wait_normal_token_owner=wait_normal_token_owner)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <ccmlib.scylla_node.ScyllaNode object at 0x7fbfb9742690>
nodes = <ccmlib.scylla_node.ScyllaNode object at 0x7fbedb9d6590>, timeout = 120
wait_normal_token_owner = True

    def watch_rest_for_alive(self, nodes, timeout=120, wait_normal_token_owner=True):
        """
        Use the REST API to wait until this node detects that the nodes listed
        in "nodes" become fully operational.
        This is similar to watch_log_for_alive but uses ScyllaDB's REST API
        instead of the log file and waits for the node to be really useable,
        not just "UP" (see issue #461)
    
        Params:
          - wait_normal_token_owner: return only when this node sees all other nodes as normal token owner (True by default).
        """
        logging.getLogger('urllib3.connectionpool').disabled = True
        try:
            nodes_tofind = nodes if isinstance(nodes, list) else [nodes]
            tofind = set([node.address() for node in nodes_tofind])
            tofind_host_id_map = dict([(node.address(), node.hostid()) for node in nodes_tofind])
            found = set()
            found_host_id_map = dict()
            url_live = f"http://{self.address()}:10000/gossiper/endpoint/live"
            url_joining = f"http://{self.address()}:10000/storage_service/nodes/joining"
            url_tokens = f"http://{self.address()}:10000/storage_service/tokens/"
            url_host_ids = f"http://{self.address()}:10000/storage_service/host_id"
            endtime = time.time() + timeout
            while time.time() < endtime:
                live = set()
                response = requests.get(url=url_live)
                if response.status_code == requests.codes.ok:
                    live = set(response.json())
                response = requests.get(url=url_joining)
                if response.status_code == requests.codes.ok:
                    live = live - set(response.json())
                # Verify that node knows not only about the existance of the
                # other node, but also its host_id as a normal token owner:
                if tofind.issubset(live):
                    # This node thinks that all given nodes are alive and not
                    # "joining", we're almost done, but still need to verify
                    # that the node knows the others' tokens.
                    check = tofind
                    have_no_tokens = set()
                    for n in check:
                        response = requests.get(url=url_tokens+n)
                        if response.text == '[]':
                            have_no_tokens.add(n)
                    if not have_no_tokens:
                        if not wait_normal_token_owner:
                            return
                        # and that the node knows that the others' are normal token owners.
                        host_id_map = dict()
                        response = requests.get(url=url_host_ids)
                        if response.status_code == requests.codes.ok:
                            for r in response.json():
                                host_id_map[r['key']] = r['value']
                        # Verify that the other nodes are considered normal token owners on this node
                        # and their host_id matches the host_id the client knows about
                        normal = set([addr for addr, id in host_id_map.items() \
                                    if addr in tofind_host_id_map and \
                                    (id == tofind_host_id_map[addr] or not tofind_host_id_map[addr])])
                        tofind = tofind.difference(normal)
                        if not tofind:
                            return
                        # Update cumulative maps for debugging
                        found = found.union(normal)
                        found_host_id_map.update(host_id_map)
                time.sleep(0.1)
            self.debug(f"watch_rest_for_alive: tofind={tofind} found={found}: tofind_host_id_map={tofind_host_id_map} found_host_id_map={found_host_id_map}")
>           raise TimeoutError(f"watch_rest_for_alive() timeout after {timeout} seconds")
E           ccmlib.node.TimeoutError: watch_rest_for_alive() timeout after 120 seconds

node6.log:

...
INFO  2024-02-15 06:09:20,782 [shard 0:main] init - serving
INFO  2024-02-15 06:09:20,782 [shard 0:main] init - Scylla version 5.5.0~development-0.20240215.53f4805ebaf9 initialization completed.
INFO  2024-02-15 06:09:20,782 [shard 0:comp] compaction - [Compact system_schema.columns c23c3f10-cbc8-11ee-a23d-b5f1215606f0] Compacted 2 sstables to [/jenkins/workspace/scylla-master/releng-testing/gating-dtest-release-with-consistent-topology-changes/scylla/.dtest/dtest-1k0j2lwn/test/node6/data/system_schema/columns-24101c25a2ae3af787c1b40ee1aca33f/me-3gdn_0h3k_4l1cw2gtjbdeos8b8w-big-Data.db:level=0]. 37kB to 21kB (~57% of original) in 12ms = 3MB/s. ~256 total partitions merged to 7.
WARN  2024-02-15 06:10:50,281 [shard 0:main] service_level_controller - update_from_distributed_data: failed to update configuration for more than  90 seconds : exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
INFO  2024-02-15 06:11:21,050 [shard 0:main] compaction_manager - Asked to stop
INFO  2024-02-15 06:11:21,050 [shard 0:main] view_update_generator - Terminating background fiber
INFO  2024-02-15 06:11:21,050 [shard 0:strm] view_update_generator - leaving 0 unstaged sstables unprocessed
WARN  2024-02-15 06:11:21,050 [shard 0:strm] gossip - failure_detector_loop: Got error in the loop, live_nodes={}: seastar::sleep_aborted (Sleep is aborted)
INFO  2024-02-15 06:11:21,050 [shard 0:strm] gossip - failure_detector_loop: Finished main loop
INFO  2024-02-15 06:11:21,050 [shard 0:main] init - Signal received; shutting down
...

Scylla version: 5.5.0~development-0.20240215.53f4805ebaf9
This is a regression from at least 5.5.0~dev-0.20240211.13e16475fabb

https://jenkins.scylladb.com/job/scylla-master/job/releng-testing/job/gating-dtest-release-with-consistent-topology-changes/2/artifact/logs-full.release.005/1707977484079_repair_based_node_operations_test.py%3A%3ATestRepairBasedNodeOperations%3A%3Atest_ignore_dead_nodes_for_replace_option/

The text was updated successfully, but these errors were encountered:

enaydanov · 2024-02-15T08:58:59Z

5.5.0~dev-0.20240214.710d1826547e is affected too.

kbr-scylla · 2024-02-25T20:58:33Z

Similar to #15602. Possibly falls under #17493.

kbr-scylla · 2024-03-13T16:53:49Z

Similar to #15602. Possibly falls under #17493.

Nope, because it reproduces consistently in weekly raft-topology dtest runs (e.g. https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/artsiom_mishuta/job/dtest_raft_topolgy/job/dtest-full-with-consistent-topology-changes/28/testReport/repair_based_node_operations_test/TestRepairBasedNodeOperations/FullDtest___full_split005___test_ignore_dead_nodes_for_replace_option/), and apparently only in raft-topology mode?

So it's something else.

gleb-cloudius · 2024-03-24T10:58:48Z

The test tries to restart previously ignored nodes:

        logger.debug(f"Start {dead_node1_address} and {dead_node2_address} nodes")
        for node in [dead_node1, dead_node2]:
            node.start(wait_other_notice=True, wait_for_binary_proto=True)

kbr-scylla · 2024-03-25T07:41:23Z

@gleb-cloudius please elaborate -- is this expected, does the test need to be adjusted etc.

gleb-cloudius · 2024-03-25T08:20:03Z

@gleb-cloudius please elaborate -- is this expected, does the test need to be adjusted etc.

Ignored nodes are banned. They cannot be restarted any longer (they will not be able to communicate with the cluster). I am not sure why the test tries to restart them, but if we will drop this part the test should pass.

kbr-scylla · 2024-03-25T08:34:22Z

Ok -- @temichus I reassigned the issue to you, please adjust the test -- since ignored nodes are now banned in raft-topology mode, it is expected that other nodes in the cluster won't see them in gossip any longer even when they get restarted.

enaydanov added status/regression area/repair-based operations repare based node operations area/topology changes area/raft labels Feb 15, 2024

yaronkaikov added triage/master Looking for assignee symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework labels Feb 15, 2024

mykaul assigned kbr-scylla Feb 15, 2024

mykaul removed the triage/master Looking for assignee label Feb 15, 2024

kbr-scylla mentioned this issue Feb 25, 2024

Missing gossiper updates #17493

Closed

kbr-scylla added this to the 6.0 milestone Mar 13, 2024

kbr-scylla added the status/release blocker Preventing from a release to be promoted label Mar 19, 2024

kbr-scylla removed the status/release blocker Preventing from a release to be promoted label Mar 25, 2024

kbr-scylla assigned enaydanov, temichus and gleb-cloudius and unassigned kbr-scylla Mar 25, 2024

kbr-scylla unassigned gleb-cloudius Mar 25, 2024

temichus unassigned enaydanov Mar 25, 2024

enaydanov closed this as completed Mar 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dtest: replace_node_when_two_nodes_dead scenario of repair_based_node_operations_test.TestRepairBasedNodeOperations.test_ignore_dead_nodes_for_replace_option is failing with ccmlib.node.TimeoutError: watch_rest_for_alive() timeout after 120 seconds #17349

dtest: replace_node_when_two_nodes_dead scenario of repair_based_node_operations_test.TestRepairBasedNodeOperations.test_ignore_dead_nodes_for_replace_option is failing with ccmlib.node.TimeoutError: watch_rest_for_alive() timeout after 120 seconds #17349

enaydanov commented Feb 15, 2024 •

edited

enaydanov commented Feb 15, 2024

kbr-scylla commented Feb 25, 2024

kbr-scylla commented Mar 13, 2024

gleb-cloudius commented Mar 24, 2024

kbr-scylla commented Mar 25, 2024

gleb-cloudius commented Mar 25, 2024

kbr-scylla commented Mar 25, 2024

dtest: replace_node_when_two_nodes_dead scenario of repair_based_node_operations_test.TestRepairBasedNodeOperations.test_ignore_dead_nodes_for_replace_option is failing with ccmlib.node.TimeoutError: watch_rest_for_alive() timeout after 120 seconds #17349

dtest: replace_node_when_two_nodes_dead scenario of repair_based_node_operations_test.TestRepairBasedNodeOperations.test_ignore_dead_nodes_for_replace_option is failing with ccmlib.node.TimeoutError: watch_rest_for_alive() timeout after 120 seconds #17349

Comments

enaydanov commented Feb 15, 2024 • edited

enaydanov commented Feb 15, 2024

kbr-scylla commented Feb 25, 2024

kbr-scylla commented Mar 13, 2024

gleb-cloudius commented Mar 24, 2024

kbr-scylla commented Mar 25, 2024

gleb-cloudius commented Mar 25, 2024

kbr-scylla commented Mar 25, 2024

enaydanov commented Feb 15, 2024 •

edited