Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dtest: replace_node_when_two_nodes_dead scenario of repair_based_node_operations_test.TestRepairBasedNodeOperations.test_ignore_dead_nodes_for_replace_option is failing with ccmlib.node.TimeoutError: watch_rest_for_alive() timeout after 120 seconds #17349

Closed
enaydanov opened this issue Feb 15, 2024 · 7 comments
Assignees
Labels
area/raft area/repair-based operations repare based node operations area/topology changes status/regression symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework
Milestone

Comments

@enaydanov
Copy link
Contributor

enaydanov commented Feb 15, 2024

replace_node_when_two_nodes_dead scenario of repair_based_node_operations_test.TestRepairBasedNodeOperations.test_ignore_dead_nodes_for_replace_option is failing with ccmlib.node.TimeoutError: watch_rest_for_alive() timeout after 120 seconds:

dtest log:

...
06:09:19,940 805     repair_based_node_operations_test DEBUG    repair_based_node_operations_test.py:170  | test_ignore_dead_nodes_for_replace_option: Start 127.0.46.6 and 127.0.46.7 nodes
06:09:19,946 805     ccm                            DEBUG    cluster.py          :762  | test_ignore_dead_nodes_for_replace_option: node6: Starting scylla: args=['/jenkins/workspace/scylla-master/releng-testing/gating-dtest-release-with-consistent-topology-changes/scylla/.dtest/dtest-1k0j2lwn/test/node6/bin/scylla', '--options-file', '/jenkins/workspace/scylla-master/releng-testing/gating-dtest-release-with-consistent-topology-changes/scylla/.dtest/dtest-1k0j2lwn/test/node6/conf/scylla.yaml', '--log-to-stdout', '1', '--abort-on-seastar-bad-alloc', '--abort-on-lsa-bad-alloc', '1', '--abort-on-internal-error', '1', '--api-address', '127.0.46.6', '--smp', '2', '--memory', '1024M', '--developer-mode', 'true', '--default-log-level', 'info', '--overprovisioned', '--prometheus-address', '127.0.46.6', '--unsafe-bypass-fsync', '1'] wait_other_notice=True wait_for_binary_proto=True
06:09:21,427 805     cassandra.cluster              INFO     thread.py           :58   | test_ignore_dead_nodes_for_replace_option: New Cassandra host <Host: 127.0.46.8:9042 datacenter1> discovered
06:09:21,586 805     cassandra.pool                 WARNING  thread.py           :58   | test_ignore_dead_nodes_for_replace_option: Error attempting to reconnect to 127.0.46.7:9042, scheduling retry in 4.0 seconds: [Errno 111] Tried connecting to [('127.0.46.7', 9042)]. Last error: Connection refused
06:09:22,989 805     cassandra.pool                 INFO     thread.py           :58   | test_ignore_dead_nodes_for_replace_option: Successful reconnection to 127.0.46.6:9042, marking node up if it isn't already
06:09:22,989 805     cassandra.cluster              INFO     thread.py           :58   | test_ignore_dead_nodes_for_replace_option: Host 127.0.46.6:9042 may be up; will prepare queries and open connection pool
06:09:25,592 805     cassandra.pool                 WARNING  thread.py           :58   | test_ignore_dead_nodes_for_replace_option: Error attempting to reconnect to 127.0.46.7:9042, scheduling retry in 3.92 seconds: [Errno 111] Tried connecting to [('127.0.46.7', 9042)]. Last error: Connection refused
...
06:11:20,891 805     ccm                            DEBUG    cluster.py          :762  | test_ignore_dead_nodes_for_replace_option: node1: watch_rest_for_alive: tofind={'127.0.46.6'} found=set(): tofind_host_id_map={'127.0.46.6': 'cc76206a-052a-471e-9bc3-3f0b7166b4c3'} found_host_id_map={}
06:11:21,011 805     errors                         ERROR    conftest.py         :202  | test_ignore_dead_nodes_for_replace_option: test failed: 
self = <repair_based_node_operations_test.TestRepairBasedNodeOperations object at 0x7fbfb888bfd0>

    def test_ignore_dead_nodes_for_replace_option(self):
        """
        --ignore-dead-nodes-for-replace was added by commit
        https://github.com/scylladb/scylla/commit/eba4a4fba4e6f203742307c16448034f40713711
    
        This option allows ignoring dead nodes for replace operation.
        If this option is not set, replace operation will fail because one node is down.
        """
        enable_repair_based_node_ops = True
        self.prepare_cluster(nodes=7, enable_repair_based_node_ops=enable_repair_based_node_ops)
        self.prepare_schema(node=self.cluster.nodelist()[0])
    
        rbnos = RepairBasedNodeOperationsScenarios(tester=self)
>       rbnos.run_scenarios(rbno_enabled=enable_repair_based_node_ops, scenarios=[rbnos.replace_with_dead_nodes_scenario])

repair_based_node_operations_test.py:455: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
repair_based_node_operations_test.py:306: in run_scenarios
    tested_node = scenario.operation()
repair_based_node_operations_test.py:172: in replace_node_when_two_nodes_dead
    node.start(wait_other_notice=True, wait_for_binary_proto=True)
../scylla/.local/lib/python3.11/site-packages/ccmlib/scylla_node.py:683: in start
    scylla_process = self._start_scylla(args=args, marks=marks, update_pid=update_pid,
../scylla/.local/lib/python3.11/site-packages/ccmlib/scylla_node.py:337: in _start_scylla
    node.watch_rest_for_alive(self, timeout=t, wait_normal_token_owner=wait_normal_token_owner)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <ccmlib.scylla_node.ScyllaNode object at 0x7fbfb9742690>
nodes = <ccmlib.scylla_node.ScyllaNode object at 0x7fbedb9d6590>, timeout = 120
wait_normal_token_owner = True

    def watch_rest_for_alive(self, nodes, timeout=120, wait_normal_token_owner=True):
        """
        Use the REST API to wait until this node detects that the nodes listed
        in "nodes" become fully operational.
        This is similar to watch_log_for_alive but uses ScyllaDB's REST API
        instead of the log file and waits for the node to be really useable,
        not just "UP" (see issue #461)
    
        Params:
          - wait_normal_token_owner: return only when this node sees all other nodes as normal token owner (True by default).
        """
        logging.getLogger('urllib3.connectionpool').disabled = True
        try:
            nodes_tofind = nodes if isinstance(nodes, list) else [nodes]
            tofind = set([node.address() for node in nodes_tofind])
            tofind_host_id_map = dict([(node.address(), node.hostid()) for node in nodes_tofind])
            found = set()
            found_host_id_map = dict()
            url_live = f"http://{self.address()}:10000/gossiper/endpoint/live"
            url_joining = f"http://{self.address()}:10000/storage_service/nodes/joining"
            url_tokens = f"http://{self.address()}:10000/storage_service/tokens/"
            url_host_ids = f"http://{self.address()}:10000/storage_service/host_id"
            endtime = time.time() + timeout
            while time.time() < endtime:
                live = set()
                response = requests.get(url=url_live)
                if response.status_code == requests.codes.ok:
                    live = set(response.json())
                response = requests.get(url=url_joining)
                if response.status_code == requests.codes.ok:
                    live = live - set(response.json())
                # Verify that node knows not only about the existance of the
                # other node, but also its host_id as a normal token owner:
                if tofind.issubset(live):
                    # This node thinks that all given nodes are alive and not
                    # "joining", we're almost done, but still need to verify
                    # that the node knows the others' tokens.
                    check = tofind
                    have_no_tokens = set()
                    for n in check:
                        response = requests.get(url=url_tokens+n)
                        if response.text == '[]':
                            have_no_tokens.add(n)
                    if not have_no_tokens:
                        if not wait_normal_token_owner:
                            return
                        # and that the node knows that the others' are normal token owners.
                        host_id_map = dict()
                        response = requests.get(url=url_host_ids)
                        if response.status_code == requests.codes.ok:
                            for r in response.json():
                                host_id_map[r['key']] = r['value']
                        # Verify that the other nodes are considered normal token owners on this node
                        # and their host_id matches the host_id the client knows about
                        normal = set([addr for addr, id in host_id_map.items() \
                                    if addr in tofind_host_id_map and \
                                    (id == tofind_host_id_map[addr] or not tofind_host_id_map[addr])])
                        tofind = tofind.difference(normal)
                        if not tofind:
                            return
                        # Update cumulative maps for debugging
                        found = found.union(normal)
                        found_host_id_map.update(host_id_map)
                time.sleep(0.1)
            self.debug(f"watch_rest_for_alive: tofind={tofind} found={found}: tofind_host_id_map={tofind_host_id_map} found_host_id_map={found_host_id_map}")
>           raise TimeoutError(f"watch_rest_for_alive() timeout after {timeout} seconds")
E           ccmlib.node.TimeoutError: watch_rest_for_alive() timeout after 120 seconds

node6.log:

...
INFO  2024-02-15 06:09:20,782 [shard 0:main] init - serving
INFO  2024-02-15 06:09:20,782 [shard 0:main] init - Scylla version 5.5.0~development-0.20240215.53f4805ebaf9 initialization completed.
INFO  2024-02-15 06:09:20,782 [shard 0:comp] compaction - [Compact system_schema.columns c23c3f10-cbc8-11ee-a23d-b5f1215606f0] Compacted 2 sstables to [/jenkins/workspace/scylla-master/releng-testing/gating-dtest-release-with-consistent-topology-changes/scylla/.dtest/dtest-1k0j2lwn/test/node6/data/system_schema/columns-24101c25a2ae3af787c1b40ee1aca33f/me-3gdn_0h3k_4l1cw2gtjbdeos8b8w-big-Data.db:level=0]. 37kB to 21kB (~57% of original) in 12ms = 3MB/s. ~256 total partitions merged to 7.
WARN  2024-02-15 06:10:50,281 [shard 0:main] service_level_controller - update_from_distributed_data: failed to update configuration for more than  90 seconds : exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
INFO  2024-02-15 06:11:21,050 [shard 0:main] compaction_manager - Asked to stop
INFO  2024-02-15 06:11:21,050 [shard 0:main] view_update_generator - Terminating background fiber
INFO  2024-02-15 06:11:21,050 [shard 0:strm] view_update_generator - leaving 0 unstaged sstables unprocessed
WARN  2024-02-15 06:11:21,050 [shard 0:strm] gossip - failure_detector_loop: Got error in the loop, live_nodes={}: seastar::sleep_aborted (Sleep is aborted)
INFO  2024-02-15 06:11:21,050 [shard 0:strm] gossip - failure_detector_loop: Finished main loop
INFO  2024-02-15 06:11:21,050 [shard 0:main] init - Signal received; shutting down
...

Scylla version: 5.5.0~development-0.20240215.53f4805ebaf9
This is a regression from at least 5.5.0~dev-0.20240211.13e16475fabb

https://jenkins.scylladb.com/job/scylla-master/job/releng-testing/job/gating-dtest-release-with-consistent-topology-changes/2/artifact/logs-full.release.005/1707977484079_repair_based_node_operations_test.py%3A%3ATestRepairBasedNodeOperations%3A%3Atest_ignore_dead_nodes_for_replace_option/

@yaronkaikov yaronkaikov added triage/master Looking for assignee symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework labels Feb 15, 2024
@enaydanov
Copy link
Contributor Author

5.5.0~dev-0.20240214.710d1826547e is affected too.

@mykaul mykaul removed the triage/master Looking for assignee label Feb 15, 2024
@kbr-scylla
Copy link
Contributor

Similar to #15602. Possibly falls under #17493.

@kbr-scylla
Copy link
Contributor

@kbr-scylla kbr-scylla added this to the 6.0 milestone Mar 13, 2024
@kbr-scylla kbr-scylla added the status/release blocker Preventing from a release to be promoted label Mar 19, 2024
@gleb-cloudius
Copy link
Contributor

The test tries to restart previously ignored nodes:

        logger.debug(f"Start {dead_node1_address} and {dead_node2_address} nodes")
        for node in [dead_node1, dead_node2]:
            node.start(wait_other_notice=True, wait_for_binary_proto=True)

@kbr-scylla
Copy link
Contributor

@gleb-cloudius please elaborate -- is this expected, does the test need to be adjusted etc.

@kbr-scylla kbr-scylla removed the status/release blocker Preventing from a release to be promoted label Mar 25, 2024
@gleb-cloudius
Copy link
Contributor

@gleb-cloudius please elaborate -- is this expected, does the test need to be adjusted etc.

Ignored nodes are banned. They cannot be restarted any longer (they will not be able to communicate with the cluster). I am not sure why the test tries to restart them, but if we will drop this part the test should pass.

@kbr-scylla
Copy link
Contributor

Ok -- @temichus I reassigned the issue to you, please adjust the test -- since ignored nodes are now banned in raft-topology mode, it is expected that other nodes in the cluster won't see them in gossip any longer even when they get restarted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/raft area/repair-based operations repare based node operations area/topology changes status/regression symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework
Projects
None yet
Development

No branches or pull requests

6 participants