Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_putget_with_internode_rack_ssl_with_dc_compression: joining node did not start bootstrap #16099

Closed
bhalevy opened this issue Nov 20, 2023 · 10 comments
Assignees
Labels
triage/master Looking for assignee

Comments

@bhalevy
Copy link
Member

bhalevy commented Nov 20, 2023

Seen in https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-debug/266/testReport/internode_ssl_test/TestInternodeSSL/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split043___test_putget_with_internode_rack_ssl_with_dc_compression/
Scylla version 7b08886

self = <ccmlib.scylla_node.ScyllaNode object at 0x7f86884715d0>
args = ['/jenkins/workspace/scylla-master/dtest-debug/scylla/.dtest/dtest-m3fkpk0f/test/node3/bin/scylla', '--options-file', ...scylla/.dtest/dtest-m3fkpk0f/test/node3/conf/scylla.yaml', '--log-to-stdout', '1', '--abort-on-seastar-bad-alloc', ...]
marks = [(<ccmlib.scylla_node.ScyllaNode object at 0x7f8688470c50>, 90074), (<ccmlib.scylla_node.ScyllaNode object at 0x7f86895652d0>, 108616)]
update_pid = False, wait_other_notice = True, wait_for_binary_proto = True
ext_env = {'ASAN_OPTIONS': 'disable_coredump=0:abort_on_error=1', 'BOOST_TEST_CATCH_SYSTEM_ERRORS': 'no', 'LLVM_PROFILE_FILE': '...test-debug/scylla-dtest/logs-full.debug.043/coverage/%3m.profraw', 'UBSAN_OPTIONS': 'halt_on_error=1:abort_on_error=1'}

    def _start_scylla(self, args, marks, update_pid, wait_other_notice,
                      wait_for_binary_proto, ext_env):
        log_file = os.path.join(self.get_path(), 'logs', 'system.log')
        # In case we are restarting a node
        # we risk reading the old cassandra.pid file
        self._delete_old_pid()
    
        env = self._get_environ(ext_env, SCYLLA_HOME=self.get_path())
    
        with open(log_file, 'a') as scylla_log:
            self._process_scylla = \
                subprocess.Popen(args, stdout=scylla_log, stderr=scylla_log, close_fds=True, env=env)
        self._process_scylla.poll()
        # When running on ccm standalone, the waiter thread would block
        # the create commands. Besides in that mode, waiting is unnecessary,
        # since the original popen reference is garbage collected.
        standalone = os.environ.get('SCYLLA_CCM_STANDALONE', None)
        if standalone is None:
            self._process_scylla_waiter = threading.Thread(target=self._wait_for_scylla)
            # Don't block the main thread on abnormal shutdown
            self._process_scylla_waiter.daemon = True
            self._process_scylla_waiter.start()
        pid_filename = os.path.join(self.get_path(), 'cassandra.pid')
        with open(pid_filename, 'w') as pid_file:
            pid_file.write(str(self._process_scylla.pid))
    
        if update_pid:
            self._update_pid(self._process_scylla)
            if not self.is_running():
                raise NodeError(f"Error starting node {self.name}",
                                self._process_scylla)
    
        if wait_for_binary_proto:
            t = self.cluster.default_wait_for_binary_proto
            from_mark = self.mark
            try:
>               self.wait_for_binary_interface(from_mark=from_mark, process=self._process_scylla, timeout=t)

../scylla/.local/lib/python3.11/site-packages/ccmlib/scylla_node.py:322: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../scylla/.local/lib/python3.11/site-packages/ccmlib/node.py:542: in wait_for_binary_interface
    self.watch_log_for("Starting listening for CQL clients", **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <ccmlib.scylla_node.ScyllaNode object at 0x7f86884715d0>
exprs = 'Starting listening for CQL clients', from_mark = 0, timeout = 900
process = <Popen: returncode: None args: ['/jenkins/workspace/scylla-master/dtest-debu...>
verbose = False, filename = 'system.log'

    def watch_log_for(self, exprs, from_mark=None, timeout=600, process=None, verbose=False, filename='system.log'):
        """
        Watch the log until one or more (regular) expression are found.
        This methods when all the expressions have been found or the method
        timeouts (a TimeoutError is then raised). On successful completion,
        a list of pair (line matched, match object) is returned.
        """
        elapsed = 0
        tofind = [exprs] if isinstance(exprs, str) else exprs
        tofind = [re.compile(e) for e in tofind]
        matchings = []
        reads = ""
        if len(tofind) == 0:
            return None
    
        log_file = os.path.join(self.get_path(), 'logs', filename)
        while not os.path.exists(log_file):
            time.sleep(.1)
            if process:
                process.poll()
                if process.returncode is not None:
                    self.print_process_output(self.name, process, verbose)
                    if process.returncode != 0:
                        raise RuntimeError()  # Shouldn't reuse RuntimeError but I'm lazy
    
        with open(log_file) as f:
            if from_mark:
                f.seek(from_mark)
    
            while True:
                # First, if we have a process to check, then check it.
                # Skip on Windows - stdout/stderr is cassandra.bat
                if not common.is_win():
                    if process:
                        process.poll()
                        if process.returncode is not None:
                            self.print_process_output(self.name, process, verbose)
                            if process.returncode != 0:
                                raise RuntimeError()  # Shouldn't reuse RuntimeError but I'm lazy
    
                line = f.readline()
                if line:
                    reads = reads + line
                    for e in tofind:
                        m = e.search(line)
                        if m:
                            matchings.append((line, m))
    
                            # This is bogus! - One should not modify the list from inside the loop which iterates on
                            # that list.
                            # However since we are going to break from the loop a few lines below that should be ok.
                            tofind.remove(e)
    
                            if len(tofind) == 0:
                                return matchings[0] if isinstance(exprs, str) else matchings
                            break
                else:
                    # yep, it's ugly
                    time.sleep(0.01)
                    elapsed = elapsed + 1
                    if elapsed > 100 * timeout:
>                       raise TimeoutError(time.strftime("%d %b %Y %H:%M:%S", time.gmtime()) + " [" + self.name + "] Missing: " + str(
                            [e.pattern for e in tofind]) + ":\n" + reads[:50] + f".....\nSee {filename} for remainder")
E                       ccmlib.node.TimeoutError: 19 Nov 2023 08:17:41 [node3] Missing: ['Starting listening for CQL clients']:
E                       Scylla version 5.5.0~dev-0.20231113.7b08886e8dd8 w.....
E                       See system.log for remainder

../scylla/.local/lib/python3.11/site-packages/ccmlib/node.py:495: TimeoutError

During handling of the above exception, another exception occurred:

self = <internode_ssl_test.TestInternodeSSL object at 0x7f86891f09d0>

    def test_putget_with_internode_rack_ssl_with_dc_compression(self):
        """
        Simple putget test with internode ssl enabled
        with 'dc' internode compression and 'rack' internode encryption.
        """
>       self.__putget_with_internode_ssl_test('dc', internode_encryption='rack', dcs=2)

internode_ssl_test.py:56: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
internode_ssl_test.py:84: in __putget_with_internode_ssl_test
    cluster.populate([3 for i in range(dcs)]).start(
../scylla/.local/lib/python3.11/site-packages/ccmlib/scylla_cluster.py:169: in start
    started = self.start_nodes(**kwargs)
../scylla/.local/lib/python3.11/site-packages/ccmlib/scylla_cluster.py:137: in start_nodes
    p = node.start(update_pid=False, jvm_args=jvm_args,
../scylla/.local/lib/python3.11/site-packages/ccmlib/scylla_node.py:677: in start
    scylla_process = self._start_scylla(args, marks, update_pid,
../scylla/.local/lib/python3.11/site-packages/ccmlib/scylla_node.py:324: in _start_scylla
    self.wait_for_starting(from_mark=self.mark, timeout=t)
../scylla/.local/lib/python3.11/site-packages/ccmlib/scylla_node.py:259: in wait_for_starting
    if not self.watch_log_for(f"{starting_message}|{bootstrap_message}|{resharding_message}", from_mark=from_mark, timeout=timeout, process=process):
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <ccmlib.scylla_node.ScyllaNode object at 0x7f86884715d0>
exprs = 'Starting listening for CQL clients|storage_service .* Starting to bootstrap|(compaction|database) -.*Resharding'
from_mark = 0, timeout = 900
process = <Popen: returncode: None args: ['/jenkins/workspace/scylla-master/dtest-debu...>
verbose = False, filename = 'system.log'

    def watch_log_for(self, exprs, from_mark=None, timeout=600, process=None, verbose=False, filename='system.log'):
        """
        Watch the log until one or more (regular) expression are found.
        This methods when all the expressions have been found or the method
        timeouts (a TimeoutError is then raised). On successful completion,
        a list of pair (line matched, match object) is returned.
        """
        elapsed = 0
        tofind = [exprs] if isinstance(exprs, str) else exprs
        tofind = [re.compile(e) for e in tofind]
        matchings = []
        reads = ""
        if len(tofind) == 0:
            return None
    
        log_file = os.path.join(self.get_path(), 'logs', filename)
        while not os.path.exists(log_file):
            time.sleep(.1)
            if process:
                process.poll()
                if process.returncode is not None:
                    self.print_process_output(self.name, process, verbose)
                    if process.returncode != 0:
                        raise RuntimeError()  # Shouldn't reuse RuntimeError but I'm lazy
    
        with open(log_file) as f:
            if from_mark:
                f.seek(from_mark)
    
            while True:
                # First, if we have a process to check, then check it.
                # Skip on Windows - stdout/stderr is cassandra.bat
                if not common.is_win():
                    if process:
                        process.poll()
                        if process.returncode is not None:
                            self.print_process_output(self.name, process, verbose)
                            if process.returncode != 0:
                                raise RuntimeError()  # Shouldn't reuse RuntimeError but I'm lazy
    
                line = f.readline()
                if line:
                    reads = reads + line
                    for e in tofind:
                        m = e.search(line)
                        if m:
                            matchings.append((line, m))
    
                            # This is bogus! - One should not modify the list from inside the loop which iterates on
                            # that list.
                            # However since we are going to break from the loop a few lines below that should be ok.
                            tofind.remove(e)
    
                            if len(tofind) == 0:
                                return matchings[0] if isinstance(exprs, str) else matchings
                            break
                else:
                    # yep, it's ugly
                    time.sleep(0.01)
                    elapsed = elapsed + 1
                    if elapsed > 100 * timeout:
>                       raise TimeoutError(time.strftime("%d %b %Y %H:%M:%S", time.gmtime()) + " [" + self.name + "] Missing: " + str(
                            [e.pattern for e in tofind]) + ":\n" + reads[:50] + f".....\nSee {filename} for remainder")
E                       ccmlib.node.TimeoutError: 19 Nov 2023 08:32:50 [node3] Missing: ['Starting listening for CQL clients|storage_service .* Starting to bootstrap|(compaction|database) -.*Resharding']:
E                       Scylla version 5.5.0~dev-0.20231113.7b08886e8dd8 w.....
E                       See system.log for remainder

../scylla/.local/lib/python3.11/site-packages/ccmlib/node.py:495: TimeoutError

https://jenkins.scylladb.com/job/scylla-master/job/dtest-debug/266/artifact/logs-full.debug.043/1700382774310_internode_ssl_test.py%3A%3ATestInternodeSSL%3A%3Atest_putget_with_internode_rack_ssl_with_dc_compression/node1.log

INFO  2023-11-19 08:02:36,562 [shard 0: gms] gossip - InetAddress 127.0.8.3 is now UP, status = UNKNOWN
INFO  2023-11-19 08:02:36,761 [shard 0:comp] compaction - [Compact system.scylla_local 007a4da0-86b2-11ee-9c48-c2d393da6d24] Compacted 2 sstables to [/jenkins/workspace/scylla-master/dtest-debug/scylla/.dtest/dtest-m3fkpk0f/test/node1/data/system/scylla_local-2972ec7ffb2038ddaac1d876f2e3fcbd/me-3gb7_0mcc_3b64g2dk4w7aj6h06c-big-Data.db:level=0]. 12kB to 6kB (~51% of original) in 201ms = 59kB/s. ~256 total partitions merged to 2.
INFO  2023-11-19 08:02:37,346 [shard 0:comp] compaction - [Compact system.scylla_local 00f4d4d0-86b2-11ee-9c48-c2d393da6d24] Compacting [/jenkins/workspace/scylla-master/dtest-debug/scylla/.dtest/dtest-m3fkpk0f/test/node1/data/system/scylla_local-2972ec7ffb2038ddaac1d876f2e3fcbd/me-3gb7_0mcd_1qssw2dk4w7aj6h06c-big-Data.db:level=0:origin=memtable,/jenkins/workspace/scylla-master/dtest-debug/scylla/.dtest/dtest-m3fkpk0f/test/node1/data/system/scylla_local-2972ec7ffb2038ddaac1d876f2e3fcbd/me-3gb7_0mcc_3b64g2dk4w7aj6h06c-big-Data.db:level=0:origin=compaction]
INFO  2023-11-19 08:02:37,464 [shard 0:comp] compaction - [Compact system.scylla_local 00f4d4d0-86b2-11ee-9c48-c2d393da6d24] Compacted 2 sstables to [/jenkins/workspace/scylla-master/dtest-debug/scylla/.dtest/dtest-m3fkpk0f/test/node1/data/system/scylla_local-2972ec7ffb2038ddaac1d876f2e3fcbd/me-3gb7_0mcd_260n42dk4w7aj6h06c-big-Data.db:level=0]. 12kB to 6kB (~51% of original) in 97ms = 123kB/s. ~256 total partitions merged to 2.
INFO  2023-11-19 08:32:50,302 [shard 0:main] init - Signal received; shutting down
INFO  2023-11-19 08:32:50,302 [shard 0:main] init - Shutting down view builder ops
INFO  2023-11-19 08:32:50,302 [shard 0:main] view - Draining view builder
INFO  2023-11-19 08:32:50,303 [shard 0:main] compaction_manager - Asked to stop
INFO  2023-11-19 08:32:50,303 [shard 0:main] view_update_generator - Terminating background fiber
INFO  2023-11-19 08:32:50,303 [shard 0:strm] view_update_generator - leaving 0 unstaged sstables unprocessed
INFO  2023-11-19 08:32:50,304 [shard 1:main] compaction_manager - Asked to stop
INFO  2023-11-19 08:32:50,304 [shard 1:main] view_update_generator - Terminating background fiber
INFO  2023-11-19 08:32:50,304 [shard 1:strm] view_update_generator - leaving 0 unstaged sstables unprocessed
WARN  2023-11-19 08:32:50,305 [shard 0:strm] gossip - failure_detector_loop: Got error in the loop, live_nodes={127.0.8.3, 127.0.8.2}: seastar::sleep_aborted (Sleep is aborted)

https://jenkins.scylladb.com/job/scylla-master/job/dtest-debug/266/artifact/logs-full.debug.043/1700382774310_internode_ssl_test.py%3A%3ATestInternodeSSL%3A%3Atest_putget_with_internode_rack_ssl_with_dc_compression/node2.log

INFO  2023-11-19 08:02:36,544 [shard 0: gms] gossip - InetAddress 127.0.8.3 is now UP, status = UNKNOWN
INFO  2023-11-19 08:02:36,757 [shard 0:comp] compaction - [Compact system.scylla_local 007a4da0-86b2-11ee-8298-d8a6adc483a1] Compacted 2 sstables to [/jenkins/workspace/scylla-master/dtest-debug/scylla/.dtest/dtest-m3fkpk0f/test/node2/data/system/scylla_local-2972ec7ffb2038ddaac1d876f2e3fcbd/me-3gb7_0mcc_3c8pc1zhvn97d8n4pt-big-Data.db:level=0]. 12kB to 6kB (~51% of original) in 200ms = 60kB/s. ~256 total partitions merged to 2.
INFO  2023-11-19 08:02:37,521 [shard 0:comp] compaction - [Compact system.scylla_local 010fd6e0-86b2-11ee-8298-d8a6adc483a1] Compacting [/jenkins/workspace/scylla-master/dtest-debug/scylla/.dtest/dtest-m3fkpk0f/test/node2/data/system/scylla_local-2972ec7ffb2038ddaac1d876f2e3fcbd/me-3gb7_0mcd_2sits1zhvn97d8n4pt-big-Data.db:level=0:origin=memtable,/jenkins/workspace/scylla-master/dtest-debug/scylla/.dtest/dtest-m3fkpk0f/test/node2/data/system/scylla_local-2972ec7ffb2038ddaac1d876f2e3fcbd/me-3gb7_0mcc_3c8pc1zhvn97d8n4pt-big-Data.db:level=0:origin=compaction]
INFO  2023-11-19 08:02:37,583 [shard 0:comp] compaction - [Compact system.scylla_local 010fd6e0-86b2-11ee-8298-d8a6adc483a1] Compacted 2 sstables to [/jenkins/workspace/scylla-master/dtest-debug/scylla/.dtest/dtest-m3fkpk0f/test/node2/data/system/scylla_local-2972ec7ffb2038ddaac1d876f2e3fcbd/me-3gb7_0mcd_34ycw1zhvn97d8n4pt-big-Data.db:level=0]. 12kB to 6kB (~51% of original) in 46ms = 260kB/s. ~256 total partitions merged to 2.
INFO  2023-11-19 08:32:50,302 [shard 0:main] compaction_manager - Asked to stop
INFO  2023-11-19 08:32:50,303 [shard 0:main] view_update_generator - Terminating background fiber
INFO  2023-11-19 08:32:50,303 [shard 0:strm] view_update_generator - leaving 0 unstaged sstables unprocessed
INFO  2023-11-19 08:32:50,304 [shard 0:main] init - Signal received; shutting down
INFO  2023-11-19 08:32:50,304 [shard 0:main] init - Shutting down view builder ops
INFO  2023-11-19 08:32:50,304 [shard 0:main] view - Draining view builder
INFO  2023-11-19 08:32:50,304 [shard 1:main] compaction_manager - Asked to stop
INFO  2023-11-19 08:32:50,305 [shard 1:main] view_update_generator - Terminating background fiber
INFO  2023-11-19 08:32:50,305 [shard 0:main] compaction_manager - Stopped
INFO  2023-11-19 08:32:50,305 [shard 1:strm] view_update_generator - leaving 0 unstaged sstables unprocessed
INFO  2023-11-19 08:32:50,306 [shard 1:main] view - Draining view builder
WARN  2023-11-19 08:32:50,306 [shard 0:strm] gossip - failure_detector_loop: Got error in the loop, live_nodes={127.0.8.3, 127.0.8.1}: seastar::sleep_aborted (Sleep is aborted)

https://jenkins.scylladb.com/job/scylla-master/job/dtest-debug/266/artifact/logs-full.debug.043/1700382774310_internode_ssl_test.py%3A%3ATestInternodeSSL%3A%3Atest_putget_with_internode_rack_ssl_with_dc_compression/node3.log

INFO  2023-11-19 08:02:37,029 [shard 0:strm] storage_service - entering JOINING mode
INFO  2023-11-19 08:02:37,036 [shard 0:strm] storage_service - waiting for ring information
INFO  2023-11-19 08:02:37,036 [shard 0:strm] storage_service - waiting for schema information to complete
INFO  2023-11-19 08:02:37,058 [shard 0:strm] schema_tables - Schema version changed to 59adb24e-f3cd-3e02-97f0-5b395827453f
INFO  2023-11-19 08:02:37,232 [shard 0:strm] schema_tables - Schema version changed to 59adb24e-f3cd-3e02-97f0-5b395827453f
INFO  2023-11-19 08:02:37,362 [shard 0:comp] compaction - [Compact system.scylla_local 00b581e0-86b2-11ee-98fa-594b220ab4df] Compacted 2 sstables to [/jenkins/workspace/scylla-master/dtest-debug/scylla/.dtest/dtest-m3fkpk0f/test/node3/data/system/scylla_local-2972ec7ffb2038ddaac1d876f2e3fcbd/md-3gb7_0mcc_5xh6o2bqz1stvi3t9b-big-Data.db:level=0]. 11kB to 6kB (~55% of original) in 308ms = 36kB/s. ~256 total partitions merged to 2.
INFO  2023-11-19 08:02:37,382 [shard 0:comp] compaction - [Compact system.scylla_local 00f941a0-86b2-11ee-98fa-594b220ab4df] Compacting [/jenkins/workspace/scylla-master/dtest-debug/scylla/.dtest/dtest-m3fkpk0f/test/node3/data/system/scylla_local-2972ec7ffb2038ddaac1d876f2e3fcbd/md-3gb7_0mcc_5jz2o2bqz1stvi3t9b-big-Data.db:level=0:origin=memtable,/jenkins/workspace/scylla-master/dtest-debug/scylla/.dtest/dtest-m3fkpk0f/test/node3/data/system/scylla_local-2972ec7ffb2038ddaac1d876f2e3fcbd/me-3gb7_0mcd_05sc02bqz1stvi3t9b-big-Data.db:level=0:origin=memtable,/jenkins/workspace/scylla-master/dtest-debug/scylla/.dtest/dtest-m3fkpk0f/test/node3/data/system/scylla_local-2972ec7ffb2038ddaac1d876f2e3fcbd/me-3gb7_0mcd_16nhs2bqz1stvi3t9b-big-Data.db:level=0:origin=memtable]
INFO  2023-11-19 08:02:37,382 [shard 0:strm] schema_tables - Schema version changed to 59adb24e-f3cd-3e02-97f0-5b395827453f
INFO  2023-11-19 08:02:37,509 [shard 0:strm] schema_tables - Schema version changed to 59adb24e-f3cd-3e02-97f0-5b395827453f
INFO  2023-11-19 08:02:37,627 [shard 0:strm] schema_tables - Schema version changed to 59adb24e-f3cd-3e02-97f0-5b395827453f
INFO  2023-11-19 08:02:37,650 [shard 0:comp] compaction - [Compact system.scylla_local 00f941a0-86b2-11ee-98fa-594b220ab4df] Compacted 3 sstables to [/jenkins/workspace/scylla-master/dtest-debug/scylla/.dtest/dtest-m3fkpk0f/test/node3/data/system/scylla_local-2972ec7ffb2038ddaac1d876f2e3fcbd/me-3gb7_0mcd_2jqgw2bqz1stvi3t9b-big-Data.db:level=0]. 16kB to 6kB (~37% of original) in 245ms = 69kB/s. ~384 total partitions merged to 2.
INFO  2023-11-19 08:02:37,653 [shard 0:comp] compaction - [Compact system.scylla_local 01242230-86b2-11ee-98fa-594b220ab4df] Compacting [/jenkins/workspace/scylla-master/dtest-debug/scylla/.dtest/dtest-m3fkpk0f/test/node3/data/system/scylla_local-2972ec7ffb2038ddaac1d876f2e3fcbd/md-3gb7_0mcc_5xh6o2bqz1stvi3t9b-big-Data.db:level=0:origin=compaction,/jenkins/workspace/scylla-master/dtest-debug/scylla/.dtest/dtest-m3fkpk0f/test/node3/data/system/scylla_local-2972ec7ffb2038ddaac1d876f2e3fcbd/me-3gb7_0mcd_2jqgw2bqz1stvi3t9b-big-Data.db:level=0:origin=compaction]
INFO  2023-11-19 08:02:37,715 [shard 0:comp] compaction - [Compact system.scylla_local 01242230-86b2-11ee-98fa-594b220ab4df] Compacted 2 sstables to [/jenkins/workspace/scylla-master/dtest-debug/scylla/.dtest/dtest-m3fkpk0f/test/node3/data/system/scylla_local-2972ec7ffb2038ddaac1d876f2e3fcbd/me-3gb7_0mcd_3xob42bqz1stvi3t9b-big-Data.db:level=0]. 12kB to 6kB (~49% of original) in 45ms = 277kB/s. ~256 total partitions merged to 2.
INFO  2023-11-19 08:32:50,302 [shard 0:main] compaction_manager - Asked to stop
@bhalevy bhalevy added the triage/master Looking for assignee label Nov 20, 2023
@bhalevy
Copy link
Member Author

bhalevy commented Nov 20, 2023

@kbr-scylla can you please look for clues in the logs?

@mykaul
Copy link
Contributor

mykaul commented Nov 20, 2023

I can see:
https://jenkins.scylladb.com/job/scylla-master/job/dtest-debug/266/artifact/logs-full.debug.043/1700382774310_internode_ssl_test.py%3A%3ATestInternodeSSL%3A%3Atest_putget_with_internode_rack_ssl_with_dc_compression/node1.log :

INFO  2023-11-19 08:01:01,609 [shard 0:main] init - starting native transport
INFO  2023-11-19 08:01:01,622 [shard 0:stmt] cql_server_controller - Starting listening for CQL clients on 127.0.8.1:9042 (unencrypted, non-shard-aware)
INFO  2023-11-19 08:01:01,622 [shard 0:stmt] cql_server_controller - Starting listening for CQL clients on 127.0.8.1:19042 (unencrypted, shard-aware)
INFO  2023-11-19 08:01:01,632 [shard 0:main] init - serving
INFO  2023-11-19 08:01:01,632 [shard 0:main] init - Scylla version 5.5.0~dev-0.20231113.7b08886e8dd8 initialization completed.

https://jenkins.scylladb.com/job/scylla-master/job/dtest-debug/266/artifact/logs-full.debug.043/1700382774310_internode_ssl_test.py%3A%3ATestInternodeSSL%3A%3Atest_putget_with_internode_rack_ssl_with_dc_compression/node2.log :

INFO  2023-11-19 08:02:31,376 [shard 0:stmt] cql_server_controller - Starting listening for CQL clients on 127.0.8.2:19042 (unencrypted, shard-aware)
INFO  2023-11-19 08:02:31,376 [shard 0:stmt] cql_server_controller - Starting listening for CQL clients on 127.0.8.2:9042 (unencrypted, non-shard-aware)
INFO  2023-11-19 08:02:31,389 [shard 0:main] init - serving
INFO  2023-11-19 08:02:31,390 [shard 0:main] init - Scylla version 5.5.0~dev-0.20231113.7b08886e8dd8 initialization completed.

Node 3 is the challenging one - https://jenkins.scylladb.com/job/scylla-master/job/dtest-debug/266/artifact/logs-full.debug.043/1700382774310_internode_ssl_test.py%3A%3ATestInternodeSSL%3A%3Atest_putget_with_internode_rack_ssl_with_dc_compression/node3.log has some errors. Gossip issues?

INFO  2023-11-19 08:02:37,036 [shard 0:strm] storage_service - waiting for schema information to complete
INFO  2023-11-19 08:02:37,058 [shard 0:strm] schema_tables - Schema version changed to 59adb24e-f3cd-3e02-97f0-5b395827453f
INFO  2023-11-19 08:02:37,232 [shard 0:strm] schema_tables - Schema version changed to 59adb24e-f3cd-3e02-97f0-5b395827453f
INFO  2023-11-19 08:02:37,362 [shard 0:comp] compaction - [Compact system.scylla_local 00b581e0-86b2-11ee-98fa-594b220ab4df] Compacted 2 sstables to [/jenkins/workspace/scylla-master/dtest-debug/scylla/.dtest/dtest-m3fkpk0f/test/node3/data/system/scylla_local-2972ec7ffb2038ddaac1d876f2e3fcbd/md-3gb7_0mcc_5xh6o2bqz1stvi3t9b-big-Data.db:level=0]. 11kB to 6kB (~55% of original) in 308ms = 36kB/s. ~256 total partitions merged to 2.
INFO  2023-11-19 08:02:37,382 [shard 0:comp] compaction - [Compact system.scylla_local 00f941a0-86b2-11ee-98fa-594b220ab4df] Compacting [/jenkins/workspace/scylla-master/dtest-debug/scylla/.dtest/dtest-m3fkpk0f/test/node3/data/system/scylla_local-2972ec7ffb2038ddaac1d876f2e3fcbd/md-3gb7_0mcc_5jz2o2bqz1stvi3t9b-big-Data.db:level=0:origin=memtable,/jenkins/workspace/scylla-master/dtest-debug/scylla/.dtest/dtest-m3fkpk0f/test/node3/data/system/scylla_local-2972ec7ffb2038ddaac1d876f2e3fcbd/me-3gb7_0mcd_05sc02bqz1stvi3t9b-big-Data.db:level=0:origin=memtable,/jenkins/workspace/scylla-master/dtest-debug/scylla/.dtest/dtest-m3fkpk0f/test/node3/data/system/scylla_local-2972ec7ffb2038ddaac1d876f2e3fcbd/me-3gb7_0mcd_16nhs2bqz1stvi3t9b-big-Data.db:level=0:origin=memtable]
INFO  2023-11-19 08:02:37,382 [shard 0:strm] schema_tables - Schema version changed to 59adb24e-f3cd-3e02-97f0-5b395827453f
INFO  2023-11-19 08:02:37,509 [shard 0:strm] schema_tables - Schema version changed to 59adb24e-f3cd-3e02-97f0-5b395827453f
INFO  2023-11-19 08:02:37,627 [shard 0:strm] schema_tables - Schema version changed to 59adb24e-f3cd-3e02-97f0-5b395827453f
INFO  2023-11-19 08:02:37,650 [shard 0:comp] compaction - [Compact system.scylla_local 00f941a0-86b2-11ee-98fa-594b220ab4df] Compacted 3 sstables to [/jenkins/workspace/scylla-master/dtest-debug/scylla/.dtest/dtest-m3fkpk0f/test/node3/data/system/scylla_local-2972ec7ffb2038ddaac1d876f2e3fcbd/me-3gb7_0mcd_2jqgw2bqz1stvi3t9b-big-Data.db:level=0]. 16kB to 6kB (~37% of original) in 245ms = 69kB/s. ~384 total partitions merged to 2.
INFO  2023-11-19 08:02:37,653 [shard 0:comp] compaction - [Compact system.scylla_local 01242230-86b2-11ee-98fa-594b220ab4df] Compacting [/jenkins/workspace/scylla-master/dtest-debug/scylla/.dtest/dtest-m3fkpk0f/test/node3/data/system/scylla_local-2972ec7ffb2038ddaac1d876f2e3fcbd/md-3gb7_0mcc_5xh6o2bqz1stvi3t9b-big-Data.db:level=0:origin=compaction,/jenkins/workspace/scylla-master/dtest-debug/scylla/.dtest/dtest-m3fkpk0f/test/node3/data/system/scylla_local-2972ec7ffb2038ddaac1d876f2e3fcbd/me-3gb7_0mcd_2jqgw2bqz1stvi3t9b-big-Data.db:level=0:origin=compaction]
INFO  2023-11-19 08:02:37,715 [shard 0:comp] compaction - [Compact system.scylla_local 01242230-86b2-11ee-98fa-594b220ab4df] Compacted 2 sstables to [/jenkins/workspace/scylla-master/dtest-debug/scylla/.dtest/dtest-m3fkpk0f/test/node3/data/system/scylla_local-2972ec7ffb2038ddaac1d876f2e3fcbd/me-3gb7_0mcd_3xob42bqz1stvi3t9b-big-Data.db:level=0]. 12kB to 6kB (~49% of original) in 45ms = 277kB/s. ~256 total partitions merged to 2.
INFO  2023-11-19 08:32:50,302 [shard 0:main] compaction_manager - Asked to stop

@bhalevy
Copy link
Member Author

bhalevy commented Nov 21, 2023

has some errors

Which? The Schema version changed message doesn't indicate an error, it's is informational only.

@mykaul
Copy link
Contributor

mykaul commented Nov 21, 2023

INFO 2023-11-19 08:02:37,036 [shard 0:strm] storage_service - waiting for schema information to complete

It never got this:

Schema version changed to 337d53c0-c601-3fad-a204-ad0ca5ccba2e

For some reason, it did not get this schema.

@mykaul
Copy link
Contributor

mykaul commented Nov 21, 2023

BTW, is Raft disabled here?

@mykaul
Copy link
Contributor

mykaul commented Nov 27, 2023

@kbr-scylla - unsure, but I think your team should look at this.

@kbr-scylla
Copy link
Contributor

Looks to be the same as #15393

@kbr-scylla
Copy link
Contributor

Since it looks the same as #15393, the potential root cause of which was suspected to be #17493, which was most likely fixed by fd32e2e, I'm moving it out of the backlog, as there is nothing we can currently do about it other than wait and keep fingers crossed that it doesn't reproduce.

image

@mykaul
Copy link
Contributor

mykaul commented Mar 14, 2024

Closing. If it happens again, we'll re-open / open a new issue.

@mykaul mykaul closed this as completed Mar 14, 2024
@kbr-scylla kbr-scylla closed this as not planned Won't fix, can't repro, duplicate, stale Mar 14, 2024
@kbr-scylla
Copy link
Contributor

not sure if we should consider it "completed" or "not planned"

@mykaul mykaul closed this as not planned Won't fix, can't repro, duplicate, stale Mar 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage/master Looking for assignee
Projects
None yet
Development

No branches or pull requests

4 participants