[dtest] Tests / Sanity Tests-RAFT / throughput_limits_test.TestCompactionLimitThroughput.test_can_limit_compaction_throughput is flakey #15721

avikivity · 2023-10-16T10:58:28Z

https://jenkins.scylladb.com/view/nexts/job/scylla-master/job/next/6668/testReport/junit/throughput_limits_test/TestCompactionLimitThroughput/Tests___Sanity_Tests_RAFT___test_can_limit_compaction_throughput/

did not run on r3, seems like a real bug.

failed on teardown with "AssertionError: Unexpected errors found: [('node1', ['ERROR 2023-10-15 19:26:03,281 [shard 0:comp] storage_proxy - exception during mutation write to 127.0.70.1: utils::internal::nested_exception<std::runtime_error> (Could not write mutation system:compaction_history (pk{0010a9c162f06b9011ee883fcdd4cd97304b}) to commitlog): seastar::gate_closed_exception (gate closed)'])]"

request = <SubRequest 'fixture_dtest_setup' for <Function test_can_limit_compaction_throughput>>
dtest_config = <dtest_config.DTestConfig object at 0x7f2dd1b340d0>
fixture_dtest_setup_overrides = <dtest_setup_overrides.DTestSetupOverrides object at 0x7f2dd0d6e090>
fixture_logging_setup = None, fixture_dtest_cluster_name = 'test'
fixture_dtest_create_cluster_func = <function DTestSetup.create_ccm_cluster at 0x7f2dd489a160>

    @pytest.fixture(scope='function', autouse=False)
    def fixture_dtest_setup(request,
                            dtest_config,
                            fixture_dtest_setup_overrides,
                            fixture_logging_setup,
                            fixture_dtest_cluster_name,
                            fixture_dtest_create_cluster_func):
    
        # do all of our setup operations to get the enviornment ready for the actual test
        # to run (e.g. bring up a cluster with the necessary config, populate variables, etc)
        initial_environment = copy.deepcopy(os.environ)
        dtest_setup = DTestSetup(dtest_config=dtest_config,
                                 setup_overrides=fixture_dtest_setup_overrides,
                                 cluster_name=fixture_dtest_cluster_name)
    
        cassandra_cluster = dtest_config.cassandra_version
    
        if not cassandra_cluster:
            if ((request.node.get_closest_marker('single_node') or
                    not request.node.get_closest_marker('no_boot_speedups'))
                    and not dtest_setup.dtest_config.tablets):  # workaround for https://github.com/scylladb/scylladb/issues/15152
                dtest_setup.cluster_options.setdefault('skip_wait_for_gossip_to_settle', 0)
    
            # Reduce waiting time for the nodes to hear from others before joining the ring.
            # Since all test cases run on localhost and there are no large test clusters
            # it's safe to reduce the value to save a lot of time while testing.
            # (Default value for the option is 30s)
            dtest_setup.cluster_options.setdefault('ring_delay_ms', 10000)
    
        cluster_options = request.node.get_closest_marker('cluster_options')
        if cluster_options:
            for name, value in cluster_options.kwargs.items():
                dtest_setup.cluster_options.setdefault(name, value)
    
        manager_install_dir = dtest_setup.prepare_scylla_manager() if request.node.get_closest_marker('scylla_manager') else None
        dtest_setup.initialize_cluster(fixture_dtest_create_cluster_func, manager_install_dir=manager_install_dir)
    
        # at this point we're done with our setup operations in this fixture
        # yield to allow the actual test to run
        yield dtest_setup
    
        # phew! we're back after executing the test, now we need to do
        # all of our teardown and cleanup operations
    
        reset_environment_vars(initial_environment)
        dtest_setup.jvm_args = []
    
        for con in dtest_setup.connections:
            con.cluster.shutdown()
        dtest_setup.connections = []
    
        rep_setup = getattr(request.node, "rep_setup", None)
        rep_call = getattr(request.node, "rep_call", None)
        failed = getattr(rep_setup, 'failed', False) or getattr(rep_call, 'failed', False)
    
        try:
            dtest_setup.cluster.stop(gently=True)
        except Exception as e:
            logger.error("Error stopping cluster: %s", str(e))
    
        try:
            if not dtest_setup.allow_log_errors:
                try:
>                   dtest_setup.check_errors_all_nodes()

conftest.py:346: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <dtest_setup.DTestSetup object at 0x7f2dd0d84c10>
nodes = [<ccmlib.scylla_node.ScyllaNode object at 0x7f2dd0d89790>]
exclude_errors = None, search_str = None, regex = False

    def check_errors_all_nodes(self, nodes=None, exclude_errors=None, search_str=None, regex=False):
        if nodes is None:
            nodes = self.cluster.nodelist()
    
        critical_errors = []
        found_errors = []
        for node in nodes:
            try:
                critical_errors_pattern = r'Assertion.*failed|AddressSanitizer'
                if self.ignore_cores_log_patterns:
                    expr = '|'.join(["({})".format(p) for p in set(self.ignore_cores_log_patterns)])
                    matches = node.grep_log(expr)
                    if matches:
                        logger.debug("Will ignore cores on {}. Found the following log messages: {}".format(
                            node.name, matches))
                        self.ignore_cores.append(node)
                if node not in self.ignore_cores:
                    critical_errors_pattern += "|Aborting"
                matches = node.grep_log(critical_errors_pattern)
                if matches:
                    critical_errors.append((node.name, [m[0].strip() for m in matches]))
            except FileNotFoundError:
                pass
            errors = self.check_errors(node=node, exclude_errors=exclude_errors, search_str=search_str, regex=regex,
                                       return_errors=True)
            if len(errors):
                found_errors.append((node.name, errors))
    
        if critical_errors:
            raise AssertionError('Critical errors found: {}\nOther errors: {}'.format(
                critical_errors, found_errors))
        if found_errors:
>           raise AssertionError('Unexpected errors found: {}'.format(found_errors))
E           AssertionError: Unexpected errors found: [('node1', ['ERROR 2023-10-15 19:26:03,281 [shard 0:comp] storage_proxy - exception during mutation write to 127.0.70.1: utils::internal::nested_exception<std::runtime_error> (Could not write mutation system:compaction_history (pk{0010a9c162f06b9011ee883fcdd4cd97304b}) to commitlog): seastar::gate_closed_exception (gate closed)'])]

dtest_setup.py:558: AssertionError

The text was updated successfully, but these errors were encountered:

avikivity · 2023-10-16T11:15:54Z

/cc @elcallio

elcallio · 2023-10-16T12:06:48Z

The log message would indicate a mutation is being written after/during commitlog shutdown. Since we are writing to compaction history, I guess it could be part of shutting down sstables? @raphaelsc - any ideas?

raphaelsc · 2023-10-16T12:41:12Z

The log message would indicate a mutation is being written after/during commitlog shutdown. Since we are writing to compaction history, I guess it could be part of shutting down sstables? @raphaelsc - any ideas?

Yes, indeed looks like a race in shutdown

In database::stop, we have:

    co_await close_tables(database::table_kind::user);
    co_await close_tables(database::table_kind::system);

so user tables are correctly closed first

but it could happen that a system (or even user from another shard) table itself writes into compaction_history after the latter was already closed. The effect is history missing a record for a table. Not sure it's worth fixing it. A possible fix is failing compaction if write to history failed, but I don't think it's a good idea. The weight of having compaction completed is much higher than missing some entry in history.

avikivity · 2023-10-16T14:36:38Z

We should quiesce compactions before closing the tables.

nyh · 2023-11-21T18:37:48Z

Seen again failing in CI run
https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/4808/:

failed on teardown with "AssertionError: Unexpected errors found: [('node1', ['ERROR 2023-11-21 17:02:11,105 [shard 0:comp] storage_proxy - exception during mutation write to 127.0.50.1: utils::internal::nested_exception<std::runtime_error> (Could not write mutation system:compaction_history (pk{0010a81f52b0888f11eeb27253faead7e27b}) to commitlog): seastar::gate_closed_exception (gate closed)'])]"

See full details in https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/4808/testReport/junit/throughput_limits_test/TestCompactionLimitThroughput/Tests___Sanity_Tests___test_can_limit_compaction_throughput/

yaronkaikov · 2023-11-28T06:40:06Z

seen again in 2024.1 - https://jenkins.scylladb.com/job/enterprise-2024.1/job/next/25/testReport/throughput_limits_test/TestCompactionLimitThroughput/Tests___Sanity_Tests_RAFT___test_can_limit_compaction_throughput/

mykaul · 2023-11-28T08:40:51Z

We should quiesce compactions before closing the tables.

@bhalevy - please prioritize and assign

bhalevy · 2023-11-28T10:00:01Z

@raphaelsc Is database::close_tables safe? It calls table->stop() under parallel_for_each_table so the compaction_history table is closed while e.g. system.raft is still active and compacting.

https://jenkins.scylladb.com/job/scylla-master/job/gating-dtest-release/6119/artifact/logs-full.release.001/1700586132081_throughput_limits_test.py%3A%3ATestCompactionLimitThroughput%3A%3Atest_can_limit_compaction_throughput/node1.log

INFO  2023-11-21 17:02:11,079 [shard 0:main] compaction_manager - Asked to drain
INFO  2023-11-21 17:02:11,079 [shard 0:main] compaction_manager - Drained
INFO  2023-11-21 17:02:11,079 [shard 1:main] compaction_manager - Asked to drain
INFO  2023-11-21 17:02:11,079 [shard 1:main] compaction_manager - Drained
INFO  2023-11-21 17:02:11,079 [shard 0:main] database - Flushing non-system tables
INFO  2023-11-21 17:02:11,079 [shard 0:main] database - Flushed non-system tables
INFO  2023-11-21 17:02:11,079 [shard 1:main] database - Flushing non-system tables
INFO  2023-11-21 17:02:11,079 [shard 1:main] database - Flushed non-system tables
INFO  2023-11-21 17:02:11,079 [shard 0:main] database - Flushing system tables
INFO  2023-11-21 17:02:11,079 [shard 0:main] database - Flushed system tables
INFO  2023-11-21 17:02:11,079 [shard 1:main] database - Flushing system tables
INFO  2023-11-21 17:02:11,083 [shard 1:main] database - Flushed system tables
INFO  2023-11-21 17:02:11,104 [shard 0:comp] compaction - [Compact keyspace1.standard1 a30ec8a0-888f-11ee-b272-53faead7e27b] Compacting of 2 sstables interrupted due to: sstables::compaction_stopped_exception (Compaction for keyspace1/standard1 was stopped due to: user requested abort)
INFO  2023-11-21 17:02:11,104 [shard 0:comp] compaction_manager - Compaction task 0x6000042ca100 for table keyspace1.standard1 compaction_group=0/1 [0x600004e20400]: Compaction for keyspace1/standard1 was stopped due to: user requested abort: stopping
INFO  2023-11-21 17:02:11,105 [shard 0:comp] compaction - [Compact system.raft a81f52b0-888f-11ee-b272-53faead7e27b] Compacted 2 sstables to [/jenkins/workspace/scylla-master/gating-dtest-release/scylla/.dtest/dtest-3r_frxsc/test/node1/data/system/raft-3e17774c57f539939625327cbafbf5bb/me-3gb9_1bb1_2tt4g2pox9t35halx7-big-Data.db:level=0]. 70kB to 65kB (~92% of original) in 16524ms = 4kB/s. ~256 total partitions merged to 1.
ERROR 2023-11-21 17:02:11,105 [shard 0:comp] storage_proxy - exception during mutation write to 127.0.50.1: utils::internal::nested_exception<std::runtime_error> (Could not write mutation system:compaction_history (pk{0010a81f52b0888f11eeb27253faead7e27b}) to commitlog): seastar::gate_closed_exception (gate closed)
ERROR 2023-11-21 17:02:11,105 [shard 0:comp] system_keyspace - update compaction history failed: exceptions::mutation_write_failure_exception (Operation failed for system.compaction_history - received 0 responses and 1 failures from 1 CL=ONE.): ignored
INFO  2023-11-21 17:02:11,107 [shard 0:main] init - Shutting down local storage was successful

nyh · 2023-11-30T22:20:56Z

Seen again in https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/4992/:

failed on teardown with "AssertionError: Unexpected errors found: [('node1', ['ERROR 2023-11-30 18:12:52,591 [shard 0:comp] storage_proxy - exception during mutation write to 127.0.99.1: utils::internal::nested_exception<std::runtime_error> (Could not write mutation system:compaction_history (pk{001004523f708fac11eebf529d8d3c58f7ea}) to commitlog): seastar::gate_closed_exception (gate closed)'])]"

(see details in https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/4992/testReport/junit/throughput_limits_test/TestCompactionLimitThroughput/Tests___Sanity_Tests___test_can_limit_compaction_throughput/)

… shutdown During shutdown, as all system tables are closed in parallel, there is a possibility of a race condition between compaction stoppage and the closure of the compaction_history table. So, stop all the compaction tasks before attempting to close the tables. Fixes scylladb#15721 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

…ing shutdown During shutdown, as all system tables are closed in parallel, there is a possibility of a race condition between compaction stoppage and the closure of the compaction_history table. So, quiesce all the compaction tasks before attempting to close the tables. Fixes scylladb#15721 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

benipeled · 2024-02-19T07:43:37Z

Seen again on 2024.1 - @scylladb/scylla-maint please backport

https://jenkins.scylladb.com/job/enterprise-2024.1/job/next/106/testReport/junit/throughput_limits_test/TestCompactionLimitThroughput/Tests___Sanity_Tests_RAFT___test_can_limit_compaction_throughput/

…ing shutdown During shutdown, as all system tables are closed in parallel, there is a possibility of a race condition between compaction stoppage and the closure of the compaction_history table. So, quiesce all the compaction tasks before attempting to close the tables. Fixes #15721 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> Closes #17218 (cherry picked from commit 3b7b315)

denesb · 2024-02-19T07:45:59Z

Backported to 5.4.

@raphaelsc / @lkshminarayanan is this needed in 5.2?

…ing shutdown During shutdown, as all system tables are closed in parallel, there is a possibility of a race condition between compaction stoppage and the closure of the compaction_history table. So, quiesce all the compaction tasks before attempting to close the tables. Fixes scylladb#15721 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> Closes scylladb#17218

avikivity added area/commitlog Issues related to the commit log. tests/dtest symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework labels Oct 16, 2023

nyh mentioned this issue Nov 21, 2023

scylla-sstable: print "validate" result in JSON #16105

Closed

mykaul added the P1 Urgent label Nov 28, 2023

mykaul added this to the 6.0 milestone Nov 28, 2023

mykaul added the triage/master Looking for assignee label Nov 28, 2023

bhalevy assigned denesb and raphaelsc Nov 28, 2023

bhalevy added P2 High Priority and removed P1 Urgent labels Nov 28, 2023

nyh mentioned this issue Nov 30, 2023

cql3: statement_restrictions: cartestian product size error message fix. #16226

Closed

denesb assigned lkshminarayanan and unassigned raphaelsc Jan 30, 2024

lkshminarayanan mentioned this issue Feb 8, 2024

replica/database: stop compaction before closing system tables during shutdown #17218

Closed

scylladb-promoter closed this as completed in 3b7b315 Feb 15, 2024

scylladb-promoter added the Backport candidate label Feb 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dtest] Tests / Sanity Tests-RAFT / throughput_limits_test.TestCompactionLimitThroughput.test_can_limit_compaction_throughput is flakey #15721

[dtest] Tests / Sanity Tests-RAFT / throughput_limits_test.TestCompactionLimitThroughput.test_can_limit_compaction_throughput is flakey #15721

avikivity commented Oct 16, 2023

avikivity commented Oct 16, 2023

elcallio commented Oct 16, 2023

raphaelsc commented Oct 16, 2023

avikivity commented Oct 16, 2023

nyh commented Nov 21, 2023

yaronkaikov commented Nov 28, 2023

mykaul commented Nov 28, 2023

bhalevy commented Nov 28, 2023

nyh commented Nov 30, 2023

benipeled commented Feb 19, 2024

denesb commented Feb 19, 2024

[dtest] Tests / Sanity Tests-RAFT / throughput_limits_test.TestCompactionLimitThroughput.test_can_limit_compaction_throughput is flakey #15721

[dtest] Tests / Sanity Tests-RAFT / throughput_limits_test.TestCompactionLimitThroughput.test_can_limit_compaction_throughput is flakey #15721

Comments

avikivity commented Oct 16, 2023

avikivity commented Oct 16, 2023

elcallio commented Oct 16, 2023

raphaelsc commented Oct 16, 2023

avikivity commented Oct 16, 2023

nyh commented Nov 21, 2023

yaronkaikov commented Nov 28, 2023

mykaul commented Nov 28, 2023

bhalevy commented Nov 28, 2023

nyh commented Nov 30, 2023

benipeled commented Feb 19, 2024

denesb commented Feb 19, 2024