Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dtest] Tests / Sanity Tests-RAFT / throughput_limits_test.TestCompactionLimitThroughput.test_can_limit_compaction_throughput is flakey #15721

Closed
avikivity opened this issue Oct 16, 2023 · 11 comments
Assignees
Labels
area/commitlog Issues related to the commit log. Backport candidate P2 High Priority symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework tests/dtest triage/master Looking for assignee
Milestone

Comments

@avikivity
Copy link
Member

https://jenkins.scylladb.com/view/nexts/job/scylla-master/job/next/6668/testReport/junit/throughput_limits_test/TestCompactionLimitThroughput/Tests___Sanity_Tests_RAFT___test_can_limit_compaction_throughput/

did not run on r3, seems like a real bug.

failed on teardown with "AssertionError: Unexpected errors found: [('node1', ['ERROR 2023-10-15 19:26:03,281 [shard 0:comp] storage_proxy - exception during mutation write to 127.0.70.1: utils::internal::nested_exception<std::runtime_error> (Could not write mutation system:compaction_history (pk{0010a9c162f06b9011ee883fcdd4cd97304b}) to commitlog): seastar::gate_closed_exception (gate closed)'])]"
request = <SubRequest 'fixture_dtest_setup' for <Function test_can_limit_compaction_throughput>>
dtest_config = <dtest_config.DTestConfig object at 0x7f2dd1b340d0>
fixture_dtest_setup_overrides = <dtest_setup_overrides.DTestSetupOverrides object at 0x7f2dd0d6e090>
fixture_logging_setup = None, fixture_dtest_cluster_name = 'test'
fixture_dtest_create_cluster_func = <function DTestSetup.create_ccm_cluster at 0x7f2dd489a160>

    @pytest.fixture(scope='function', autouse=False)
    def fixture_dtest_setup(request,
                            dtest_config,
                            fixture_dtest_setup_overrides,
                            fixture_logging_setup,
                            fixture_dtest_cluster_name,
                            fixture_dtest_create_cluster_func):
    
        # do all of our setup operations to get the enviornment ready for the actual test
        # to run (e.g. bring up a cluster with the necessary config, populate variables, etc)
        initial_environment = copy.deepcopy(os.environ)
        dtest_setup = DTestSetup(dtest_config=dtest_config,
                                 setup_overrides=fixture_dtest_setup_overrides,
                                 cluster_name=fixture_dtest_cluster_name)
    
        cassandra_cluster = dtest_config.cassandra_version
    
        if not cassandra_cluster:
            if ((request.node.get_closest_marker('single_node') or
                    not request.node.get_closest_marker('no_boot_speedups'))
                    and not dtest_setup.dtest_config.tablets):  # workaround for https://github.com/scylladb/scylladb/issues/15152
                dtest_setup.cluster_options.setdefault('skip_wait_for_gossip_to_settle', 0)
    
            # Reduce waiting time for the nodes to hear from others before joining the ring.
            # Since all test cases run on localhost and there are no large test clusters
            # it's safe to reduce the value to save a lot of time while testing.
            # (Default value for the option is 30s)
            dtest_setup.cluster_options.setdefault('ring_delay_ms', 10000)
    
        cluster_options = request.node.get_closest_marker('cluster_options')
        if cluster_options:
            for name, value in cluster_options.kwargs.items():
                dtest_setup.cluster_options.setdefault(name, value)
    
        manager_install_dir = dtest_setup.prepare_scylla_manager() if request.node.get_closest_marker('scylla_manager') else None
        dtest_setup.initialize_cluster(fixture_dtest_create_cluster_func, manager_install_dir=manager_install_dir)
    
        # at this point we're done with our setup operations in this fixture
        # yield to allow the actual test to run
        yield dtest_setup
    
        # phew! we're back after executing the test, now we need to do
        # all of our teardown and cleanup operations
    
        reset_environment_vars(initial_environment)
        dtest_setup.jvm_args = []
    
        for con in dtest_setup.connections:
            con.cluster.shutdown()
        dtest_setup.connections = []
    
        rep_setup = getattr(request.node, "rep_setup", None)
        rep_call = getattr(request.node, "rep_call", None)
        failed = getattr(rep_setup, 'failed', False) or getattr(rep_call, 'failed', False)
    
        try:
            dtest_setup.cluster.stop(gently=True)
        except Exception as e:
            logger.error("Error stopping cluster: %s", str(e))
    
        try:
            if not dtest_setup.allow_log_errors:
                try:
>                   dtest_setup.check_errors_all_nodes()

conftest.py:346: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <dtest_setup.DTestSetup object at 0x7f2dd0d84c10>
nodes = [<ccmlib.scylla_node.ScyllaNode object at 0x7f2dd0d89790>]
exclude_errors = None, search_str = None, regex = False

    def check_errors_all_nodes(self, nodes=None, exclude_errors=None, search_str=None, regex=False):
        if nodes is None:
            nodes = self.cluster.nodelist()
    
        critical_errors = []
        found_errors = []
        for node in nodes:
            try:
                critical_errors_pattern = r'Assertion.*failed|AddressSanitizer'
                if self.ignore_cores_log_patterns:
                    expr = '|'.join(["({})".format(p) for p in set(self.ignore_cores_log_patterns)])
                    matches = node.grep_log(expr)
                    if matches:
                        logger.debug("Will ignore cores on {}. Found the following log messages: {}".format(
                            node.name, matches))
                        self.ignore_cores.append(node)
                if node not in self.ignore_cores:
                    critical_errors_pattern += "|Aborting"
                matches = node.grep_log(critical_errors_pattern)
                if matches:
                    critical_errors.append((node.name, [m[0].strip() for m in matches]))
            except FileNotFoundError:
                pass
            errors = self.check_errors(node=node, exclude_errors=exclude_errors, search_str=search_str, regex=regex,
                                       return_errors=True)
            if len(errors):
                found_errors.append((node.name, errors))
    
        if critical_errors:
            raise AssertionError('Critical errors found: {}\nOther errors: {}'.format(
                critical_errors, found_errors))
        if found_errors:
>           raise AssertionError('Unexpected errors found: {}'.format(found_errors))
E           AssertionError: Unexpected errors found: [('node1', ['ERROR 2023-10-15 19:26:03,281 [shard 0:comp] storage_proxy - exception during mutation write to 127.0.70.1: utils::internal::nested_exception<std::runtime_error> (Could not write mutation system:compaction_history (pk{0010a9c162f06b9011ee883fcdd4cd97304b}) to commitlog): seastar::gate_closed_exception (gate closed)'])]

dtest_setup.py:558: AssertionError
@avikivity avikivity added area/commitlog Issues related to the commit log. tests/dtest symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework labels Oct 16, 2023
@avikivity
Copy link
Member Author

/cc @elcallio

@elcallio
Copy link
Contributor

The log message would indicate a mutation is being written after/during commitlog shutdown. Since we are writing to compaction history, I guess it could be part of shutting down sstables? @raphaelsc - any ideas?

@raphaelsc
Copy link
Member

The log message would indicate a mutation is being written after/during commitlog shutdown. Since we are writing to compaction history, I guess it could be part of shutting down sstables? @raphaelsc - any ideas?

Yes, indeed looks like a race in shutdown

In database::stop, we have:

    co_await close_tables(database::table_kind::user);
    co_await close_tables(database::table_kind::system);

so user tables are correctly closed first

but it could happen that a system (or even user from another shard) table itself writes into compaction_history after the latter was already closed. The effect is history missing a record for a table. Not sure it's worth fixing it. A possible fix is failing compaction if write to history failed, but I don't think it's a good idea. The weight of having compaction completed is much higher than missing some entry in history.

@avikivity
Copy link
Member Author

We should quiesce compactions before closing the tables.

@nyh
Copy link
Contributor

nyh commented Nov 21, 2023

Seen again failing in CI run
https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/4808/:

failed on teardown with "AssertionError: Unexpected errors found: [('node1', ['ERROR 2023-11-21 17:02:11,105 [shard 0:comp] storage_proxy - exception during mutation write to 127.0.50.1: utils::internal::nested_exception<std::runtime_error> (Could not write mutation system:compaction_history (pk{0010a81f52b0888f11eeb27253faead7e27b}) to commitlog): seastar::gate_closed_exception (gate closed)'])]"

See full details in https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/4808/testReport/junit/throughput_limits_test/TestCompactionLimitThroughput/Tests___Sanity_Tests___test_can_limit_compaction_throughput/

@mykaul mykaul added the P1 Urgent label Nov 28, 2023
@mykaul mykaul added this to the 6.0 milestone Nov 28, 2023
@mykaul mykaul added the triage/master Looking for assignee label Nov 28, 2023
@mykaul
Copy link
Contributor

mykaul commented Nov 28, 2023

We should quiesce compactions before closing the tables.

@bhalevy - please prioritize and assign

@bhalevy bhalevy added P2 High Priority and removed P1 Urgent labels Nov 28, 2023
@bhalevy
Copy link
Member

bhalevy commented Nov 28, 2023

@raphaelsc Is database::close_tables safe? It calls table->stop() under parallel_for_each_table so the compaction_history table is closed while e.g. system.raft is still active and compacting.

https://jenkins.scylladb.com/job/scylla-master/job/gating-dtest-release/6119/artifact/logs-full.release.001/1700586132081_throughput_limits_test.py%3A%3ATestCompactionLimitThroughput%3A%3Atest_can_limit_compaction_throughput/node1.log

INFO  2023-11-21 17:02:11,079 [shard 0:main] compaction_manager - Asked to drain
INFO  2023-11-21 17:02:11,079 [shard 0:main] compaction_manager - Drained
INFO  2023-11-21 17:02:11,079 [shard 1:main] compaction_manager - Asked to drain
INFO  2023-11-21 17:02:11,079 [shard 1:main] compaction_manager - Drained
INFO  2023-11-21 17:02:11,079 [shard 0:main] database - Flushing non-system tables
INFO  2023-11-21 17:02:11,079 [shard 0:main] database - Flushed non-system tables
INFO  2023-11-21 17:02:11,079 [shard 1:main] database - Flushing non-system tables
INFO  2023-11-21 17:02:11,079 [shard 1:main] database - Flushed non-system tables
INFO  2023-11-21 17:02:11,079 [shard 0:main] database - Flushing system tables
INFO  2023-11-21 17:02:11,079 [shard 0:main] database - Flushed system tables
INFO  2023-11-21 17:02:11,079 [shard 1:main] database - Flushing system tables
INFO  2023-11-21 17:02:11,083 [shard 1:main] database - Flushed system tables
INFO  2023-11-21 17:02:11,104 [shard 0:comp] compaction - [Compact keyspace1.standard1 a30ec8a0-888f-11ee-b272-53faead7e27b] Compacting of 2 sstables interrupted due to: sstables::compaction_stopped_exception (Compaction for keyspace1/standard1 was stopped due to: user requested abort)
INFO  2023-11-21 17:02:11,104 [shard 0:comp] compaction_manager - Compaction task 0x6000042ca100 for table keyspace1.standard1 compaction_group=0/1 [0x600004e20400]: Compaction for keyspace1/standard1 was stopped due to: user requested abort: stopping
INFO  2023-11-21 17:02:11,105 [shard 0:comp] compaction - [Compact system.raft a81f52b0-888f-11ee-b272-53faead7e27b] Compacted 2 sstables to [/jenkins/workspace/scylla-master/gating-dtest-release/scylla/.dtest/dtest-3r_frxsc/test/node1/data/system/raft-3e17774c57f539939625327cbafbf5bb/me-3gb9_1bb1_2tt4g2pox9t35halx7-big-Data.db:level=0]. 70kB to 65kB (~92% of original) in 16524ms = 4kB/s. ~256 total partitions merged to 1.
ERROR 2023-11-21 17:02:11,105 [shard 0:comp] storage_proxy - exception during mutation write to 127.0.50.1: utils::internal::nested_exception<std::runtime_error> (Could not write mutation system:compaction_history (pk{0010a81f52b0888f11eeb27253faead7e27b}) to commitlog): seastar::gate_closed_exception (gate closed)
ERROR 2023-11-21 17:02:11,105 [shard 0:comp] system_keyspace - update compaction history failed: exceptions::mutation_write_failure_exception (Operation failed for system.compaction_history - received 0 responses and 1 failures from 1 CL=ONE.): ignored
INFO  2023-11-21 17:02:11,107 [shard 0:main] init - Shutting down local storage was successful

@nyh
Copy link
Contributor

nyh commented Nov 30, 2023

Seen again in https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/4992/:

failed on teardown with "AssertionError: Unexpected errors found: [('node1', ['ERROR 2023-11-30 18:12:52,591 [shard 0:comp] storage_proxy - exception during mutation write to 127.0.99.1: utils::internal::nested_exception<std::runtime_error> (Could not write mutation system:compaction_history (pk{001004523f708fac11eebf529d8d3c58f7ea}) to commitlog): seastar::gate_closed_exception (gate closed)'])]"

(see details in https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/4992/testReport/junit/throughput_limits_test/TestCompactionLimitThroughput/Tests___Sanity_Tests___test_can_limit_compaction_throughput/)

@denesb denesb assigned lkshminarayanan and unassigned raphaelsc Jan 30, 2024
lkshminarayanan added a commit to lkshminarayanan/scylladb that referenced this issue Feb 8, 2024
… shutdown

During shutdown, as all system tables are closed in parallel, there is a
possibility of a race condition between compaction stoppage and the
closure of the compaction_history table. So, stop all the compaction
tasks before attempting to close the tables.

Fixes scylladb#15721

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
lkshminarayanan added a commit to lkshminarayanan/scylladb that referenced this issue Feb 8, 2024
… shutdown

During shutdown, as all system tables are closed in parallel, there is a
possibility of a race condition between compaction stoppage and the
closure of the compaction_history table. So, stop all the compaction
tasks before attempting to close the tables.

Fixes scylladb#15721

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
lkshminarayanan added a commit to lkshminarayanan/scylladb that referenced this issue Feb 12, 2024
…ing shutdown

During shutdown, as all system tables are closed in parallel, there is a
possibility of a race condition between compaction stoppage and the
closure of the compaction_history table. So, quiesce all the compaction
tasks before attempting to close the tables.

Fixes scylladb#15721

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
lkshminarayanan added a commit to lkshminarayanan/scylladb that referenced this issue Feb 12, 2024
…ing shutdown

During shutdown, as all system tables are closed in parallel, there is a
possibility of a race condition between compaction stoppage and the
closure of the compaction_history table. So, quiesce all the compaction
tasks before attempting to close the tables.

Fixes scylladb#15721

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
lkshminarayanan added a commit to lkshminarayanan/scylladb that referenced this issue Feb 14, 2024
…ing shutdown

During shutdown, as all system tables are closed in parallel, there is a
possibility of a race condition between compaction stoppage and the
closure of the compaction_history table. So, quiesce all the compaction
tasks before attempting to close the tables.

Fixes scylladb#15721

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
denesb pushed a commit that referenced this issue Feb 19, 2024
…ing shutdown

During shutdown, as all system tables are closed in parallel, there is a
possibility of a race condition between compaction stoppage and the
closure of the compaction_history table. So, quiesce all the compaction
tasks before attempting to close the tables.

Fixes #15721

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

Closes #17218

(cherry picked from commit 3b7b315)
@denesb
Copy link
Contributor

denesb commented Feb 19, 2024

Backported to 5.4.

@raphaelsc / @lkshminarayanan is this needed in 5.2?

dgarcia360 pushed a commit to dgarcia360/scylla that referenced this issue Apr 30, 2024
…ing shutdown

During shutdown, as all system tables are closed in parallel, there is a
possibility of a race condition between compaction stoppage and the
closure of the compaction_history table. So, quiesce all the compaction
tasks before attempting to close the tables.

Fixes scylladb#15721

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

Closes scylladb#17218
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/commitlog Issues related to the commit log. Backport candidate P2 High Priority symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework tests/dtest triage/master Looking for assignee
Projects
None yet