Shutting down auth service may hang #13545

bhalevy · 2023-04-17T14:13:59Z

Seen in https://jenkins.scylladb.com/job/scylla-master/job/build/1423/artifact/testlog/aarch64/debug/topology_raft_disabled.test_raft_upgrade.1.log

______________________ test_recovery_after_majority_loss _______________________

manager = <test.pylib.manager_client.ManagerClient object at 0xffff863520d0>
random_tables = <test.pylib.random_tables.RandomTables object at 0xffff7716d750>

    @pytest.mark.asyncio
    @log_run_time
    async def test_recovery_after_majority_loss(manager: ManagerClient, random_tables: RandomTables):
        """
        We successfully upgrade a cluster. Eventually however all servers but one fail - group 0
        is left without a majority. We create a new group 0 by entering RECOVERY, using `removenode`
        to get rid of the other servers, clearing Raft data and restarting. The Raft upgrade procedure
        runs again to establish a single-node group 0. We also verify that schema changes performed
        using the old group 0 are still there.
        Note: in general there's no guarantee that all schema changes will be present; the minority
        used to recover group 0 might have missed them. However in this test the driver waits
        for schema agreement to complete before proceeding, so we know that every server learned
        about the schema changes.
    
        kbr-: the test takes about 22 seconds in dev mode on my laptop.
        """
        servers = await manager.running_servers()
    
        logging.info(f"Enabling Raft on {servers} and restarting")
        await asyncio.gather(*(enable_raft_and_restart(manager, srv) for srv in servers))
        cql = await reconnect_driver(manager)
    
        logging.info("Cluster restarted, waiting until driver reconnects to every server")
        hosts = await wait_for_cql_and_get_hosts(cql, servers, time.time() + 60)
    
        logging.info(f"Driver reconnected, hosts: {hosts}. Waiting until upgrade finishes")
        await asyncio.gather(*(wait_until_upgrade_finishes(cql, h, time.time() + 60) for h in hosts))
    
        logging.info("Upgrade finished. Creating a bunch of tables")
        tables = await asyncio.gather(*(random_tables.add_table(ncolumns=5) for _ in range(5)))
    
        srv1, *others = servers
    
        logging.info(f"Killing all nodes except {srv1}")
        await asyncio.gather(*(manager.server_stop_gracefully(srv.server_id) for srv in others))
    
        logging.info(f"Entering recovery state on {srv1}")
        host1 = next(h for h in hosts if h.address == srv1.ip_addr)
        await cql.run_async("update system.scylla_local set value = 'recovery' where key = 'group0_upgrade_state'", host=host1)
>       await restart(manager, srv1)

test/topology_raft_disabled/test_raft_upgrade.py:324: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
test/topology_raft_disabled/test_raft_upgrade.py:37: in restart
    await manager.server_stop_gracefully(server.server_id)
test/pylib/manager_client.py:139: in server_stop_gracefully
    await self.client.get_text(f"/cluster/server/{server_id}/stop_gracefully")
test/pylib/rest_client.py:82: in get_text
    ret = await self._fetch("GET", resource_uri, response_type = "text", host = host,
test/pylib/rest_client.py:64: in _fetch
    async with request(method, uri,
/usr/lib64/python3.11/site-packages/aiohttp/client.py:1189: in __aenter__
    self._resp = await self._coro
/usr/lib64/python3.11/site-packages/aiohttp/client.py:560: in _request
    await resp.start(conn)
/usr/lib64/python3.11/site-packages/aiohttp/client_reqrep.py:894: in start
    with self._timer:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <aiohttp.helpers.TimerContext object at 0xffff77460c10>
exc_type = <class 'asyncio.exceptions.CancelledError'>
exc_val = CancelledError(), exc_tb = <traceback object at 0xffff772bb400>

    def __exit__(
        self,
        exc_type: Optional[Type[BaseException]],
        exc_val: Optional[BaseException],
        exc_tb: Optional[TracebackType],
    ) -> Optional[bool]:
        if self._tasks:
            self._tasks.pop()
    
        if exc_type is asyncio.CancelledError and self._cancelled:
>           raise asyncio.TimeoutError from None
E           TimeoutError

topology_raft_disabled.test_raft_upgrade.1.log:

07:23:06.494 INFO> Killing all nodes except Server(211, 127.17.167.39)
07:23:06.494 DEBUG> ManagerClient stopping gracefully 212
07:23:06.494 DEBUG> RESTClient fetching GET http+unix://api/cluster/server/212/stop_gracefully
07:23:06.495 DEBUG> ManagerClient stopping gracefully 213
07:23:06.495 DEBUG> RESTClient fetching GET http+unix://api/cluster/server/213/stop_gracefully
07:23:06.526 DEBUG> Message pushed from server: <EventMessage(event_type='STATUS_CHANGE', event_args={'change_type': 'DOWN', 'address': ('127.17.167.20', 9042)}, stream_id=-1, trace_id=None)>
07:23:06.547 DEBUG> Connection <AsyncoreConnection(281472683252432) 127.17.167.20:9042> closed by server
07:23:06.547 DEBUG> Closing connection (281472683252432) to 127.17.167.20:9042
07:23:06.547 DEBUG> Closed socket to 127.17.167.20:9042
07:23:14.049 INFO> Entering recovery state on Server(211, 127.17.167.39)
07:23:14.066 INFO> Stopping Server(211, 127.17.167.39) gracefully
...
07:28:14.950 DEBUG> after_test for test_recovery_after_majority_loss (success: False)
07:28:14.950 DEBUG> RESTClient fetching GET http+unix://api/cluster/after-test/False
07:28:14.952 INFO> Cluster after test test_recovery_after_majority_loss: ScyllaCluster(name: d1ed22c8-d81f-11ed-bfbc-0afa8b88e8c3, running: ScyllaServer(211, 127.17.167.39, 6def9dbc-9ab9-4261-bcb5-6ba1d8e6c04e), stopped: ScyllaServer(213, 127.17.167.11, f703b6b9-f03e-4553-9950-edc8cab37a6b), ScyllaServer(212, 127.17.167.20, bd39a7f6-233c-4752-96cc-61903b04003f))

https://jenkins.scylladb.com/job/scylla-master/job/build/1423/artifact/testlog/aarch64/debug/scylla-211.log shows it's hung on shitdown. Last printout is Shutting down auth service
and then suppressed backtraces (reactor stalls?)

@xemul does that look familiar?

INFO  2023-04-11 07:23:07,227 [shard 0] gossip - Got shutdown message from 127.17.167.11, received_generation=1681186958, local_generation=1681186958
INFO  2023-04-11 07:23:07,229 [shard 0] gossip - InetAddress 127.17.167.11 is now DOWN, status = shutdown
INFO  2023-04-11 07:23:07,257 [shard 0] gossip - Got shutdown message from 127.17.167.20, received_generation=1681186955, local_generation=1681186955
INFO  2023-04-11 07:23:07,259 [shard 0] gossip - InetAddress 127.17.167.20 is now DOWN, status = shutdown
INFO  2023-04-11 07:23:10,163 [shard 0] raft_group_registry - marking Raft server f703b6b9-f03e-4553-9950-edc8cab37a6b as dead for raft groups
INFO  2023-04-11 07:23:10,255 [shard 0] raft_group_registry - marking Raft server bd39a7f6-233c-4752-96cc-61903b04003f as dead for raft groups
INFO  2023-04-11 07:23:14,070 [shard 0] compaction_manager - Asked to stop
WARN  2023-04-11 07:23:14,071 [shard 0] gossip - failure_detector_loop: Got error in the loop, live_nodes=[]: seastar::sleep_aborted (Sleep is aborted)
INFO  2023-04-11 07:23:14,072 [shard 0] gossip - failure_detector_loop: Finished main loop
INFO  2023-04-11 07:23:14,072 [shard 0] init - Signal received; shutting down
INFO  2023-04-11 07:23:14,072 [shard 0] init - Shutting down view builder ops
INFO  2023-04-11 07:23:14,072 [shard 0] view - Draining view builder
INFO  2023-04-11 07:23:14,073 [shard 1] compaction_manager - Asked to stop
INFO  2023-04-11 07:23:14,074 [shard 1] view - Draining view builder
INFO  2023-04-11 07:23:14,076 [shard 1] compaction_manager - Stopped
INFO  2023-04-11 07:23:14,082 [shard 0] compaction_manager - Stopped
INFO  2023-04-11 07:23:14,083 [shard 0] init - Shutting down view builder ops was successful
INFO  2023-04-11 07:23:14,083 [shard 0] init - Shutting down local storage
INFO  2023-04-11 07:23:14,085 [shard 0] storage_service - Stop transport: starts
INFO  2023-04-11 07:23:14,085 [shard 0] migration_manager - stopping migration service
INFO  2023-04-11 07:23:14,086 [shard 1] migration_manager - stopping migration service
INFO  2023-04-11 07:23:14,087 [shard 0] storage_service - Shutting down native transport server
INFO  2023-04-11 07:23:14,122 [shard 0] cql_server_controller - CQL server stopped
INFO  2023-04-11 07:23:14,122 [shard 0] storage_service - Shutting down native transport server was successful
INFO  2023-04-11 07:23:14,122 [shard 0] storage_service - Shutting down rpc server
INFO  2023-04-11 07:23:14,123 [shard 0] storage_service - Shutting down rpc server was successful
INFO  2023-04-11 07:23:14,123 [shard 0] storage_service - Shutting down alternator server
INFO  2023-04-11 07:23:14,126 [shard 0] storage_service - Shutting down alternator server was successful
INFO  2023-04-11 07:23:14,126 [shard 0] storage_service - Shutting down redis server
INFO  2023-04-11 07:23:14,126 [shard 0] storage_service - Shutting down redis server was successful
INFO  2023-04-11 07:23:14,127 [shard 0] storage_service - Stop transport: shutdown rpc and cql server done
INFO  2023-04-11 07:23:14,129 [shard 0] gossip - My status = NORMAL
INFO  2023-04-11 07:23:14,129 [shard 0] gossip - Announcing shutdown
INFO  2023-04-11 07:23:14,168 [shard 0] storage_service - Node 127.17.167.39 state jump to normal
INFO  2023-04-11 07:23:17,447 [shard 0] gossip - Disable and wait for gossip loop started
INFO  2023-04-11 07:23:17,449 [shard 0] gossip - Gossip is now stopped
INFO  2023-04-11 07:23:17,450 [shard 0] storage_service - Stop transport: stop_gossiping done
INFO  2023-04-11 07:23:17,450 [shard 0] messaging_service - Stopping nontls server
INFO  2023-04-11 07:23:17,450 [shard 0] messaging_service - Stopping tls server
INFO  2023-04-11 07:23:17,451 [shard 0] messaging_service - Stopping client for address: 127.17.167.20:0
INFO  2023-04-11 07:23:17,451 [shard 0] messaging_service - Stopping client for address: 127.17.167.11:0
INFO  2023-04-11 07:23:17,451 [shard 0] messaging_service - Stopping client for address: 127.17.167.11:0
INFO  2023-04-11 07:23:17,451 [shard 0] messaging_service - Stopping client for address: 127.17.167.20:0
INFO  2023-04-11 07:23:17,452 [shard 0] messaging_service - Stopping client for address: 127.17.167.39:0
INFO  2023-04-11 07:23:17,453 [shard 0] messaging_service - Stopping client for address: 127.17.167.11:0 - Done
INFO  2023-04-11 07:23:17,453 [shard 0] messaging_service - Stopping client for address: 127.17.167.20:0 - Done
INFO  2023-04-11 07:23:17,453 [shard 0] messaging_service - Stopping client for address: 127.17.167.20:0 - Done
INFO  2023-04-11 07:23:17,453 [shard 0] messaging_service - Stopping client for address: 127.17.167.11:0 - Done
INFO  2023-04-11 07:23:17,454 [shard 0] messaging_service - Stopping tls server - Done
INFO  2023-04-11 07:23:17,454 [shard 1] messaging_service - Stopping nontls server
INFO  2023-04-11 07:23:17,454 [shard 1] messaging_service - Stopping tls server
INFO  2023-04-11 07:23:17,454 [shard 1] messaging_service - Stopping client for address: 127.17.167.11:0
INFO  2023-04-11 07:23:17,455 [shard 1] messaging_service - Stopping tls server - Done
INFO  2023-04-11 07:23:17,455 [shard 1] messaging_service - Stopping client for address: 127.17.167.11:0 - Done
INFO  2023-04-11 07:23:17,456 [shard 1] messaging_service - Stopping nontls server - Done
INFO  2023-04-11 07:23:17,456 [shard 0] messaging_service - Stopping client for address: 127.17.167.39:0 - Done
INFO  2023-04-11 07:23:17,457 [shard 0] messaging_service - Stopping nontls server - Done
INFO  2023-04-11 07:23:17,457 [shard 0] storage_service - messaging_service stopped
INFO  2023-04-11 07:23:17,457 [shard 0] storage_service - Stop transport: shutdown messaging_service done
INFO  2023-04-11 07:23:17,458 [shard 0] storage_service - Stop transport: shutdown stream_manager done
INFO  2023-04-11 07:23:17,458 [shard 0] storage_service - Stop transport: done
INFO  2023-04-11 07:23:17,460 [shard 0] tracing - Asked to shut down
INFO  2023-04-11 07:23:17,460 [shard 0] tracing - Tracing is down
INFO  2023-04-11 07:23:17,460 [shard 1] tracing - Asked to shut down
INFO  2023-04-11 07:23:17,461 [shard 1] tracing - Tracing is down
INFO  2023-04-11 07:23:17,461 [shard 0] batchlog_manager - Asked to drain
INFO  2023-04-11 07:23:17,462 [shard 0] batchlog_manager - Drained
INFO  2023-04-11 07:23:17,462 [shard 1] batchlog_manager - Asked to drain
INFO  2023-04-11 07:23:17,462 [shard 1] batchlog_manager - Drained
INFO  2023-04-11 07:23:17,463 [shard 0] compaction_manager - Asked to drain
INFO  2023-04-11 07:23:17,463 [shard 0] compaction_manager - Drained
INFO  2023-04-11 07:23:17,464 [shard 1] compaction_manager - Asked to drain
INFO  2023-04-11 07:23:17,464 [shard 1] compaction_manager - Drained
INFO  2023-04-11 07:23:17,465 [shard 0] database - Flushing non-system tables
INFO  2023-04-11 07:23:17,465 [shard 1] database - Flushing non-system tables
INFO  2023-04-11 07:23:17,480 [shard 0] database - Flushed non-system tables
INFO  2023-04-11 07:23:17,589 [shard 1] database - Flushed non-system tables
INFO  2023-04-11 07:23:17,589 [shard 0] database - Flushing system tables
INFO  2023-04-11 07:23:17,590 [shard 1] database - Flushing system tables
Reactor stalled for 33 ms on shard 1. Backtrace: 0xce47e57 0xffff8810f23b 0xffff8810e653 0xffff87f1da97 0xffff87f17de3 0xffff87f17847 0xffff87f18213 0xffff87f1f073 0x7db 0x1278eb07 0x1278e9ab 0x1278e27f 0x1278d717 0x1278d297 0x1278cf6f 0x1278cd87 0x1278cb2f 0x1276b777 0x12848263 0x128472af 0x12846b63 0x1284697b 0x12846723 0x12846497 0x1284636b 0x120bd33f 0x11d99c8b 0x120d1c5f 0x11ddbd4f 0x12c91507 0x13753a9f 0x12849b73 0x12849633 0x12849083 0x11ebe41b 0x11ebdf6f 0x11ebdd53 0x11ebdb3f 0x11ebd917 0x11ebd6eb 0x11ebd53b 0x11ebd30f 0xffff87b53c0f 0xffff8850fde3
INFO  2023-04-11 07:23:18,276 [shard 0] raft_group_registry - marking Raft server 6def9dbc-9ab9-4261-bcb5-6ba1d8e6c04e as dead for raft groups
INFO  2023-04-11 07:23:18,520 [shard 1] database - Flushed system tables
INFO  2023-04-11 07:23:20,596 [shard 0] database - Flushed system tables
INFO  2023-04-11 07:23:20,610 [shard 0] init - Shutting down local storage was successful
INFO  2023-04-11 07:23:20,610 [shard 0] init - Shutting down view update generator
INFO  2023-04-11 07:23:20,611 [shard 0] view_update_generator - leaving 0 unstaged sstables unprocessed
INFO  2023-04-11 07:23:20,611 [shard 1] view_update_generator - leaving 0 unstaged sstables unprocessed
INFO  2023-04-11 07:23:20,614 [shard 0] init - Shutting down view update generator was successful
INFO  2023-04-11 07:23:20,614 [shard 0] init - Shutting down service level controller update loop
INFO  2023-04-11 07:23:20,616 [shard 0] service_level_controller - update_from_distributed_data: configuration polling loop aborted
INFO  2023-04-11 07:23:20,617 [shard 0] init - Shutting down service level controller update loop was successful
INFO  2023-04-11 07:23:20,617 [shard 0] init - Shutting down repair
INFO  2023-04-11 07:23:20,618 [shard 0] init - Shutting down repair was successful
INFO  2023-04-11 07:23:20,618 [shard 0] init - Shutting down rpc controller API
INFO  2023-04-11 07:23:20,619 [shard 0] init - Shutting down rpc controller API was successful
INFO  2023-04-11 07:23:20,619 [shard 0] init - Shutting down transport controller API
INFO  2023-04-11 07:23:20,620 [shard 0] init - Shutting down transport controller API was successful
INFO  2023-04-11 07:23:20,620 [shard 0] init - Shutting down native transport
INFO  2023-04-11 07:23:20,621 [shard 0] init - Shutting down native transport was successful
INFO  2023-04-11 07:23:20,621 [shard 0] init - Shutting down view builder API
INFO  2023-04-11 07:23:20,622 [shard 0] init - Shutting down view builder API was successful
INFO  2023-04-11 07:23:20,622 [shard 0] init - Shutting down view builder
INFO  2023-04-11 07:23:20,622 [shard 0] view - Stopping view builder
INFO  2023-04-11 07:23:20,622 [shard 1] view - Stopping view builder
INFO  2023-04-11 07:23:20,624 [shard 0] init - Shutting down view builder was successful
INFO  2023-04-11 07:23:20,625 [shard 0] init - Shutting down hinted handoff API
INFO  2023-04-11 07:23:20,626 [shard 0] init - Shutting down hinted handoff API was successful
INFO  2023-04-11 07:23:20,626 [shard 0] init - Shutting down view update backlog broker
INFO  2023-04-11 07:23:20,628 [shard 0] init - Shutting down view update backlog broker was successful
INFO  2023-04-11 07:23:20,628 [shard 0] init - Shutting down cf cache hit rate calculator
INFO  2023-04-11 07:23:20,630 [shard 0] init - Shutting down cf cache hit rate calculator was successful
INFO  2023-04-11 07:23:20,630 [shard 0] init - Shutting down load meter
INFO  2023-04-11 07:23:20,630 [shard 0] init - Shutting down load meter was successful
INFO  2023-04-11 07:23:20,630 [shard 0] init - Shutting down batchlog manager
INFO  2023-04-11 07:23:20,631 [shard 0] batchlog_manager - Asked to stop
INFO  2023-04-11 07:23:20,631 [shard 0] batchlog_manager - Asked to drain
INFO  2023-04-11 07:23:20,631 [shard 0] batchlog_manager - Drained
INFO  2023-04-11 07:23:20,631 [shard 0] batchlog_manager - Stopped
INFO  2023-04-11 07:23:20,631 [shard 1] batchlog_manager - Asked to stop
INFO  2023-04-11 07:23:20,632 [shard 1] batchlog_manager - Asked to drain
INFO  2023-04-11 07:23:20,632 [shard 1] batchlog_manager - Drained
INFO  2023-04-11 07:23:20,632 [shard 1] batchlog_manager - Stopped
INFO  2023-04-11 07:23:20,633 [shard 0] init - Shutting down batchlog manager was successful
INFO  2023-04-11 07:23:20,634 [shard 0] init - Shutting down snapshots API
INFO  2023-04-11 07:23:20,635 [shard 0] init - Shutting down snapshots API was successful
INFO  2023-04-11 07:23:20,635 [shard 0] init - Shutting down snapshots
INFO  2023-04-11 07:23:20,636 [shard 0] init - Shutting down snapshots was successful
INFO  2023-04-11 07:23:20,636 [shard 0] init - Shutting down authorization cache api
INFO  2023-04-11 07:23:20,637 [shard 0] init - Shutting down authorization cache api was successful
INFO  2023-04-11 07:23:20,637 [shard 0] init - Shutting down auth service
Rate-limit: suppressed 23 backtraces on shard 0
------ Ending test debug/topology_raft_disabled.test_raft_upgrade.1::test_recovery_after_majority_loss ------

Forked off of #8079

The text was updated successfully, but these errors were encountered:

kbr-scylla · 2023-04-27T08:43:01Z

Observed in another run
https://jenkins.scylladb.com/view/master/job/scylla-master/job/next/5928/artifact/testlog/x86_64/debug/scylla-585.log

INFO  2023-04-27 09:15:33,181 [shard 0] batchlog_manager - Asked to stop
INFO  2023-04-27 09:15:33,181 [shard 0] batchlog_manager - Asked to drain
INFO  2023-04-27 09:15:33,181 [shard 0] batchlog_manager - Drained
INFO  2023-04-27 09:15:33,181 [shard 0] batchlog_manager - Stopped
INFO  2023-04-27 09:15:33,181 [shard 1] batchlog_manager - Asked to stop
INFO  2023-04-27 09:15:33,181 [shard 1] batchlog_manager - Asked to drain
INFO  2023-04-27 09:15:33,181 [shard 1] batchlog_manager - Drained
INFO  2023-04-27 09:15:33,182 [shard 1] batchlog_manager - Stopped
INFO  2023-04-27 09:15:33,182 [shard 0] init - Shutting down batchlog manager was successful
INFO  2023-04-27 09:15:33,182 [shard 0] init - Shutting down snapshots API
INFO  2023-04-27 09:15:33,182 [shard 0] init - Shutting down snapshots API was successful
INFO  2023-04-27 09:15:33,182 [shard 0] init - Shutting down snapshots
INFO  2023-04-27 09:15:33,183 [shard 0] init - Shutting down snapshots was successful
INFO  2023-04-27 09:15:33,183 [shard 0] init - Shutting down authorization cache api
INFO  2023-04-27 09:15:33,183 [shard 0] init - Shutting down authorization cache api was successful
INFO  2023-04-27 09:15:33,183 [shard 0] init - Shutting down auth service
------ Ending test debug/topology_raft_disabled.test_raft_upgrade.2::test_recover_stuck_raft_upgrade ------

kbr-scylla · 2023-06-05T09:50:26Z

Reproducer: kbr-scylla@1ae2f86
it reproduces in ~half of the runs because it depends on token ring distribution.

There's a query in auth service which we perform after we start a node. This query uses infinite timeout:

future<bool> default_role_row_satisfies(
        cql3::query_processor& qp,
        std::function<bool(const cql3::untyped_result_set_row&)> p) {
    static const sstring query = format("SELECT * FROM {} WHERE {} = ?",
            meta::roles_table::qualified_name,
            meta::roles_table::role_col_name);

    return do_with(std::move(p), [&qp](const auto& p) {
        return qp.execute_internal(
                query,
                db::consistency_level::ONE,
                {meta::DEFAULT_SUPERUSER_NAME},
                cql3::query_processor::cache_internal::yes).then([&qp, &p](::shared_ptr<cql3::untyped_result_set> results) {

with the right conditions and timing (we kill a node which serves this data before another node attempts the query), shutdown will hang since it's waiting on this query:

            _stopped = auth::do_after_system_ready(_as, [this] {
                return seastar::async([this] {
                    _migration_manager.wait_for_schema_agreement(_qp.db().real_database(), db::timeout_clock::time_point::max(), &_as).get0();

                    if (any_nondefault_role_row_satisfies(_qp, &has_can_login).get0()) {
                        if (this->legacy_metadata_exists()) {
                            log.warn("Ignoring legacy user metadata since nondefault roles already exist.");
                        }

                        return;
                    }

                    if (this->legacy_metadata_exists()) {
                        this->migrate_legacy_metadata().get0();
                        return;
                    }

                    create_default_role_if_missing().get0();
                });
            });

(it's done in create_default_role_if_missing).

It looks like an oversight because in the past there was a commit that was supposed to remove infinite timeouts from distributed table queries. 620e950 issue: #3603

The fix is simple - use a timeout.

A long long time ago there was an issue about removing infinite timeouts from distributed queries: scylladb#3603. There was also a fix: 620e950. But apparently some queries escaped the fix, like the one in `default_role_row_satisfies`. With the right conditions and timing this query may cause a node to hang indefinitely on shutdown. A node tries to perform this query after it starts. If we kill another node which is required to serve this query right before that moment, the query will hang; when we try to shutdown the querying node, it will wait for the query to finish (it's a background task in auth service), which it never does due to infinite timeout. Use the same timeout configuration as other queries in this module do. Fixes scylladb#13545.

A long long time ago there was an issue about removing infinite timeouts from distributed queries: #3603. There was also a fix: 620e950. But apparently some queries escaped the fix, like the one in `default_role_row_satisfies`. With the right conditions and timing this query may cause a node to hang indefinitely on shutdown. A node tries to perform this query after it starts. If we kill another node which is required to serve this query right before that moment, the query will hang; when we try to shutdown the querying node, it will wait for the query to finish (it's a background task in auth service), which it never does due to infinite timeout. Use the same timeout configuration as other queries in this module do. Fixes #13545. Closes #14134 (cherry picked from commit f51312e)

avikivity · 2023-06-06T17:11:01Z

Backported to 5.1, 5.2, 5.3.

bhalevy mentioned this issue Apr 17, 2023

cql_tracing_test.TestCqlTracing.tracing_startup_test hangs on shutdown in dtest debug #8079

Closed

mykaul assigned eliransin Apr 17, 2023

mykaul added Eng-1 area/test Issues related to the testing system code and environment labels May 4, 2023

kbr-scylla mentioned this issue Jun 5, 2023

auth: don't use infinite timeout in default_role_row_satisfies query #14134

Closed

scylladb-promoter closed this as completed in f51312e Jun 5, 2023

scylladb-promoter added the Backport candidate label Jun 5, 2023

avikivity removed the Backport candidate label Jun 6, 2023

DoronArazii added this to the 5.4 milestone Jun 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shutting down auth service may hang #13545

Shutting down auth service may hang #13545

bhalevy commented Apr 17, 2023

kbr-scylla commented Apr 27, 2023

kbr-scylla commented Jun 5, 2023

avikivity commented Jun 6, 2023

Shutting down auth service may hang #13545

Shutting down auth service may hang #13545

Comments

bhalevy commented Apr 17, 2023

kbr-scylla commented Apr 27, 2023

kbr-scylla commented Jun 5, 2023

avikivity commented Jun 6, 2023