Open
Description
Jira Link: DB-17233
Description
Version: 2.27.0.0-b169
Test is failing consistently with 100% repro rate, last it was passed on 2.27.0.0-b143
Error:
System waited for ~15 sec and timed out with error leader not found
Could not locate the leader master: GetMasterRegistration RPC (request call id 11) to 172.151.23.76:7100 timed out after 14.979s\n
One of the node i found:
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
W0602 03:26:22.175999 33480 catalog_manager.cc:1681] Failed to get current config: Illegal state (yb/master/catalog_manager.cc:12001): Node a4d4f9bf10b64dfe83834e60d4441c9f peer not initialized.
W0602 03:26:22.214972 33487 catalog_manager.cc:1681] Failed to get current config: Illegal state (yb/master/catalog_manager.cc:12001): Node a4d4f9bf10b64dfe83834e60d4441c9f peer not initialized.
W0602 03:26:22.339953 33497 catalog_manager_bg_tasks.cc:195] Catalog manager background task thread going to sleep: Service unavailable (yb/master/scoped_leader_shared_lock.cc:92): Catalog manager is not initialized. State: 1
Master log:
I0602 03:38:30.074385 47208 xcluster_manager.cc:538] IsAlterXClusterReplicationDone: replication_group_id: "cac7db1c-005d-44a4-abc1-c916b8a074fe_rep" target_master_addresses { host: "172.151.26.129" port: 7100 } target_master_addresses { host: "172.151.29.112" port: 7100 } target_master_addresses { host: "172.151.31.154" port: 7100 }, from: 172.151.23.76:45478
I0602 03:38:30.074442 47208 secure.cc:138] SetupSecureContext: kInternal, 1
I0602 03:38:30.074671 47208 secure.cc:192] Certs directory: /home/yugabyte/yugabyte-tls-producer/cac7db1c-005d-44a4-abc1-c916b8a074fe_rep, node name:
I0602 03:38:30.093333 47208 thread_pool.cc:231] Starting thread pool { name: xcluster-remote max_workers: 1024 idle_timeout: 15.000s }
I0602 03:38:30.098526 47850 client-internal.cc:2922] New master addresses: [172.151.26.129:7100,172.151.29.112:7100,172.151.31.154:7100]
W0602 03:38:30.105381 47849 outbound_call.cc:169] Failed to schedule invoking callback on response for request yb.master.MasterService.GetMasterRegistration to 172.151.31.154: Aborted (yb/rpc/thread_pool.cc:57): Service is shutting down
W0602 03:38:30.183352 47856 async_rpc_tasks_base.cc:579] CreateTablet RPC for tablet f1a2ba93f8ab421f884f018ce4aa914e (yb_db_dr_rand_c97a330_1_employees_table_c97a330_3 [id=00004003000030008000000000004024]) on TS=d1d276ac3a0e47b3911d4d0cb1c4693c (task=0x00007273379302d8, state=kRunning): TS 0x72733dbcaa98: Create Tablet RPC failed for tablet f1a2ba93f8ab421f884f018ce4aa914e: Network error (yb/rpc/connection.cc:274): Connect timeout Connection (0x0000727337f081e0) client 172.151.29.157:34717 => 172.151.27.63:9100, passed: 14.960s, timeout: 15.000s: kConnectFailed (network error 1)
I0602 03:38:30.183410 47856 async_rpc_tasks_base.cc:355] CreateTablet RPC for tablet f1a2ba93f8ab421f884f018ce4aa914e (yb_db_dr_rand_c97a330_1_employees_table_c97a330_3 [id=00004003000030008000000000004024]) on TS=d1d276ac3a0e47b3911d4d0cb1c4693c (task=0x00007273379302d8, state=kRunning): Scheduling retry with a delay of 16397ms (attempt = 11 / 2147483647)...
W0602 03:38:31.021175 47208 catalog_manager.cc:11919] Expected replicas 3 but found 2 for tablet 7276bd4ff48348128c3bd6a45e046187: tablet_id: "7276bd4ff48348128c3bd6a45e046187" replicas { ts_info { permanent_uuid: "dd6cb43d0850497797f466bc5d1f91a6" private_rpc_addresses { host: "172.151.29.157" port: 9100 } cloud_info { placement_cloud: "aws" placement_region: "us-west-2" placement_zone: "us-west-2a" } placement_uuid: "7e66dfe1-b56f-4921-b5e7-7249e0d0e165" } role: FOLLOWER member_type: VOTER state: RUNNING } replicas { ts_info { permanent_uuid: "82a8ffe8345244ca8b2f3c1498787203" private_rpc_addresses { host: "172.151.23.76" port: 9100 } cloud_info { placement_cloud: "aws" placement_region: "us-west-2" placement_zone: "us-west-2a" } placement_uuid: "7e66dfe1-b56f-4921-b5e7-7249e0d0e165" } role: LEADER member_type: VOTER state: RUNNING } stale: false partition { partition_key_start: "" partition_key_end: "" } table_id: "00004004000030008000000000004018.colocation.parent.uuid" table_ids: "00004004000030008000000000004018.colocation.parent.uuid" table_ids: "00004004000030008000000000004015" table_ids: "0000400400003000800000000000401b" table_ids: "00004004000030008000000000004020" table_ids: "00004004000030008000000000004025" split_depth: 0 expected_live_replicas: 3 expected_read_replicas: 0 split_parent_tablet_id: "" raft_config_opid_index: -1
I0602 03:38:31.022809 47208 xrepl_catalog_manager.cc:2894] GetCDCStream from 172.151.23.76:42680: stream_id: "9ebecae21ce30483624b44163aea1a13"
W0602 03:38:35.095098 43502 consensus_peers.cc:625] T 00000000000000000000000000000000 P 3d9095ec6c514ed283e842e583ec2b2a -> Peer 1bb56204b298490d99eef93574187e52 ([host: "172.151.27.63" port: 7100], []): Couldn't send request. Status: Timed out (yb/rpc/outbound_call.cc:647): UpdateConsensus RPC (request call id 368056) to 172.151.27.63:7100 timed out after 3.000s. Retrying in the next heartbeat period. Already tried 73 times. State: 2
W0602 03:38:35.587921 48139 async_rpc_tasks_base.cc:579] CreateTablet RPC for tablet c91bef1129ea49739bc83939fb96175c (yb_db_dr_rand_c97a330_1_employees_table_c97a330_2 [id=0000400300003000800000000000401f]) on TS=d1d276ac3a0e47b3911d4d0cb1c4693c (task=0x0000727337931358, state=kRunning): TS 0x72733dbcaa98: Create Tablet RPC failed for tablet c91bef1129ea49739bc83939fb96175c: Network error (yb/rpc/connection.cc:274): Connect timeout Connection (0x000072733c2c6020) client 172.151.29.157:43239 => 172.151.27.63:9100, passed: 14.960s, timeout: 15.000s: kConnectFailed (network error 1)
I0602 03:38:35.587986 48139 async_rpc_tasks_base.cc:355] CreateTablet RPC for tablet c91bef1129ea49739bc83939fb96175c (yb_db_dr_rand_c97a330_1_employees_table_c97a330_2 [id=0000400300003000800000000000401f]) on TS=d1d276ac3a0e47b3911d4d0cb1c4693c (task=0x0000727337931358, state=kRunning): Scheduling retry with a delay of 32781ms (attempt = 12 / 2147483647)...
I0602 03:38:36.710337 33412 secure_stream.cc:878] SECURE[S] kEnabled { local: 172.151.29.157:7100 remote: 172.151.21.199:54632 }: Network error (yb/rpc/secure_stream.cc:877): SSL read failed: no error (6)
I0602 03:38:36.747010 33411 secure_stream.cc:878] SECURE[S] kEnabled { local: 172.151.29.157:7100 remote: 172.151.21.199:54636 }: Network error (yb/rpc/secure_stream.cc:877): SSL read failed: no error (6)
I0602 03:38:36.786448 33412 secure_stream.cc:878] SECURE[S] kEnabled { local: 172.151.29.157:7100 remote: 172.151.21.199:54634 }: Network error (yb/rpc/secure_stream.cc:877): SSL read failed: no error (6)
I0602 03:38:36.812523 33412 secure_stream.cc:878] SECURE[S] kEnabled { local: 172.151.29.157:7100 remote: 172.151.21.199:54642 }: Network error (yb/rpc/secure_stream.cc:877): SSL read failed: no error (6)
I0602 03:38:36.950429 33411 secure_stream.cc:878] SECURE[S] kEnabled { local: 172.151.29.157:7100 remote: 172.151.21.199:54648 }: Network error (yb/rpc/secure_stream.cc:877): SSL read failed: no error (6)
I0602 03:38:37.610963 33411 secure_stream.cc:878] SECURE[S] kEnabled { local: 172.151.29.157:7100 remote: 172.151.21.199:54656 }: Network error (yb/rpc/secure_stream.cc:877): SSL read failed: no error (6)
I0602 03:38:37.760815 33412 secure_stream.cc:878] SECURE[S] kEnabled { local: 172.151.29.157:7100 remote: 172.151.21.199:54662 }: Network error (yb/rpc/secure_stream.cc:877): SSL read failed: no error (6)
I0602 03:38:38.234839 33411 secure_stream.cc:878] SECURE[S] kEnabled { local: 172.151.29.157:7100 remote: 172.151.21.199:54668 }: Network error (yb/rpc/secure_stream.cc:877): SSL read failed: no error (6)
W0602 03:38:38.762108 33432 master.cc:541] ListMasters: Network error (yb/rpc/connection.cc:274): Unable to get registration information for peer ([172.151.27.63:7100]) id (1bb56204b298490d99eef93574187e52): Connect timeout Connection (0x00007273383e2fe0) client 172.151.29.157:37209 => 172.151.27.63:7100, passed: 14.980s, timeout: 15.000s: kConnectFailed (network error 1)
W0602 03:38:38.762390 33432 scoped_leader_shared_lock.cc:159] RPC took a long time (/share/jenkins/workspace/github-yugabyte-db-alma8-master-clang19-release-aarch64/yugabyte-db/src/yb/master/catalog_manager_bg_tasks.cc:193, Run): 15.002s
@ 0xaaaacef9550c yb::master::CatalogManagerBgTasks::Run()
@ 0xaaaad0125458 yb::Thread::SuperviseThread()
@ 0xffffabd878b8 start_thread
@ 0xffffabde3afc thread_start
I0602 03:38:39.767758 33432 cluster_balance_util.cc:254] Master leader not received heartbeat from ts d1d276ac3a0e47b3911d4d0cb1c4693c. Only performing leader balancing for tables with replicas in this TS.
I0602 03:38:39.768054 33432 cluster_balance.cc:722] Skipping adding replicas for under-replicated tablet 7276bd4ff48348128c3bd6a45e046187: no valid tservers to place tablet
I0602 03:38:39.768079 33432 cluster_balance.cc:570] Skipping removing replicas. Only leader balancing table 00004003000030008000000000004024
I0602 03:38:39.768085 33432 cluster_balance.cc:595] Skipping adding replicas. Only leader balancing table 00004003000030008000000000004024
W0602 03:38:41.780747 43502 consensus_peers.cc:625] T 00000000000000000000000000000000 P 3d9095ec6c514ed283e842e583ec2b2a -> Peer 1bb56204b298490d99eef93574187e52 ([host: "172.151.27.63" port: 7100], []): Couldn't send request. Status: Timed out (yb/rpc/outbound_call.cc:647): UpdateConsensus RPC (request call id 368072) to 172.151.27.63:7100 timed out after 3.000s. Retrying in the next heartbeat period. Already tried 75 times. State: 2
I0602 03:38:43.067974 47213 xrepl_catalog_manager.cc:2894] GetCDCStream from 172.151.29.157:51138: stream_id: "9ebecae21ce30483624b44163aea1a13"
I0602 03:38:43.068217 47213 xrepl_catalog_manager.cc:2894] GetCDCStream from 172.151.29.157:33643: stream_id: "2b6f4d3cbababd8fe1469741115ca6b5"
I0602 03:38:43.068439 47213 xrepl_catalog_manager.cc:2894] GetCDCStream from 172.151.29.157:38385: stream_id: "a80456531d64efabea4d1876e999ce11"
I0602 03:38:43.068630 47213 xrepl_catalog_manager.cc:2894] GetCDCStream from 172.151.29.157:34229: stream_id: "edf50c40741084b42c491c8fac35c45f"
I0602 03:38:43.068835 47213 xrepl_catalog_manager.cc:2894] GetCDCStream from 172.151.29.157:35971: stream_id: "46bf503e32a08aa41c4449f67b0979bd"
I0602 03:38:43.283470 47213 xrepl_catalog_manager.cc:2894] GetCDCStream from 172.151.23.76:39303: stream_id: "2b6f4d3cbababd8fe1469741115ca6b5"
I0602 03:38:43.283721 47213 xrepl_catalog_manager.cc:2894] GetCDCStream from 172.151.23.76:40677: stream_id: "a80456531d64efabea4d1876e999ce11"
I0602 03:38:43.283937 47213 xrepl_catalog_manager.cc:2894] GetCDCStream from 172.151.23.76:37909: stream_id: "edf50c40741084b42c491c8fac35c45f"
I0602 03:38:43.284116 47213 xrepl_catalog_manager.cc:2894] GetCDCStream from 172.151.23.76:38595: stream_id: "46bf503e32a08aa41c4449f67b0979bd"
I0602 03:38:45.871347 47213 xrepl_catalog_manager.cc:2894] GetCDCStream from 172.151.29.157:33643: stream_id: "2b6f4d3cbababd8fe1469741115ca6b5"
I0602 03:38:45.871559 47213 xrepl_catalog_manager.cc:2894] GetCDCStream from 172.151.29.157:38385: stream_id: "a80456531d64efabea4d1876e999ce11"
I0602 03:38:45.871729 47213 xrepl_catalog_manager.cc:2894] GetCDCStream from 172.151.29.157:34229: stream_id: "edf50c40741084b42c491c8fac35c45f"
I0602 03:38:45.871878 47213 xrepl_catalog_manager.cc:2894] GetCDCStream from 172.151.29.157:35971: stream_id: "46bf503e32a08aa41c4449f67b0979bd"
I0602 03:38:45.986842 47213 xrepl_catalog_manager.cc:2894] GetCDCStream from 172.151.23.76:42692: stream_id: "2b6f4d3cbababd8fe1469741115ca6b5"
I0602 03:38:45.987051 47213 xrepl_catalog_manager.cc:2894] GetCDCStream from 172.151.23.76:39303: stream_id: "a80456531d64efabea4d1876e999ce11"
I0602 03:38:45.987226 47213 xrepl_catalog_manager.cc:2894] GetCDCStream from 172.151.23.76:40677: stream_id: "edf50c40741084b42c491c8fac35c45f"
I0602 03:38:45.987396 47213 xrepl_catalog_manager.cc:2894] GetCDCStream from 172.151.23.76:37909: stream_id: "46bf503e32a08aa41c4449f67b0979bd"
W0602 03:38:48.437584 43502 consensus_peers.cc:625] T 00000000000000000000000000000000 P 3d9095ec6c514ed283e842e583ec2b2a -> Peer 1bb56204b298490d99eef93574187e52 ([host: "172.151.27.63" port: 7100], []): Couldn't send request. Status: Timed out (yb/rpc/outbound_call.cc:647): UpdateConsensus RPC (request call id 368090) to 172.151.27.63:7100 timed out after 3.000s. Retrying in the next heartbeat period. Already tried 77 times. State: 2
W0602 03:38:54.769824 33432 master.cc:541] ListMasters: Network error (yb/rpc/connection.cc:274): Unable to get registration information for peer ([172.151.27.63:7100]) id (1bb56204b298490d99eef93574187e52): Connect timeout Connection (0x000072733c2aa1e0) client 172.151.29.157:33585 => 172.151.27.63:7100, passed: 14.970s, timeout: 15.000s: kConnectFailed (network error 1)
Steps to perform:
1. Create multiple database( 50% colocated )
2. Create tables for each database
3. Setup replication
4. Start cycle:
a. Start parallel nemesis on both source and target
b. Create new databases (non-colocated/colocated)
c. Create tables/indexes (both sides)
d. Add databases to the replication
Partition network is being performed during add database to replication operation

Attached logs to jira
Issue Type
kind/bug
Warning: Please confirm that this issue does not contain any sensitive information
- I confirm this issue does not contain any sensitive information.