[DocDB] Fatal: Couldn't find connection for any index to Connection (0x000017e13de943d8) #21738

shishir2001-yb · 2024-03-29T09:10:28Z

Description

Tried on version 2.23.00-b65
Logs: https://drive.google.com/file/d/1mgQ4cbpVThLNgCoS4HObGKN3uwmGO_hp/view?usp=sharing
Encountered the following Fatal while running cross DB DDLs test with PITR and Backup/Restore.

F20240328 23:26:43 ../../src/yb/rpc/reactor.cc:861] Check failed: erased Couldn't find connection for any index to Connection (0x000017e13de943d8) client 172.151.24.91:44447 => 172.151.24.4:9100
    @     0xaaaab0df389c  google::LogMessage::SendToLog()
    @     0xaaaab0df4740  google::LogMessage::Flush()
    @     0xaaaab0df4ddc  google::LogMessageFatal::~LogMessageFatal()
    @     0xaaaab200c348  yb::rpc::Reactor::DestroyConnection()
    @     0xaaaab1fcc6bc  yb::rpc::FunctorReactorTaskWithWeakPtr<>::Run()
    @     0xaaaab2005adc  ev::base<>::method_thunk<>()
    @     0xaaaab1193630  ev_invoke_pending
    @     0xaaaab11966bc  ev_run
    @     0xaaaab20081fc  yb::rpc::Reactor::RunThread()
    @     0xaaaab28901d8  yb::Thread::SuperviseThread()
    @     0xffff8a3a78b8  start_thread
    @     0xffff8a403afc  thread_start

Test details:

Test Description:
        1. Create a cluster with required g-flags
        2. Start the cross DB DDL workload which will execute DDLs and DMLs across databases concurrently (50 colocated
           database and 100 non-colocated database), run this for 20-30 mins
        3. Create a PITR schedule on 10 random database
        4. Start a while loop and run it for 120 mins
          a. Note down time fr PITR(0) 
          b. Create a backup of 1 random database
          c. Start the cross DB DDL workload and stop it after 10 mins
          d. Note down the time for PITR(1)
          e. Start the cross DB DDL workload and run it for 10 mins
          f. Execute PITR on all 10 databases at random times(Between 1-9 sec ago).
          g. Restore to PITR(1)
          h. Validate data
          i. Restore to PITR(0) with a probability of 0.6 and validate data
          j. Delete the PITR schedule for the backup db 
          k. Drop the database 
          l. Restore the backup
          m. Create the snapshot schedule for this new DB

Observed a coredump as well

(lldb) target create "/home/yugabyte/yb-software/yugabyte-2.23.0.0-b65-almalinux8-aarch64/postgres/bin/postgres" --core "/home/yugabyte/cores/core_31868_1711669027_!home!yugabyte!yb-software!yugabyte-2.23.0.0-b65-almalinux8-aarch64!postgres!bin!postgres"
Core file '/home/yugabyte/cores/core_31868_1711669027_!home!yugabyte!yb-software!yugabyte-2.23.0.0-b65-almalinux8-aarch64!postgres!bin!postgres' (aarch64) was loaded.
(lldb) bt all
* thread #1, name = 'postgres', stop reason = signal SIGSEGV: address not mapped to object
  * frame #0: 0x0000ffff92c375dc libyb_util.so`std::__1::__hash_const_iterator<std::__1::__hash_node<std::__1::__hash_value_type<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, void*>*> std::__1::__hash_table<std::__1::__hash_value_type<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::__unordered_map_hasher<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::__hash_value_type<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, true>, std::__1::__unordered_map_equal<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::__hash_value_type<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, true>, std::__1::allocator<std::__1::__hash_value_type<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>>::find<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>(this=<unavailable>, __k=<unavailable>) const at __hash_table:2168:31
    frame #1: 0x0000ffff92c371d4 libyb_util.so`yb::PrometheusWriter::WriteSingleEntry(std::__1::unordered_map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, long, yb::AggregationFunction, unsigned int, char const*, char const*) [inlined] std::__1::unordered_map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>>::find[abi:v170002](this=0x0000ffff84dfee60, __k="metric_type") const at unordered_map:1534:69
    frame #2: 0x0000ffff92c371d0 libyb_util.so`yb::PrometheusWriter::WriteSingleEntry(this=0x0000ffff79a5d0a8, attr=0x0000ffff84dfee60, name="handler_latency_yb_ysqlserver_SQLProcessor_SelectStmt_count", value=98334, aggregation_function=kSum, default_levels=<unavailable>, type="unknown", description="unknown") at metrics_writer.cc:139:30
    frame #3: 0x0000ffff84dd97ec libyb_pggate_webserver.so`yb::pggate::PgPrometheusMetricsHandler(req=<unavailable>, resp=<unavailable>) at pgsql_webserver_wrapper.cc:496:5
    frame #4: 0x0000ffff84d43580 libserver_process.so`yb::Webserver::Impl::RunPathHandler(yb::Webserver::Impl::PathHandler const&, sq_connection*, sq_request_info*) [inlined] std::__1::__function::__value_func<void (yb::WebCallbackRegistry::WebRequest const&, yb::WebCallbackRegistry::WebResponse*)>::operator()[abi:v170002](this=0x0000147b3f860000, __args=0x0000ffff79a5f5a0, __args=0x0000ffff79a5d480) const at function.h:517:16
    frame #5: 0x0000ffff84d43564 libserver_process.so`yb::Webserver::Impl::RunPathHandler(yb::Webserver::Impl::PathHandler const&, sq_connection*, sq_request_info*) [inlined] std::__1::function<void (yb::WebCallbackRegistry::WebRequest const&, yb::WebCallbackRegistry::WebResponse*)>::operator()(this= Function = yb::pggate::PgPrometheusMetricsHandler(yb::WebCallbackRegistry::WebRequest const&, yb::WebCallbackRegistry::WebResponse*) , __arg=0x0000ffff79a5f5a0, __arg=0x0000ffff79a5d5a0) const at function.h:1168:12
    frame #6: 0x0000ffff84d43564 libserver_process.so`yb::Webserver::Impl::RunPathHandler(this=0x0000147b3f81c500, handler=0x0000147b3fdd1540, connection=0x0000147b3f5a1000, request_info=<unavailable>) at webserver.cc:648:5
    frame #7: 0x0000ffff84d42e60 libserver_process.so`yb::Webserver::Impl::BeginRequestCallback(this=0x0000147b3f81c500, connection=<unavailable>, request_info=0x0000147b3f5a1000) at webserver.cc:567:33
    frame #8: 0x0000ffff84d4e2d4 libserver_process.so`worker_thread + 5524
    frame #9: 0x0000ffff973878b8 libpthread.so.0`start_thread + 392
    frame #10: 0x0000ffff97223afc libc.so.6`thread_start + 12
  thread #2, stop reason = signal 0
    frame #0: 0x0000ffff94fc8a04 libyb_client.so`__do_fini
    frame #1: 0x0000ffff98724cd4 ld-linux-aarch64.so.1`_dl_fini at dl-fini.c:141:9
    frame #2: 0x0000ffff9723899c libc.so.6`__run_exit_handlers + 252
    frame #3: 0x0000ffff97238b1c libc.so.6`exit + 28
    frame #4: 0x0000aaaae04b8b94 postgres`proc_exit(code=0) at ipc.c:157:2
    frame #5: 0x0000ffff84bb3840 yb_pg_metrics.so`webserver_worker_main(unused=<unavailable>) at yb_pg_metrics.c:443:3
    frame #6: 0x0000aaaae04137e4 postgres`StartBackgroundWorker at bgworker.c:849:2
    frame #7: 0x0000aaaae042cb74 postgres`maybe_start_bgworkers [inlined] do_start_bgworker(rw=0x0000147b3fd8c580) at postmaster.c:6095:4
    frame #8: 0x0000aaaae042cb18 postgres`maybe_start_bgworkers at postmaster.c:6321:9
    frame #9: 0x0000aaaae04290bc postgres`PostmasterMain(argc=<unavailable>, argv=<unavailable>) at postmaster.c:1432:2
    frame #10: 0x0000aaaae0323c84 postgres`PostgresServerProcessMain(argc=25, argv=0x0000147b3fd0c0d0) at main.c:234:3
    frame #11: 0x0000aaaadffe5e38 postgres`main + 36
    frame #12: 0x0000ffff97224384 libc.so.6`__libc_start_main + 220
    frame #13: 0x0000aaaadffe5cf4 postgres`_start + 52
  thread #3, stop reason = signal 0
    frame #0: 0x0000ffff972ca91c libc.so.6`__poll + 236
    frame #1: 0x0000ffff84d4ca24 libserver_process.so`master_thread + 740
    frame #2: 0x0000ffff973878b8 libpthread.so.0`start_thread + 392
    frame #3: 0x0000ffff97223afc libc.so.6`thread_start + 12

G-flags:

tserver_gflags={
                "ysql_enable_packed_row": "true",
                "ysql_enable_packed_row_for_colocated_table": "true",
                "enable_automatic_tablet_splitting": "true",
                "ysql_max_connections": "500",
                'client_read_write_timeout_ms': str(30 * 60 * 1000),
                'yb_client_admin_operation_timeout_sec': str(30 * 60),
                "consistent_restore": "true",
                "ysql_enable_db_catalog_version_mode": "true",
                "allowed_preview_flags_csv": "ysql_enable_db_catalog_version_mode",
                "tablet_replicas_per_gib_limit": 0,
                "ysql_pg_conf_csv": "yb_debug_report_error_stacktrace=true"
            },
            master_gflags={
                "ysql_enable_packed_row": "true",
                "ysql_enable_packed_row_for_colocated_table": "true",
                "enable_automatic_tablet_splitting": "true",
                "tablet_split_high_phase_shard_count_per_node": 20000,
                "tablet_split_high_phase_size_threshold_bytes": 2097152,  # 2MB
                # low_phase_size 100KB
                "tablet_split_low_phase_size_threshold_bytes": 102400,  # 100 KB
                "tablet_split_low_phase_shard_count_per_node": 10000,
                "consistent_restore": "true",
                "ysql_enable_db_catalog_version_mode": "true",
                "allowed_preview_flags_csv": "ysql_enable_db_catalog_version_mode",
                "tablet_replicas_per_gib_limit": 0,
                "ysql_pg_conf_csv": "yb_debug_report_error_stacktrace=true"
            }

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

I confirm this issue does not contain any sensitive information.

The text was updated successfully, but these errors were encountered:

rthallamko3 · 2024-03-29T16:55:24Z

Probably related to #20661 as indicated by Shishir on slack offline.

Summary: It could happen that connection gets 2 failures simultaneously, especially when underlying tcp stream gets broken. There is a flag avoid calling DestroyConnection multiple times - `queued_destroy_connection_`. But in case of rpc heartbeat timeout, connection is destroyed but flag is not set. As result we could get into situation when Reactor::DestroyConnection is called twice for the same connection. But in case of client connection the first call will remove connection from `client_conns_`. And second call to DestroyConnection would not be able to find connection in this map. Sanity check will fail and process will die because of check failure. Fixed rpc heartbeat timeout handling to check and set `queued_destroy_connection_`. Also added logic to avoid removing connection from `client_conns_` if connection was already destroyed. Jira: DB-10612 Test Plan: Jenkins Reviewers: bogdan, mbautin Reviewed By: bogdan Subscribers: ybase Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D33685

shishir2001-yb · 2024-04-17T05:11:31Z

The fix needs to be backported to 2024.1

…ction Summary: It could happen that connection gets 2 failures simultaneously, especially when underlying tcp stream gets broken. There is a flag avoid calling DestroyConnection multiple times - `queued_destroy_connection_`. But in case of rpc heartbeat timeout, connection is destroyed but flag is not set. As result we could get into situation when Reactor::DestroyConnection is called twice for the same connection. But in case of client connection the first call will remove connection from `client_conns_`. And second call to DestroyConnection would not be able to find connection in this map. Sanity check will fail and process will die because of check failure. Fixed rpc heartbeat timeout handling to check and set `queued_destroy_connection_`. Also added logic to avoid removing connection from `client_conns_` if connection was already destroyed. Jira: DB-10612 Original commit: 136b6e4 / D33685 Test Plan: Jenkins Reviewers: bogdan, mbautin, rthallam Reviewed By: rthallam Subscribers: ybase Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D34294

…on Connection Summary: It could happen that connection gets 2 failures simultaneously, especially when underlying tcp stream gets broken. There is a flag avoid calling DestroyConnection multiple times - `queued_destroy_connection_`. But in case of rpc heartbeat timeout, connection is destroyed but flag is not set. As result we could get into situation when Reactor::DestroyConnection is called twice for the same connection. But in case of client connection the first call will remove connection from `client_conns_`. And second call to DestroyConnection would not be able to find connection in this map. Sanity check will fail and process will die because of check failure. Fixed rpc heartbeat timeout handling to check and set `queued_destroy_connection_`. Also added logic to avoid removing connection from `client_conns_` if connection was already destroyed. Jira: DB-10612 Original commit: 136b6e4 / D33685 Test Plan: Jenkins Reviewers: bogdan, mbautin, rthallam Reviewed By: rthallam Subscribers: ybase Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D34294

shishir2001-yb added area/docdb YugabyteDB core features QA QA filed bugs status/awaiting-triage Issue awaiting triage qa_stress Bugs identified via Stress automation labels Mar 29, 2024

shishir2001-yb assigned bmatican Mar 29, 2024

yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue labels Mar 29, 2024

rthallamko3 assigned basavaraj29 and unassigned bmatican Mar 29, 2024

yugabyte-ci removed the status/awaiting-triage Issue awaiting triage label Mar 29, 2024

rthallamko3 mentioned this issue Apr 11, 2024

[DocDB] Couldn't find connection for any index FATAL occured in tserver restart scenario #20661

Closed

1 task

rthallamko3 closed this as completed Apr 11, 2024

shishir2001-yb reopened this Apr 17, 2024

shishir2001-yb added 2024.1 Backport Required 2024.1_blocker labels Apr 17, 2024

shishir2001-yb assigned spolitov Apr 17, 2024

rthallamko3 unassigned basavaraj29 Apr 18, 2024

rthallamko3 removed the 2024.1_blocker label Apr 18, 2024

rthallamko3 closed this as completed Apr 22, 2024

yugabyte-ci added the 2024.1_blocker label May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DocDB] Fatal: Couldn't find connection for any index to Connection (0x000017e13de943d8) #21738

[DocDB] Fatal: Couldn't find connection for any index to Connection (0x000017e13de943d8) #21738

shishir2001-yb commented Mar 29, 2024 •

edited

Loading

rthallamko3 commented Mar 29, 2024

shishir2001-yb commented Apr 17, 2024

[DocDB] Fatal: Couldn't find connection for any index to Connection (0x000017e13de943d8) #21738

[DocDB] Fatal: Couldn't find connection for any index to Connection (0x000017e13de943d8) #21738

Comments

shishir2001-yb commented Mar 29, 2024 • edited Loading

Description

Issue Type

Warning: Please confirm that this issue does not contain any sensitive information

rthallamko3 commented Mar 29, 2024

shishir2001-yb commented Apr 17, 2024

shishir2001-yb commented Mar 29, 2024 •

edited

Loading