Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DocDB] Fatal: Couldn't find connection for any index to Connection (0x000017e13de943d8) #21738

Closed
1 task done
shishir2001-yb opened this issue Mar 29, 2024 · 2 comments
Closed
1 task done
Assignees
Labels
2024.1 Backport Required 2024.1_blocker area/docdb YugabyteDB core features kind/bug This issue is a bug priority/medium Medium priority issue qa_stress Bugs identified via Stress automation QA QA filed bugs

Comments

@shishir2001-yb
Copy link

shishir2001-yb commented Mar 29, 2024

Jira Link: DB-10612

Description

Tried on version 2.23.00-b65
Logs: https://drive.google.com/file/d/1mgQ4cbpVThLNgCoS4HObGKN3uwmGO_hp/view?usp=sharing
Encountered the following Fatal while running cross DB DDLs test with PITR and Backup/Restore.

F20240328 23:26:43 ../../src/yb/rpc/reactor.cc:861] Check failed: erased Couldn't find connection for any index to Connection (0x000017e13de943d8) client 172.151.24.91:44447 => 172.151.24.4:9100
    @     0xaaaab0df389c  google::LogMessage::SendToLog()
    @     0xaaaab0df4740  google::LogMessage::Flush()
    @     0xaaaab0df4ddc  google::LogMessageFatal::~LogMessageFatal()
    @     0xaaaab200c348  yb::rpc::Reactor::DestroyConnection()
    @     0xaaaab1fcc6bc  yb::rpc::FunctorReactorTaskWithWeakPtr<>::Run()
    @     0xaaaab2005adc  ev::base<>::method_thunk<>()
    @     0xaaaab1193630  ev_invoke_pending
    @     0xaaaab11966bc  ev_run
    @     0xaaaab20081fc  yb::rpc::Reactor::RunThread()
    @     0xaaaab28901d8  yb::Thread::SuperviseThread()
    @     0xffff8a3a78b8  start_thread
    @     0xffff8a403afc  thread_start

Test details:

Test Description:
        1. Create a cluster with required g-flags
        2. Start the cross DB DDL workload which will execute DDLs and DMLs across databases concurrently (50 colocated
           database and 100 non-colocated database), run this for 20-30 mins
        3. Create a PITR schedule on 10 random database
        4. Start a while loop and run it for 120 mins
          a. Note down time fr PITR(0) 
          b. Create a backup of 1 random database
          c. Start the cross DB DDL workload and stop it after 10 mins
          d. Note down the time for PITR(1)
          e. Start the cross DB DDL workload and run it for 10 mins
          f. Execute PITR on all 10 databases at random times(Between 1-9 sec ago).
          g. Restore to PITR(1)
          h. Validate data
          i. Restore to PITR(0) with a probability of 0.6 and validate data
          j. Delete the PITR schedule for the backup db 
          k. Drop the database 
          l. Restore the backup
          m. Create the snapshot schedule for this new DB

Observed a coredump as well

(lldb) target create "/home/yugabyte/yb-software/yugabyte-2.23.0.0-b65-almalinux8-aarch64/postgres/bin/postgres" --core "/home/yugabyte/cores/core_31868_1711669027_!home!yugabyte!yb-software!yugabyte-2.23.0.0-b65-almalinux8-aarch64!postgres!bin!postgres"
Core file '/home/yugabyte/cores/core_31868_1711669027_!home!yugabyte!yb-software!yugabyte-2.23.0.0-b65-almalinux8-aarch64!postgres!bin!postgres' (aarch64) was loaded.
(lldb) bt all
* thread #1, name = 'postgres', stop reason = signal SIGSEGV: address not mapped to object
  * frame #0: 0x0000ffff92c375dc libyb_util.so`std::__1::__hash_const_iterator<std::__1::__hash_node<std::__1::__hash_value_type<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, void*>*> std::__1::__hash_table<std::__1::__hash_value_type<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::__unordered_map_hasher<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::__hash_value_type<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, true>, std::__1::__unordered_map_equal<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::__hash_value_type<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, true>, std::__1::allocator<std::__1::__hash_value_type<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>>::find<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>(this=<unavailable>, __k=<unavailable>) const at __hash_table:2168:31
    frame #1: 0x0000ffff92c371d4 libyb_util.so`yb::PrometheusWriter::WriteSingleEntry(std::__1::unordered_map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, long, yb::AggregationFunction, unsigned int, char const*, char const*) [inlined] std::__1::unordered_map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>>::find[abi:v170002](this=0x0000ffff84dfee60, __k="metric_type") const at unordered_map:1534:69
    frame #2: 0x0000ffff92c371d0 libyb_util.so`yb::PrometheusWriter::WriteSingleEntry(this=0x0000ffff79a5d0a8, attr=0x0000ffff84dfee60, name="handler_latency_yb_ysqlserver_SQLProcessor_SelectStmt_count", value=98334, aggregation_function=kSum, default_levels=<unavailable>, type="unknown", description="unknown") at metrics_writer.cc:139:30
    frame #3: 0x0000ffff84dd97ec libyb_pggate_webserver.so`yb::pggate::PgPrometheusMetricsHandler(req=<unavailable>, resp=<unavailable>) at pgsql_webserver_wrapper.cc:496:5
    frame #4: 0x0000ffff84d43580 libserver_process.so`yb::Webserver::Impl::RunPathHandler(yb::Webserver::Impl::PathHandler const&, sq_connection*, sq_request_info*) [inlined] std::__1::__function::__value_func<void (yb::WebCallbackRegistry::WebRequest const&, yb::WebCallbackRegistry::WebResponse*)>::operator()[abi:v170002](this=0x0000147b3f860000, __args=0x0000ffff79a5f5a0, __args=0x0000ffff79a5d480) const at function.h:517:16
    frame #5: 0x0000ffff84d43564 libserver_process.so`yb::Webserver::Impl::RunPathHandler(yb::Webserver::Impl::PathHandler const&, sq_connection*, sq_request_info*) [inlined] std::__1::function<void (yb::WebCallbackRegistry::WebRequest const&, yb::WebCallbackRegistry::WebResponse*)>::operator()(this= Function = yb::pggate::PgPrometheusMetricsHandler(yb::WebCallbackRegistry::WebRequest const&, yb::WebCallbackRegistry::WebResponse*) , __arg=0x0000ffff79a5f5a0, __arg=0x0000ffff79a5d5a0) const at function.h:1168:12
    frame #6: 0x0000ffff84d43564 libserver_process.so`yb::Webserver::Impl::RunPathHandler(this=0x0000147b3f81c500, handler=0x0000147b3fdd1540, connection=0x0000147b3f5a1000, request_info=<unavailable>) at webserver.cc:648:5
    frame #7: 0x0000ffff84d42e60 libserver_process.so`yb::Webserver::Impl::BeginRequestCallback(this=0x0000147b3f81c500, connection=<unavailable>, request_info=0x0000147b3f5a1000) at webserver.cc:567:33
    frame #8: 0x0000ffff84d4e2d4 libserver_process.so`worker_thread + 5524
    frame #9: 0x0000ffff973878b8 libpthread.so.0`start_thread + 392
    frame #10: 0x0000ffff97223afc libc.so.6`thread_start + 12
  thread #2, stop reason = signal 0
    frame #0: 0x0000ffff94fc8a04 libyb_client.so`__do_fini
    frame #1: 0x0000ffff98724cd4 ld-linux-aarch64.so.1`_dl_fini at dl-fini.c:141:9
    frame #2: 0x0000ffff9723899c libc.so.6`__run_exit_handlers + 252
    frame #3: 0x0000ffff97238b1c libc.so.6`exit + 28
    frame #4: 0x0000aaaae04b8b94 postgres`proc_exit(code=0) at ipc.c:157:2
    frame #5: 0x0000ffff84bb3840 yb_pg_metrics.so`webserver_worker_main(unused=<unavailable>) at yb_pg_metrics.c:443:3
    frame #6: 0x0000aaaae04137e4 postgres`StartBackgroundWorker at bgworker.c:849:2
    frame #7: 0x0000aaaae042cb74 postgres`maybe_start_bgworkers [inlined] do_start_bgworker(rw=0x0000147b3fd8c580) at postmaster.c:6095:4
    frame #8: 0x0000aaaae042cb18 postgres`maybe_start_bgworkers at postmaster.c:6321:9
    frame #9: 0x0000aaaae04290bc postgres`PostmasterMain(argc=<unavailable>, argv=<unavailable>) at postmaster.c:1432:2
    frame #10: 0x0000aaaae0323c84 postgres`PostgresServerProcessMain(argc=25, argv=0x0000147b3fd0c0d0) at main.c:234:3
    frame #11: 0x0000aaaadffe5e38 postgres`main + 36
    frame #12: 0x0000ffff97224384 libc.so.6`__libc_start_main + 220
    frame #13: 0x0000aaaadffe5cf4 postgres`_start + 52
  thread #3, stop reason = signal 0
    frame #0: 0x0000ffff972ca91c libc.so.6`__poll + 236
    frame #1: 0x0000ffff84d4ca24 libserver_process.so`master_thread + 740
    frame #2: 0x0000ffff973878b8 libpthread.so.0`start_thread + 392
    frame #3: 0x0000ffff97223afc libc.so.6`thread_start + 12

G-flags:

tserver_gflags={
                "ysql_enable_packed_row": "true",
                "ysql_enable_packed_row_for_colocated_table": "true",
                "enable_automatic_tablet_splitting": "true",
                "ysql_max_connections": "500",
                'client_read_write_timeout_ms': str(30 * 60 * 1000),
                'yb_client_admin_operation_timeout_sec': str(30 * 60),
                "consistent_restore": "true",
                "ysql_enable_db_catalog_version_mode": "true",
                "allowed_preview_flags_csv": "ysql_enable_db_catalog_version_mode",
                "tablet_replicas_per_gib_limit": 0,
                "ysql_pg_conf_csv": "yb_debug_report_error_stacktrace=true"
            },
            master_gflags={
                "ysql_enable_packed_row": "true",
                "ysql_enable_packed_row_for_colocated_table": "true",
                "enable_automatic_tablet_splitting": "true",
                "tablet_split_high_phase_shard_count_per_node": 20000,
                "tablet_split_high_phase_size_threshold_bytes": 2097152,  # 2MB
                # low_phase_size 100KB
                "tablet_split_low_phase_size_threshold_bytes": 102400,  # 100 KB
                "tablet_split_low_phase_shard_count_per_node": 10000,
                "consistent_restore": "true",
                "ysql_enable_db_catalog_version_mode": "true",
                "allowed_preview_flags_csv": "ysql_enable_db_catalog_version_mode",
                "tablet_replicas_per_gib_limit": 0,
                "ysql_pg_conf_csv": "yb_debug_report_error_stacktrace=true"
            }

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@shishir2001-yb shishir2001-yb added area/docdb YugabyteDB core features QA QA filed bugs status/awaiting-triage Issue awaiting triage qa_stress Bugs identified via Stress automation labels Mar 29, 2024
@yugabyte-ci yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue labels Mar 29, 2024
@rthallamko3
Copy link
Contributor

Probably related to #20661 as indicated by Shishir on slack offline.

@rthallamko3 rthallamko3 assigned basavaraj29 and unassigned bmatican Mar 29, 2024
@yugabyte-ci yugabyte-ci removed the status/awaiting-triage Issue awaiting triage label Mar 29, 2024
spolitov added a commit that referenced this issue Apr 6, 2024
Summary:
It could happen that connection gets 2 failures simultaneously, especially when underlying tcp stream gets broken.
There is a flag avoid calling DestroyConnection multiple times - `queued_destroy_connection_`.
But in case of rpc heartbeat timeout, connection is destroyed but flag is not set.
As result we could get into situation when Reactor::DestroyConnection is called twice for the same connection.
But in case of client connection the first call will remove connection from `client_conns_`.
And second call to DestroyConnection would not be able to find connection in this map.
Sanity check will fail and process will die because of check failure.

Fixed rpc heartbeat timeout handling to check and set `queued_destroy_connection_`.
Also added logic to avoid removing connection from `client_conns_` if connection was already destroyed.
Jira: DB-10612

Test Plan: Jenkins

Reviewers: bogdan, mbautin

Reviewed By: bogdan

Subscribers: ybase

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D33685
@shishir2001-yb
Copy link
Author

The fix needs to be backported to 2024.1

spolitov added a commit that referenced this issue Apr 20, 2024
…ction

Summary:
It could happen that connection gets 2 failures simultaneously, especially when underlying tcp stream gets broken.
There is a flag avoid calling DestroyConnection multiple times - `queued_destroy_connection_`.
But in case of rpc heartbeat timeout, connection is destroyed but flag is not set.
As result we could get into situation when Reactor::DestroyConnection is called twice for the same connection.
But in case of client connection the first call will remove connection from `client_conns_`.
And second call to DestroyConnection would not be able to find connection in this map.
Sanity check will fail and process will die because of check failure.

Fixed rpc heartbeat timeout handling to check and set `queued_destroy_connection_`.
Also added logic to avoid removing connection from `client_conns_` if connection was already destroyed.
Jira: DB-10612

Original commit: 136b6e4 / D33685

Test Plan: Jenkins

Reviewers: bogdan, mbautin, rthallam

Reviewed By: rthallam

Subscribers: ybase

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D34294
ZhenYongFan pushed a commit to ZhenYongFan/yugabyte-db that referenced this issue Jun 15, 2024
…on Connection

Summary:
It could happen that connection gets 2 failures simultaneously, especially when underlying tcp stream gets broken.
There is a flag avoid calling DestroyConnection multiple times - `queued_destroy_connection_`.
But in case of rpc heartbeat timeout, connection is destroyed but flag is not set.
As result we could get into situation when Reactor::DestroyConnection is called twice for the same connection.
But in case of client connection the first call will remove connection from `client_conns_`.
And second call to DestroyConnection would not be able to find connection in this map.
Sanity check will fail and process will die because of check failure.

Fixed rpc heartbeat timeout handling to check and set `queued_destroy_connection_`.
Also added logic to avoid removing connection from `client_conns_` if connection was already destroyed.
Jira: DB-10612

Original commit: 136b6e4 / D33685

Test Plan: Jenkins

Reviewers: bogdan, mbautin, rthallam

Reviewed By: rthallam

Subscribers: ybase

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D34294
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2024.1 Backport Required 2024.1_blocker area/docdb YugabyteDB core features kind/bug This issue is a bug priority/medium Medium priority issue qa_stress Bugs identified via Stress automation QA QA filed bugs
Projects
None yet
Development

No branches or pull requests

6 participants