Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DocDB] Increased follower lag observed in production with stuck OutboundCall #18744

Closed
1 task done
karan-yb opened this issue Aug 18, 2023 · 2 comments
Closed
1 task done

Comments

@karan-yb
Copy link
Contributor

karan-yb commented Aug 18, 2023

Jira Link: DB-7637

Description

In production, we saw quite a few cases where Peer is waiting for the response of a Update RPC call. Since Peer holds the performing_update_mutex for the duration of the RPC, we get into a situation where leader never sends any data to this peer.

After analyzing the dump, we found that the OutboundCall for the RPC is in SENT state for a long time. There was no clear way to confirm if connection (on which this call was sent) was active or not.

To avoid this follower lag issue, add a mechanism to detect the situation, log additional details to root cause and automatically recover from it by cancelling the pending OutstandingCall.

Issue #18685 is created to track all stuck OutboundCalls and possibly recover from it.

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@karan-yb karan-yb added area/docdb YugabyteDB core features status/awaiting-triage Issue awaiting triage labels Aug 18, 2023
@karan-yb karan-yb self-assigned this Aug 18, 2023
@yugabyte-ci yugabyte-ci added kind/enhancement This is an enhancement of an existing feature priority/medium Medium priority issue kind/bug This issue is a bug and removed status/awaiting-triage Issue awaiting triage kind/enhancement This is an enhancement of an existing feature labels Aug 18, 2023
karan-yb added a commit that referenced this issue Aug 23, 2023
…uck OutboundCall

Summary:
In production, we saw a few cases where follower lag was continuously increasing for few tablets. Also we noticed that the replica which was lagging was not removed from quorum by ConsensusQueue either. After capturing a dump of a node where leader was hosted, we were able to figure out that the performing_update_mutex was held for a long time for the affected Peer.

Peer acquires the performing_update_mutex when it is building a request to send to peer and keeps the mutex locked until it receives a response. In the captured dump, we noticed that the OutboundCall was in SENT state, but we were not able to confirm if the connection (on which it was sent) was active or not -- we did an analysis of OutboundCall references and we believe that the connection had been shut down, but we were not sure.

This change tries to detect the stuck OutboundCall in Peer:
* Whenever we try to send more data or heartbeat to the peer (in `SignalRequest`), we check if we can acquire the `performing_update_mutex`. If the mutex is already held, then we try to see if it has been more than the time specified by `FLAGS_stuck_peer_call_threshold_ms` + request timeout since the call start time.
* If the lock is held for more than timeout + `FLAGS_stuck_peer_call_threshold_ms` time duration, then we log additional details which can help identify the root cause of the issue (see below for examples).
* And when `FLAGS_force_recover_from_stuck_peer_call` is set to true, we try to mark the stuck call as failed. If the call object is not present, then we won't be able to recover from this situation.

Another change is that whenever a Connection encounters a write failure, instead of immediately destroying the connection, the operation queues a reactor task to ensure that all queued operations on the socket are executed in order. However, since the socket is closed, all of these operations in queue will encounter write failures, resulting in all of them scheduling a DestroyConnection task. After the first DestroyConnection task is executed, we will not be able to find this connection in the reactor-tracked connections, which will lead to a CHECK error. To prevent multiple DestroyConnection tasks for a single connection, we track whether the connection has already queued the task for its destruction.

Information logged when we detect this situation -
```
I0822 13:36:35.338526 4161232384 rpc_stub-test.cc:1153] OutboundCall (0x0000000107186018 -> RPC call yb.rpc_test.CalculatorService.Concat -> { remote: 127.0.0.1:58098 idx: 1 protocol: 0x0000000103fae8f0 -> tcp } , state=SENT.) tostring: RPC call yb.rpc_test.CalculatorService.Concat -> { remote: 127.0.0.1:58098 idx: 1 protocol: 0x0000000103fae8f0 -> tcp } , state=SENT., times (start, sent, callback, now): (3713549.841s, 3713549.842s, 0.000s, 3713549.947s), connection: 0x000000010708a658 -> Connection (0x000000010708a658) client 127.0.0.1:58100 => 127.0.0.1:58098
I0822 13:36:35.338694 1845915648 connection.cc:409] Connection (0x000000010708a658) client 127.0.0.1:58100 => 127.0.0.1:58098: LastActivityTime: 3713549.839s, ActiveCalls stats: { during shutdown: 0, current size: 0, earliest expiry: 0.000s, }, OutboundCall: { ptr: 0x0000000107186018, call id: 2, is active: 0 }, Shutdown status: Service unavailable (yb/rpc/reactor.cc:106): Shutdown connection (system error 58), Shutdown time: 3713549.940s, Queue attempts after shutdown: { calls: 0, responses: 0 }
```

1. OutboundCall timing details - it includes the time when call was created, when it was sent on network and when callback call was received.
2. Connection - we dump the connection state on which the call was sent. This state includes whether connection is active or not. If alive, then we will see the active calls count, and whether the current call is present in active calls or not. If connection is not alive, then you will see the time when connection was shutdown. We also log the number of active call present during shutdown and queue attempts after shutdown.

Using the above timing information and connection state, we can determine the order of events. For ex, in the above sample logs, as you can see the connection was closed after the data was already sent on the socket.

Example trace for stuck call detection in Peer -
```
I0817 17:42:07.975589 4161232384 consensus_peers.cc:179] T test-peers-tablet P peer-0 -> Peer peer-1 ([], []): Found a RPC call in stuck state - is_finished: 0, timeout: 0.005s, last_update_lock_release_time_: 3296281.832s, stuck threshold: 0.001s, force recover: 0, call state: call not set
```
Jira: DB-7637

Test Plan:
./yb_build.sh --sj
./build/latest/tests-rpc/rpc_stub-test --gtest_filter=*StuckOutboundCall*

Reviewers: mbautin, sergei

Reviewed By: mbautin

Subscribers: amitanand, yql, yyan, rthallam, ybase, bogdan

Differential Revision: https://phorge.dev.yugabyte.com/D27859
karan-yb added a commit that referenced this issue Aug 25, 2023
… lag caused by stuck OutboundCall

Summary:
Original commit: 30157f5df21928215b9d4b227bd2c59c6d0d565c / D28057
In production, we saw a few cases where follower lag was continuously increasing for few tablets. Also we noticed that the replica which was lagging was not removed from quorum by ConsensusQueue either. After capturing a dump of a node where leader was hosted, we were able to figure out that the performing_update_mutex was held for a long time for the affected Peer.

Peer acquires the performing_update_mutex when it is building a request to send to peer and keeps the mutex locked until it receives a response. In the captured dump, we noticed that the OutboundCall was in SENT state, but we were not able to confirm if the connection (on which it was sent) was active or not -- we did an analysis of OutboundCall references and we believe that the connection had been shut down, but we were not sure.

This change tries to detect the stuck OutboundCall in Peer:
* Whenever we try to send more data or heartbeat to the peer (in `SignalRequest`), we check if we can acquire the `performing_update_mutex`. If the mutex is already held, then we try to see if it has been more than the time specified by `FLAGS_stuck_peer_call_threshold_ms` + request timeout since the call start time.
* If the lock is held for more than timeout + `FLAGS_stuck_peer_call_threshold_ms` time duration, then we log additional details which can help identify the root cause of the issue (see below for examples).
* And when `FLAGS_force_recover_from_stuck_peer_call` is set to true, we try to mark the stuck call as failed. If the call object is not present, then we won't be able to recover from this situation.

Another change is that whenever a Connection encounters a write failure, instead of immediately destroying the connection, the operation queues a reactor task to ensure that all queued operations on the socket are executed in order. However, since the socket is closed, all of these operations in queue will encounter write failures, resulting in all of them scheduling a DestroyConnection task. After the first DestroyConnection task is executed, we will not be able to find this connection in the reactor-tracked connections, which will lead to a CHECK error. To prevent multiple DestroyConnection tasks for a single connection, we track whether the connection has already queued the task for its destruction.

Information logged when we detect this situation -
```
I0822 13:36:35.338526 4161232384 rpc_stub-test.cc:1153] OutboundCall (0x0000000107186018 -> RPC call yb.rpc_test.CalculatorService.Concat -> { remote: 127.0.0.1:58098 idx: 1 protocol: 0x0000000103fae8f0 -> tcp } , state=SENT.) tostring: RPC call yb.rpc_test.CalculatorService.Concat -> { remote: 127.0.0.1:58098 idx: 1 protocol: 0x0000000103fae8f0 -> tcp } , state=SENT., times (start, sent, callback, now): (3713549.841s, 3713549.842s, 0.000s, 3713549.947s), connection: 0x000000010708a658 -> Connection (0x000000010708a658) client 127.0.0.1:58100 => 127.0.0.1:58098
I0822 13:36:35.338694 1845915648 connection.cc:409] Connection (0x000000010708a658) client 127.0.0.1:58100 => 127.0.0.1:58098: LastActivityTime: 3713549.839s, ActiveCalls stats: { during shutdown: 0, current size: 0, earliest expiry: 0.000s, }, OutboundCall: { ptr: 0x0000000107186018, call id: 2, is active: 0 }, Shutdown status: Service unavailable (yb/rpc/reactor.cc:106): Shutdown connection (system error 58), Shutdown time: 3713549.940s, Queue attempts after shutdown: { calls: 0, responses: 0 }
```

1. OutboundCall timing details - it includes the time when call was created, when it was sent on network and when callback call was received.
2. Connection - we dump the connection state on which the call was sent. This state includes whether connection is active or not. If alive, then we will see the active calls count, and whether the current call is present in active calls or not. If connection is not alive, then you will see the time when connection was shutdown. We also log the number of active call present during shutdown and queue attempts after shutdown.

Using the above timing information and connection state, we can determine the order of events. For ex, in the above sample logs, as you can see the connection was closed after the data was already sent on the socket.

Example trace for stuck call detection in Peer -
```
I0817 17:42:07.975589 4161232384 consensus_peers.cc:179] T test-peers-tablet P peer-0 -> Peer peer-1 ([], []): Found a RPC call in stuck state - is_finished: 0, timeout: 0.005s, last_update_lock_release_time_: 3296281.832s, stuck threshold: 0.001s, force recover: 0, call state: call not set
```
Jira: DB-7637

Test Plan:
./yb_build.sh --sj
./build/latest/tests-rpc/rpc_stub-test --gtest_filter=*StuckOutboundCall*

Reviewers: mbautin, sergei

Reviewed By: mbautin

Subscribers: amitanand, yql, yyan, rthallam, ybase, bogdan

Differential Revision: https://phorge.dev.yugabyte.com/D28076
@bmatican
Copy link
Contributor

Found some TSAN issues with the above commit in
https://detective-gcp.dev.yugabyte.com/stability/test?branch=master&build_type=all&class=RaftConsensusITest&fail_tag=tsan_race&name=MultiThreadedInsertWithFailovers&platform=linux
https://detective-gcp.dev.yugabyte.com/stability/test?branch=master&build_type=all&class=MasterFailoverTestIndexCreation%2FMasterFailoverTestIndexCreation&fail_tag=tsan_race&name=TestPauseAfterCreateIndexIssued%2F0&platform=linux
https://detective-gcp.dev.yugabyte.com/stability/test?branch=master&build_type=all&class=KVTableTsFailoverWriteIfTest&fail_tag=tsan_race&name=KillTabletServerDuringReplication&platform=linux

Flavor 1

[ts-2] WARNING: ThreadSanitizer: data race (pid=2163)
[ts-2]   Read of size 8 at 0x7b54000c0198 by thread T21 (mutexes: write M0):
[ts-2]     #0 std::shared_ptr<yb::rpc::Connection>::get[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:843:16 (libyrpc.so+0x11383f)
[ts-2]     #1 std::shared_ptr<yb::rpc::Connection>::operator bool[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:875:16 (libyrpc.so+0x11383f)
[ts-2]     #2 yb::rpc::OutboundCall::DebugString() const ${BUILD_ROOT}/../../src/yb/rpc/outbound_call.cc:613:7 (libyrpc.so+0x11383f)
[ts-2]     #3 yb::rpc::RpcController::CallStateDebugString() const ${BUILD_ROOT}/../../src/yb/rpc/rpc_controller.cc:163:19 (libyrpc.so+0x156976)
[ts-2]     #4 yb::consensus::Peer::SignalRequest(yb::consensus::RequestTriggerMode) ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:188:77 (libconsensus.so+0x94c2e)
[ts-2]     #5 yb::consensus::Peer::Init()::$_0::operator()() const ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:156:25 (libconsensus.so+0x9bb07)
[ts-2]     #6 decltype(std::declval<yb::consensus::Peer::Init()::$_0&>()()) std::__invoke[abi:v160006]<yb::consensus::Peer::Init()::$_0&>(yb::consensus::Peer::Init()::$_0&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/invoke.h:394:23 (libconsensus.so+0x9bb07)
[ts-2]     #7 void std::__invoke_void_return_wrapper<void, true>::__call<yb::consensus::Peer::Init()::$_0&>(yb::consensus::Peer::Init()::$_0&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/invoke.h:487:9 (libconsensus.so+0x9bb07)
[ts-2]     #8 std::__function::__alloc_func<yb::consensus::Peer::Init()::$_0, std::allocator<yb::consensus::Peer::Init()::$_0>, void ()>::operator()[abi:v160006]() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/function.h:185:16 (libconsensus.so+0x9bb07)
[ts-2]     #9 std::__function::__func<yb::consensus::Peer::Init()::$_0, std::allocator<yb::consensus::Peer::Init()::$_0>, void ()>::operator()() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/function.h:356:12 (libconsensus.so+0x9bb07)
[ts-2]     #10 std::__function::__value_func<void ()>::operator()[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/function.h:510:16 (libyrpc.so+0x11bc78)
[ts-2]     #11 std::function<void ()>::operator()() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/function.h:1156:12 (libyrpc.so+0x11bc78)
[ts-2]     #12 yb::rpc::PeriodicTimer::Callback(long) ${BUILD_ROOT}/../../src/yb/rpc/periodic.cc:189:5 (libyrpc.so+0x11bc78)
[ts-2]     #13 yb::rpc::PeriodicTimer::Callback(long)::$_0::operator()(yb::Status const&) const ${BUILD_ROOT}/../../src/yb/rpc/periodic.cc:208:14 (libyrpc.so+0x11c4ef)
[ts-2]     #14 yb::rpc::ScheduledTask<yb::rpc::PeriodicTimer::Callback(long)::$_0>::Run(yb::Status const&) ${BUILD_ROOT}/../../src/yb/rpc/scheduler.h:55:5 (libyrpc.so+0x11c4ef)
[ts-2]     #15 yb::rpc::Scheduler::Impl::HandleTimer(boost::system::error_code const&)::'lambda'()::operator()() const ${BUILD_ROOT}/../../src/yb/rpc/scheduler.cc:142:57 (libyrpc.so+0x170a99)
[ts-2]     #16 void boost::asio::asio_handler_invoke<yb::rpc::Scheduler::Impl::HandleTimer(boost::system::error_code const&)::'lambda'()>(yb::rpc::Scheduler::Impl::HandleTimer(boost::system::error_code const&)::'lambda'()&, ...) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/include/boost/asio/handler_invoke_hook.hpp:88:3 (libyrpc.so+0x170a99)
[ts-2]     #17 void boost_asio_handler_invoke_helpers::invoke<yb::rpc::Scheduler::Impl::HandleTimer(boost::system::error_code const&)::'lambda'(), yb::rpc::Scheduler::Impl::HandleTimer(boost::system::error_code const&)::'lambda'()>(yb::rpc::Scheduler::Impl::HandleTimer(boost::system::error_code const&)::'lambda'()&, yb::rpc::Scheduler::Impl::HandleTimer(boost::system::error_code const&)::'lambda'()&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/include/boost/asio/detail/handler_invoke_helpers.hpp:54:3 (libyrpc.so+0x170a99)
[ts-2]     #18 void boost::asio::detail::handler_work<yb::rpc::Scheduler::Impl::HandleTimer(boost::system::error_code const&)::'lambda'(), boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul>, void>::complete<yb::rpc::Scheduler::Impl::HandleTimer(boost::system::error_code const&)::'lambda'()>(yb::rpc::Scheduler::Impl::HandleTimer(boost::system::error_code const&)::'lambda'()&, yb::rpc::Scheduler::Impl::HandleTimer(boost::system::error_code const&)::'lambda'()&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/include/boost/asio/detail/handler_work.hpp:520:7 (libyrpc.so+0x170a99)
[ts-2]     #19 boost::asio::detail::completion_handler<yb::rpc::Scheduler::Impl::HandleTimer(boost::system::error_code const&)::'lambda'(), boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul>>::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/include/boost/asio/detail/completion_handler.hpp:74:9 (libyrpc.so+0x170a99)
[ts-2]     #20 boost::asio::detail::scheduler_operation::complete(void*, boost::system::error_code const&, unsigned long) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/include/boost/asio/detail/scheduler_operation.hpp:40:5 (libtserver_main_impl.so+0x1c2da)
[ts-2]     #21 boost::asio::detail::scheduler::do_run_one(boost::asio::detail::conditionally_enabled_mutex::scoped_lock&, boost::asio::detail::scheduler_thread_info&, boost::system::error_code const&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/include/boost/asio/detail/impl/scheduler.ipp:492:12 (libtserver_main_impl.so+0x1c2da)
[ts-2]     #22 boost::asio::detail::scheduler::run(boost::system::error_code&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/include/boost/asio/detail/impl/scheduler.ipp:210:10 (libtserver_main_impl.so+0x1be21)
[ts-2]     #23 boost::asio::io_context::run(boost::system::error_code&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/include/boost/asio/impl/io_context.ipp:71:16 (libyrpc.so+0xf2d00)
[ts-2]     #24 yb::rpc::IoThreadPool::Impl::Execute() ${BUILD_ROOT}/../../src/yb/rpc/io_thread_pool.cc:76:17 (libyrpc.so+0xf2d00)
[ts-2]     #25 decltype(*std::declval<yb::rpc::IoThreadPool::Impl*&>().*std::declval<void (yb::rpc::IoThreadPool::Impl::*&)()>()()) std::__invoke[abi:v160006]<void (yb::rpc::IoThreadPool::Impl::*&)(), yb::rpc::IoThreadPool::Impl*&, void>(void (yb::rpc::IoThreadPool::Impl::*&)(), yb::rpc::IoThreadPool::Impl*&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/invoke.h:359:23 (libyrpc.so+0xf915b)
[ts-2]     #26 std::__bind_return<void (yb::rpc::IoThreadPool::Impl::*)(), std::tuple<yb::rpc::IoThreadPool::Impl*>, std::tuple<>, __is_valid_bind_return<void (yb::rpc::IoThreadPool::Impl::*)(), std::tuple<yb::rpc::IoThreadPool::Impl*>, std::tuple<>>::value>::type std::__apply_functor[abi:v160006]<void (yb::rpc::IoThreadPool::Impl::*)(), std::tuple<yb::rpc::IoThreadPool::Impl*>, 0ul, std::tuple<>>(void (yb::rpc::IoThreadPool::Impl::*&)(), std::tuple<yb::rpc::IoThreadPool::Impl*>&, std::__tuple_indices<0ul>, std::tuple<>&&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/bind.h:263:12 (libyrpc.so+0xf915b)
[ts-2]     #27 std::__bind_return<void (yb::rpc::IoThreadPool::Impl::*)(), std::tuple<yb::rpc::IoThreadPool::Impl*>, std::tuple<>, __is_valid_bind_return<void (yb::rpc::IoThreadPool::Impl::*)(), std::tuple<yb::rpc::IoThreadPool::Impl*>, std::tuple<>>::value>::type std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>::operator()[abi:v160006]<>() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/bind.h:295:20 (libyrpc.so+0xf915b)
[ts-2]     #28 decltype(std::declval<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>&>()()) std::__invoke[abi:v160006]<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>&>(std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/invoke.h:394:23 (libyrpc.so+0xf915b)
[ts-2]     #29 std::__bind_return<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>, std::tuple<>, std::tuple<>, __is_valid_bind_return<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>, std::tuple<>, std::tuple<>>::value>::type std::__apply_functor[abi:v160006]<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>, std::tuple<>, std::tuple<>>(std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>&, std::tuple<>&, std::__tuple_indices<>, std::tuple<>&&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/bind.h:263:12 (libyrpc.so+0xf915b)
[ts-2]     #30 std::__bind_return<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>, std::tuple<>, std::tuple<>, __is_valid_bind_return<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>, std::tuple<>, std::tuple<>>::value>::type std::__bind<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>>::operator()[abi:v160006]<>() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/bind.h:295:20 (libyrpc.so+0xf915b)
[ts-2]     #31 decltype(std::declval<std::__bind<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>>&>()()) std::__invoke[abi:v160006]<std::__bind<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>>&>(std::__bind<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>>&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/invoke.h:394:23 (libyrpc.so+0xf915b)
[ts-2]     #32 void std::__invoke_void_return_wrapper<void, true>::__call<std::__bind<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>>&>(std::__bind<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>>&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/invoke.h:487:9 (libyrpc.so+0xf915b)
[ts-2]     #33 std::__function::__alloc_func<std::__bind<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>>, std::allocator<std::__bind<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>>>, void ()>::operator()[abi:v160006]() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/function.h:185:16 (libyrpc.so+0xf915b)
[ts-2]     #34 std::__function::__func<std::__bind<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>>, std::allocator<std::__bind<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>>>, void ()>::operator()() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/function.h:356:12 (libyrpc.so+0xf915b)
[ts-2]     #35 std::__function::__value_func<void ()>::operator()[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/function.h:510:16 (libyb_util.so+0x3af852)
[ts-2]     #36 std::function<void ()>::operator()() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/function.h:1156:12 (libyb_util.so+0x3af852)
[ts-2]     #37 yb::Thread::SuperviseThread(void*) ${BUILD_ROOT}/../../src/yb/util/thread.cc:842:3 (libyb_util.so+0x3af852)
[ts-2] 
[ts-2]   Previous write of size 8 at 0x7b54000c0198 by thread T25:
[ts-2]     #0 std::enable_if<is_move_constructible<yb::rpc::Connection*>::value && is_move_assignable<yb::rpc::Connection*>::value, void>::type std::swap[abi:v160006]<yb::rpc::Connection*>(yb::rpc::Connection*&, yb::rpc::Connection*&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__utility/swap.h:41:7 (libyrpc.so+0x11166a)
[ts-2]     #1 std::shared_ptr<yb::rpc::Connection>::swap[abi:v160006](std::shared_ptr<yb::rpc::Connection>&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:805:9 (libyrpc.so+0x11166a)
[ts-2]     #2 std::shared_ptr<yb::rpc::Connection>::reset[abi:v160006]() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:812:22 (libyrpc.so+0x11166a)
[ts-2]     #3 yb::rpc::OutboundCall::InvokeCallback() ${BUILD_ROOT}/../../src/yb/rpc/outbound_call.cc:372:15 (libyrpc.so+0x11166a)
[ts-2]     #4 yb::rpc::OutboundCall::SetTimedOut() ${BUILD_ROOT}/../../src/yb/rpc/outbound_call.cc:538:5 (libyrpc.so+0x112c64)
[ts-2]     #5 yb::rpc::Connection::CleanupExpirationQueue(std::chrono::time_point<yb::CoarseMonoClock, std::chrono::duration<long long, std::ratio<1l, 1000000000l>>>) ${BUILD_ROOT}/../../src/yb/rpc/connection.cc:240:13 (libyrpc.so+0xd8ad7)
[ts-2]     #6 yb::rpc::Connection::HandleTimeout(ev::timer&, int) ${BUILD_ROOT}/../../src/yb/rpc/connection.cc:221:3 (libyrpc.so+0xd84d6)
[ts-2]     #7 void ev::base<ev_timer, ev::timer>::method_thunk<yb::rpc::Connection, &yb::rpc::Connection::HandleTimeout(ev::timer&, int)>(ev_loop*, ev_timer*, int) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/common/include/ev++.h:479:7 (libyrpc.so+0xe7ee8)
[ts-2]     #8 ev_invoke_pending <null> (libev.so.4+0x896a)
[ts-2]     #9 decltype(*std::declval<yb::rpc::Reactor*&>().*std::declval<void (yb::rpc::Reactor::*&)()>()()) std::__invoke[abi:v160006]<void (yb::rpc::Reactor::*&)(), yb::rpc::Reactor*&, void>(void (yb::rpc::Reactor::*&)(), yb::rpc::Reactor*&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/invoke.h:359:23 (libyrpc.so+0x13ba1b)
[ts-2]     #10 std::__bind_return<void (yb::rpc::Reactor::*)(), std::tuple<yb::rpc::Reactor*>, std::tuple<>, __is_valid_bind_return<void (yb::rpc::Reactor::*)(), std::tuple<yb::rpc::Reactor*>, std::tuple<>>::value>::type std::__apply_functor[abi:v160006]<void (yb::rpc::Reactor::*)(), std::tuple<yb::rpc::Reactor*>, 0ul, std::tuple<>>(void (yb::rpc::Reactor::*&)(), std::tuple<yb::rpc::Reactor*>&, std::__tuple_indices<0ul>, std::tuple<>&&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/bind.h:263:12 (libyrpc.so+0x13ba1b)
[ts-2]     #11 std::__bind_return<void (yb::rpc::Reactor::*)(), std::tuple<yb::rpc::Reactor*>, std::tuple<>, __is_valid_bind_return<void (yb::rpc::Reactor::*)(), std::tuple<yb::rpc::Reactor*>, std::tuple<>>::value>::type std::__bind<void (yb::rpc::Reactor::* const&)(), yb::rpc::Reactor* const&>::operator()[abi:v160006]<>() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/bind.h:295:20 (libyrpc.so+0x13ba1b)
[ts-2]     #12 decltype(std::declval<std::__bind<void (yb::rpc::Reactor::* const&)(), yb::rpc::Reactor* const&>&>()()) std::__invoke[abi:v160006]<std::__bind<void (yb::rpc::Reactor::* const&)(), yb::rpc::Reactor* const&>&>(std::__bind<void (yb::rpc::Reactor::* const&)(), yb::rpc::Reactor* const&>&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/invoke.h:394:23 (libyrpc.so+0x13ba1b)
[ts-2]     #13 void std::__invoke_void_return_wrapper<void, true>::__call<std::__bind<void (yb::rpc::Reactor::* const&)(), yb::rpc::Reactor* const&>&>(std::__bind<void (yb::rpc::Reactor::* const&)(), yb::rpc::Reactor* const&>&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/invoke.h:487:9 (libyrpc.so+0x13ba1b)
[ts-2]     #14 std::__function::__alloc_func<std::__bind<void (yb::rpc::Reactor::* const&)(), yb::rpc::Reactor* const&>, std::allocator<std::__bind<void (yb::rpc::Reactor::* const&)(), yb::rpc::Reactor* const&>>, void ()>::operator()[abi:v160006]() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/function.h:185:16 (libyrpc.so+0x13ba1b)
[ts-2]     #15 std::__function::__func<std::__bind<void (yb::rpc::Reactor::* const&)(), yb::rpc::Reactor* const&>, std::allocator<std::__bind<void (yb::rpc::Reactor::* const&)(), yb::rpc::Reactor* const&>>, void ()>::operator()() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/function.h:356:12 (libyrpc.so+0x13ba1b)
[ts-2]     #16 std::__function::__value_func<void ()>::operator()[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/function.h:510:16 (libyb_util.so+0x3af852)
[ts-2]     #17 std::function<void ()>::operator()() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/function.h:1156:12 (libyb_util.so+0x3af852)
[ts-2]     #18 yb::Thread::SuperviseThread(void*) ${BUILD_ROOT}/../../src/yb/util/thread.cc:842:3 (libyb_util.so+0x3af852)

Flavor 2:

[m-2] WARNING: ThreadSanitizer: data race (pid=27931)
[m-2]   Write of size 8 at 0x7b6000040708 by thread T29 (mutexes: write M0):
[m-2]     #0 std::enable_if<is_move_constructible<yb::rpc::OutboundCall*>::value && is_move_assignable<yb::rpc::OutboundCall*>::value, void>::type std::swap[abi:v160006]<yb::rpc::OutboundCall*>(yb::rpc::OutboundCall*&, yb::rpc::OutboundCall*&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__utility/swap.h:41:7 (libyrpc.so+0x1560a4)
[m-2]     #1 std::shared_ptr<yb::rpc::OutboundCall>::swap[abi:v160006](std::shared_ptr<yb::rpc::OutboundCall>&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:805:9 (libyrpc.so+0x1560a4)
[m-2]     #2 std::shared_ptr<yb::rpc::OutboundCall>::reset[abi:v160006]() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:812:22 (libyrpc.so+0x1560a4)
[m-2]     #3 yb::rpc::RpcController::Reset() ${BUILD_ROOT}/../../src/yb/rpc/rpc_controller.cc:83:9 (libyrpc.so+0x1560a4)
[m-2]     #4 yb::consensus::Peer::ProcessResponse() ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:496:15 (libconsensus.so+0x97ffb)
[m-2]     #5 decltype(*std::declval<std::shared_ptr<yb::consensus::Peer>&>().*std::declval<void (yb::consensus::Peer::*&)()>()()) std::__invoke[abi:v160006]<void (yb::consensus::Peer::*&)(), std::shared_ptr<yb::consensus::Peer>&, void>(void (yb::consensus::Peer::*&)(), std::shared_ptr<yb::consensus::Peer>&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/invoke.h:359:23 (libconsensus.so+0xa269b)
[m-2]     #6 std::__bind_return<void (yb::consensus::Peer::*)(), std::tuple<std::shared_ptr<yb::consensus::Peer>>, std::tuple<>, __is_valid_bind_return<void (yb::consensus::Peer::*)(), std::tuple<std::shared_ptr<yb::consensus::Peer>>, std::tuple<>>::value>::type std::__apply_functor[abi:v160006]<void (yb::consensus::Peer::*)(), std::tuple<std::shared_ptr<yb::consensus::Peer>>, 0ul, std::tuple<>>(void (yb::consensus::Peer::*&)(), std::tuple<std::shared_ptr<yb::consensus::Peer>>&, std::__tuple_indices<0ul>, std::tuple<>&&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/bind.h:263:12 (libconsensus.so+0xa269b)
[m-2]     #7 std::__bind_return<void (yb::consensus::Peer::*)(), std::tuple<std::shared_ptr<yb::consensus::Peer>>, std::tuple<>, __is_valid_bind_return<void (yb::consensus::Peer::*)(), std::tuple<std::shared_ptr<yb::consensus::Peer>>, std::tuple<>>::value>::type std::__bind<void (yb::consensus::Peer::*)(), std::shared_ptr<yb::consensus::Peer>&>::operator()[abi:v160006]<>() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/bind.h:295:20 (libconsensus.so+0xa269b)
[m-2]     #8 decltype(std::declval<std::__bind<void (yb::consensus::Peer::*)(), std::shared_ptr<yb::consensus::Peer>&>&>()()) std::__invoke[abi:v160006]<std::__bind<void (yb::consensus::Peer::*)(), std::shared_ptr<yb::consensus::Peer>&>&>(std::__bind<void (yb::consensus::Peer::*)(), std::shared_ptr<yb::consensus::Peer>&>&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/invoke.h:394:23 (libconsensus.so+0xa269b)
[m-2]     #9 void std::__invoke_void_return_wrapper<void, true>::__call<std::__bind<void (yb::consensus::Peer::*)(), std::shared_ptr<yb::consensus::Peer>&>&>(std::__bind<void (yb::consensus::Peer::*)(), std::shared_ptr<yb::consensus::Peer>&>&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/invoke.h:487:9 (libconsensus.so+0xa269b)
[m-2]     #10 std::__function::__alloc_func<std::__bind<void (yb::consensus::Peer::*)(), std::shared_ptr<yb::consensus::Peer>&>, std::allocator<std::__bind<void (yb::consensus::Peer::*)(), std::shared_ptr<yb::consensus::Peer>&>>, void ()>::operator()[abi:v160006]() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/function.h:185:16 (libconsensus.so+0xa269b)
[m-2]     #11 std::__function::__func<std::__bind<void (yb::consensus::Peer::*)(), std::shared_ptr<yb::consensus::Peer>&>, std::allocator<std::__bind<void (yb::consensus::Peer::*)(), std::shared_ptr<yb::consensus::Peer>&>>, void ()>::operator()() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/function.h:356:12 (libconsensus.so+0xa269b)
[m-2]     #12 std::__function::__value_func<void ()>::operator()[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/function.h:510:16 (libyrpc.so+0x10d123)
[m-2]     #13 std::function<void ()>::operator()() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/function.h:1156:12 (libyrpc.so+0x10d123)
[m-2]     #14 yb::rpc::OutboundCall::InvokeCallbackSync() ${BUILD_ROOT}/../../src/yb/rpc/outbound_call.cc:390:3 (libyrpc.so+0x10d123)
[m-2]     #15 yb::rpc::InvokeCallbackTask::Run() ${BUILD_ROOT}/../../src/yb/rpc/outbound_call.cc:125:10 (libyrpc.so+0x10d03d)
[m-2]     #16 yb::rpc::(anonymous namespace)::Worker::Execute() ${BUILD_ROOT}/../../src/yb/rpc/thread_pool.cc:104:15 (libyrpc.so+0x1a1184)
[m-2]     #17 decltype(*std::declval<yb::rpc::(anonymous namespace)::Worker*&>().*std::declval<void (yb::rpc::(anonymous namespace)::Worker::*&)()>()()) std::__invoke[abi:v160006]<void (yb::rpc::(anonymous namespace)::Worker::*&)(), yb::rpc::(anonymous namespace)::Worker*&, void>(void (yb::rpc::(anonymous namespace)::Worker::*&)(), yb::rpc::(anonymous namespace)::Worker*&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/invoke.h:359:23 (libyrpc.so+0x1a15db)
[m-2]     #18 std::__bind_return<void (yb::rpc::(anonymous namespace)::Worker::*)(), std::tuple<yb::rpc::(anonymous namespace)::Worker*>, std::tuple<>, __is_valid_bind_return<void (yb::rpc::(anonymous namespace)::Worker::*)(), std::tuple<yb::rpc::(anonymous namespace)::Worker*>, std::tuple<>>::value>::type std::__apply_functor[abi:v160006]<void (yb::rpc::(anonymous namespace)::Worker::*)(), std::tuple<yb::rpc::(anonymous namespace)::Worker*>, 0ul, std::tuple<>>(void (yb::rpc::(anonymous namespace)::Worker::*&)(), std::tuple<yb::rpc::(anonymous namespace)::Worker*>&, std::__tuple_indices<0ul>, std::tuple<>&&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/bind.h:263:12 (libyrpc.so+0x1a15db)
[m-2]     #19 std::__bind_return<void (yb::rpc::(anonymous namespace)::Worker::*)(), std::tuple<yb::rpc::(anonymous namespace)::Worker*>, std::tuple<>, __is_valid_bind_return<void (yb::rpc::(anonymous namespace)::Worker::*)(), std::tuple<yb::rpc::(anonymous namespace)::Worker*>, std::tuple<>>::value>::type std::__bind<void (yb::rpc::(anonymous namespace)::Worker::* const&)(), yb::rpc::(anonymous namespace)::Worker* const&>::operator()[abi:v160006]<>() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/bind.h:295:20 (libyrpc.so+0x1a15db)
[m-2]     #20 decltype(std::declval<std::__bind<void (yb::rpc::(anonymous namespace)::Worker::* const&)(), yb::rpc::(anonymous namespace)::Worker* const&>&>()()) std::__invoke[abi:v160006]<std::__bind<void (yb::rpc::(anonymous namespace)::Worker::* const&)(), yb::rpc::(anonymous namespace)::Worker* const&>&>(std::__bind<void (yb::rpc::(anonymous namespace)::Worker::* const&)(), yb::rpc::(anonymous namespace)::Worker* const&>&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/invoke.h:394:23 (libyrpc.so+0x1a15db)
[m-2]     #21 void std::__invoke_void_return_wrapper<void, true>::__call<std::__bind<void (yb::rpc::(anonymous namespace)::Worker::* const&)(), yb::rpc::(anonymous namespace)::Worker* const&>&>(std::__bind<void (yb::rpc::(anonymous namespace)::Worker::* const&)(), yb::rpc::(anonymous namespace)::Worker* const&>&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/invoke.h:487:9 (libyrpc.so+0x1a15db)
[m-2]     #22 std::__function::__alloc_func<std::__bind<void (yb::rpc::(anonymous namespace)::Worker::* const&)(), yb::rpc::(anonymous namespace)::Worker* const&>, std::allocator<std::__bind<void (yb::rpc::(anonymous namespace)::Worker::* const&)(), yb::rpc::(anonymous namespace)::Worker* const&>>, void ()>::operator()[abi:v160006]() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/function.h:185:16 (libyrpc.so+0x1a15db)
[m-2]     #23 std::__function::__func<std::__bind<void (yb::rpc::(anonymous namespace)::Worker::* const&)(), yb::rpc::(anonymous namespace)::Worker* const&>, std::allocator<std::__bind<void (yb::rpc::(anonymous namespace)::Worker::* const&)(), yb::rpc::(anonymous namespace)::Worker* const&>>, void ()>::operator()() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/function.h:356:12 (libyrpc.so+0x1a15db)
[m-2]     #24 std::__function::__value_func<void ()>::operator()[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/function.h:510:16 (libyb_util.so+0x3afa02)
[m-2]     #25 std::function<void ()>::operator()() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/function.h:1156:12 (libyb_util.so+0x3afa02)
[m-2]     #26 yb::Thread::SuperviseThread(void*) ${BUILD_ROOT}/../../src/yb/util/thread.cc:842:3 (libyb_util.so+0x3afa02)
[m-2] 
[m-2]   Previous read of size 8 at 0x7b6000040708 by thread T7:
[m-2]     #0 std::shared_ptr<yb::rpc::OutboundCall>::get[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:843:16 (libyrpc.so+0x1561ea)
[m-2]     #1 std::shared_ptr<yb::rpc::OutboundCall>::operator bool[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:875:16 (libyrpc.so+0x1561ea)
[m-2]     #2 yb::rpc::RpcController::finished() const ${BUILD_ROOT}/../../src/yb/rpc/rpc_controller.cc:87:7 (libyrpc.so+0x1561ea)
[m-2]     #3 yb::consensus::Peer::SignalRequest(yb::consensus::RequestTriggerMode) ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:183:81 (libconsensus.so+0x94934)
[m-2]     #4 yb::consensus::Peer::Init()::$_0::operator()() const ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:156:25 (libconsensus.so+0x9bc67)
[m-2]     #5 decltype(std::declval<yb::consensus::Peer::Init()::$_0&>()()) std::__invoke[abi:v160006]<yb::consensus::Peer::Init()::$_0&>(yb::consensus::Peer::Init()::$_0&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/invoke.h:394:23 (libconsensus.so+0x9bc67)
[m-2]     #6 void std::__invoke_void_return_wrapper<void, true>::__call<yb::consensus::Peer::Init()::$_0&>(yb::consensus::Peer::Init()::$_0&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/invoke.h:487:9 (libconsensus.so+0x9bc67)
[m-2]     #7 std::__function::__alloc_func<yb::consensus::Peer::Init()::$_0, std::allocator<yb::consensus::Peer::Init()::$_0>, void ()>::operator()[abi:v160006]() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/function.h:185:16 (libconsensus.so+0x9bc67)
[m-2]     #8 std::__function::__func<yb::consensus::Peer::Init()::$_0, std::allocator<yb::consensus::Peer::Init()::$_0>, void ()>::operator()() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/function.h:356:12 (libconsensus.so+0x9bc67)
[m-2]     #9 std::__function::__value_func<void ()>::operator()[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/function.h:510:16 (libyrpc.so+0x11bc88)
[m-2]     #10 std::function<void ()>::operator()() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/function.h:1156:12 (libyrpc.so+0x11bc88)
[m-2]     #11 yb::rpc::PeriodicTimer::Callback(long) ${BUILD_ROOT}/../../src/yb/rpc/periodic.cc:189:5 (libyrpc.so+0x11bc88)
[m-2]     #12 yb::rpc::PeriodicTimer::Callback(long)::$_0::operator()(yb::Status const&) const ${BUILD_ROOT}/../../src/yb/rpc/periodic.cc:208:14 (libyrpc.so+0x11c4ff)
[m-2]     #13 yb::rpc::ScheduledTask<yb::rpc::PeriodicTimer::Callback(long)::$_0>::Run(yb::Status const&) ${BUILD_ROOT}/../../src/yb/rpc/scheduler.h:55:5 (libyrpc.so+0x11c4ff)
[m-2]     #14 yb::rpc::Scheduler::Impl::HandleTimer(boost::system::error_code const&)::'lambda'()::operator()() const ${BUILD_ROOT}/../../src/yb/rpc/scheduler.cc:142:57 (libyrpc.so+0x170b39)
[m-2]     #15 void boost::asio::asio_handler_invoke<yb::rpc::Scheduler::Impl::HandleTimer(boost::system::error_code const&)::'lambda'()>(yb::rpc::Scheduler::Impl::HandleTimer(boost::system::error_code const&)::'lambda'()&, ...) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/include/boost/asio/handler_invoke_hook.hpp:88:3 (libyrpc.so+0x170b39)
[m-2]     #16 void boost_asio_handler_invoke_helpers::invoke<yb::rpc::Scheduler::Impl::HandleTimer(boost::system::error_code const&)::'lambda'(), yb::rpc::Scheduler::Impl::HandleTimer(boost::system::error_code const&)::'lambda'()>(yb::rpc::Scheduler::Impl::HandleTimer(boost::system::error_code const&)::'lambda'()&, yb::rpc::Scheduler::Impl::HandleTimer(boost::system::error_code const&)::'lambda'()&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/include/boost/asio/detail/handler_invoke_helpers.hpp:54:3 (libyrpc.so+0x170b39)
[m-2]     #17 void boost::asio::detail::handler_work<yb::rpc::Scheduler::Impl::HandleTimer(boost::system::error_code const&)::'lambda'(), boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul>, void>::complete<yb::rpc::Scheduler::Impl::HandleTimer(boost::system::error_code const&)::'lambda'()>(yb::rpc::Scheduler::Impl::HandleTimer(boost::system::error_code const&)::'lambda'()&, yb::rpc::Scheduler::Impl::HandleTimer(boost::system::error_code const&)::'lambda'()&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/include/boost/asio/detail/handler_work.hpp:520:7 (libyrpc.so+0x170b39)
[m-2]     #18 boost::asio::detail::completion_handler<yb::rpc::Scheduler::Impl::HandleTimer(boost::system::error_code const&)::'lambda'(), boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul>>::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/include/boost/asio/detail/completion_handler.hpp:74:9 (libyrpc.so+0x170b39)
[m-2]     #19 boost::asio::detail::scheduler_operation::complete(void*, boost::system::error_code const&, unsigned long) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/include/boost/asio/detail/scheduler_operation.hpp:40:5 (libyrpc.so+0xf7c2a)
[m-2]     #20 boost::asio::detail::scheduler::do_run_one(boost::asio::detail::conditionally_enabled_mutex::scoped_lock&, boost::asio::detail::scheduler_thread_info&, boost::system::error_code const&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/include/boost/asio/detail/impl/scheduler.ipp:492:12 (libyrpc.so+0xf7c2a)
[m-2]     #21 boost::asio::detail::scheduler::run(boost::system::error_code&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/include/boost/asio/detail/impl/scheduler.ipp:210:10 (libyrpc.so+0xf7771)
[m-2]     #22 boost::asio::io_context::run(boost::system::error_code&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/include/boost/asio/impl/io_context.ipp:71:16 (libyrpc.so+0xf2d10)
[m-2]     #23 yb::rpc::IoThreadPool::Impl::Execute() ${BUILD_ROOT}/../../src/yb/rpc/io_thread_pool.cc:76:17 (libyrpc.so+0xf2d10)
[m-2]     #24 decltype(*std::declval<yb::rpc::IoThreadPool::Impl*&>().*std::declval<void (yb::rpc::IoThreadPool::Impl::*&)()>()()) std::__invoke[abi:v160006]<void (yb::rpc::IoThreadPool::Impl::*&)(), yb::rpc::IoThreadPool::Impl*&, void>(void (yb::rpc::IoThreadPool::Impl::*&)(), yb::rpc::IoThreadPool::Impl*&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/invoke.h:359:23 (libyrpc.so+0xf916b)
[m-2]     #25 std::__bind_return<void (yb::rpc::IoThreadPool::Impl::*)(), std::tuple<yb::rpc::IoThreadPool::Impl*>, std::tuple<>, __is_valid_bind_return<void (yb::rpc::IoThreadPool::Impl::*)(), std::tuple<yb::rpc::IoThreadPool::Impl*>, std::tuple<>>::value>::type std::__apply_functor[abi:v160006]<void (yb::rpc::IoThreadPool::Impl::*)(), std::tuple<yb::rpc::IoThreadPool::Impl*>, 0ul, std::tuple<>>(void (yb::rpc::IoThreadPool::Impl::*&)(), std::tuple<yb::rpc::IoThreadPool::Impl*>&, std::__tuple_indices<0ul>, std::tuple<>&&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/bind.h:263:12 (libyrpc.so+0xf916b)
[m-2]     #26 std::__bind_return<void (yb::rpc::IoThreadPool::Impl::*)(), std::tuple<yb::rpc::IoThreadPool::Impl*>, std::tuple<>, __is_valid_bind_return<void (yb::rpc::IoThreadPool::Impl::*)(), std::tuple<yb::rpc::IoThreadPool::Impl*>, std::tuple<>>::value>::type std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>::operator()[abi:v160006]<>() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/bind.h:295:20 (libyrpc.so+0xf916b)
[m-2]     #27 decltype(std::declval<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>&>()()) std::__invoke[abi:v160006]<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>&>(std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/invoke.h:394:23 (libyrpc.so+0xf916b)
[m-2]     #28 std::__bind_return<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>, std::tuple<>, std::tuple<>, __is_valid_bind_return<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>, std::tuple<>, std::tuple<>>::value>::type std::__apply_functor[abi:v160006]<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>, std::tuple<>, std::tuple<>>(std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>&, std::tuple<>&, std::__tuple_indices<>, std::tuple<>&&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/bind.h:263:12 (libyrpc.so+0xf916b)
[m-2]     #29 std::__bind_return<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>, std::tuple<>, std::tuple<>, __is_valid_bind_return<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>, std::tuple<>, std::tuple<>>::value>::type std::__bind<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>>::operator()[abi:v160006]<>() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/bind.h:295:20 (libyrpc.so+0xf916b)
[m-2]     #30 decltype(std::declval<std::__bind<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>>&>()()) std::__invoke[abi:v160006]<std::__bind<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>>&>(std::__bind<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>>&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/invoke.h:394:23 (libyrpc.so+0xf916b)
[m-2]     #31 void std::__invoke_void_return_wrapper<void, true>::__call<std::__bind<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>>&>(std::__bind<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>>&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/invoke.h:487:9 (libyrpc.so+0xf916b)
[m-2]     #32 std::__function::__alloc_func<std::__bind<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>>, std::allocator<std::__bind<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>>>, void ()>::operator()[abi:v160006]() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/function.h:185:16 (libyrpc.so+0xf916b)
[m-2]     #33 std::__function::__func<std::__bind<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>>, std::allocator<std::__bind<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>>>, void ()>::operator()() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/function.h:356:12 (libyrpc.so+0xf916b)
[m-2]     #34 std::__function::__value_func<void ()>::operator()[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/function.h:510:16 (libyb_util.so+0x3afa02)
[m-2]     #35 std::function<void ()>::operator()() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/function.h:1156:12 (libyb_util.so+0x3afa02)
[m-2]     #36 yb::Thread::SuperviseThread(void*) ${BUILD_ROOT}/../../src/yb/util/thread.cc:842:3 (libyb_util.so+0x3afa02)

karan-yb added a commit that referenced this issue Aug 29, 2023
…ulted in tsan race in connection

Summary:
This change fixes TSAN race issues caused by change 98586ef. In initial change, we were incorrectly setting the RPC start time to now when processing response, because of which detection logic was not working correctly, hence resulting in tsan race issues.

tsan race details -
- Connection getting reset when we are trying to get DebugString. This is solved by avoiding the call to DebugString() until we figure out that call is actually stuck.
```
[ts-2]   Read of size 8 at 0x7b54000c0198 by thread T21 (mutexes: write M0):
[ts-2]     #0 std::shared_ptr<yb::rpc::Connection>::get[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:843:16 (libyrpc.so+0x11383f)
[ts-2]     #1 std::shared_ptr<yb::rpc::Connection>::operator bool[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:875:16 (libyrpc.so+0x11383f)
[ts-2]     #2 yb::rpc::OutboundCall::DebugString() const ${BUILD_ROOT}/../../src/yb/rpc/outbound_call.cc:613:7 (libyrpc.so+0x11383f)
[ts-2]     #3 yb::rpc::RpcController::CallStateDebugString() const ${BUILD_ROOT}/../../src/yb/rpc/rpc_controller.cc:163:19 (libyrpc.so+0x156976)
[ts-2]     #4 yb::consensus::Peer::SignalRequest(yb::consensus::RequestTriggerMode) ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:188:77 (libconsensus.so+0x94c2e)

[ts-2]   Previous write of size 8 at 0x7b54000c0198 by thread T25:
[ts-2]     #0 std::enable_if<is_move_constructible<yb::rpc::Connection*>::value && is_move_assignable<yb::rpc::Connection*>::value, void>::type std::swap[abi:v160006]<yb::rpc::Connection*>(yb::rpc::Connection*&, yb::rpc::Connection*&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__utility/swap.h:41:7 (libyrpc.so+0x11166a)
[ts-2]     #1 std::shared_ptr<yb::rpc::Connection>::swap[abi:v160006](std::shared_ptr<yb::rpc::Connection>&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:805:9 (libyrpc.so+0x11166a)
[ts-2]     #2 std::shared_ptr<yb::rpc::Connection>::reset[abi:v160006]() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:812:22 (libyrpc.so+0x11166a)
[ts-2]     #3 yb::rpc::OutboundCall::InvokeCallback() ${BUILD_ROOT}/../../src/yb/rpc/outbound_call.cc:372:15 (libyrpc.so+0x11166a)
[ts-2]     #4 yb::rpc::OutboundCall::SetTimedOut() ${BUILD_ROOT}/../../src/yb/rpc/outbound_call.cc:538:5 (libyrpc.so+0x112c64)
```

- RpcController call_ concurrent access - fix will avoid calling finished() function call.
```
[m-2] WARNING: ThreadSanitizer: data race (pid=27931)
[m-2]   Write of size 8 at 0x7b6000040708 by thread T29 (mutexes: write M0):
[m-2]     #0 std::enable_if<is_move_constructible<yb::rpc::OutboundCall*>::value && is_move_assignable<yb::rpc::OutboundCall*>::value, void>::type std::swap[abi:v160006]<yb::rpc::OutboundCall*>(yb::rpc::OutboundCall*&, yb::rpc::OutboundCall*&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__utility/swap.h:41:7 (libyrpc.so+0x1560a4)
[m-2]     #1 std::shared_ptr<yb::rpc::OutboundCall>::swap[abi:v160006](std::shared_ptr<yb::rpc::OutboundCall>&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:805:9 (libyrpc.so+0x1560a4)
[m-2]     #2 std::shared_ptr<yb::rpc::OutboundCall>::reset[abi:v160006]() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:812:22 (libyrpc.so+0x1560a4)
[m-2]     #3 yb::rpc::RpcController::Reset() ${BUILD_ROOT}/../../src/yb/rpc/rpc_controller.cc:83:9 (libyrpc.so+0x1560a4)
[m-2]     #4 yb::consensus::Peer::ProcessResponse() ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:496:15 (libconsensus.so+0x97ffb)

[m-2]   Previous read of size 8 at 0x7b6000040708 by thread T7:
[m-2]     #0 std::shared_ptr<yb::rpc::OutboundCall>::get[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:843:16 (libyrpc.so+0x1561ea)
[m-2]     #1 std::shared_ptr<yb::rpc::OutboundCall>::operator bool[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:875:16 (libyrpc.so+0x1561ea)
[m-2]     #2 yb::rpc::RpcController::finished() const ${BUILD_ROOT}/../../src/yb/rpc/rpc_controller.cc:87:7 (libyrpc.so+0x1561ea)
[m-2]     #3 yb::consensus::Peer::SignalRequest(yb::consensus::RequestTriggerMode) ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:183:81 (libconsensus.so+0x94934)
[m-2]     #4 yb::consensus::Peer::Init()::$_0::operator()() const ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:156:25 (libconsensus.so+0x9bc67)
```
Jira: DB-7637

Test Plan:
./yb_build.sh tsan -n 10 --cxx-test integration-tests_master_failover-itest --gtest_filter MasterFailoverTestIndexCreation/MasterFailoverTestIndexCreation.TestPauseAfterCreateIndexIssued/0
./yb_build.sh tsan -n 10 --cxx-test integration-tests_raft_consensus-itest --gtest_filter RaftConsensusITest.MultiThreadedInsertWithFailovers
Jenkins

Reviewers: mbautin

Reviewed By: mbautin

Subscribers: mbautin, ybase, bogdan

Differential Revision: https://phorge.dev.yugabyte.com/D28161
karan-yb added a commit that referenced this issue Aug 30, 2023
…_ reset which resulted in tsan race in connection

Summary:
Original commit: 0dc13e6 / D28161
This change fixes TSAN race issues caused by change 98586ef. In initial change, we were incorrectly setting the RPC start time to now when processing response, because of which detection logic was not working correctly, hence resulting in tsan race issues.

tsan race details -
- Connection getting reset when we are trying to get DebugString. This is solved by avoiding the call to DebugString() until we figure out that call is actually stuck.
```
[ts-2]   Read of size 8 at 0x7b54000c0198 by thread T21 (mutexes: write M0):
[ts-2]     #0 std::shared_ptr<yb::rpc::Connection>::get[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:843:16 (libyrpc.so+0x11383f)
[ts-2]     #1 std::shared_ptr<yb::rpc::Connection>::operator bool[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:875:16 (libyrpc.so+0x11383f)
[ts-2]     #2 yb::rpc::OutboundCall::DebugString() const ${BUILD_ROOT}/../../src/yb/rpc/outbound_call.cc:613:7 (libyrpc.so+0x11383f)
[ts-2]     #3 yb::rpc::RpcController::CallStateDebugString() const ${BUILD_ROOT}/../../src/yb/rpc/rpc_controller.cc:163:19 (libyrpc.so+0x156976)
[ts-2]     #4 yb::consensus::Peer::SignalRequest(yb::consensus::RequestTriggerMode) ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:188:77 (libconsensus.so+0x94c2e)

[ts-2]   Previous write of size 8 at 0x7b54000c0198 by thread T25:
[ts-2]     #0 std::enable_if<is_move_constructible<yb::rpc::Connection*>::value && is_move_assignable<yb::rpc::Connection*>::value, void>::type std::swap[abi:v160006]<yb::rpc::Connection*>(yb::rpc::Connection*&, yb::rpc::Connection*&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__utility/swap.h:41:7 (libyrpc.so+0x11166a)
[ts-2]     #1 std::shared_ptr<yb::rpc::Connection>::swap[abi:v160006](std::shared_ptr<yb::rpc::Connection>&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:805:9 (libyrpc.so+0x11166a)
[ts-2]     #2 std::shared_ptr<yb::rpc::Connection>::reset[abi:v160006]() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:812:22 (libyrpc.so+0x11166a)
[ts-2]     #3 yb::rpc::OutboundCall::InvokeCallback() ${BUILD_ROOT}/../../src/yb/rpc/outbound_call.cc:372:15 (libyrpc.so+0x11166a)
[ts-2]     #4 yb::rpc::OutboundCall::SetTimedOut() ${BUILD_ROOT}/../../src/yb/rpc/outbound_call.cc:538:5 (libyrpc.so+0x112c64)
```

- RpcController call_ concurrent access - fix will avoid calling finished() function call.
```
[m-2] WARNING: ThreadSanitizer: data race (pid=27931)
[m-2]   Write of size 8 at 0x7b6000040708 by thread T29 (mutexes: write M0):
[m-2]     #0 std::enable_if<is_move_constructible<yb::rpc::OutboundCall*>::value && is_move_assignable<yb::rpc::OutboundCall*>::value, void>::type std::swap[abi:v160006]<yb::rpc::OutboundCall*>(yb::rpc::OutboundCall*&, yb::rpc::OutboundCall*&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__utility/swap.h:41:7 (libyrpc.so+0x1560a4)
[m-2]     #1 std::shared_ptr<yb::rpc::OutboundCall>::swap[abi:v160006](std::shared_ptr<yb::rpc::OutboundCall>&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:805:9 (libyrpc.so+0x1560a4)
[m-2]     #2 std::shared_ptr<yb::rpc::OutboundCall>::reset[abi:v160006]() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:812:22 (libyrpc.so+0x1560a4)
[m-2]     #3 yb::rpc::RpcController::Reset() ${BUILD_ROOT}/../../src/yb/rpc/rpc_controller.cc:83:9 (libyrpc.so+0x1560a4)
[m-2]     #4 yb::consensus::Peer::ProcessResponse() ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:496:15 (libconsensus.so+0x97ffb)

[m-2]   Previous read of size 8 at 0x7b6000040708 by thread T7:
[m-2]     #0 std::shared_ptr<yb::rpc::OutboundCall>::get[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:843:16 (libyrpc.so+0x1561ea)
[m-2]     #1 std::shared_ptr<yb::rpc::OutboundCall>::operator bool[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:875:16 (libyrpc.so+0x1561ea)
[m-2]     #2 yb::rpc::RpcController::finished() const ${BUILD_ROOT}/../../src/yb/rpc/rpc_controller.cc:87:7 (libyrpc.so+0x1561ea)
[m-2]     #3 yb::consensus::Peer::SignalRequest(yb::consensus::RequestTriggerMode) ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:183:81 (libconsensus.so+0x94934)
[m-2]     #4 yb::consensus::Peer::Init()::$_0::operator()() const ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:156:25 (libconsensus.so+0x9bc67)
```
Jira: DB-7637

Test Plan:
./yb_build.sh tsan -n 10 --cxx-test integration-tests_master_failover-itest --gtest_filter MasterFailoverTestIndexCreation/MasterFailoverTestIndexCreation.TestPauseAfterCreateIndexIssued/0
./yb_build.sh tsan -n 10 --cxx-test integration-tests_raft_consensus-itest --gtest_filter RaftConsensusITest.MultiThreadedInsertWithFailovers
Jenkins

Reviewers: mbautin

Reviewed By: mbautin

Subscribers: bogdan, ybase, mbautin

Differential Revision: https://phorge.dev.yugabyte.com/D28166
vaibhav-yb pushed a commit to vaibhav-yb/yugabyte-db that referenced this issue Sep 7, 2023
…ed by stuck OutboundCall

Summary:
In production, we saw a few cases where follower lag was continuously increasing for few tablets. Also we noticed that the replica which was lagging was not removed from quorum by ConsensusQueue either. After capturing a dump of a node where leader was hosted, we were able to figure out that the performing_update_mutex was held for a long time for the affected Peer.

Peer acquires the performing_update_mutex when it is building a request to send to peer and keeps the mutex locked until it receives a response. In the captured dump, we noticed that the OutboundCall was in SENT state, but we were not able to confirm if the connection (on which it was sent) was active or not -- we did an analysis of OutboundCall references and we believe that the connection had been shut down, but we were not sure.

This change tries to detect the stuck OutboundCall in Peer:
* Whenever we try to send more data or heartbeat to the peer (in `SignalRequest`), we check if we can acquire the `performing_update_mutex`. If the mutex is already held, then we try to see if it has been more than the time specified by `FLAGS_stuck_peer_call_threshold_ms` + request timeout since the call start time.
* If the lock is held for more than timeout + `FLAGS_stuck_peer_call_threshold_ms` time duration, then we log additional details which can help identify the root cause of the issue (see below for examples).
* And when `FLAGS_force_recover_from_stuck_peer_call` is set to true, we try to mark the stuck call as failed. If the call object is not present, then we won't be able to recover from this situation.

Another change is that whenever a Connection encounters a write failure, instead of immediately destroying the connection, the operation queues a reactor task to ensure that all queued operations on the socket are executed in order. However, since the socket is closed, all of these operations in queue will encounter write failures, resulting in all of them scheduling a DestroyConnection task. After the first DestroyConnection task is executed, we will not be able to find this connection in the reactor-tracked connections, which will lead to a CHECK error. To prevent multiple DestroyConnection tasks for a single connection, we track whether the connection has already queued the task for its destruction.

Information logged when we detect this situation -
```
I0822 13:36:35.338526 4161232384 rpc_stub-test.cc:1153] OutboundCall (0x0000000107186018 -> RPC call yb.rpc_test.CalculatorService.Concat -> { remote: 127.0.0.1:58098 idx: 1 protocol: 0x0000000103fae8f0 -> tcp } , state=SENT.) tostring: RPC call yb.rpc_test.CalculatorService.Concat -> { remote: 127.0.0.1:58098 idx: 1 protocol: 0x0000000103fae8f0 -> tcp } , state=SENT., times (start, sent, callback, now): (3713549.841s, 3713549.842s, 0.000s, 3713549.947s), connection: 0x000000010708a658 -> Connection (0x000000010708a658) client 127.0.0.1:58100 => 127.0.0.1:58098
I0822 13:36:35.338694 1845915648 connection.cc:409] Connection (0x000000010708a658) client 127.0.0.1:58100 => 127.0.0.1:58098: LastActivityTime: 3713549.839s, ActiveCalls stats: { during shutdown: 0, current size: 0, earliest expiry: 0.000s, }, OutboundCall: { ptr: 0x0000000107186018, call id: 2, is active: 0 }, Shutdown status: Service unavailable (yb/rpc/reactor.cc:106): Shutdown connection (system error 58), Shutdown time: 3713549.940s, Queue attempts after shutdown: { calls: 0, responses: 0 }
```

1. OutboundCall timing details - it includes the time when call was created, when it was sent on network and when callback call was received.
2. Connection - we dump the connection state on which the call was sent. This state includes whether connection is active or not. If alive, then we will see the active calls count, and whether the current call is present in active calls or not. If connection is not alive, then you will see the time when connection was shutdown. We also log the number of active call present during shutdown and queue attempts after shutdown.

Using the above timing information and connection state, we can determine the order of events. For ex, in the above sample logs, as you can see the connection was closed after the data was already sent on the socket.

Example trace for stuck call detection in Peer -
```
I0817 17:42:07.975589 4161232384 consensus_peers.cc:179] T test-peers-tablet P peer-0 -> Peer peer-1 ([], []): Found a RPC call in stuck state - is_finished: 0, timeout: 0.005s, last_update_lock_release_time_: 3296281.832s, stuck threshold: 0.001s, force recover: 0, call state: call not set
```
Jira: DB-7637

Test Plan:
./yb_build.sh --sj
./build/latest/tests-rpc/rpc_stub-test --gtest_filter=*StuckOutboundCall*

Reviewers: mbautin, sergei

Reviewed By: mbautin

Subscribers: amitanand, yql, yyan, rthallam, ybase, bogdan

Differential Revision: https://phorge.dev.yugabyte.com/D27859
vaibhav-yb pushed a commit to vaibhav-yb/yugabyte-db that referenced this issue Sep 7, 2023
…hich resulted in tsan race in connection

Summary:
This change fixes TSAN race issues caused by change 98586ef. In initial change, we were incorrectly setting the RPC start time to now when processing response, because of which detection logic was not working correctly, hence resulting in tsan race issues.

tsan race details -
- Connection getting reset when we are trying to get DebugString. This is solved by avoiding the call to DebugString() until we figure out that call is actually stuck.
```
[ts-2]   Read of size 8 at 0x7b54000c0198 by thread T21 (mutexes: write M0):
[ts-2]     #0 std::shared_ptr<yb::rpc::Connection>::get[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:843:16 (libyrpc.so+0x11383f)
[ts-2]     yugabyte#1 std::shared_ptr<yb::rpc::Connection>::operator bool[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:875:16 (libyrpc.so+0x11383f)
[ts-2]     yugabyte#2 yb::rpc::OutboundCall::DebugString() const ${BUILD_ROOT}/../../src/yb/rpc/outbound_call.cc:613:7 (libyrpc.so+0x11383f)
[ts-2]     yugabyte#3 yb::rpc::RpcController::CallStateDebugString() const ${BUILD_ROOT}/../../src/yb/rpc/rpc_controller.cc:163:19 (libyrpc.so+0x156976)
[ts-2]     yugabyte#4 yb::consensus::Peer::SignalRequest(yb::consensus::RequestTriggerMode) ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:188:77 (libconsensus.so+0x94c2e)

[ts-2]   Previous write of size 8 at 0x7b54000c0198 by thread T25:
[ts-2]     #0 std::enable_if<is_move_constructible<yb::rpc::Connection*>::value && is_move_assignable<yb::rpc::Connection*>::value, void>::type std::swap[abi:v160006]<yb::rpc::Connection*>(yb::rpc::Connection*&, yb::rpc::Connection*&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__utility/swap.h:41:7 (libyrpc.so+0x11166a)
[ts-2]     yugabyte#1 std::shared_ptr<yb::rpc::Connection>::swap[abi:v160006](std::shared_ptr<yb::rpc::Connection>&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:805:9 (libyrpc.so+0x11166a)
[ts-2]     yugabyte#2 std::shared_ptr<yb::rpc::Connection>::reset[abi:v160006]() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:812:22 (libyrpc.so+0x11166a)
[ts-2]     yugabyte#3 yb::rpc::OutboundCall::InvokeCallback() ${BUILD_ROOT}/../../src/yb/rpc/outbound_call.cc:372:15 (libyrpc.so+0x11166a)
[ts-2]     yugabyte#4 yb::rpc::OutboundCall::SetTimedOut() ${BUILD_ROOT}/../../src/yb/rpc/outbound_call.cc:538:5 (libyrpc.so+0x112c64)
```

- RpcController call_ concurrent access - fix will avoid calling finished() function call.
```
[m-2] WARNING: ThreadSanitizer: data race (pid=27931)
[m-2]   Write of size 8 at 0x7b6000040708 by thread T29 (mutexes: write M0):
[m-2]     #0 std::enable_if<is_move_constructible<yb::rpc::OutboundCall*>::value && is_move_assignable<yb::rpc::OutboundCall*>::value, void>::type std::swap[abi:v160006]<yb::rpc::OutboundCall*>(yb::rpc::OutboundCall*&, yb::rpc::OutboundCall*&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__utility/swap.h:41:7 (libyrpc.so+0x1560a4)
[m-2]     yugabyte#1 std::shared_ptr<yb::rpc::OutboundCall>::swap[abi:v160006](std::shared_ptr<yb::rpc::OutboundCall>&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:805:9 (libyrpc.so+0x1560a4)
[m-2]     yugabyte#2 std::shared_ptr<yb::rpc::OutboundCall>::reset[abi:v160006]() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:812:22 (libyrpc.so+0x1560a4)
[m-2]     yugabyte#3 yb::rpc::RpcController::Reset() ${BUILD_ROOT}/../../src/yb/rpc/rpc_controller.cc:83:9 (libyrpc.so+0x1560a4)
[m-2]     yugabyte#4 yb::consensus::Peer::ProcessResponse() ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:496:15 (libconsensus.so+0x97ffb)

[m-2]   Previous read of size 8 at 0x7b6000040708 by thread T7:
[m-2]     #0 std::shared_ptr<yb::rpc::OutboundCall>::get[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:843:16 (libyrpc.so+0x1561ea)
[m-2]     yugabyte#1 std::shared_ptr<yb::rpc::OutboundCall>::operator bool[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:875:16 (libyrpc.so+0x1561ea)
[m-2]     yugabyte#2 yb::rpc::RpcController::finished() const ${BUILD_ROOT}/../../src/yb/rpc/rpc_controller.cc:87:7 (libyrpc.so+0x1561ea)
[m-2]     yugabyte#3 yb::consensus::Peer::SignalRequest(yb::consensus::RequestTriggerMode) ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:183:81 (libconsensus.so+0x94934)
[m-2]     yugabyte#4 yb::consensus::Peer::Init()::$_0::operator()() const ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:156:25 (libconsensus.so+0x9bc67)
```
Jira: DB-7637

Test Plan:
./yb_build.sh tsan -n 10 --cxx-test integration-tests_master_failover-itest --gtest_filter MasterFailoverTestIndexCreation/MasterFailoverTestIndexCreation.TestPauseAfterCreateIndexIssued/0
./yb_build.sh tsan -n 10 --cxx-test integration-tests_raft_consensus-itest --gtest_filter RaftConsensusITest.MultiThreadedInsertWithFailovers
Jenkins

Reviewers: mbautin

Reviewed By: mbautin

Subscribers: mbautin, ybase, bogdan

Differential Revision: https://phorge.dev.yugabyte.com/D28161
karan-yb added a commit that referenced this issue Sep 27, 2023
…ag caused by stuck OutboundCall

Summary:
Original commit: 98586ef / D27859 and 0dc13e6 / D28161

In production, we saw a few cases where follower lag was continuously increasing for few tablets. Also we noticed that the replica which was lagging was not removed from quorum by ConsensusQueue either. After capturing a dump of a node where leader was hosted, we were able to figure out that the performing_update_mutex was held for a long time for the affected Peer.

Peer acquires the performing_update_mutex when it is building a request to send to peer and keeps the mutex locked until it receives a response. In the captured dump, we noticed that the OutboundCall was in SENT state, but we were not able to confirm if the connection (on which it was sent) was active or not -- we did an analysis of OutboundCall references and we believe that the connection had been shut down, but we were not sure.

This change tries to detect the stuck OutboundCall in Peer:
* Whenever we try to send more data or heartbeat to the peer (in `SignalRequest`), we check if we can acquire the `performing_update_mutex`. If the mutex is already held, then we try to see if it has been more than the time specified by `FLAGS_stuck_peer_call_threshold_ms` + request timeout since the call start time.
* If the lock is held for more than timeout + `FLAGS_stuck_peer_call_threshold_ms` time duration, then we log additional details which can help identify the root cause of the issue (see below for examples).
* And when `FLAGS_force_recover_from_stuck_peer_call` is set to true, we try to mark the stuck call as failed. If the call object is not present, then we won't be able to recover from this situation.

Another change is that whenever a Connection encounters a write failure, instead of immediately destroying the connection, the operation queues a reactor task to ensure that all queued operations on the socket are executed in order. However, since the socket is closed, all of these operations in queue will encounter write failures, resulting in all of them scheduling a DestroyConnection task. After the first DestroyConnection task is executed, we will not be able to find this connection in the reactor-tracked connections, which will lead to a CHECK error. To prevent multiple DestroyConnection tasks for a single connection, we track whether the connection has already queued the task for its destruction.

Information logged when we detect this situation -
```
I0822 13:36:35.338526 4161232384 rpc_stub-test.cc:1153] OutboundCall (0x0000000107186018 -> RPC call yb.rpc_test.CalculatorService.Concat -> { remote: 127.0.0.1:58098 idx: 1 protocol: 0x0000000103fae8f0 -> tcp } , state=SENT.) tostring: RPC call yb.rpc_test.CalculatorService.Concat -> { remote: 127.0.0.1:58098 idx: 1 protocol: 0x0000000103fae8f0 -> tcp } , state=SENT., times (start, sent, callback, now): (3713549.841s, 3713549.842s, 0.000s, 3713549.947s), connection: 0x000000010708a658 -> Connection (0x000000010708a658) client 127.0.0.1:58100 => 127.0.0.1:58098
I0822 13:36:35.338694 1845915648 connection.cc:409] Connection (0x000000010708a658) client 127.0.0.1:58100 => 127.0.0.1:58098: LastActivityTime: 3713549.839s, ActiveCalls stats: { during shutdown: 0, current size: 0, earliest expiry: 0.000s, }, OutboundCall: { ptr: 0x0000000107186018, call id: 2, is active: 0 }, Shutdown status: Service unavailable (yb/rpc/reactor.cc:106): Shutdown connection (system error 58), Shutdown time: 3713549.940s, Queue attempts after shutdown: { calls: 0, responses: 0 }
```

1. OutboundCall timing details - it includes the time when call was created, when it was sent on network and when callback call was received.
2. Connection - we dump the connection state on which the call was sent. This state includes whether connection is active or not. If alive, then we will see the active calls count, and whether the current call is present in active calls or not. If connection is not alive, then you will see the time when connection was shutdown. We also log the number of active call present during shutdown and queue attempts after shutdown.

Using the above timing information and connection state, we can determine the order of events. For ex, in the above sample logs, as you can see the connection was closed after the data was already sent on the socket.

Example trace for stuck call detection in Peer -
```
I0817 17:42:07.975589 4161232384 consensus_peers.cc:179] T test-peers-tablet P peer-0 -> Peer peer-1 ([], []): Found a RPC call in stuck state - is_finished: 0, timeout: 0.005s, last_update_lock_release_time_: 3296281.832s, stuck threshold: 0.001s, force recover: 0, call state: call not set
```
Jira: DB-7637

Test Plan:
./yb_build.sh --sj
./build/latest/tests-rpc/rpc_stub-test --gtest_filter=*StuckOutboundCall*

Reviewers: mbautin, sergei

Reviewed By: mbautin

Subscribers: bogdan, ybase, rthallam, yyan, yql, amitanand

Differential Revision: https://phorge.dev.yugabyte.com/D28057
karan-yb added a commit that referenced this issue Sep 27, 2023
…ulted in tsan race in connection

Summary:
This change fixes TSAN race issues caused by change 98586ef. In initial change, we were incorrectly setting the RPC start time to now when processing response, because of which detection logic was not working correctly, hence resulting in tsan race issues.

tsan race details -
- Connection getting reset when we are trying to get DebugString. This is solved by avoiding the call to DebugString() until we figure out that call is actually stuck.
```
[ts-2]   Read of size 8 at 0x7b54000c0198 by thread T21 (mutexes: write M0):
[ts-2]     #0 std::shared_ptr<yb::rpc::Connection>::get[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:843:16 (libyrpc.so+0x11383f)
[ts-2]     #1 std::shared_ptr<yb::rpc::Connection>::operator bool[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:875:16 (libyrpc.so+0x11383f)
[ts-2]     #2 yb::rpc::OutboundCall::DebugString() const ${BUILD_ROOT}/../../src/yb/rpc/outbound_call.cc:613:7 (libyrpc.so+0x11383f)
[ts-2]     #3 yb::rpc::RpcController::CallStateDebugString() const ${BUILD_ROOT}/../../src/yb/rpc/rpc_controller.cc:163:19 (libyrpc.so+0x156976)
[ts-2]     #4 yb::consensus::Peer::SignalRequest(yb::consensus::RequestTriggerMode) ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:188:77 (libconsensus.so+0x94c2e)

[ts-2]   Previous write of size 8 at 0x7b54000c0198 by thread T25:
[ts-2]     #0 std::enable_if<is_move_constructible<yb::rpc::Connection*>::value && is_move_assignable<yb::rpc::Connection*>::value, void>::type std::swap[abi:v160006]<yb::rpc::Connection*>(yb::rpc::Connection*&, yb::rpc::Connection*&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__utility/swap.h:41:7 (libyrpc.so+0x11166a)
[ts-2]     #1 std::shared_ptr<yb::rpc::Connection>::swap[abi:v160006](std::shared_ptr<yb::rpc::Connection>&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:805:9 (libyrpc.so+0x11166a)
[ts-2]     #2 std::shared_ptr<yb::rpc::Connection>::reset[abi:v160006]() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:812:22 (libyrpc.so+0x11166a)
[ts-2]     #3 yb::rpc::OutboundCall::InvokeCallback() ${BUILD_ROOT}/../../src/yb/rpc/outbound_call.cc:372:15 (libyrpc.so+0x11166a)
[ts-2]     #4 yb::rpc::OutboundCall::SetTimedOut() ${BUILD_ROOT}/../../src/yb/rpc/outbound_call.cc:538:5 (libyrpc.so+0x112c64)
```

- RpcController call_ concurrent access - fix will avoid calling finished() function call.
```
[m-2] WARNING: ThreadSanitizer: data race (pid=27931)
[m-2]   Write of size 8 at 0x7b6000040708 by thread T29 (mutexes: write M0):
[m-2]     #0 std::enable_if<is_move_constructible<yb::rpc::OutboundCall*>::value && is_move_assignable<yb::rpc::OutboundCall*>::value, void>::type std::swap[abi:v160006]<yb::rpc::OutboundCall*>(yb::rpc::OutboundCall*&, yb::rpc::OutboundCall*&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__utility/swap.h:41:7 (libyrpc.so+0x1560a4)
[m-2]     #1 std::shared_ptr<yb::rpc::OutboundCall>::swap[abi:v160006](std::shared_ptr<yb::rpc::OutboundCall>&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:805:9 (libyrpc.so+0x1560a4)
[m-2]     #2 std::shared_ptr<yb::rpc::OutboundCall>::reset[abi:v160006]() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:812:22 (libyrpc.so+0x1560a4)
[m-2]     #3 yb::rpc::RpcController::Reset() ${BUILD_ROOT}/../../src/yb/rpc/rpc_controller.cc:83:9 (libyrpc.so+0x1560a4)
[m-2]     #4 yb::consensus::Peer::ProcessResponse() ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:496:15 (libconsensus.so+0x97ffb)

[m-2]   Previous read of size 8 at 0x7b6000040708 by thread T7:
[m-2]     #0 std::shared_ptr<yb::rpc::OutboundCall>::get[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:843:16 (libyrpc.so+0x1561ea)
[m-2]     #1 std::shared_ptr<yb::rpc::OutboundCall>::operator bool[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:875:16 (libyrpc.so+0x1561ea)
[m-2]     #2 yb::rpc::RpcController::finished() const ${BUILD_ROOT}/../../src/yb/rpc/rpc_controller.cc:87:7 (libyrpc.so+0x1561ea)
[m-2]     #3 yb::consensus::Peer::SignalRequest(yb::consensus::RequestTriggerMode) ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:183:81 (libconsensus.so+0x94934)
[m-2]     #4 yb::consensus::Peer::Init()::$_0::operator()() const ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:156:25 (libconsensus.so+0x9bc67)
```
Jira: DB-7637

Test Plan:
./yb_build.sh tsan -n 10 --cxx-test integration-tests_master_failover-itest --gtest_filter MasterFailoverTestIndexCreation/MasterFailoverTestIndexCreation.TestPauseAfterCreateIndexIssued/0
./yb_build.sh tsan -n 10 --cxx-test integration-tests_raft_consensus-itest --gtest_filter RaftConsensusITest.MultiThreadedInsertWithFailovers
Jenkins

Reviewers: mbautin

Reviewed By: mbautin

Subscribers: mbautin, ybase, bogdan

Differential Revision: https://phorge.dev.yugabyte.com/D28161
karan-yb added a commit that referenced this issue Sep 28, 2023
…ag caused by stuck OutboundCall

Summary:
Original commit: 98586ef / D27859 and 0dc13e6 / D28161 / D28057

In production, we saw a few cases where follower lag was continuously increasing for few tablets. Also we noticed that the replica which was lagging was not removed from quorum by ConsensusQueue either. After capturing a dump of a node where leader was hosted, we were able to figure out that the performing_update_mutex was held for a long time for the affected Peer.

Peer acquires the performing_update_mutex when it is building a request to send to peer and keeps the mutex locked until it receives a response. In the captured dump, we noticed that the OutboundCall was in SENT state, but we were not able to confirm if the connection (on which it was sent) was active or not -- we did an analysis of OutboundCall references and we believe that the connection had been shut down, but we were not sure.

This change tries to detect the stuck OutboundCall in Peer:
* Whenever we try to send more data or heartbeat to the peer (in `SignalRequest`), we check if we can acquire the `performing_update_mutex`. If the mutex is already held, then we try to see if it has been more than the time specified by `FLAGS_stuck_peer_call_threshold_ms` + request timeout since the call start time.
* If the lock is held for more than timeout + `FLAGS_stuck_peer_call_threshold_ms` time duration, then we log additional details which can help identify the root cause of the issue (see below for examples).
* And when `FLAGS_force_recover_from_stuck_peer_call` is set to true, we try to mark the stuck call as failed. If the call object is not present, then we won't be able to recover from this situation.

Another change is that whenever a Connection encounters a write failure, instead of immediately destroying the connection, the operation queues a reactor task to ensure that all queued operations on the socket are executed in order. However, since the socket is closed, all of these operations in queue will encounter write failures, resulting in all of them scheduling a DestroyConnection task. After the first DestroyConnection task is executed, we will not be able to find this connection in the reactor-tracked connections, which will lead to a CHECK error. To prevent multiple DestroyConnection tasks for a single connection, we track whether the connection has already queued the task for its destruction.

Information logged when we detect this situation -
```
I0822 13:36:35.338526 4161232384 rpc_stub-test.cc:1153] OutboundCall (0x0000000107186018 -> RPC call yb.rpc_test.CalculatorService.Concat -> { remote: 127.0.0.1:58098 idx: 1 protocol: 0x0000000103fae8f0 -> tcp } , state=SENT.) tostring: RPC call yb.rpc_test.CalculatorService.Concat -> { remote: 127.0.0.1:58098 idx: 1 protocol: 0x0000000103fae8f0 -> tcp } , state=SENT., times (start, sent, callback, now): (3713549.841s, 3713549.842s, 0.000s, 3713549.947s), connection: 0x000000010708a658 -> Connection (0x000000010708a658) client 127.0.0.1:58100 => 127.0.0.1:58098
I0822 13:36:35.338694 1845915648 connection.cc:409] Connection (0x000000010708a658) client 127.0.0.1:58100 => 127.0.0.1:58098: LastActivityTime: 3713549.839s, ActiveCalls stats: { during shutdown: 0, current size: 0, earliest expiry: 0.000s, }, OutboundCall: { ptr: 0x0000000107186018, call id: 2, is active: 0 }, Shutdown status: Service unavailable (yb/rpc/reactor.cc:106): Shutdown connection (system error 58), Shutdown time: 3713549.940s, Queue attempts after shutdown: { calls: 0, responses: 0 }
```

1. OutboundCall timing details - it includes the time when call was created, when it was sent on network and when callback call was received.
2. Connection - we dump the connection state on which the call was sent. This state includes whether connection is active or not. If alive, then we will see the active calls count, and whether the current call is present in active calls or not. If connection is not alive, then you will see the time when connection was shutdown. We also log the number of active call present during shutdown and queue attempts after shutdown.

Using the above timing information and connection state, we can determine the order of events. For ex, in the above sample logs, as you can see the connection was closed after the data was already sent on the socket.

Example trace for stuck call detection in Peer -
```
I0817 17:42:07.975589 4161232384 consensus_peers.cc:179] T test-peers-tablet P peer-0 -> Peer peer-1 ([], []): Found a RPC call in stuck state - is_finished: 0, timeout: 0.005s, last_update_lock_release_time_: 3296281.832s, stuck threshold: 0.001s, force recover: 0, call state: call not set
```
Jira: DB-7637

Test Plan:
./yb_build.sh --sj
./build/latest/tests-rpc/rpc_stub-test --gtest_filter=*StuckOutboundCall*

Reviewers: mbautin, sergei

Reviewed By: mbautin

Subscribers: amitanand, yql, yyan, rthallam, ybase, bogdan

Differential Revision: https://phorge.dev.yugabyte.com/D28695
karan-yb added a commit that referenced this issue Sep 28, 2023
…ag caused by stuck OutboundCall

Summary:
Original commit: 98586ef / D27859 and 0dc13e6 / D28161

In production, we saw a few cases where follower lag was continuously increasing for few tablets. Also we noticed that the replica which was lagging was not removed from quorum by ConsensusQueue either. After capturing a dump of a node where leader was hosted, we were able to figure out that the performing_update_mutex was held for a long time for the affected Peer.

Peer acquires the performing_update_mutex when it is building a request to send to peer and keeps the mutex locked until it receives a response. In the captured dump, we noticed that the OutboundCall was in SENT state, but we were not able to confirm if the connection (on which it was sent) was active or not -- we did an analysis of OutboundCall references and we believe that the connection had been shut down, but we were not sure.

This change tries to detect the stuck OutboundCall in Peer:
* Whenever we try to send more data or heartbeat to the peer (in `SignalRequest`), we check if we can acquire the `performing_update_mutex`. If the mutex is already held, then we try to see if it has been more than the time specified by `FLAGS_stuck_peer_call_threshold_ms` + request timeout since the call start time.
* If the lock is held for more than timeout + `FLAGS_stuck_peer_call_threshold_ms` time duration, then we log additional details which can help identify the root cause of the issue (see below for examples).
* And when `FLAGS_force_recover_from_stuck_peer_call` is set to true, we try to mark the stuck call as failed. If the call object is not present, then we won't be able to recover from this situation.

Another change is that whenever a Connection encounters a write failure, instead of immediately destroying the connection, the operation queues a reactor task to ensure that all queued operations on the socket are executed in order. However, since the socket is closed, all of these operations in queue will encounter write failures, resulting in all of them scheduling a DestroyConnection task. After the first DestroyConnection task is executed, we will not be able to find this connection in the reactor-tracked connections, which will lead to a CHECK error. To prevent multiple DestroyConnection tasks for a single connection, we track whether the connection has already queued the task for its destruction.

Information logged when we detect this situation -
```
I0822 13:36:35.338526 4161232384 rpc_stub-test.cc:1153] OutboundCall (0x0000000107186018 -> RPC call yb.rpc_test.CalculatorService.Concat -> { remote: 127.0.0.1:58098 idx: 1 protocol: 0x0000000103fae8f0 -> tcp } , state=SENT.) tostring: RPC call yb.rpc_test.CalculatorService.Concat -> { remote: 127.0.0.1:58098 idx: 1 protocol: 0x0000000103fae8f0 -> tcp } , state=SENT., times (start, sent, callback, now): (3713549.841s, 3713549.842s, 0.000s, 3713549.947s), connection: 0x000000010708a658 -> Connection (0x000000010708a658) client 127.0.0.1:58100 => 127.0.0.1:58098
I0822 13:36:35.338694 1845915648 connection.cc:409] Connection (0x000000010708a658) client 127.0.0.1:58100 => 127.0.0.1:58098: LastActivityTime: 3713549.839s, ActiveCalls stats: { during shutdown: 0, current size: 0, earliest expiry: 0.000s, }, OutboundCall: { ptr: 0x0000000107186018, call id: 2, is active: 0 }, Shutdown status: Service unavailable (yb/rpc/reactor.cc:106): Shutdown connection (system error 58), Shutdown time: 3713549.940s, Queue attempts after shutdown: { calls: 0, responses: 0 }
```

1. OutboundCall timing details - it includes the time when call was created, when it was sent on network and when callback call was received.
2. Connection - we dump the connection state on which the call was sent. This state includes whether connection is active or not. If alive, then we will see the active calls count, and whether the current call is present in active calls or not. If connection is not alive, then you will see the time when connection was shutdown. We also log the number of active call present during shutdown and queue attempts after shutdown.

Using the above timing information and connection state, we can determine the order of events. For ex, in the above sample logs, as you can see the connection was closed after the data was already sent on the socket.

Example trace for stuck call detection in Peer -
```
I0817 17:42:07.975589 4161232384 consensus_peers.cc:179] T test-peers-tablet P peer-0 -> Peer peer-1 ([], []): Found a RPC call in stuck state - is_finished: 0, timeout: 0.005s, last_update_lock_release_time_: 3296281.832s, stuck threshold: 0.001s, force recover: 0, call state: call not set
```
Jira: DB-7637

Test Plan:
./yb_build.sh --sj
./build/latest/tests-rpc/rpc_stub-test --gtest_filter=*StuckOutboundCall*

Reviewers: mbautin, sergei

Reviewed By: mbautin

Subscribers: bogdan, ybase, rthallam, yyan, yql, amitanand

Differential Revision: https://phorge.dev.yugabyte.com/D28048
karan-yb added a commit that referenced this issue Sep 28, 2023
…ulted in tsan race in connection

Summary:
This change fixes TSAN race issues caused by change 98586ef. In initial change, we were incorrectly setting the RPC start time to now when processing response, because of which detection logic was not working correctly, hence resulting in tsan race issues.

tsan race details -
- Connection getting reset when we are trying to get DebugString. This is solved by avoiding the call to DebugString() until we figure out that call is actually stuck.
```
[ts-2]   Read of size 8 at 0x7b54000c0198 by thread T21 (mutexes: write M0):
[ts-2]     #0 std::shared_ptr<yb::rpc::Connection>::get[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:843:16 (libyrpc.so+0x11383f)
[ts-2]     #1 std::shared_ptr<yb::rpc::Connection>::operator bool[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:875:16 (libyrpc.so+0x11383f)
[ts-2]     #2 yb::rpc::OutboundCall::DebugString() const ${BUILD_ROOT}/../../src/yb/rpc/outbound_call.cc:613:7 (libyrpc.so+0x11383f)
[ts-2]     #3 yb::rpc::RpcController::CallStateDebugString() const ${BUILD_ROOT}/../../src/yb/rpc/rpc_controller.cc:163:19 (libyrpc.so+0x156976)
[ts-2]     #4 yb::consensus::Peer::SignalRequest(yb::consensus::RequestTriggerMode) ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:188:77 (libconsensus.so+0x94c2e)

[ts-2]   Previous write of size 8 at 0x7b54000c0198 by thread T25:
[ts-2]     #0 std::enable_if<is_move_constructible<yb::rpc::Connection*>::value && is_move_assignable<yb::rpc::Connection*>::value, void>::type std::swap[abi:v160006]<yb::rpc::Connection*>(yb::rpc::Connection*&, yb::rpc::Connection*&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__utility/swap.h:41:7 (libyrpc.so+0x11166a)
[ts-2]     #1 std::shared_ptr<yb::rpc::Connection>::swap[abi:v160006](std::shared_ptr<yb::rpc::Connection>&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:805:9 (libyrpc.so+0x11166a)
[ts-2]     #2 std::shared_ptr<yb::rpc::Connection>::reset[abi:v160006]() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:812:22 (libyrpc.so+0x11166a)
[ts-2]     #3 yb::rpc::OutboundCall::InvokeCallback() ${BUILD_ROOT}/../../src/yb/rpc/outbound_call.cc:372:15 (libyrpc.so+0x11166a)
[ts-2]     #4 yb::rpc::OutboundCall::SetTimedOut() ${BUILD_ROOT}/../../src/yb/rpc/outbound_call.cc:538:5 (libyrpc.so+0x112c64)
```

- RpcController call_ concurrent access - fix will avoid calling finished() function call.
```
[m-2] WARNING: ThreadSanitizer: data race (pid=27931)
[m-2]   Write of size 8 at 0x7b6000040708 by thread T29 (mutexes: write M0):
[m-2]     #0 std::enable_if<is_move_constructible<yb::rpc::OutboundCall*>::value && is_move_assignable<yb::rpc::OutboundCall*>::value, void>::type std::swap[abi:v160006]<yb::rpc::OutboundCall*>(yb::rpc::OutboundCall*&, yb::rpc::OutboundCall*&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__utility/swap.h:41:7 (libyrpc.so+0x1560a4)
[m-2]     #1 std::shared_ptr<yb::rpc::OutboundCall>::swap[abi:v160006](std::shared_ptr<yb::rpc::OutboundCall>&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:805:9 (libyrpc.so+0x1560a4)
[m-2]     #2 std::shared_ptr<yb::rpc::OutboundCall>::reset[abi:v160006]() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:812:22 (libyrpc.so+0x1560a4)
[m-2]     #3 yb::rpc::RpcController::Reset() ${BUILD_ROOT}/../../src/yb/rpc/rpc_controller.cc:83:9 (libyrpc.so+0x1560a4)
[m-2]     #4 yb::consensus::Peer::ProcessResponse() ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:496:15 (libconsensus.so+0x97ffb)

[m-2]   Previous read of size 8 at 0x7b6000040708 by thread T7:
[m-2]     #0 std::shared_ptr<yb::rpc::OutboundCall>::get[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:843:16 (libyrpc.so+0x1561ea)
[m-2]     #1 std::shared_ptr<yb::rpc::OutboundCall>::operator bool[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:875:16 (libyrpc.so+0x1561ea)
[m-2]     #2 yb::rpc::RpcController::finished() const ${BUILD_ROOT}/../../src/yb/rpc/rpc_controller.cc:87:7 (libyrpc.so+0x1561ea)
[m-2]     #3 yb::consensus::Peer::SignalRequest(yb::consensus::RequestTriggerMode) ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:183:81 (libconsensus.so+0x94934)
[m-2]     #4 yb::consensus::Peer::Init()::$_0::operator()() const ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:156:25 (libconsensus.so+0x9bc67)
```
Jira: DB-7637

Test Plan:
./yb_build.sh tsan -n 10 --cxx-test integration-tests_master_failover-itest --gtest_filter MasterFailoverTestIndexCreation/MasterFailoverTestIndexCreation.TestPauseAfterCreateIndexIssued/0
./yb_build.sh tsan -n 10 --cxx-test integration-tests_raft_consensus-itest --gtest_filter RaftConsensusITest.MultiThreadedInsertWithFailovers
Jenkins

Reviewers: mbautin

Reviewed By: mbautin

Subscribers: mbautin, ybase, bogdan

Differential Revision: https://phorge.dev.yugabyte.com/D28161
@karan-yb karan-yb reopened this Sep 28, 2023
@karan-yb
Copy link
Contributor Author

Need to backport change to 2.14.10 branch.

karan-yb added a commit that referenced this issue Sep 28, 2023
…r lag caused by stuck OutboundCall

Summary:
Original commit: Original commit: 98586ef / D27859 and 0dc13e6 / D28161 / D28695

In production, we saw a few cases where follower lag was continuously increasing for few tablets. Also we noticed that the replica which was lagging was not removed from quorum by ConsensusQueue either. After capturing a dump of a node where leader was hosted, we were able to figure out that the performing_update_mutex was held for a long time for the affected Peer.

Peer acquires the performing_update_mutex when it is building a request to send to peer and keeps the mutex locked until it receives a response. In the captured dump, we noticed that the OutboundCall was in SENT state, but we were not able to confirm if the connection (on which it was sent) was active or not -- we did an analysis of OutboundCall references and we believe that the connection had been shut down, but we were not sure.

This change tries to detect the stuck OutboundCall in Peer:
* Whenever we try to send more data or heartbeat to the peer (in `SignalRequest`), we check if we can acquire the `performing_update_mutex`. If the mutex is already held, then we try to see if it has been more than the time specified by `FLAGS_stuck_peer_call_threshold_ms` + request timeout since the call start time.
* If the lock is held for more than timeout + `FLAGS_stuck_peer_call_threshold_ms` time duration, then we log additional details which can help identify the root cause of the issue (see below for examples).
* And when `FLAGS_force_recover_from_stuck_peer_call` is set to true, we try to mark the stuck call as failed. If the call object is not present, then we won't be able to recover from this situation.

Another change is that whenever a Connection encounters a write failure, instead of immediately destroying the connection, the operation queues a reactor task to ensure that all queued operations on the socket are executed in order. However, since the socket is closed, all of these operations in queue will encounter write failures, resulting in all of them scheduling a DestroyConnection task. After the first DestroyConnection task is executed, we will not be able to find this connection in the reactor-tracked connections, which will lead to a CHECK error. To prevent multiple DestroyConnection tasks for a single connection, we track whether the connection has already queued the task for its destruction.

Information logged when we detect this situation -
```
I0822 13:36:35.338526 4161232384 rpc_stub-test.cc:1153] OutboundCall (0x0000000107186018 -> RPC call yb.rpc_test.CalculatorService.Concat -> { remote: 127.0.0.1:58098 idx: 1 protocol: 0x0000000103fae8f0 -> tcp } , state=SENT.) tostring: RPC call yb.rpc_test.CalculatorService.Concat -> { remote: 127.0.0.1:58098 idx: 1 protocol: 0x0000000103fae8f0 -> tcp } , state=SENT., times (start, sent, callback, now): (3713549.841s, 3713549.842s, 0.000s, 3713549.947s), connection: 0x000000010708a658 -> Connection (0x000000010708a658) client 127.0.0.1:58100 => 127.0.0.1:58098
I0822 13:36:35.338694 1845915648 connection.cc:409] Connection (0x000000010708a658) client 127.0.0.1:58100 => 127.0.0.1:58098: LastActivityTime: 3713549.839s, ActiveCalls stats: { during shutdown: 0, current size: 0, earliest expiry: 0.000s, }, OutboundCall: { ptr: 0x0000000107186018, call id: 2, is active: 0 }, Shutdown status: Service unavailable (yb/rpc/reactor.cc:106): Shutdown connection (system error 58), Shutdown time: 3713549.940s, Queue attempts after shutdown: { calls: 0, responses: 0 }
```

1. OutboundCall timing details - it includes the time when call was created, when it was sent on network and when callback call was received.
2. Connection - we dump the connection state on which the call was sent. This state includes whether connection is active or not. If alive, then we will see the active calls count, and whether the current call is present in active calls or not. If connection is not alive, then you will see the time when connection was shutdown. We also log the number of active call present during shutdown and queue attempts after shutdown.

Using the above timing information and connection state, we can determine the order of events. For ex, in the above sample logs, as you can see the connection was closed after the data was already sent on the socket.

Example trace for stuck call detection in Peer -
```
I0817 17:42:07.975589 4161232384 consensus_peers.cc:179] T test-peers-tablet P peer-0 -> Peer peer-1 ([], []): Found a RPC call in stuck state - is_finished: 0, timeout: 0.005s, last_update_lock_release_time_: 3296281.832s, stuck threshold: 0.001s, force recover: 0, call state: call not set
```
Jira: DB-7637

Test Plan:
./yb_build.sh --sj
./build/latest/tests-rpc/rpc_stub-test --gtest_filter=*StuckOutboundCall*
Jenkins: urgent

Reviewers: mbautin, sergei

Reviewed By: mbautin

Subscribers: bogdan, ybase, rthallam, yyan, yql, amitanand

Differential Revision: https://phorge.dev.yugabyte.com/D28915
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants