Skip to content

2.29.0.0-b67

@druzac druzac tagged this 17 Oct 15:30
Summary:
Before the refactor in 583053e943922b97cc47f510ddf4ad137230754a, any failure in `tablet_server_main_impl` would result in a `LOG(FATAL)` and immediately crash the program. After that commit, failures in `StartServices` return a non-ok `Status` and call the destructor on the locally allocated Services struct. This results in destructors called on partially initialized services, exposing an existing check-fail bug in `CQLServer::Start` and `CQLServer::Shutdown`.

This diff refactors `StartServices` so it doesn't own the `Services` struct. This way on an early return service destructors are not invoked.
Instead we will hit the `LOG_AND_RETURN_FROM_MAIN_NOT_OK` wrapping the `StartServices` call, thus skipping any destructor invocations. The `LOG_AND_RETURN_FROM_MAIN_NOT_OK` contains a `LOG(FATAL)` invocation on non-ok `Status` as long as glogging is initialized, so it will immediately abort the program on non-ok `Status`.
Jira: DB-18651

Test Plan:
Check fail stack trace observed by qa, indicating shutdown calls via destructor on early return from `StartServices`:
```
F20251013 00:29:26 ../../src/yb/rpc/acceptor.cc:131] Check failed: sockets_to_add_.empty()
@ 0xaaaac089d5a4 google::LogMessage::SendToLog()
@ 0xaaaac089e4c4 google::LogMessage::Flush()
@ 0xaaaac089ec1c google::LogMessageFatal::~LogMessageFatal()
@ 0xaaaac1c68cec yb::rpc::Acceptor::Shutdown()
@ 0xaaaac1c688f8 yb::rpc::Acceptor::~Acceptor()
@ 0xaaaac1c97358 yb::rpc::Messenger::ShutdownAcceptor()
@ 0xaaaac1dc5540 yb::server::RpcAndWebServerBase::~RpcAndWebServerBase()
@ 0xaaaac26bd714 yb::cqlserver::CQLServer::~CQLServer()
@ 0xaaaac22e5db0 yb::tserver::Services::~Services()
@ 0xaaaac22e3cd4 yb::tserver::StartServices()
@ 0xaaaac084cf04 main
@ 0xffff82592340 __libc_start_call_main
@ 0xffff82592418 __libc_start_main_alias_2
@ 0xaaaac06d2034 (unknown)
```

With this diff[0], and this patch, the std error for the tserver on startup is:

```
% yb-ctl create --rf 1 --tserver_flags "TEST_webserver_fail_on_startup=true"
```

```
F1017 10:49:56.988938 4164516544 tablet_server_main_impl.cc:480] Illegal state (yb/server/webserver.cc:463): foo
Fatal failure details written to /Users/zdrudi/yugabyte-data/node-1/disk-1/yb-data/tserver/logs/yb-tserver.FATAL.details.2025-10-17T10_49_56.pid9311.txt
F20251017 10:49:56 ../../src/yb/tserver/tablet_server_main_impl.cc:480] Illegal state (yb/server/webserver.cc:463): foo
    @        0x10a5d6510  google::LogDestination::LogToSinks()
    @        0x10a5d5654  google::LogMessage::SendToLog()
    @        0x10a5d6034  google::LogMessage::Flush()
    @        0x10a5d9f3c  google::LogMessageFatal::~LogMessageFatal()
    @        0x10a5d6ee0  google::LogMessageFatal::~LogMessageFatal()
    @        0x104f3d654  yb::tserver::TabletServerMain()
    @        0x1903020e0  start

*** Check failure stack trace: ***
    @        0x10a5d5814  google::LogMessage::SendToLog()
    @        0x10a5d6034  google::LogMessage::Flush()
    @        0x10a5d9f3c  google::LogMessageFatal::~LogMessageFatal()
    @        0x10a5d6ee0  google::LogMessageFatal::~LogMessageFatal()
    @        0x104f3d654  yb::tserver::TabletServerMain()
    @        0x1903020e0  start

```

On master and the same diff [0], there is no std err and we see in the INFO logs messages about destructors being called:
```
I1017 10:54:45.258898 4164516544 service_pool.cc:146] yb.stateful_service.PgAutoAnalyzeService: yb::rpc::ServicePoolImpl created at 0x1029dc9a0
I1017 10:54:45.258975 4164516544 webserver.cc:365] Starting webserver on 127.0.0.1:9000
I1017 10:54:45.258980 4164516544 webserver.cc:374] Document root: /Users/zdrudi/code/yugabyte-db/www
I1017 10:54:45.258991 4164516544 webserver.cc:360] Webserver listen spec is 127.0.0.1:9000
I1017 10:54:45.259048 4164516544 tablet_server.cc:777] TabletServer shutting down...
```

[0]
```
% git diff
diff --git a/src/yb/server/webserver.cc b/src/yb/server/webserver.cc
index ce660c3b88..6c9279bcee 100644
--- a/src/yb/server/webserver.cc
+++ b/src/yb/server/webserver.cc
@@ -117,6 +117,10 @@ DEFINE_test_flag(bool, use_custom_varz, false,
     "Cause /varz endpoint of TServers to output for selected gflags their values at "
     "startup rather than their current values.");

+DEFINE_test_flag(
+    bool, webserver_fail_on_startup, false,
+    "If set fail webserver startup");
+
 namespace yb {

 using std::string;
@@ -455,6 +459,9 @@ Status Webserver::Impl::Start() {
   // default callback. That method unpacks the pointer to this and calls the real
   // callback.
   context_ = sq_start(&callbacks, reinterpret_cast<void*>(this), &options[0]);
+  if (FLAGS_TEST_webserver_fail_on_startup) {
+    return STATUS(IllegalState, "foo");
+  }

   // Restore the child signal handler so wait() works properly.
   signal(SIGCHLD, sig_chld);
```

Reviewers: bkolagani, hsunder, asrivastava

Reviewed By: asrivastava

Subscribers: ybase, rthallam

Differential Revision: https://phorge.dev.yugabyte.com/D47478
Assets 2
Loading