Skip to content

Commit

Permalink
[BACKPORT 2.18][#18948] YSQL: Explicitly destruct YSQL webserver on S…
Browse files Browse the repository at this point in the history
…IGTERM

Summary:
Original commit: 5862233 / D35116
The YSQL webserver has occasionally produced coredumps of the following form upon receiving a termination signal from postmaster.
```
                #0  0x00007fbac35a9ae3 _ZNKSt3__112__hash_tableINS_17__hash_value_typeINS_12basic_string <snip>
                #1  0x00007fbac005485d _ZNKSt3__113unordered_mapINS_12basic_string <snip> (libyb_util.so)
                #2  0x00007fbac0053180 _ZN2yb16PrometheusWriter16WriteSingleEntryERKNSt3__113unordered_mapINS1_12basic_string <snip>
                #3  0x00007fbab21ff1eb _ZN2yb6pggateL26PgPrometheusMetricsHandlerERKNS_19WebCallbackRegistry10WebRequestEPNS1_11WebResponseE (libyb_pggate_webserver.so)
                ....
                ....
```

The coredump indicates corruption of a namespace-scoped variable of type unordered_map while attempting to serve a request after a termination signal has been received.
The current code causes the webserver (postgres background worker) to call postgres' `proc_exit()` which consequently calls `exit()`.

According to the [[ https://en.cppreference.com/w/cpp/utility/program/exit | C++ standard ]], a limited amount of cleanup is performed on exit():
 - Notably destructors of variables with automatic storage duration are not invoked. This implies that the webserver's destructor is not called, and therefore the server is not stopped.
 - Namespace-scoped variables have [[ https://en.cppreference.com/w/cpp/language/storage_duration | static storage duration ]].
 - Objects with static storage duration are destroyed.
 - This leads to a possibility of undefined behavior where the webserver may continue running for a short duration of time, while the static variables used to serve requests may have been GC'ed.

This revision explicitly stops the webserver upon receiving a termination signal, by calling its destructor.
It also adds logic to the handlers to return a `503 SERVICE_UNAVAILABLE` once termination has been initiated.
Jira: DB-7796

Test Plan:
To test this manually, use a HTTP load generation tool like locust to bombard the YSQL Webserver with requests to an endpoint like `<address>:13000/prometheus-metrics`.
On a standard devserver, I configured locust to use 30 simultaneous users (30 requests per second) to reproduce the issue.

The following bash script can be used to detect the coredumps:
```
ITERATIONS=50
YBDB_PATH=/path/to/code/yugabyte-db

idumps=$(ls /var/lib/systemd/coredump/ | wc -l)
for ((i = 0 ; i < $ITERATIONS ; i++ ))
do
        echo "Iteration: $(($i + 1))";
        $YBDB_PATH/bin/yb-ctl restart > /dev/null

        nservers=$(netstat -nlpt 2> /dev/null | grep 13000 | wc -l)
        if (( nservers != 1)); then
                echo "Web server has not come up. Exiting"
                exit 1;
        fi

        sleep 5s

        # Kill the webserver
        pkill -TERM -f 'YSQL webserver'

        # Count the number of coredumps
        # Please validate that the coredump produced is that of postgres/webserver
        ndumps=$(ls /var/lib/systemd/coredump/ | wc -l)
        if (( ndumps > idumps  )); then
                echo "Core dumps: $(($ndumps - $idumps))"
        else
                echo "No new core dumps found"
        fi
done
```

Run the script with the load generation tool running against the webserver in the background.
 - Without the fix in this revision, the above script produced 8 postgres/webserver core dumps in 50 iterations.
 - With the fix, no coredumps were observed.

Reviewers: telgersma, fizaa

Reviewed By: telgersma

Subscribers: yql, smishra, ybase

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D35171
  • Loading branch information
karthik-ramanathan-3006 committed Jun 6, 2024
1 parent c3ed204 commit f05a060
Show file tree
Hide file tree
Showing 4 changed files with 32 additions and 2 deletions.
6 changes: 6 additions & 0 deletions src/postgres/contrib/yb_pg_metrics/yb_pg_metrics.c
Original file line number Diff line number Diff line change
Expand Up @@ -438,6 +438,12 @@ webserver_worker_main(Datum unused)
MemoryContextSwitchTo(oldcontext);
}

if (webserver)
{
DestroyWebserver(webserver);
webserver = NULL;
}

if (rc & WL_POSTMASTER_DEATH)
proc_exit(1);

Expand Down
5 changes: 5 additions & 0 deletions src/yb/server/pgsql_webserver_wrapper.cc
Original file line number Diff line number Diff line change
Expand Up @@ -451,6 +451,11 @@ YBCStatus StartWebserver(WebserverWrapper *webserver_wrapper) {
return ToYBCStatus(WithMaskedYsqlSignals([webserver]() { return webserver->Start(); }));
}

void DestroyWebserver(struct WebserverWrapper *webserver) {
Webserver *webserver_impl = reinterpret_cast<Webserver *>(webserver);
delete webserver_impl;
}

void SetWebserverConfig(
WebserverWrapper *webserver_wrapper, bool enable_access_logging, bool enable_tcmalloc_logging,
int webserver_profiler_sample_freq_bytes) {
Expand Down
1 change: 1 addition & 0 deletions src/yb/server/pgsql_webserver_wrapper.h
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,7 @@ typedef struct {
} YbConnectionMetrics;

struct WebserverWrapper *CreateWebserver(char *listen_addresses, int port);
void DestroyWebserver(struct WebserverWrapper *webserver);
void RegisterMetrics(ybpgmEntry *tab, int num_entries, char *metric_node_name);
void RegisterRpczEntries(
postgresCallbacks *callbacks, int *num_backends_ptr, rpczEntry **rpczEntriesPointer,
Expand Down
22 changes: 20 additions & 2 deletions src/yb/server/webserver.cc
Original file line number Diff line number Diff line change
Expand Up @@ -254,6 +254,9 @@ class Webserver::Impl {
// Mutex guarding against concurrenct calls to Stop().
std::mutex stop_mutex_;

// Variable to notify handlers to stop serving requests.
std::atomic<bool> stop_initiated = false;

mutable std::mutex auto_flags_mutex_;
// The AutoFlags that are associated with this particular server. In LTO builds we use the same
// process for both yb-master and yb-tserver, so the process may have more AutoFlags than this
Expand Down Expand Up @@ -448,11 +451,19 @@ Status Webserver::Impl::Start() {
}

void Webserver::Impl::Stop() {
std::lock_guard<std::mutex> lock_(stop_mutex_);
// Indicate to the handlers that they can now stop serving requests. If any further signals are
// received to terminate the webserver, the handlers will no longer access non-static variables
// and will return immediately.
stop_initiated = true;

std::lock_guard lock_(stop_mutex_);
if (context_ != nullptr) {
LOG(INFO) << "Stopping webserver on " << http_address_;
sq_stop(context_);
context_ = nullptr;
}

LOG(INFO) << "Webserver stopped";
}

Status Webserver::Impl::GetInputHostPort(HostPort* hp) const {
Expand Down Expand Up @@ -583,6 +594,13 @@ sq_callback_result_t Webserver::Impl::BeginRequestCallback(struct sq_connection*
sq_callback_result_t Webserver::Impl::RunPathHandler(const PathHandler& handler,
struct sq_connection* connection,
struct sq_request_info* request_info) {
constexpr auto SERVICE_UNAVAILABLE_MSG = "HTTP/1.1 503 Service Unavailable\r\n";
// If we're in the process of stopping, do not parse or route the request. Just return.
if (PREDICT_FALSE(stop_initiated)) {
sq_printf(connection, SERVICE_UNAVAILABLE_MSG);
return SQ_HANDLED_CLOSE_CONNECTION;
}

// Should we render with css styles?
bool use_style = true;

Expand Down Expand Up @@ -647,7 +665,7 @@ sq_callback_result_t Webserver::Impl::RunPathHandler(const PathHandler& handler,
for (const PathHandlerCallback& callback_ : handler.callbacks()) {
callback_(req, resp_ptr);
if (resp_ptr->code == 503) {
sq_printf(connection, "HTTP/1.1 503 Service Unavailable\r\n");
sq_printf(connection, SERVICE_UNAVAILABLE_MSG);
return SQ_HANDLED_CLOSE_CONNECTION;
}
}
Expand Down

0 comments on commit f05a060

Please sign in to comment.