Skip to content

Fix crash when COM_BINLOG_DUMP received with idle backends (#5556)#5559

Merged
renecannao merged 2 commits intov3.0from
v3.0-issue5556
Apr 1, 2026
Merged

Fix crash when COM_BINLOG_DUMP received with idle backends (#5556)#5559
renecannao merged 2 commits intov3.0from
v3.0-issue5556

Conversation

@renecannao
Copy link
Copy Markdown
Contributor

@renecannao renecannao commented Apr 1, 2026

Summary

  • Fix: Clean up idle backend connections from other hostgroups when a session transitions to FAST_FORWARD via COM_BINLOG_DUMP / COM_BINLOG_DUMP_GTID / COM_REGISTER_SLAVE
  • Root cause: GPFC_Replication_SwitchToFastForward() only managed the active backend, leaving idle backends from other hostgroups in the session's mybes array. When MySQL's wait_timeout closed these idle connections, the POLLIN event triggered handler() with async_state_machine==ASYNC_IDLE, hitting assert(0).
  • Regression test: test_binlog_dump_multi_backend_crash-t reproduces the exact scenario from the bug report

Fixes #5556

Test plan

  • New test test_binlog_dump_multi_backend_crash-t confirms crash without fix, passes with fix
  • Run legacy-binlog-g1 test group in CI
  • Verify no regressions in other binlog tests (test_binlog_fast_forward-t, test_binlog_reader-t, etc.)

Summary by CodeRabbit

  • Bug Fixes

    • Fixed a crash during binary-log replication when switching replication modes with multiple active backends by improving backend cleanup to maintain stable connections.
  • Tests

    • Added a regression test that reproduces the multi-backend replication scenario and verifies stability after mode transitions and idle backend closures.

… hostgroups (#5556)

When a session has backend connections to multiple hostgroups and receives
COM_BINLOG_DUMP, GPFC_Replication_SwitchToFastForward() transitions the
session to FAST_FORWARD mode but only manages the active backend. Idle
backends from other hostgroups remain in the session's mybes array with
async_state_machine==ASYNC_IDLE. If MySQL's wait_timeout closes one of
these idle connections, the resulting POLLIN event causes handler() to be
called with ASYNC_IDLE, which has no case in the switch and hits assert(0).

Fix: Clean up idle backends from other hostgroups during the FAST_FORWARD
transition in GPFC_Replication_SwitchToFastForward(). These connections
will never be used again, so they are destroyed immediately.

Includes a regression test (test_binlog_dump_multi_backend_crash-t) that:
- Sets up two hostgroups with query rules routing SELECTs to reader HG
- Creates backend connections to both hostgroups on one session
- Sends COM_BINLOG_DUMP to enter FAST_FORWARD mode
- Waits for wait_timeout to close the idle backend
- Verifies ProxySQL does not crash
Copilot AI review requested due to automatic review settings April 1, 2026 09:52
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 1, 2026

📝 Walkthrough

Walkthrough

Added backend cleanup in MySQL_Session::GPFC_Replication_SwitchToFastForward() to remove and destroy non-selected backends when transitioning to FAST_FORWARD, and added a TAP regression test and test-group entry that reproduces the multi-backend idle-closure scenario (issue #5556).

Changes

Cohort / File(s) Summary
Core Fix
lib/MySQL_Session.cpp
Added a loop in GPFC_Replication_SwitchToFastForward() that logs and destroys pooled server connections for non-selected backends, removes them from the backend collection with remove_index_fast(i), resets their state, and deletes the objects.
Test Configuration
test/tap/groups/groups.json
Added mapping entry for the new test: "test_binlog_dump_multi_backend_crash-t"["legacy-binlog-g1"].
Regression Test
test/tap/tests/test_binlog_dump_multi_backend_crash-t.cpp
New TAP test that configures query rules, creates reader/writer backend connections, opens a replication stream (COM_BINLOG_DUMP), forces a backend to be closed by MySQL via reduced wait_timeout, triggers POLLIN handling on the idle backend, and verifies ProxySQL stability.

Sequence Diagram(s)

(omitted)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

status/triage

Poem

🐇 I hopped through backends, sniffed each little thread,

Found idle friends sleeping, their sockets cold and dead.
With tidy paws I cleared the mess, one-by-one they flew,
Now FAST_FORWARD skips the crash — the stream runs fresh and true. 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely summarizes the primary fix: preventing a crash when COM_BINLOG_DUMP is received with idle backends.
Linked Issues check ✅ Passed The PR addresses all key coding objectives: fixes the crash handler in GPFC_Replication_SwitchToFastForward(), cleans up idle backends, adds regression test, and includes assertions to validate preconditions.
Out of Scope Changes check ✅ Passed All changes are directly scoped to issue #5556: backend cleanup logic in MySQL_Session.cpp, test registration in groups.json, and new regression test—no unrelated modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch v3.0-issue5556

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses issue #5556, where ProxySQL could crash due to unhandled events on idle backends when a session transitions to FAST_FORWARD mode. The fix involves cleaning up and removing idle backends from other hostgroups during this transition in lib/MySQL_Session.cpp. A comprehensive regression test has also been added. The review feedback suggests refactoring the cleanup logic into a reusable helper method to cover all FAST_FORWARD transition paths and improving the robustness of the test code by passing client flags directly to the connection function.

Comment thread lib/MySQL_Session.cpp
Comment on lines +5104 to +5129
// Issue #5556: Clean up idle backends from other hostgroups.
// When transitioning to FAST_FORWARD, only the active backend (mybe) is
// needed. Any other backends from different hostgroups are now idle and
// will never be used again. If left alive, they can receive POLLIN events
// (e.g. from MySQL wait_timeout) which causes a crash in handler() because
// ASYNC_IDLE is not handled in the switch-case.
{
unsigned int i = 0;
while (i < mybes->len) {
MySQL_Backend* _mybe = (MySQL_Backend*)mybes->index(i);
if (_mybe != mybe) {
if (_mybe->server_myds && _mybe->server_myds->myconn) {
proxy_info(
"Cleaning up idle backend from hostgroup %d during FAST_FORWARD transition\n",
_mybe->hostgroup_id
);
_mybe->server_myds->destroy_MySQL_Connection_From_Pool(false);
}
mybes->remove_index_fast(i);
_mybe->reset();
delete _mybe;
} else {
i++;
}
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The cleanup logic for idle backends is correct and effectively prevents the reported crash by ensuring that only the active backend remains registered in the session and monitored for events during FAST_FORWARD mode.

However, this logic is currently specific to replication-related transitions. Since FAST_FORWARD can also be enabled via query rules (handled in GPFC_WaitingClientData_FastForwardSession), sessions with multiple backends might still be vulnerable to similar crashes if a rule triggers a transition to FAST_FORWARD.

Consider refactoring this cleanup logic into a reusable helper method (e.g., cleanup_idle_backends_for_fast_forward()) and applying it to all code paths that transition a session to FAST_FORWARD status.

Comment on lines +170 to +173
proxy->options.client_flag |= CLIENT_DEPRECATE_EOF;

if (!mysql_real_connect(proxy, cl.root_host, cl.root_username,
cl.root_password, NULL, cl.root_port, NULL, 0)) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In the regression test, manually modifying proxy->options.client_flag before calling mysql_real_connect is generally discouraged as the connection function might overwrite these options or ignore them if they are not passed as arguments. While this might work with the specific MariaDB client library version used by ProxySQL, it is more robust to pass the required flags directly to the mysql_real_connect call.

if (!mysql_real_connect(proxy, cl.root_host, cl.root_username,
			cl.root_password, NULL, cl.root_port, NULL, CLIENT_DEPRECATE_EOF)) {

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/tap/tests/test_binlog_dump_multi_backend_crash-t.cpp`:
- Around line 143-156: The test currently only pins ^SELECT to reader_hg so
non-SELECTs still go to the client's default hostgroup, hiding the
multi-hostgroup session; update the setup to explicitly ensure writer_hg is used
for non-SELECTs by adding a mysql_query_rules entry that routes a rule matching
DO|SET (or a catch-all for non-^SELECT) to writer_hg, or add an assertion using
admin stats to confirm the session has backends in both reader_hg and writer_hg
before calling mariadb_rpl_open(); modify the block that inserts into
mysql_query_rules (and the corresponding block around lines 189-209) to create
the explicit writer-side rule and/or add a call that checks session ownership of
backends for reader_hg and writer_hg prior to mariadb_rpl_open().
- Around line 95-107: The test currently mutates MySQL/ProxySQL state without
reliably capturing and restoring it: ensure you read and validate the real
@@global.wait_timeout into orig_wait_timeout (do not default silently to 28800
on read failure) by checking mysql_query/mysql_store_result/mysql_fetch_row
errors, capture the existing mysql_query_rules before issuing any DELETE, then
add only a test-specific rule (instead of wiping all rules) and on test teardown
restore the exact original query rules and the original wait_timeout (don’t
hardcode setting mysql-set_query_lock_on_hostgroup=1); update the logic around
the functions/variables orig_wait_timeout, set_wt, and the DELETE/insert rule
statements so the test is order-independent and fully restores state even on
partial failures (also apply same changes to the other occurrences noted).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 99be3f2d-1e6d-490b-90cf-8945089e4d9f

📥 Commits

Reviewing files that changed from the base of the PR and between d0625dc and b2f4b4f.

📒 Files selected for processing (3)
  • lib/MySQL_Session.cpp
  • test/tap/groups/groups.json
  • test/tap/tests/test_binlog_dump_multi_backend_crash-t.cpp
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: CI-legacy-clickhouse-g1 / tests (clickhouse23)
  • GitHub Check: run / trigger
  • GitHub Check: claude-review
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,h,hpp}

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.{cpp,h,hpp}: Class names must use PascalCase with protocol prefixes (MySQL_, PgSQL_, ProxySQL_)
Member variables must use snake_case
Constants and macros must use UPPER_SNAKE_CASE
C++17 is required; use conditional compilation via #ifdef PROXYSQLGENAI, #ifdef PROXYSQL31, etc. for feature flags
Use pthread mutexes for synchronization and std::atomic<> for counters

Files:

  • test/tap/tests/test_binlog_dump_multi_backend_crash-t.cpp
  • lib/MySQL_Session.cpp
**/*.cpp

📄 CodeRabbit inference engine (CLAUDE.md)

Use RAII for resource management and jemalloc for memory allocation

Files:

  • test/tap/tests/test_binlog_dump_multi_backend_crash-t.cpp
  • lib/MySQL_Session.cpp
test/tap/tests/{test_*,*-t}.cpp

📄 CodeRabbit inference engine (CLAUDE.md)

Test files in test/tap/tests/ must follow the naming pattern test_*.cpp or *-t.cpp

Files:

  • test/tap/tests/test_binlog_dump_multi_backend_crash-t.cpp
lib/**/*.cpp

📄 CodeRabbit inference engine (CLAUDE.md)

One class per file is the typical convention

Files:

  • lib/MySQL_Session.cpp
🧠 Learnings (8)
📓 Common learnings
Learnt from: yuji-hatakeyama
Repo: sysown/proxysql PR: 5307
File: test/tap/tests/reg_test_5306-show_warnings_with_comment-t.cpp:24-28
Timestamp: 2026-01-20T07:40:34.938Z
Learning: In ProxySQL test files, calling `mysql_error(NULL)` after `mysql_init()` failure is safe because the MariaDB client library implementation returns an empty string for NULL handles (not undefined behavior).
Learnt from: yuji-hatakeyama
Repo: sysown/proxysql PR: 5307
File: test/tap/tests/reg_test_5306-show_warnings_with_comment-t.cpp:39-48
Timestamp: 2026-01-20T09:34:27.165Z
Learning: In ProxySQL test files (test/tap/tests/), resource leaks (such as not calling `mysql_close()` on early return paths) are not typically fixed because test processes are short-lived and the OS frees resources on process exit. This is a common pattern across the test suite.
📚 Learning: 2026-03-22T14:38:16.093Z
Learnt from: CR
Repo: sysown/proxysql PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-22T14:38:16.093Z
Learning: Applies to test/tap/tests/unit/**/*.cpp : Unit tests in test/tap/tests/unit/ must use test_globals.h and test_init.h and link against libproxysql.a via the custom test harness

Applied to files:

  • test/tap/tests/test_binlog_dump_multi_backend_crash-t.cpp
📚 Learning: 2026-01-20T09:34:19.124Z
Learnt from: yuji-hatakeyama
Repo: sysown/proxysql PR: 5307
File: test/tap/tests/reg_test_5306-show_warnings_with_comment-t.cpp:39-48
Timestamp: 2026-01-20T09:34:19.124Z
Learning: In ProxySQL's TAP test suite, resource leaks (e.g., not calling mysql_close() on early return paths) are commonly tolerated because test processes are short-lived and OS frees resources on exit. This pattern applies to all C++ test files under test/tap/tests. When reviewing, recognize this as a project-wide test convention and focus on test correctness and isolation rather than insisting on fixing such leaks in these test files.

Applied to files:

  • test/tap/tests/test_binlog_dump_multi_backend_crash-t.cpp
📚 Learning: 2026-03-22T14:38:16.093Z
Learnt from: CR
Repo: sysown/proxysql PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-22T14:38:16.093Z
Learning: Applies to test/tap/tests/{test_*,*-t}.cpp : Test files in test/tap/tests/ must follow the naming pattern test_*.cpp or *-t.cpp

Applied to files:

  • test/tap/tests/test_binlog_dump_multi_backend_crash-t.cpp
📚 Learning: 2026-01-20T07:40:34.938Z
Learnt from: yuji-hatakeyama
Repo: sysown/proxysql PR: 5307
File: test/tap/tests/reg_test_5306-show_warnings_with_comment-t.cpp:24-28
Timestamp: 2026-01-20T07:40:34.938Z
Learning: In ProxySQL test files, calling `mysql_error(NULL)` after `mysql_init()` failure is safe because the MariaDB client library implementation returns an empty string for NULL handles (not undefined behavior).

Applied to files:

  • test/tap/tests/test_binlog_dump_multi_backend_crash-t.cpp
📚 Learning: 2026-03-22T14:38:16.093Z
Learnt from: CR
Repo: sysown/proxysql PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-22T14:38:16.093Z
Learning: Applies to **/*.{cpp,h,hpp} : C++17 is required; use conditional compilation via `#ifdef` PROXYSQLGENAI, `#ifdef` PROXYSQL31, etc. for feature flags

Applied to files:

  • test/tap/tests/test_binlog_dump_multi_backend_crash-t.cpp
📚 Learning: 2026-02-13T05:55:42.693Z
Learnt from: mevishalr
Repo: sysown/proxysql PR: 5364
File: lib/MySQL_Logger.cpp:1211-1232
Timestamp: 2026-02-13T05:55:42.693Z
Learning: In ProxySQL, the MySQL_Logger and PgSQL_Logger destructors run after all worker threads have been joined during shutdown. The sequence in src/main.cpp is: (1) join all worker threads, (2) call ProxySQL_Main_shutdown_all_modules() which deletes the loggers. Therefore, there is no concurrent thread access during logger destruction, and lock ordering in the destructors cannot cause deadlocks.

Applied to files:

  • test/tap/tests/test_binlog_dump_multi_backend_crash-t.cpp
📚 Learning: 2026-03-26T16:39:02.446Z
Learnt from: yuji-hatakeyama
Repo: sysown/proxysql PR: 5548
File: lib/mysql_connection.cpp:1837-1843
Timestamp: 2026-03-26T16:39:02.446Z
Learning: In ProxySQL's lib/mysql_connection.cpp, `SHOW WARNINGS` detection for both `update_warning_count_from_connection()` and the `add_eof()` call in `ASYNC_USE_RESULT_CONT` intentionally uses `myds->sess->CurrentQuery.QueryParserArgs.digest_text` (comment-stripped digest text). This means the fix/feature does not work when `mysql-query_digests_keep_comment=1` (digest_text contains comments) or `mysql-query_digests=0` (digest_text is unavailable) — these configurations are explicitly excluded from the regression test for `reg_test_5306-show_warnings_with_comment-t`. This design is consistent across the codebase and is an accepted, documented limitation.

Applied to files:

  • test/tap/tests/test_binlog_dump_multi_backend_crash-t.cpp
🪛 Clang (14.0.6)
test/tap/tests/test_binlog_dump_multi_backend_crash-t.cpp

[error] 26-26: 'cstdlib' file not found

(clang-diagnostic-error)

🔇 Additional comments (2)
test/tap/groups/groups.json (1)

255-255: LGTM! Test group mapping is correctly configured.

The new regression test test_binlog_dump_multi_backend_crash-t is properly assigned to the legacy-binlog-g1 group, which is consistent with other binlog-related tests in the suite. The JSON syntax is valid, the alphabetical ordering is correct, and the naming convention matches existing patterns.

lib/MySQL_Session.cpp (1)

5111-5126: Good prune loop for FAST_FORWARD cleanup.

Keeping i unchanged after remove_index_fast(i) is the right move here; it ensures the swapped-in backend is still examined, so this transition won't skip any stale backend entries.

Comment on lines +95 to +107
long orig_wait_timeout = 28800;
if (mysql_query(backend, "SELECT @@global.wait_timeout") == 0) {
MYSQL_RES* res = mysql_store_result(backend);
if (res) {
MYSQL_ROW row = mysql_fetch_row(res);
if (row && row[0]) orig_wait_timeout = strtol(row[0], NULL, 10);
mysql_free_result(res);
}
}
diag("Original wait_timeout: %ld, setting to %d", orig_wait_timeout, SHORT_WAIT_TIMEOUT);

std::string set_wt = "SET GLOBAL wait_timeout=" + std::to_string(SHORT_WAIT_TIMEOUT);
if (mysql_query(backend, set_wt.c_str()) != 0) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Snapshot and restore the original server state.

This test mutates shared MySQL/ProxySQL config but does not restore it faithfully: orig_wait_timeout falls back to 28800 if the read fails, DELETE FROM mysql_query_rules drops every rule, and cleanup hardcodes mysql-set_query_lock_on_hostgroup=1. That makes the test order-dependent and can leave the environment changed after partial failures. Please capture the original values/rules first, add only a test-specific rule, and restore exactly what this test modified.

Also applies to: 144-156, 298-330

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/tap/tests/test_binlog_dump_multi_backend_crash-t.cpp` around lines 95 -
107, The test currently mutates MySQL/ProxySQL state without reliably capturing
and restoring it: ensure you read and validate the real @@global.wait_timeout
into orig_wait_timeout (do not default silently to 28800 on read failure) by
checking mysql_query/mysql_store_result/mysql_fetch_row errors, capture the
existing mysql_query_rules before issuing any DELETE, then add only a
test-specific rule (instead of wiping all rules) and on test teardown restore
the exact original query rules and the original wait_timeout (don’t hardcode
setting mysql-set_query_lock_on_hostgroup=1); update the logic around the
functions/variables orig_wait_timeout, set_wt, and the DELETE/insert rule
statements so the test is order-independent and fully restores state even on
partial failures (also apply same changes to the other occurrences noted).

Comment thread test/tap/tests/test_binlog_dump_multi_backend_crash-t.cpp
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a ProxySQL crash triggered when a session switches to FAST_FORWARD for replication commands (COM_BINLOG_DUMP/_GTID/COM_REGISTER_SLAVE) while the session still holds idle backend connections from other hostgroups. It also adds a TAP regression test covering the reported scenario (issue #5556).

Changes:

  • Clean up/remove non-active MySQL_Backend entries from mybes during GPFC_Replication_SwitchToFastForward() to prevent idle backend poll events from hitting an invalid state.
  • Add a new TAP regression test reproducing the multi-backend + wait_timeout close scenario.
  • Register the new test in the legacy-binlog-g1 TAP group.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
lib/MySQL_Session.cpp Removes/destroys idle backends from other hostgroups when transitioning to FAST_FORWARD.
test/tap/tests/test_binlog_dump_multi_backend_crash-t.cpp New regression test to ensure ProxySQL doesn’t crash when an idle backend is closed during FAST_FORWARD.
test/tap/groups/groups.json Adds the new test to the legacy-binlog-g1 group.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +134 to +138
if (writer_hg < 0 || reader_hg < 0) {
diag("Could not discover replication hostgroups. Skipping.");
ok(1, "SKIP: No replication hostgroups configured");
mysql_close(backend);
mysql_close(admin);
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If replication hostgroups can’t be discovered, the test returns early after having already executed SET GLOBAL wait_timeout=..., leaving the backend’s global wait_timeout modified for subsequent tests. This can create flaky failures in the same TAP group. Ensure wait_timeout is restored (or move the global change until after this skip condition is ruled out) before returning.

Copilot uses AI. Check for mistakes.
- Move wait_timeout mutation after hostgroup discovery skip check so
  early returns don't leave backend state modified (Copilot review)
- Add precondition assertion verifying connections exist in both writer
  and reader hostgroups before sending COM_BINLOG_DUMP, using ConnOK
  pool stats (CodeRabbit review)
@renecannao
Copy link
Copy Markdown
Contributor Author

Addressed the actionable review feedback in 499896e:

  1. Copilot (line 138): Moved SET GLOBAL wait_timeout after the hostgroup discovery skip check, so early returns (e.g., no replication hostgroups) don't leave the backend's wait_timeout modified.

  2. CodeRabbit (line 156): Added a precondition assertion that verifies connections were established to both writer and reader hostgroups before sending COM_BINLOG_DUMP. Uses ConnOK from stats_mysql_connection_pool (total connections ever made) rather than ConnUsed (currently checked out), since the writer backend may be returned to the pool after DO 1 completes while still remaining in the session's mybes array.

Regarding the other review comments:

  • Gemini (line 5129) — Extending cleanup to GPFC_WaitingClientData_FastForwardSession (fast-forward via query rules) is a valid architectural improvement but out of scope for this issue. Could be addressed separately.
  • Gemini (line 173) — Setting client_flag before mysql_real_connect via proxy->options.client_flag |= CLIENT_DEPRECATE_EOF follows the same pattern used in test_binlog_fast_forward-t.cpp.
  • CodeRabbit (line 107) — Full save/restore of query rules and variables: the test runner already resets ProxySQL state between tests, and this pattern is consistent with the test suite. The test does restore wait_timeout and set_query_lock_on_hostgroup in its cleanup.

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud bot commented Apr 1, 2026

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
test/tap/tests/test_binlog_dump_multi_backend_crash-t.cpp (2)

123-137: ⚠️ Potential issue | 🟠 Major

Restore the exact shared state, not hardcoded defaults.

This still mutates shared MySQL/ProxySQL state without a faithful snapshot: orig_wait_timeout silently falls back to 28800, the test deletes every mysql_query_rules row, and teardown always resets mysql-set_query_lock_on_hostgroup to 1. That makes the test order-dependent and can leave the environment altered after partial failures. Please capture the real values/rules first and restore only what this test changed.

Also applies to: 145-157, 327-363

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/tap/tests/test_binlog_dump_multi_backend_crash-t.cpp` around lines 123 -
137, The test currently uses a hardcoded fallback for orig_wait_timeout and
unconditionally deletes or resets shared state; change it to capture and persist
exact pre-test state: read and require the actual @@global.wait_timeout (set
orig_wait_timeout from the query result only, avoid silent fallback to 28800),
dump existing mysql_query_rules rows before you delete them and restore those
exact rows in teardown, and read the current value of the
mysql-set_query_lock_on_hostgroup variable (don’t assume 1) and restore that
original value during teardown; update the setup/teardown code paths that
reference orig_wait_timeout, mysql_query_rules handling, and
mysql-set_query_lock_on_hostgroup so only the captured state is restored even on
partial failures.

144-150: ⚠️ Potential issue | 🟠 Major

Make the writer route and precondition session-scoped.

Only ^SELECT is routed explicitly. DO 1 and the later SET @master_* statements still depend on implicit routing, yet MySQL_Session::GPFC_Replication_SwitchToFastForward() uses previous_hostgroup directly when COM_BINLOG_DUMP arrives. The stats_mysql_connection_pool.ConnOK check is pool-wide and cumulative, so unrelated traffic can satisfy it without proving that this client still owns both backends or that previous_hostgroup is writer_hg right before mariadb_rpl_open(). Add an explicit writer-side rule, or assert on session-scoped state after the SET @master_* calls and before opening replication.

Also applies to: 197-225, 234-240

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/tap/tests/test_binlog_dump_multi_backend_crash-t.cpp` around lines 144 -
150, The test relies on implicit routing for writer operations so make the
writer route session-scoped and/or assert the session hostgroup before opening
replication: add an explicit mysql_query_rules INSERT that routes non-SELECT
statements (or a rule matching the specific "DO" / "SET" patterns) to writer_hg
(using MYSQL_QUERY and the existing mysql_query_rules logic) so this client owns
both backends, and immediately after the SET `@master_`* calls assert that
MySQL_Session::previous_hostgroup == writer_hg (or otherwise verify
session-scoped ownership) before calling mariadb_rpl_open()/handling
COM_BINLOG_DUMP; avoid relying on stats_mysql_connection_pool.ConnOK as it is
pool-wide.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/tap/tests/test_binlog_dump_multi_backend_crash-t.cpp`:
- Around line 95-137: The test sets GLOBAL wait_timeout only on the current
backend (variable backend) but the crash requires lowering timeout on the actual
server(s) serving reader_hg; modify the setup after discovering
writer_hg/reader_hg so you query mysql_replication_hostgroups to find the
host/port(s) for reader_hg and run "SET GLOBAL wait_timeout=SHORT_WAIT_TIMEOUT"
on each of those backend connections (not just cl.mysql_host/cl.mysql_port),
ensuring this is done before creating the reader-side backend/connection used
later in the test; update the same logic in the other occurrence that changes
wait_timeout (the block around where set_wt is constructed and
mysql_query(backend, ...) is called).

---

Duplicate comments:
In `@test/tap/tests/test_binlog_dump_multi_backend_crash-t.cpp`:
- Around line 123-137: The test currently uses a hardcoded fallback for
orig_wait_timeout and unconditionally deletes or resets shared state; change it
to capture and persist exact pre-test state: read and require the actual
@@global.wait_timeout (set orig_wait_timeout from the query result only, avoid
silent fallback to 28800), dump existing mysql_query_rules rows before you
delete them and restore those exact rows in teardown, and read the current value
of the mysql-set_query_lock_on_hostgroup variable (don’t assume 1) and restore
that original value during teardown; update the setup/teardown code paths that
reference orig_wait_timeout, mysql_query_rules handling, and
mysql-set_query_lock_on_hostgroup so only the captured state is restored even on
partial failures.
- Around line 144-150: The test relies on implicit routing for writer operations
so make the writer route session-scoped and/or assert the session hostgroup
before opening replication: add an explicit mysql_query_rules INSERT that routes
non-SELECT statements (or a rule matching the specific "DO" / "SET" patterns) to
writer_hg (using MYSQL_QUERY and the existing mysql_query_rules logic) so this
client owns both backends, and immediately after the SET `@master_`* calls assert
that MySQL_Session::previous_hostgroup == writer_hg (or otherwise verify
session-scoped ownership) before calling mariadb_rpl_open()/handling
COM_BINLOG_DUMP; avoid relying on stats_mysql_connection_pool.ConnOK as it is
pool-wide.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3087441b-d53c-447d-a034-ec27d82db9ad

📥 Commits

Reviewing files that changed from the base of the PR and between b2f4b4f and 499896e.

📒 Files selected for processing (1)
  • test/tap/tests/test_binlog_dump_multi_backend_crash-t.cpp
📜 Review details
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{cpp,h,hpp}

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.{cpp,h,hpp}: Class names must use PascalCase with protocol prefixes (MySQL_, PgSQL_, ProxySQL_)
Member variables must use snake_case
Constants and macros must use UPPER_SNAKE_CASE
C++17 is required; use conditional compilation via #ifdef PROXYSQLGENAI, #ifdef PROXYSQL31, etc. for feature flags
Use pthread mutexes for synchronization and std::atomic<> for counters

Files:

  • test/tap/tests/test_binlog_dump_multi_backend_crash-t.cpp
**/*.cpp

📄 CodeRabbit inference engine (CLAUDE.md)

Use RAII for resource management and jemalloc for memory allocation

Files:

  • test/tap/tests/test_binlog_dump_multi_backend_crash-t.cpp
test/tap/tests/{test_*,*-t}.cpp

📄 CodeRabbit inference engine (CLAUDE.md)

Test files in test/tap/tests/ must follow the naming pattern test_*.cpp or *-t.cpp

Files:

  • test/tap/tests/test_binlog_dump_multi_backend_crash-t.cpp
🧠 Learnings (11)
📓 Common learnings
Learnt from: yuji-hatakeyama
Repo: sysown/proxysql PR: 5307
File: test/tap/tests/reg_test_5306-show_warnings_with_comment-t.cpp:24-28
Timestamp: 2026-01-20T07:40:34.938Z
Learning: In ProxySQL test files, calling `mysql_error(NULL)` after `mysql_init()` failure is safe because the MariaDB client library implementation returns an empty string for NULL handles (not undefined behavior).
Learnt from: yuji-hatakeyama
Repo: sysown/proxysql PR: 5307
File: test/tap/tests/reg_test_5306-show_warnings_with_comment-t.cpp:39-48
Timestamp: 2026-01-20T09:34:27.165Z
Learning: In ProxySQL test files (test/tap/tests/), resource leaks (such as not calling `mysql_close()` on early return paths) are not typically fixed because test processes are short-lived and the OS frees resources on process exit. This is a common pattern across the test suite.
📚 Learning: 2026-03-22T14:38:16.093Z
Learnt from: CR
Repo: sysown/proxysql PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-22T14:38:16.093Z
Learning: Applies to test/tap/tests/unit/**/*.cpp : Unit tests in test/tap/tests/unit/ must use test_globals.h and test_init.h and link against libproxysql.a via the custom test harness

Applied to files:

  • test/tap/tests/test_binlog_dump_multi_backend_crash-t.cpp
📚 Learning: 2026-01-20T09:34:19.124Z
Learnt from: yuji-hatakeyama
Repo: sysown/proxysql PR: 5307
File: test/tap/tests/reg_test_5306-show_warnings_with_comment-t.cpp:39-48
Timestamp: 2026-01-20T09:34:19.124Z
Learning: In ProxySQL's TAP test suite, resource leaks (e.g., not calling mysql_close() on early return paths) are commonly tolerated because test processes are short-lived and OS frees resources on exit. This pattern applies to all C++ test files under test/tap/tests. When reviewing, recognize this as a project-wide test convention and focus on test correctness and isolation rather than insisting on fixing such leaks in these test files.

Applied to files:

  • test/tap/tests/test_binlog_dump_multi_backend_crash-t.cpp
📚 Learning: 2026-01-20T07:40:34.938Z
Learnt from: yuji-hatakeyama
Repo: sysown/proxysql PR: 5307
File: test/tap/tests/reg_test_5306-show_warnings_with_comment-t.cpp:24-28
Timestamp: 2026-01-20T07:40:34.938Z
Learning: In ProxySQL test files, calling `mysql_error(NULL)` after `mysql_init()` failure is safe because the MariaDB client library implementation returns an empty string for NULL handles (not undefined behavior).

Applied to files:

  • test/tap/tests/test_binlog_dump_multi_backend_crash-t.cpp
📚 Learning: 2026-03-22T14:38:16.093Z
Learnt from: CR
Repo: sysown/proxysql PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-22T14:38:16.093Z
Learning: Applies to test/tap/tests/{test_*,*-t}.cpp : Test files in test/tap/tests/ must follow the naming pattern test_*.cpp or *-t.cpp

Applied to files:

  • test/tap/tests/test_binlog_dump_multi_backend_crash-t.cpp
📚 Learning: 2026-03-22T14:38:16.093Z
Learnt from: CR
Repo: sysown/proxysql PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-22T14:38:16.093Z
Learning: Applies to **/*.{cpp,h,hpp} : C++17 is required; use conditional compilation via `#ifdef` PROXYSQLGENAI, `#ifdef` PROXYSQL31, etc. for feature flags

Applied to files:

  • test/tap/tests/test_binlog_dump_multi_backend_crash-t.cpp
📚 Learning: 2026-02-13T09:29:39.713Z
Learnt from: wazir-ahmed
Repo: sysown/proxysql PR: 5372
File: test/tap/tap/mcp_client.cpp:355-385
Timestamp: 2026-02-13T09:29:39.713Z
Learning: In ProxySQL MCP implementation (test/tap/tap/mcp_client.cpp), the `check_server()` method uses the ping endpoint which is designed to work without authentication. The `ping` method at the `config` endpoint should not require the `Authorization: Bearer` header, unlike tool invocation endpoints which do require authentication when `auth_token_` is set.

Applied to files:

  • test/tap/tests/test_binlog_dump_multi_backend_crash-t.cpp
📚 Learning: 2026-03-26T16:39:02.446Z
Learnt from: yuji-hatakeyama
Repo: sysown/proxysql PR: 5548
File: lib/mysql_connection.cpp:1837-1843
Timestamp: 2026-03-26T16:39:02.446Z
Learning: In ProxySQL's lib/mysql_connection.cpp, `SHOW WARNINGS` detection for both `update_warning_count_from_connection()` and the `add_eof()` call in `ASYNC_USE_RESULT_CONT` intentionally uses `myds->sess->CurrentQuery.QueryParserArgs.digest_text` (comment-stripped digest text). This means the fix/feature does not work when `mysql-query_digests_keep_comment=1` (digest_text contains comments) or `mysql-query_digests=0` (digest_text is unavailable) — these configurations are explicitly excluded from the regression test for `reg_test_5306-show_warnings_with_comment-t`. This design is consistent across the codebase and is an accepted, documented limitation.

Applied to files:

  • test/tap/tests/test_binlog_dump_multi_backend_crash-t.cpp
📚 Learning: 2026-03-22T14:38:16.093Z
Learnt from: CR
Repo: sysown/proxysql PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-22T14:38:16.093Z
Learning: Feature tiers are controlled by build flags: PROXYSQL31=1 for v3.1.x, PROXYSQLGENAI=1 for v4.0.x; PROXYSQLGENAI=1 implies PROXYSQL31=1

Applied to files:

  • test/tap/tests/test_binlog_dump_multi_backend_crash-t.cpp
📚 Learning: 2026-03-22T14:38:16.093Z
Learnt from: CR
Repo: sysown/proxysql PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-22T14:38:16.093Z
Learning: Applies to **/*.{cpp,h,hpp} : Class names must use PascalCase with protocol prefixes (MySQL_, PgSQL_, ProxySQL_)

Applied to files:

  • test/tap/tests/test_binlog_dump_multi_backend_crash-t.cpp
📚 Learning: 2026-02-13T05:55:42.693Z
Learnt from: mevishalr
Repo: sysown/proxysql PR: 5364
File: lib/MySQL_Logger.cpp:1211-1232
Timestamp: 2026-02-13T05:55:42.693Z
Learning: In ProxySQL, the MySQL_Logger and PgSQL_Logger destructors run after all worker threads have been joined during shutdown. The sequence in src/main.cpp is: (1) join all worker threads, (2) call ProxySQL_Main_shutdown_all_modules() which deletes the loggers. Therefore, there is no concurrent thread access during logger destruction, and lock ordering in the destructors cannot cause deadlocks.

Applied to files:

  • test/tap/tests/test_binlog_dump_multi_backend_crash-t.cpp
🪛 Clang (14.0.6)
test/tap/tests/test_binlog_dump_multi_backend_crash-t.cpp

[error] 26-26: 'cstdlib' file not found

(clang-diagnostic-error)

Comment on lines +95 to +137
// Step 2: Discover hostgroup topology
// ========================================================================
int writer_hg = -1;
int reader_hg = -1;
{
const char* q = "SELECT writer_hostgroup, reader_hostgroup "
"FROM mysql_replication_hostgroups LIMIT 1";
if (mysql_query(admin, q) == 0) {
MYSQL_RES* res = mysql_store_result(admin);
if (res) {
MYSQL_ROW row = mysql_fetch_row(res);
if (row && row[0] && row[1]) {
writer_hg = atoi(row[0]);
reader_hg = atoi(row[1]);
}
mysql_free_result(res);
}
}
}
if (writer_hg < 0 || reader_hg < 0) {
diag("Could not discover replication hostgroups. Skipping.");
ok(1, "SKIP: No replication hostgroups configured");
mysql_close(backend);
mysql_close(admin);
return exit_status();
}
diag("Discovered hostgroups: writer=%d, reader=%d", writer_hg, reader_hg);

// Save and lower wait_timeout (after skip checks so early returns don't
// leave it modified — addresses Copilot review feedback)
long orig_wait_timeout = 28800;
if (mysql_query(backend, "SELECT @@global.wait_timeout") == 0) {
MYSQL_RES* res = mysql_store_result(backend);
if (res) {
MYSQL_ROW row = mysql_fetch_row(res);
if (row && row[0]) orig_wait_timeout = strtol(row[0], NULL, 10);
mysql_free_result(res);
}
}
diag("Original wait_timeout: %ld, setting to %d", orig_wait_timeout, SHORT_WAIT_TIMEOUT);

std::string set_wt = "SET GLOBAL wait_timeout=" + std::to_string(SHORT_WAIT_TIMEOUT);
if (mysql_query(backend, set_wt.c_str()) != 0) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Lower the timeout on the backend that actually becomes idle.

The crash path depends on MySQL closing the idle reader_hg socket, but this code only changes wait_timeout on cl.mysql_host / cl.mysql_port. In a normal writer/reader topology, reader_hg can point at different backend server(s), so the sleep/fetch sequence can go green without ever producing the POLLIN that reproduces #5556. Please discover the server(s) behind reader_hg and apply the short timeout there before creating the reader-side backend.

Also applies to: 275-282

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/tap/tests/test_binlog_dump_multi_backend_crash-t.cpp` around lines 95 -
137, The test sets GLOBAL wait_timeout only on the current backend (variable
backend) but the crash requires lowering timeout on the actual server(s) serving
reader_hg; modify the setup after discovering writer_hg/reader_hg so you query
mysql_replication_hostgroups to find the host/port(s) for reader_hg and run "SET
GLOBAL wait_timeout=SHORT_WAIT_TIMEOUT" on each of those backend connections
(not just cl.mysql_host/cl.mysql_port), ensuring this is done before creating
the reader-side backend/connection used later in the test; update the same logic
in the other occurrence that changes wait_timeout (the block around where set_wt
is constructed and mysql_query(backend, ...) is called).

@renecannao renecannao merged commit 63007cf into v3.0 Apr 1, 2026
4 checks passed
This was referenced Apr 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Crash when backend connection from another hostgroup is closed by MySQL while session is in FAST_FORWARD mode

2 participants