Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling of pxc_maint_mode drops active connections #3182

Closed
Flourid opened this issue Dec 3, 2020 · 0 comments · Fixed by #3394
Closed

Handling of pxc_maint_mode drops active connections #3182

Flourid opened this issue Dec 3, 2020 · 0 comments · Fixed by #3394

Comments

@Flourid
Copy link

Flourid commented Dec 3, 2020

Problem:

Setting pxc_maint_mode to MAINTENANCE or SHUTDOWN marks the read node as OFFLINE_HARD, resulting in dropped connections and errors in the application.

Configuration:

A Percona XtraDB Cluster with three nodes and one ProxySQL host with the following version:

[root@XXXXXXXX proxysql]# proxysql --version
ProxySQL version 2.0.14-percona-1.1, codename Truls

[root@XXXXXXXX proxysql]# cat /etc/redhat-release
Red Hat Enterprise Linux release 8.2 (Ootpa)

All Percona nodes were added to mysql_servers in hostgroup 10:

mysql> select * from mysql_servers;
+--------------+----------------+------+-----------+--------+--------+-------------+-----------------+---------------------+---------+----------------+--------------------+
| hostgroup_id | hostname       | port | gtid_port | status | weight | compression | max_connections | max_replication_lag | use_ssl | max_latency_ms | comment            |
+--------------+----------------+------+-----------+--------+--------+-------------+-----------------+---------------------+---------+----------------+--------------------+
| 10           | 192.168.50.236 | 3306 | 0         | ONLINE | 1      | 0           | 200             | 0                   | 1       | 0              | XXXXXXXXXXXXXXXXXX |
| 10           | 192.168.50.237 | 3306 | 0         | ONLINE | 1      | 0           | 200             | 0                   | 1       | 0              | XXXXXXXXXXXXXXXXXX |
| 10           | 192.168.50.238 | 3306 | 0         | ONLINE | 1      | 0           | 200             | 0                   | 1       | 0              | XXXXXXXXXXXXXXXXXX |
+--------------+----------------+------+-----------+--------+--------+-------------+-----------------+---------------------+---------+----------------+--------------------+
3 rows in set (0.00 sec)

The Galera hostgroup is configured as follows (note the write_is_also_reader=1 and max_writers=1):

mysql> select * from mysql_galera_hostgroups;
+------------------+-------------------------+------------------+-------------------+--------+-------------+-----------------------+-------------------------+----------------------------+
| writer_hostgroup | backup_writer_hostgroup | reader_hostgroup | offline_hostgroup | active | max_writers | writer_is_also_reader | max_transactions_behind | comment                    |
+------------------+-------------------------+------------------+-------------------+--------+-------------+-----------------------+-------------------------+----------------------------+
| 10               | 12                      | 11               | 13                | 1      | 1           | 1                     | 100                     | XXXXXXXXXXXXXXXXXX         |
+------------------+-------------------------+------------------+-------------------+--------+-------------+-----------------------+-------------------------+----------------------------+
1 row in set (0.00 sec)

This means that the runtime servers are configured as follows (1 node online as writer, all are readers, the other shunned writers are backup writers). This seems to be correct.

mysql> select * from runtime_mysql_servers order by hostgroup_id;
+--------------+----------------+------+-----------+---------+--------+-------------+-----------------+---------------------+---------+----------------+--------------------+
| hostgroup_id | hostname       | port | gtid_port | status  | weight | compression | max_connections | max_replication_lag | use_ssl | max_latency_ms | comment            |
+--------------+----------------+------+-----------+---------+--------+-------------+-----------------+---------------------+---------+----------------+--------------------+
| 10           | 192.168.50.236 | 3306 | 0         | SHUNNED | 1      | 0           | 200             | 0                   | 1       | 0              | XXXXXXXXXXXXXXXXXX |
| 10           | 192.168.50.237 | 3306 | 0         | SHUNNED | 1      | 0           | 200             | 0                   | 1       | 0              | XXXXXXXXXXXXXXXXXX |
| 10           | 192.168.50.238 | 3306 | 0         | ONLINE  | 1      | 0           | 200             | 0                   | 1       | 0              | XXXXXXXXXXXXXXXXXX |
| 11           | 192.168.50.236 | 3306 | 0         | ONLINE  | 1      | 0           | 200             | 0                   | 1       | 0              | XXXXXXXXXXXXXXXXXX |
| 11           | 192.168.50.237 | 3306 | 0         | ONLINE  | 1      | 0           | 200             | 0                   | 1       | 0              | XXXXXXXXXXXXXXXXXX |
| 11           | 192.168.50.238 | 3306 | 0         | ONLINE  | 1      | 0           | 200             | 0                   | 1       | 0              | XXXXXXXXXXXXXXXXXX |
| 12           | 192.168.50.236 | 3306 | 0         | ONLINE  | 1      | 0           | 200             | 0                   | 1       | 0              | XXXXXXXXXXXXXXXXXX |
| 12           | 192.168.50.237 | 3306 | 0         | ONLINE  | 1      | 0           | 200             | 0                   | 1       | 0              | XXXXXXXXXXXXXXXXXX |
+--------------+----------------+------+-----------+---------+--------+-------------+-----------------+---------------------+---------+----------------+--------------------+
8 rows in set (0.01 sec)

Current behaviour:

When I now set pxc_maint_mode to MAINTENANCE one the active writer node (192.168.50.238)

root@localhost [(none)]> set global pxc_maint_mode=MAINTENANCE;
Query OK, 0 rows affected (10.00 sec)

the runtime_mysql_servers takes the following state:

mysql> select * from runtime_mysql_servers order by hostgroup_id;
+--------------+----------------+------+-----------+--------------+--------+-------------+-----------------+---------------------+---------+----------------+--------------------+
| hostgroup_id | hostname       | port | gtid_port | status       | weight | compression | max_connections | max_replication_lag | use_ssl | max_latency_ms | comment            |
+--------------+----------------+------+-----------+--------------+--------+-------------+-----------------+---------------------+---------+----------------+--------------------+
| 10           | 192.168.50.236 | 3306 | 0         | SHUNNED      | 1      | 0           | 200             | 0                   | 1       | 0              | XXXXXXXXXXXXXXXXXX |
| 10           | 192.168.50.237 | 3306 | 0         | ONLINE       | 1      | 0           | 200             | 0                   | 1       | 0              | XXXXXXXXXXXXXXXXXX |
| 10           | 192.168.50.238 | 3306 | 0         | SHUNNED      | 1      | 0           | 200             | 0                   | 1       | 0              | XXXXXXXXXXXXXXXXXX |
| 11           | 192.168.50.236 | 3306 | 0         | ONLINE       | 1      | 0           | 200             | 0                   | 1       | 0              | XXXXXXXXXXXXXXXXXX |
| 11           | 192.168.50.237 | 3306 | 0         | ONLINE       | 1      | 0           | 200             | 0                   | 1       | 0              | XXXXXXXXXXXXXXXXXX |
| 11           | 192.168.50.238 | 3306 | 0         | OFFLINE_HARD | 1      | 0           | 200             | 0                   | 1       | 0              | XXXXXXXXXXXXXXXXXX |
| 12           | 192.168.50.236 | 3306 | 0         | ONLINE       | 1      | 0           | 200             | 0                   | 1       | 0              | XXXXXXXXXXXXXXXXXX |
| 13           | 192.168.50.238 | 3306 | 0         | ONLINE       | 1      | 0           | 200             | 0                   | 1       | 0              | XXXXXXXXXXXXXXXXXX |
+--------------+----------------+------+-----------+--------------+--------+-------------+-----------------+---------------------+---------+----------------+--------------------+

Since the reader host is OFFLINE_HARD (at least for a few seconds) this means that existing connections are dropped (which we notice in the application as errors).

(Attached you will also find the proxysql.log)

Expected behaviour:

As described here the expected behaviour would be as following:

"If the user needs to take a node into maintenance mode, the user can simply set pxc_maint_mode to MAINTENANCE. With that, pxc_maint_mode is updated and the client connection updating it goes into sleep for x seconds (as dictated by pxc_maint_transition_period) before giving back control to the user. ProxySQL auto-detects this change and marks the node as OFFLINE. With this change ProxySQL avoids opening new connections for any DML transactions but continues to service existing queries."

Now obviously with the native ProxySQL/Percona integration it does not continue to service existing queries. As I understand it, the node should be marked as OFFLINE_SOFT to continue serving established connections until they become inactive.

Right now, the only workaround seems to be to implement a custom script via scheduler and configure the cluster not as Galera, but as a generic group replication hostgroup.

I'd appreciate any confirmation, tips or workaround for this issue.

renecannao added a commit that referenced this issue Apr 6, 2021
Closes #3182: Handling of pxc_maint_mode drops active connections
renecannao added a commit that referenced this issue Apr 6, 2021
Closes #3182: Handling of pxc_maint_mode drops active connections
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant