Handling of pxc_maint_mode drops active connections #3182

Flourid · 2020-12-03T15:13:44Z

Problem:

Setting pxc_maint_mode to MAINTENANCE or SHUTDOWN marks the read node as OFFLINE_HARD, resulting in dropped connections and errors in the application.

Configuration:

A Percona XtraDB Cluster with three nodes and one ProxySQL host with the following version:

[root@XXXXXXXX proxysql]# proxysql --version
ProxySQL version 2.0.14-percona-1.1, codename Truls

[root@XXXXXXXX proxysql]# cat /etc/redhat-release
Red Hat Enterprise Linux release 8.2 (Ootpa)

All Percona nodes were added to mysql_servers in hostgroup 10:

mysql> select * from mysql_servers;
+--------------+----------------+------+-----------+--------+--------+-------------+-----------------+---------------------+---------+----------------+--------------------+
| hostgroup_id | hostname       | port | gtid_port | status | weight | compression | max_connections | max_replication_lag | use_ssl | max_latency_ms | comment            |
+--------------+----------------+------+-----------+--------+--------+-------------+-----------------+---------------------+---------+----------------+--------------------+
| 10           | 192.168.50.236 | 3306 | 0         | ONLINE | 1      | 0           | 200             | 0                   | 1       | 0              | XXXXXXXXXXXXXXXXXX |
| 10           | 192.168.50.237 | 3306 | 0         | ONLINE | 1      | 0           | 200             | 0                   | 1       | 0              | XXXXXXXXXXXXXXXXXX |
| 10           | 192.168.50.238 | 3306 | 0         | ONLINE | 1      | 0           | 200             | 0                   | 1       | 0              | XXXXXXXXXXXXXXXXXX |
+--------------+----------------+------+-----------+--------+--------+-------------+-----------------+---------------------+---------+----------------+--------------------+
3 rows in set (0.00 sec)

The Galera hostgroup is configured as follows (note the write_is_also_reader=1 and max_writers=1):

mysql> select * from mysql_galera_hostgroups;
+------------------+-------------------------+------------------+-------------------+--------+-------------+-----------------------+-------------------------+----------------------------+
| writer_hostgroup | backup_writer_hostgroup | reader_hostgroup | offline_hostgroup | active | max_writers | writer_is_also_reader | max_transactions_behind | comment                    |
+------------------+-------------------------+------------------+-------------------+--------+-------------+-----------------------+-------------------------+----------------------------+
| 10               | 12                      | 11               | 13                | 1      | 1           | 1                     | 100                     | XXXXXXXXXXXXXXXXXX         |
+------------------+-------------------------+------------------+-------------------+--------+-------------+-----------------------+-------------------------+----------------------------+
1 row in set (0.00 sec)

This means that the runtime servers are configured as follows (1 node online as writer, all are readers, the other shunned writers are backup writers). This seems to be correct.

mysql> select * from runtime_mysql_servers order by hostgroup_id;
+--------------+----------------+------+-----------+---------+--------+-------------+-----------------+---------------------+---------+----------------+--------------------+
| hostgroup_id | hostname       | port | gtid_port | status  | weight | compression | max_connections | max_replication_lag | use_ssl | max_latency_ms | comment            |
+--------------+----------------+------+-----------+---------+--------+-------------+-----------------+---------------------+---------+----------------+--------------------+
| 10           | 192.168.50.236 | 3306 | 0         | SHUNNED | 1      | 0           | 200             | 0                   | 1       | 0              | XXXXXXXXXXXXXXXXXX |
| 10           | 192.168.50.237 | 3306 | 0         | SHUNNED | 1      | 0           | 200             | 0                   | 1       | 0              | XXXXXXXXXXXXXXXXXX |
| 10           | 192.168.50.238 | 3306 | 0         | ONLINE  | 1      | 0           | 200             | 0                   | 1       | 0              | XXXXXXXXXXXXXXXXXX |
| 11           | 192.168.50.236 | 3306 | 0         | ONLINE  | 1      | 0           | 200             | 0                   | 1       | 0              | XXXXXXXXXXXXXXXXXX |
| 11           | 192.168.50.237 | 3306 | 0         | ONLINE  | 1      | 0           | 200             | 0                   | 1       | 0              | XXXXXXXXXXXXXXXXXX |
| 11           | 192.168.50.238 | 3306 | 0         | ONLINE  | 1      | 0           | 200             | 0                   | 1       | 0              | XXXXXXXXXXXXXXXXXX |
| 12           | 192.168.50.236 | 3306 | 0         | ONLINE  | 1      | 0           | 200             | 0                   | 1       | 0              | XXXXXXXXXXXXXXXXXX |
| 12           | 192.168.50.237 | 3306 | 0         | ONLINE  | 1      | 0           | 200             | 0                   | 1       | 0              | XXXXXXXXXXXXXXXXXX |
+--------------+----------------+------+-----------+---------+--------+-------------+-----------------+---------------------+---------+----------------+--------------------+
8 rows in set (0.01 sec)

Current behaviour:

When I now set pxc_maint_mode to MAINTENANCE one the active writer node (192.168.50.238)

root@localhost [(none)]> set global pxc_maint_mode=MAINTENANCE;
Query OK, 0 rows affected (10.00 sec)

the runtime_mysql_servers takes the following state:

mysql> select * from runtime_mysql_servers order by hostgroup_id;
+--------------+----------------+------+-----------+--------------+--------+-------------+-----------------+---------------------+---------+----------------+--------------------+
| hostgroup_id | hostname       | port | gtid_port | status       | weight | compression | max_connections | max_replication_lag | use_ssl | max_latency_ms | comment            |
+--------------+----------------+------+-----------+--------------+--------+-------------+-----------------+---------------------+---------+----------------+--------------------+
| 10           | 192.168.50.236 | 3306 | 0         | SHUNNED      | 1      | 0           | 200             | 0                   | 1       | 0              | XXXXXXXXXXXXXXXXXX |
| 10           | 192.168.50.237 | 3306 | 0         | ONLINE       | 1      | 0           | 200             | 0                   | 1       | 0              | XXXXXXXXXXXXXXXXXX |
| 10           | 192.168.50.238 | 3306 | 0         | SHUNNED      | 1      | 0           | 200             | 0                   | 1       | 0              | XXXXXXXXXXXXXXXXXX |
| 11           | 192.168.50.236 | 3306 | 0         | ONLINE       | 1      | 0           | 200             | 0                   | 1       | 0              | XXXXXXXXXXXXXXXXXX |
| 11           | 192.168.50.237 | 3306 | 0         | ONLINE       | 1      | 0           | 200             | 0                   | 1       | 0              | XXXXXXXXXXXXXXXXXX |
| 11           | 192.168.50.238 | 3306 | 0         | OFFLINE_HARD | 1      | 0           | 200             | 0                   | 1       | 0              | XXXXXXXXXXXXXXXXXX |
| 12           | 192.168.50.236 | 3306 | 0         | ONLINE       | 1      | 0           | 200             | 0                   | 1       | 0              | XXXXXXXXXXXXXXXXXX |
| 13           | 192.168.50.238 | 3306 | 0         | ONLINE       | 1      | 0           | 200             | 0                   | 1       | 0              | XXXXXXXXXXXXXXXXXX |
+--------------+----------------+------+-----------+--------------+--------+-------------+-----------------+---------------------+---------+----------------+--------------------+

Since the reader host is OFFLINE_HARD (at least for a few seconds) this means that existing connections are dropped (which we notice in the application as errors).

(Attached you will also find the proxysql.log)

Expected behaviour:

As described here the expected behaviour would be as following:

"If the user needs to take a node into maintenance mode, the user can simply set pxc_maint_mode to MAINTENANCE. With that, pxc_maint_mode is updated and the client connection updating it goes into sleep for x seconds (as dictated by pxc_maint_transition_period) before giving back control to the user. ProxySQL auto-detects this change and marks the node as OFFLINE. With this change ProxySQL avoids opening new connections for any DML transactions but continues to service existing queries."

Now obviously with the native ProxySQL/Percona integration it does not continue to service existing queries. As I understand it, the node should be marked as OFFLINE_SOFT to continue serving established connections until they become inactive.

Right now, the only workaround seems to be to implement a custom script via scheduler and configure the cluster not as Galera, but as a generic group replication hostgroup.

I'd appreciate any confirmation, tips or workaround for this issue.

The text was updated successfully, but these errors were encountered:

…as SHUNNED

Closes #3182: Handling of pxc_maint_mode drops active connections

JavierJF added a commit that referenced this issue Apr 6, 2021

Closes #3182: 'update_galera_set_offline' soft now also sets readers …

938306f

…as SHUNNED

JavierJF mentioned this issue Apr 6, 2021

Closes #3182: Handling of pxc_maint_mode drops active connections #3394

Merged

JavierJF added a commit that referenced this issue Apr 6, 2021

Closes #3182: 'update_galera_set_offline' soft now also sets readers …

d04ab70

…as SHUNNED

JavierJF mentioned this issue Apr 6, 2021

Closes #3182: Handling of pxc_maint_mode drops active connections #3395

Merged

renecannao closed this as completed in #3394 Apr 6, 2021

renecannao added a commit that referenced this issue Apr 6, 2021

Merge pull request #3394 from sysown/v2.1.1-3182

134517d

Closes #3182: Handling of pxc_maint_mode drops active connections

renecannao added a commit that referenced this issue Apr 6, 2021

Merge pull request #3395 from sysown/v2.0.18-3182

adf079a

Closes #3182: Handling of pxc_maint_mode drops active connections

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling of pxc_maint_mode drops active connections #3182

Handling of pxc_maint_mode drops active connections #3182

Flourid commented Dec 3, 2020

Handling of pxc_maint_mode drops active connections #3182

Handling of pxc_maint_mode drops active connections #3182

Comments

Flourid commented Dec 3, 2020

Problem:

Configuration:

Current behaviour:

Expected behaviour: