Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SHUNNED_REPLICATION_LAG doesn't stop queries #774

Closed
estehnet opened this issue Nov 2, 2016 · 10 comments
Closed

SHUNNED_REPLICATION_LAG doesn't stop queries #774

estehnet opened this issue Nov 2, 2016 · 10 comments

Comments

@estehnet
Copy link

estehnet commented Nov 2, 2016

step 1. replication slave works fine
step 2. there is an error in replication on replication slave and replication stops
step 3. mysql server status in proxysql's runtime_mysql_servers becomes SHUNNED
step 4. show processlist on replication slave keeps showing sql queries until we restart the slave and automatically kill those connections

why doesn't it stop sending queries? even after several hours it didn't stop.
FYI Our client application is based on java and has it's own connection pool.

we tried 1.3.0e and 1.2.4 and both versions seam to have the same issue. Here is the status from 1.2.4

mysql> select * from global_variables;
+----------------------------------------+-----------------------------------------+
| variable_name | variable_value |
+----------------------------------------+-----------------------------------------+
| mysql-shun_on_failures | 5 |
| mysql-shun_recovery_time_sec | 10 |
| mysql-query_retries_on_failure | 1 |
| mysql-connect_retries_delay | 1 |
| mysql-connection_max_age_ms | 0 |
| mysql-connect_timeout_server_max | 10000 |
| mysql-eventslog_filename | |
| mysql-eventslog_filesize | 104857600 |
| mysql-default_charset | utf8 |
| mysql-free_connections_pct | 10 |
| mysql-client_found_rows | true |
| mysql-monitor_enabled | true |
| mysql-monitor_connect_timeout | 200 |
| mysql-monitor_ping_max_failures | 3 |
| mysql-monitor_ping_timeout | 100 |
| mysql-monitor_replication_lag_interval | 10000 |
| mysql-monitor_replication_lag_timeout | 1000 |
| mysql-monitor_username | root |
| mysql-monitor_password | Fuckoff2013 |
| mysql-monitor_query_interval | 60000 |
| mysql-monitor_query_timeout | 100 |
| mysql-monitor_slave_lag_when_null | 600 |
| mysql-monitor_writer_is_also_reader | true |
| mysql-max_allowed_packet | 4194304 |
| mysql-max_transaction_time | 14400000 |
| mysql-multiplexing | true |
| mysql-enforce_autocommit_on_reads | false |
| mysql-threshold_query_length | 524288 |
| mysql-threshold_resultset_size | 4194304 |
| mysql-wait_timeout | 28800000 |
| mysql-default_max_latency_ms | 1000 |
| mysql-query_processor_iterations | 0 |
| mysql-long_query_time | 1000 |
| mysql-query_cache_size_MB | 256 |
| mysql-ping_interval_server_msec | 10000 |
| mysql-poll_timeout_on_failure | 100 |
| mysql-server_capabilities | 45578 |
| mysql-query_digests | true |
| mysql-servers_stats | true |
| mysql-default_reconnect | true |
| mysql-ssl_p2s_ca | (null) |
| mysql-ssl_p2s_cert | (null) |
| mysql-ssl_p2s_key | (null) |
| mysql-ssl_p2s_cipher | (null) |
| mysql-init_connect | (null) |
| admin-stats_credentials | stats:stats |
| admin-telnet_admin_ifaces | (null) |
| admin-telnet_stats_ifaces | (null) |
| admin-refresh_interval | 2000 |
| admin-read_only | false |
| admin-hash_passwords | true |
| admin-version | 1.2.4 |
| admin-admin_credentials | admin:admin |
| admin-mysql_ifaces | 127.0.0.1:6032;/tmp/proxysql_admin.sock |
| mysql-threads | 4 |
| mysql-max_connections | 2048 |
| mysql-default_query_delay | 0 |
| mysql-default_query_timeout | 36000000 |
| mysql-have_compress | true |
| mysql-poll_timeout | 2000 |
| mysql-interfaces | 0.0.0.0:6033;/tmp/proxysql.sock |
| mysql-default_schema | information_schema |
| mysql-stacksize | 1048576 |
| mysql-server_version | 5.5.30 |
| mysql-connect_timeout_server | 3000 |
| mysql-monitor_history | 600000 |
| mysql-monitor_connect_interval | 60000 |
| mysql-monitor_ping_interval | 10000 |
| mysql-monitor_read_only_interval | 1500 |
| mysql-monitor_read_only_timeout | 500 |
| mysql-ping_timeout_server | 500 |
| mysql-commands_stats | true |
| mysql-sessions_sort | true |
| mysql-connect_retries_on_failure | 10 |
| mysql-session_debug | (null) |
+----------------------------------------+-----------------------------------------+
75 rows in set (0.00 sec)

mysql> select * from stats_mysql_connection_pool;
+-----------+---------------+----------+-------------------------+----------+----------+--------+---------+---------+-----------------+-----------------+------------+
| hostgroup | srv_host | srv_port | status | ConnUsed | ConnFree | ConnOK | ConnERR | Queries | Bytes_data_sent | Bytes_data_recv | Latency_ms |
+-----------+---------------+----------+-------------------------+----------+----------+--------+---------+---------+-----------------+-----------------+------------+
| 1 | 78.46.168.94 | 3306 | ONLINE | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 221 |
| 2 | 78.46.168.108 | 3306 | SHUNNED_REPLICATION_LAG | 15 | 0 | 148 | 0 | 3710 | 1226206 | 477462674 | 282 |
| 2 | 78.46.168.94 | 3306 | ONLINE | 0 | 0 | 2 | 0 | 10 | 1192 | 524 | 221 |
+-----------+---------------+----------+-------------------------+----------+----------+--------+---------+---------+-----------------+-----------------+------------+
3 rows in set (0.00 sec)

mysql> select * from runtime_mysql_servers;
+--------------+---------------+------+---------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
| hostgroup_id | hostname | port | status | weight | compression | max_connections | max_replication_lag | use_ssl | max_latency_ms | comment |
+--------------+---------------+------+---------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
| 1 | 78.46.168.94 | 3306 | ONLINE | 1 | 0 | 1000 | 0 | 0 | 0 | |
| 2 | 78.46.168.108 | 3306 | SHUNNED | 99 | 0 | 1000 | 200 | 0 | 0 | |
| 2 | 78.46.168.94 | 3306 | ONLINE | 1 | 0 | 1000 | 0 | 0 | 0 | |
+--------------+---------------+------+---------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
3 rows in set (0.00 sec)

please note that we have mysql-monitor_slave_lag_when_null set to 600 and our server 78.46.168.108 is set to have maximum 200 replication lag

@renecannao
Copy link
Contributor

Hi @estehnet ,

When a server becomes shunned, its connections aren't automatically destroyed. This is by design: connections are destroyed when returned to the connection pool.
Therefore it is possible that a server becomes shunned and that the queries currently running will continue run until the connection is sent back to the connection pool: this allows the graceful handling of such case scenario (no error to the application).

What is probably happening in your case is that some condition is preventing the connection to go back to the connection pool, where it will be destroyed. The conditions can be several, from a transaction running (unlikely in your case, as this is a slave) to the use of session variables: these conditions will implicitly disable multiplexing, and therefore the connection will not be sent to the connection pool where it will be destroyed.

I will consider this a bug: the status of the server should be checked when a connection is in use, and not only when sent to the connection pool.
Thank you for the report.

One last note: I don't know if your mysql server is accessible from Internet, but I think you have exposed access credentials in this ticket. If that's the case, please change your password.

@estehnet
Copy link
Author

estehnet commented Nov 3, 2016

Hi Rene,
you are great! Thanks for prompt response.
It is strange that this problem didn't affect more people in production using connection pools.

Thanks for the note. servers are secured.

renecannao added a commit that referenced this issue Nov 10, 2016
renecannao added a commit that referenced this issue Nov 13, 2016
@estehnet
Copy link
Author

Hi,
we've tested this feature in production. In the moment replication lag red line has been crossed (300s) this is what started to appear in proxysql log file:

HID: 2 , address: 78.46.168.94 , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
2016-11-16 15:04:42 MySQL_HostGroups_Manager.cpp:934:replication_lag_action(): [WARNING] Shunning server 78.46.168.108:3306 with replication lag of 303 second
2016-11-16 15:04:44 MySQL_Session.cpp:2220:handler(): [WARNING] Error during query on (2,78.46.168.108,3306): 0,
2016-11-16 15:04:45 MySQL_Session.cpp:2220:handler(): [WARNING] Error during query on (2,78.46.168.108,3306): 0,
...
and after many log lines and about 7 minutes:
...
2016-11-16 15:11:21 MySQL_Session.cpp:2220:handler(): [WARNING] Error during query on (2,78.46.168.108,3306): 0,
2016-11-16 15:11:22 MySQL_Session.cpp:2220:handler(): [WARNING] Error during query on (2,78.46.168.108,3306): 0,
2016-11-16 15:11:23 MySQL_Monitor.cpp:1127:monitor_ping(): [ERROR] Server 78.46.168.108:3306 missed 3 heartbeats, shunning it and killing all the connections

At Wed Nov 16 15:06:54 GMT 2016 our app sterted reporting the following Java exceptions:
com.sun.ejb.containers.EJBLocalObjectInvocationHandler.invoke(EJBLocalObjectInvocationHandler.java:212) ... 88 more Caused by: java.sql.SQLException: ResultSet is from UPDATE. No Data. at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:998) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:937) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:926) at ...

we waited for about 10 minutes and decided to restart the app. After the restart all connections went to the 78.46.168.94 server.

After about 15 minutes the 78.46.168.108 server became ONLINE again and new connections were created on it but all connections that were created during the SHUNNED period on 78.46.168.94 server were active and all queries were forwarded to it despite the fact that the 78.46.168.108 is ONLINE again.

After this incident we had two more occasions during the day but we had no issues.

  1. something may be wrong with the way you are handling replication lag - are you closing connections at that point? please note that at some point proxysql closed all the connections to lagging server. shunned server was available at all time. it was just the replication lag but you can notice that for some reason there is a message that 3 heartbeats were missed
  2. how do you exactly handle the recovered server - do you start routing requests to it immediately after it is ONLINE or for new connections only

@estehnet
Copy link
Author

yesterday morning we had another issue after replication lag line has been crossed - results sets started to get mixed - our app is actually an API gateway and database queries started to return result sets from other queries - immediately after we switched back to direct database access (without proxysql) this issue has been resolved. do you think this could be related?

@renecannao
Copy link
Contributor

1a. Connections are closed only when proxysql try to use them again, and the server is marked as lagging.
1b. It is possible that the server, busy catching up replication, missed 3 heartbeats due to a very small timeout by default, see #803 . Please increase mysql-monitor_ping_timeout .
2. it is immediately used to serve traffic, either using old connections or new connections. Although it uses an algorithm to gradually start using a server that just came online, to avoid bursts: if the server was lagging, sending to it a lot of requests will probably cause to start lagging again.
Maybe this algorithm can be tuned.

About "results sets started to get mixed" : what this means exactly?

Thank you.

@estehnet
Copy link
Author

Hi Rene,
regarding mixed results - selects and results didn't match - it looked like a complete mismatch.
it never happened before (2 years in production - last 30 days of different proxysql versions) - this was 1.3.0g version. these requests are all selects. it looked like select from A and select from B happening at the same time returned B and A results respectively.
We immediately switched back to direct database connection and it didn't happen again.
This was a production server (10000 select queries per minute). Is it possible that it could be related to proxysql multiplexing?

@renecannao
Copy link
Contributor

That sounds terribly bad!
The resultsets got mixed only after replication started lagging and slave got shunned?

@renecannao
Copy link
Contributor

@estehnet : reviewing the code, it seems that the patch introduced to fix this issue created an unexpected side effect.
If the slave becomes shunned while the proxy is waiting the result set, the client will abandon the backend connection instead of destroying it. Nonetheless, the slave will execute the query and return the result set to the proxy that will store it in an internal buffer. When the slave isn't lagging anymore, a client will choose that backend connection and will find the result set of the previous query.
I will immediately start working on fixing this bug.

@renecannao
Copy link
Contributor

New version v1.3.0h solves this issue.

@Krishnakumarklk
Copy link

Krishnakumarklk commented Feb 21, 2023

Hi,

I am facing issue while connecting MySQL RDS through ProxySQL getting MySQL RDS state is SHUNNED after restarting ProxySQL it comes ONLINE state. After one minute it will again went to SHUNNED state.

Kindly anyone help me to overcome this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants