Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help in configuring ProxySQL for MySQL HA #959

Closed
govindarajan opened this issue Mar 17, 2017 · 11 comments
Closed

Help in configuring ProxySQL for MySQL HA #959

govindarajan opened this issue Mar 17, 2017 · 11 comments

Comments

@govindarajan
Copy link

Hi René,

Thanks for the brilliant work.

I am trying to setup Highly available MySQL using ProxySQL. For that I have Master M and Replica R1 and R2. I am using query rules to route traffic to read slave((R1, R2) as well.

My configuration as follows.

select hostgroup_id, hostname, max_replication_lag from mysql_servers;                                                                                                                                       
+--------------+-------------+---------------------+
| hostgroup_id | hostname    | max_replication_lag |
+--------------+-------------+---------------------+
| 0            | M  | 0                   |
| 2            | M  | 1                   |
| 2            | R1 | 1                   |
| 2            | R2 | 1                   |
+--------------+-------------+---------------------+

select rule_id, active, match_digest, destination_hostgroup from mysql_query_rules;
+---------+--------+-----------------------+-----------------------+
| rule_id | active | match_digest          | destination_hostgroup |
+---------+--------+-----------------------+-----------------------+
| 2       | 1      | ^SELECT .* FOR UPDATE | 0                     |
| 3       | 1      | ^SELECT .*            | 2                     |
+---------+--------+-----------------------+-----------------------+
2 rows in set (0.00 sec)

select * from mysql_replication_hostgroups;
+------------------+------------------+---------+
| writer_hostgroup | reader_hostgroup | comment |
+------------------+------------------+---------+
| 0                | 2                | NULL    |
+------------------+------------------+---------+
1 row in set (0.00 sec)

Since I want to use replica node also for reading, I am using max_replication_lag option to 1 second. So that application wont read stale data. I have also setup MHA to automatic failover. When I bring down M, R1 is getting promoted as master. read_only field is getting set correctly. But R1 is in SHUNNED_REPLICATION_LAG state.

select hostgroup,srv_host,status from stats_mysql_connection_pool;                                                                                                                                           
+-----------+-------------+-------------------------+
| hostgroup | srv_host    | status                  |
+-----------+-------------+-------------------------+
| 0         | R1 | SHUNNED_REPLICATION_LAG |
| 2         | M  | SHUNNED                 |
| 2         | R1 | SHUNNED_REPLICATION_LAG |
| 2         | R2  | ONLINE                  |
+-----------+-------------+-------------------------+

Thats because we have set max_replication_lag initially. Since this is the only node which is there in writer_host_group, can we make it not use replication_lag? Or is there any workaround? Or Am I doing something wrong? Can you please point me to the correct direction?

@renecannao
Copy link
Contributor

Hi.
Can you please share proxysql error log around the time of the failover?
Thank

@govindarajan
Copy link
Author

Please find the log in the attachment.

proxysql.txt

@renecannao
Copy link
Contributor

@govindarajan : from the log, I assume 10.0.41.33 is the new master.

I think the issue you are seeing is caused by some misconfiguration in your new master.

Shunning server 10.0.41.33:3306 with replication lag of 60 second

ProxySQL is disabling the server because it has "replication lag" (details soon).
It reports that is lagging by 60 seconds, and it is always 60 seconds as for 10.0.41.78 .
60 seconds is the default value of mysql-monitor_slave_lag_when_null.
So what I believe it is happening is:

  • the new master has still replication configured, but broken (Second_Behind_Master=NULL)
  • replication lag check returns Second_Behind_Master=NULL, therefore it assumes that replication lag is 60 seconds
  • the master is shunned

Note that this behavior is expected and generally correct.
Even if 10.0.41.33 is now a master, because Second_Behind_Master=NULL the server is also a slave with broken replication.
As far as I know, MHA executes RESET SLAVE [ALL] on the new master, therefore this shouldn't be a problem at all.

@govindarajan
Copy link
Author

govindarajan commented Mar 18, 2017

I tested multiple times with mysql-monitor_replication_lag_interval = 1000.
When the switch happened, new master node is going into SHUNNED_REPLICATION_LAG state. I checked at the new master for slave status it was empty.

mysql> show slave status;
Empty set (0.00 sec)

I even run slave reset all manually in the new master. Event then the server status still continuing in SHUNNED_REPLICATION_LAG. When I do load mysql servers to runtime, server status becoming online immediately. Similarly If I change the read_only field, that time also server status is becoming online.

Edit:
The same issue happened for mysql-monitor_replication_lag_interval = 2000.
When I test with mysql-monitor_replication_lag_interval = 3000, everything was working fine. Server state become online soon after the switch. My read_only_variables details are below. Any race-condition is happening?

mysql-monitor_read_only_interval =1500
mysql-monitor_read_only_timeout =500

@renecannao
Copy link
Contributor

@govindarajan : can you please share the new error log and point at what time exactly the failover happens?

Also, when you say :

the server status still continuing in SHUNNED_REPLICATION_LAG

Do you refer to the status in mysql_servers, runtime_mysql_servers , or stats_mysql_connection_pool ?

Finally and most important: are you only checking the status of one of the tables above, or also sending traffic? If you are sending traffic, does the traffic hang because it cannot find the master?

Thanks

@govindarajan
Copy link
Author

govindarajan commented Mar 18, 2017

I am referring to stats_mysql_connection_pool table. I checked the status of mysql_servers table and in that server status was ONLINE.

When I send read(select) traffic, it is working fine. Because one node in the reader hostgroup is online. But when I sent write traffic, it is failing with ERROR 9001 (HY000): Max connect timeout reached while reaching hostgroup 0 after 10000ms.

I have attached the log for both working and non-working case.
Expected behaviour settings:

mysql-monitor_read_only_interval =1500
mysql-monitor_replication_lag_interval = 3000.

Unexpected behaviour settings:

mysql-monitor_read_only_interval =1500
mysql-monitor_replication_lag_interval = 1000.

expected_behaviour_proxysql.txt
unexpected_behaviour_proxysql.txt

@renecannao
Copy link
Contributor

I haven't reproduced it yet, but the fact you reported different behavior based on different interval made me suspect a race condition.
Looking the code, I believe it is possible that if the check fails (Second_Behind_Master=NULL) before MHA resets slave, it never recovers.
Digging more.

renecannao added a commit that referenced this issue Mar 18, 2017
If a slave was shunned because replication was broken, when it becomes a master it was staying shunned
@renecannao
Copy link
Contributor

@govindarajan , can you please compile branch v1.3.5-dev ?
Commit f70037c should fix this issue.
If you need it, I can compile it for you: what Linux version are you running?

@govindarajan
Copy link
Author

govindarajan commented Mar 19, 2017

I am using Centos 7. If you are able to compile it for me, that will be great. Will check and let you know.

@govindarajan
Copy link
Author

Tested and issue is resolved. Thanks for fixing this.

@renecannao
Copy link
Contributor

Great!
Thank you for the report

renecannao added a commit that referenced this issue Mar 23, 2017
If a slave was shunned because replication was broken, when it becomes a master it was staying shunned
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants