Skip to content

Architecture

liusifan edited this page Apr 11, 2017 · 12 revisions

Abstract

PhxSQL is a high-availability and strong-consistency MySQL cluster. PhxSQL is distinguished for its unique support of linearizable consistency, high availability, both of which are as strong as that of Zookeeper, and serializable isolation level of transaction.

PhxSQL consists of a master node and some slave nodes. With a quorum of nodes working, PhxSQL can provide an uninterrupted service.

PhxSQL is built on Percona 5.6, a branch of MySQL, therefore PhxSQL provides almost complete compliance with MySQL. MySQL can be easily replaced by PhxSQL without any change of client codes.

We use MySQL and Percona interchangeably for they are similar.

Most of MySQL cluster solutions risk data loss and therefore data inconsistency for loss-less replication between a master and slaves. Following are two typical cases:

  • The binlog which is being replicated to the slaves before the master crashes, is called pending binlog. The status of the pending binlog is uncertain in a semi-sync system, and there is not a safe way to query the status of the pending binlog. When the master performs crash recovery, it don't known how to process the pending binlog. For example, if the pending binlog hadn't been replicated to any slaves, and a new master had been elected during the old master crashes, the old master should discard the pending binlog. And the situation could be reversed, the old master should commit the binlog.
  • When the old master finish the crash recovery, MySQL client could continue to read from the old master, and receive the not up-to-date data.

There are some bug reports on these problems:

We have published some articles about the defect of mysql replication (written in Chinese)

PhxSQL aims to solve this defect of MySQL replication and presents a MySQL cluster guaranteeing consistency and availability as strong as that of Zookeeper.

The Architecture of PhxSQL

Figure 1

A PhxSQL node is composed of two modules(phxbinlogsvr, phxsqlproxy) and one MySQL plugin(phxsync_master), as figure 1 illustrated.

  • phxbinlogsvr is responsible for MySQL binlog replication, MySQL health monitoring, and master election
  • phxsqlproxy is responsible for forwarding requests to MySQL master
  • phxsync_master plugin in MySQL master is responsible for intercepting and forwarding binlog to phxbinlogsvr

PhxSQL Replication Process

Figure 2

In PhxSQL, MySQL slaves will not pull binlog from MySQL master directly any more. The phxsync_master plugin intercepts and forwards binlog to a local phxbinlogsvr, the phxbinlogsvr sends it to phxbinlogsvrs on other PhxSQL nodes via Paxos, and MySQL slaves pull binlog from local phxbinlogsvrs.

HINT: PhxSQL team developed a in-house production-level Paxos library called PhxPaxos, which can be found on https://github.com/tencent-wechat/phxpaxos.

Figure 3

The cluster which is composed of multiple phxbinlogsvr can be regarded as a reliable log storage because it use Paxos protocol to replicate binlog among the multipile phxbinlogsvr. PhxSQL uses a reliable log storage to save MySQL's binlog, and the binlog will be global consistency. Every pending binlog can be queried from phxbinlogsvr to make sure whether it has been saved or not.

The process of the master and slave:

  • Master: When the MySQL master performs crash recovery, the pending binlog is queried from phxbinlogsvr to decide whether to discard (truncate the binlog file) or to commit (apply to innodb). Like figure 4 shows.
  • Slave: The binlog which slaves have received is accepted by Paxos protocol, it is global consistency. All slaves have same binlog, there are not pending binlog on slaves.

The pending binlog problem can been solved by the reliable log storage which is composed of multiple phxbinlogsvr.

PhxSQL Master Management

Figure 5

As a normal MySQL failover architecture:

  • Host A crashes before the transaction 3 had been replicated to Host B
  • Host B had been elected as a new master
  • Host A was recovery
  • The transaction 4 had been executed on Host B, and had been replicated to Host A
  • The transaction 3 is missing

This defect dues to the incompleteness of the failover operations. Even by adding an extra service like ZooKeeper, it still doesn't work. It is not an atomic operation between master query and the operations which depends on the query results.

For example, Host A execute the following two operations:

  1. Host A got the master lock
  2. Execute a transaction and replicate binlog to slave

The Host A had been hung up for a long time before step 2, and during these time the Host B got the master lock, the problem is still existing.

This problem can been solved by using the Paxos for both binlog replication and master election. With Paxos, there is a total ordering for the binlog event and the master election event. For a complete discussion of this problem is very complex, we do not make more discussion in this article.

PhxSQL uses Paxos for both binlog replication and master election. The phxbinlogsvr as the agent of PhxSQL node will initiate master election, renew or check master lease periodically, check the progress of the binlog had been applied and switch MySQL's role as need.

Here are the principle of the PhxSQL master:

  1. All PhxSQL node are slave as the begining
  2. Phxbinlogsvr initiates a master election by using PhxPaxos library if the master lease is expired
  3. The master phxbinlogsvr renews lease periodically by using PhxPaxos library
  4. There is one master at the most or none
  5. Any binlog which is sent from expired master will be rejected

PhxSQL Master Election

Figure 6

PhxSQL has a well-designed master switching process to guarantee data consistency during the master failover period. The details of the master switching process:

  1. Phxbinlogsvr on slave nodes check master lease periodically. Once master lease is expired, goto step 2;
  2. Phxbinlogsvrs goto step3 if the lastest binlog has been applied to local MySQL, else goto step1;
  3. Phxbinlogsvrs initiate a master election. One of the phxbinlogsvrs will been elected as master, and the new master will switch the local MySQL as master role (switch the "read-only" variable to "off"). The new master phxbinlogsvr gotos step4, other phxbinlogsvrs goto step1
  4. The master phxbinlogsvr keeps a master lease, and renews lease periodically;
  5. The old master gets the definitely status of the pending binlog from the phxbinlogsvr when it performs crash recovery, so it can decide to discard or to commit the pending binlog. If the new master has been elected, phxbinlogsvr will switch the local MySQL as slave role (switch the "read-only" variable to "on"), and configure the local MySQL to receive binlog from the local phxbinlogsvr.

Request Trasfering

PhxSQL solves the server side problem of failover by using phxbinlogsvr, because it can get the definitely status of the pending binlog from phxbinlogsvr and is switched role automatically by phxbinlogsvr. But there are still some problems at the client side:

  1. MySQL clients still send write requests to the expired master, and will been rejected by MySQL
  2. MySQL clients still send read requests to the expired master, and will read the not up-to-date data

Figure 7

These two problems can not be avoid because MySQL clients are not able to gain the updated information about the master. A proxy has been added to address these problems in PhxSQL, that is called phxsqlproxy.

Figure 8

As figure 8 shows, any query to a slave Node will be transfered to the current Master Node by the phxsqlproxy. Consequently Phxsqlproxy ensure that the clients can be always executed queries on the current Master Node since there aren't more than 1 Master globally.

PhxSQL Performance

We used sysbench as a test tool to compare a performance difference between PhxSQL and MySQL semi-sync. Following is the result:

  • Read: PhxSQL is slightly lower than MySQL semi-sync since there phxsqlproxy spends some CPU.
  • Write: PhxSQL is better (15%~) than MySQL semi-sync since the implementing is more effectively.
Read Write
Client Threads QPS Avg Response Time QPS Avg Response Time
200 Decrease ~3% Increase ~2% Increase ~25% Decrease ~20%
500 Decrease ~13% Increase ~10% Increase ~16% Decrease ~10%

Testing Enviroment and Detail

  • Host Information

    • CPU : Intel(R) Xeon(R) CPU E5-2420 0 @ 1.90GHz * 24
    • Memory : 32G
    • Disk : SSD Raid10
    • Ping
      • Master -> Slave : 3 ~ 4ms
      • Client -> Master : 4ms
  • Tools And Arguments

    • sysbench
    • --oltp-tables-count=10
    • --oltp-table-size=1000000
    • --num-threads=500
    • --max-requests=100000
    • --report-interval=1
    • --max-time=200
  • Test Content

    • Apply different test sets between PhxSQL and MySQL semi-sync with different client threads.
      • insert.lua (100% write queries)
      • select.lua (0% write queries)
      • OLTP.lua (20% write queris)
  • Result

Client Threads:200

Test Clusters Test Sets
insert.lua select.lua OLTP.lua
QPS Response Time QPS Response Time QPS Response Time
PhxSQL 5076 39.34/56.93 46334 4.21/5.12 25657 140.16/186.39
MySQL semi-sync 4055 49.27/66.64 47528 4.10/5.00 20391 176.39/226.76

Client Threads:500

Test Clusters Test Sets
insert.lua select.lua OLTP.lua
QPS Response Time QPS Response Time QPS Response Time
PhxSQL 8260 60.41/83.14 105928 4.58/5.81 46543 192.93/242.85
MySQL semi-sync 7072 70.60/91.72 121535 4.17/5.08 33229 270.38/345.84
** NOTE:The 2 values of Costs represent average costs and 95% percentile costs. The unit is MS.**

Summary

PhxSQL has been solved the defects about MySQL replication and Master switching so that data remains consistent while master failover. Plus PhxSQL has a better writing performance than MySQL semi-sync replication although the reading queries perform slightly lower.