Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raftstore: improve lease read #1327

Merged
merged 26 commits into from Nov 25, 2016
Merged

raftstore: improve lease read #1327

merged 26 commits into from Nov 25, 2016

Conversation

@hhkbp2
Copy link
Contributor

hhkbp2 commented Nov 21, 2016

This PR tries to fix issue #964

It uses monotonic clocktime to calculate the accurate leader lease.
If the leader holds the lease, it serve local read requests as lease reads. If the lease has expired, it serves local read requests and quorum read requests as consistent reads. All write requests are consistent.
The consistent read/write would be written to raft log and replicated to the raft quorum. Everytime a leader perform a consistent read/write, it will try to renew its leader lease.

Related to #1301, #1296.

The sysbench results show that the performance doesn't change significantly.

}
}

#[cfg(not(any(target_os = "macos")))]

This comment has been minimized.

Copy link
@siddontang

siddontang Nov 21, 2016

Contributor

why macos? Now we can even
I think it is better to use linux and not linux.

This comment has been minimized.

Copy link
@hhkbp2

hhkbp2 Nov 21, 2016

Author Contributor

Not sure how far we should go to support this on macos/windows. Each of them requires a different implementations. The perfect layout would be:

#[cfg os = linux]
// impl
#[cfg os = macos]
// impl
#[cfg os = windows]
// impl

This comment has been minimized.

Copy link
@siddontang

siddontang Nov 21, 2016

Contributor

I think now we can only support linux first, for other platforms, we will use time lib function instead.

// See the License for the specific language governing permissions and
// limitations under the License.

pub use self::inner::now_monotonic_raw_clocktime;

This comment has been minimized.

Copy link
@siddontang

siddontang Nov 21, 2016

Contributor

I prefer moving this to util mod.

This comment has been minimized.

Copy link
@hhkbp2

hhkbp2 Nov 21, 2016

Author Contributor

It seems dozens of lines of code are needed to implement this in MacOS or Windows. It's better to put them together in a single module.

This comment has been minimized.

Copy link
@siddontang

siddontang Nov 21, 2016

Contributor

What I mean is move this file to util as a common lib, not here.

@hhkbp2

This comment has been minimized.

Copy link
Contributor Author

hhkbp2 commented Nov 21, 2016

};
let res = unsafe { libc::clock_gettime(libc::CLOCK_MONOTONIC_RAW, &mut t) };
if res == -1 {
panic!(io::Error::last_os_error());

This comment has been minimized.

Copy link
@siddontang

siddontang Nov 21, 2016

Contributor

panic with more detailed information like get raw monotonic clock failed {:?}

@@ -507,6 +542,30 @@ impl Peer {

let t = SlowTimer::new();

// Update leader lease variables when the raft state changes
if ready.ss.is_some() {

This comment has been minimized.

Copy link
@ngaut

ngaut Nov 21, 2016

Member

Extract to a method.

@@ -187,6 +191,10 @@ pub struct Peer {
pub tag: String,

pub last_compacted_idx: u64,

become_candidate_time: Option<Timespec>,

This comment has been minimized.

Copy link
@BusyJay

BusyJay Nov 21, 2016

Contributor

I don't think this is necessary. When a peer becomes leader, all lease read will fallback to quorum read.

pub fn now_monotonic_raw_clocktime() -> Timespec {
// TODO Add monotonic raw clock time impl for macos
// Currently use the `time::get_time()` instead.
time::get_time()

This comment has been minimized.

Copy link
@siddontang

siddontang Nov 21, 2016

Contributor

get_time is real time, not monotonic time.
you can use get_precise_ns and then covert to Timespec instead.

https://github.com/rust-lang-deprecated/time/blob/master/src/sys.rs#L210

@hhkbp2 hhkbp2 changed the title [DNM] raftstore: improve lease read raftstore: improve lease read Nov 21, 2016
@hhkbp2

This comment has been minimized.

Copy link
Contributor Author

hhkbp2 commented Nov 21, 2016

// during its term.
if self.is_leader() && term == self.raft_group.raft.term {
let now = clocktime::now_monotonic_raw_clocktime();
let next_expired_time =

This comment has been minimized.

Copy link
@BusyJay

BusyJay Nov 21, 2016

Contributor

This will keep updating self.leader_lease_expired_time to now plus an election timeout. I think L530 has update the time properly, no need to do it again here.

This comment has been minimized.

Copy link
@hhkbp2

hhkbp2 Nov 21, 2016

Author Contributor

The leader lease is calculated to be more precise by
lease = election_timeout - duration_used_to_commit_quorum
https://github.com/pingcap/tikv/blob/hhkbp2/improve-lease-read/src/raftstore/store/peer.rs#L509

For every write to raft, duration_used_to_commit_quorum is the time gap from start_to_replate log_ts to quorum_commit_ts. So self.leader_lease_expired_time should be updated when the leader commits of the first empty entry on a new term.

This comment has been minimized.

Copy link
@BusyJay

BusyJay Nov 21, 2016

Contributor

L530 updates the leader_lease_expired_time to become_leader_time + election_timeout; Here updates it again to now + election_timeout - (now - become_leader_time), which is exactly become_leader_time + election_timeout. So what's point to update here? And it's not guaranteed that all empty entries are the first entry at their term.

This comment has been minimized.

Copy link
@BusyJay

BusyJay Nov 21, 2016

Contributor

Why not just update leader_lease_expire_time to become_leader_time + election_timeout directly at L530? That way we don't need an extract become_leader_time field and don't need to guarantee all empty entries should be the first entry at their term.


const NANOSECONDS_PER_SECOND: u64 = 1_000_000_000;

pub fn now_monotonic_raw_clocktime() -> Timespec {

This comment has been minimized.

Copy link
@siddontang

siddontang Nov 21, 2016

Contributor

can we use a shorter name like raw_now() or other? The mod name is clocktime, so the function looks redundant with postfix clocktime.

hhkbp2 added 2 commits Nov 21, 2016
let cb = cb.unwrap();
let (cb, lease_renew_time) = cmd_cb.unwrap();
// Try to renew the leader lease as this command asks to.
if let Some(renew_time) = lease_renew_time {

This comment has been minimized.

Copy link
@BusyJay

BusyJay Nov 21, 2016

Contributor

lease_renew_time can't be None.

-> Option<Timespec> {
// The valid leader lease should be
// "lease = election_timeout - (quorum_commit_ts - send_to_quorum_ts)"
// And the expired timestamp for that leader lease is "quorum_commit_ts + lease".

This comment has been minimized.

Copy link
@BusyJay

BusyJay Nov 21, 2016

Contributor

I guess the expired timestamp should be send_to_quorum_ts + election_timeout, so quorum_commit_ts is not necessary?

This comment has been minimized.

Copy link
@hhkbp2

hhkbp2 Nov 21, 2016

Author Contributor

Good point. quorum_commit_ts is still useful for L504.

This comment has been minimized.

Copy link
@BusyJay

BusyJay Nov 21, 2016

Contributor

Actually you always need to compare self.leader_lease_expire_time to now, so no need to calculate the diff and check it here.

This comment has been minimized.

Copy link
@hhkbp2

hhkbp2 Nov 21, 2016

Author Contributor

It's better to use a valid value for leader_lease_expire_time and None for invalid value. Stricter style, easier to understand.

This comment has been minimized.

Copy link
@BusyJay

BusyJay Nov 21, 2016

Contributor

You can't guarantee the value set to leader_lease_expire_time is always valid. And that's exactly the reason why you have to check if the time is still valid every time accessing. So it's ok to set an invalid expire_time as long as it won't step back.

This comment has been minimized.

Copy link
@hhkbp2

hhkbp2 Nov 21, 2016

Author Contributor

If it takes a longer time than election_timeout to commit a log, it will rollback the expired_time. This should not happen since it's obvious not the lease time needed for this case.

This comment has been minimized.

Copy link
@BusyJay

BusyJay Nov 21, 2016

Contributor

L1129 should handle this properly.

hhkbp2 added 3 commits Nov 21, 2016
@hhkbp2

This comment has been minimized.

Copy link
Contributor Author

hhkbp2 commented Nov 22, 2016

@@ -484,6 +494,34 @@ impl Peer {
StaleState::Valid
}

fn calculate_lease_expired_time(&self, send_to_quorum_ts: Timespec) -> Timespec {

This comment has been minimized.

Copy link
@ngaut

ngaut Nov 22, 2016

Member

s/calculate/next/

let ss = ready.ss.as_ref().unwrap();
match ss.raft_state {
StateRole::Leader => {
self.leader_lease_expired_time =

This comment has been minimized.

Copy link
@siddontang

siddontang Nov 22, 2016

Contributor

If the process block for a long time before ready, the raft_state is still leader but the leader is stale now. So the next leader_lease_expired_time is not safe for following lease read.

This comment has been minimized.

Copy link
@hhkbp2

hhkbp2 Nov 22, 2016

Author Contributor

PTAL @ngaut

This comment has been minimized.

Copy link
@ngaut

ngaut Nov 23, 2016

Member

Do you mean those processing inside of raft library ? @siddontang

This comment has been minimized.

Copy link
@hhkbp2

hhkbp2 Nov 23, 2016

Author Contributor

The leader lease which is calculated locally by leader would be affected by the scheduling of CPU/OS of the leader node. It hurts the accurateness of lease.

This comment has been minimized.

Copy link
@siddontang

siddontang Nov 23, 2016

Contributor

Do you mean those processing inside of raft library

No, still in peer, not in raft.

I proposed a simple solution before.
We can use t2 - t1 < 1s to do more safe check.
t1 is the start time when proposing, and t2 is the end time when applying, if the total time is less than the boundary, we can think the leader is still the leader.

Here and following leader_lease_expired_time update is not safe. Let's assume the following case:

  1. The peer sends votes at t1
  2. All peers votes for it
  3. The peer becomes leader at t2, but call on_ready at t3

If t3 - t2 > election timeout, the leader may be stale. So we can't think [t3, t3 + 5s) is safe for lease read.

This comment has been minimized.

Copy link
@BusyJay

BusyJay Nov 23, 2016

Contributor

I don't get it. Even [t3, t3 + 5s) is not safe for lease read, how can it cause any problem? Local read doesn't perform until the applied term is equal to the leader term. When applied term is equal to the leader term, [t3, t3 +5s) becomes a valid lease, it is certainly safe to perform lease read.

This comment has been minimized.

Copy link
@BusyJay

BusyJay Nov 23, 2016

Contributor

And to be noted, the lease time set here is not the lease produce by election but the time when first empty entry is proposed. No matter how the leader is delay, the first empty entry will not be sent until lease time is updated. So as long as the first empty entry is applied at some point, the lease set here is correct and safe.

@@ -589,6 +629,19 @@ impl Peer {
if local_read {
PEER_PROPOSAL_COUNTER_VEC.with_label_values(&["local_read"]).inc();

// If the leader lease has been expired, local read should not be performed.
let now = clocktime::raw_now();

This comment has been minimized.

Copy link
@BusyJay

BusyJay Nov 22, 2016

Contributor

I think we can put these checks into is_local_read.

debug!("{} leader lease expired time {:?} is outdated",
self.tag,
self.leader_lease_expired_time);
// Reset leader lease expired time.

This comment has been minimized.

Copy link
@queenypingcap

queenypingcap Nov 25, 2016

Contributor

expired -> expiring

@@ -1012,7 +1084,7 @@ impl Peer {
}
while let Some(mut head) = self.pending_cmds.pop_normal(term) {
if head.uuid == uuid {
return Some(head.cb.take().unwrap());
return Some((head.cb.take().unwrap(), head.renew_lease_time.unwrap()));
}
// because of the lack of original RaftCmdRequest, we skip calling

This comment has been minimized.

Copy link
@queenypingcap

queenypingcap Nov 25, 2016

Contributor

because -> Because

if let Some(current_expired_time) = self.leader_lease_expired_time {
// This peer is leader and has recorded leader lease.
// Calculate the renewed lease for this command. If the renewed lease lives longer
// than current leader lease, update the leader lease to the renewed lease.

This comment has been minimized.

Copy link
@queenypingcap

queenypingcap Nov 25, 2016

Contributor

-> than the current leader lease, update the current leader lease to the renewed lease.

// Calculate the renewed lease for this command. If the renewed lease lives longer
// than current leader lease, update the leader lease to the renewed lease.
let next_expired_time = self.next_lease_expired_time(lease_renew_time);
// Use the lease expired timestamp comparation here, so that these codes still

This comment has been minimized.

Copy link
@queenypingcap

queenypingcap Nov 25, 2016

Contributor

Use the lease expired timestamp comparation here -> Use the lease expiring timestamp comparison here

self.leader_lease_expired_time = Some(next_expired_time)
}
} else if self.is_leader() {
// This peer is leader but its leader lease has been expired.

This comment has been minimized.

Copy link
@queenypingcap

queenypingcap Nov 25, 2016

Contributor

has been expired -> has expired

self.leader_lease_expired_time);
self.leader_lease_expired_time = Some(next_expired_time);
}
// Involve post appy hook.

This comment has been minimized.

Copy link
@queenypingcap

queenypingcap Nov 25, 2016

Contributor

appy -> apply?

self.leader_lease_expired_time);
// Reset leader lease expired time.
self.leader_lease_expired_time = None;
// Perform a consistent read to raft quorum and try to renew the leader lease.

This comment has been minimized.

Copy link
@queenypingcap

queenypingcap Nov 25, 2016

Contributor

In all the comments, please use "raft" or "Raft" consistently. In my opinion, "raft" needs to be changed to "Raft".


// A helper function for testing the lease reads and lease renewing.
// The leader keeps a record of its leader lease, and uses the system's
// monotonic raw clocktime to check whether its lease has been expired.

This comment has been minimized.

Copy link
@queenypingcap

queenypingcap Nov 25, 2016

Contributor

has been expired ->has expired

// and serves quorum read requests and consistent reads. If the leader lease has expired,
// the leader handles both local read and quorum read requests as consistent reads.
// All writes are consistent.
// The consistent read/write would be written to raft log and replicated to the raft quorum.

This comment has been minimized.

Copy link
@queenypingcap

queenypingcap Nov 25, 2016

Contributor

would -> will
raft log -> the raft log

fn test_renew_lease<T: Simulator>(cluster: &mut Cluster<T>) {
// Avoid triggering the log compaction in this test case.
cluster.cfg.raft_store.raft_log_gc_threshold = 100;
// Increase raft tick interval to make this test case running reliably.

This comment has been minimized.

Copy link
@queenypingcap

queenypingcap Nov 25, 2016

Contributor

raft tick interval -> the raft tick interval

// Issue a read request and check the value on response.
must_read_on_peer(cluster, peer.clone(), region.clone(), key, b"v1");

// Check the leader did a local read.

This comment has been minimized.

Copy link
@queenypingcap

queenypingcap Nov 25, 2016

Contributor

Can we change this to "Check if the leader does a local read."?

// Issue a read request and check the value on response.
must_read_on_peer(cluster, peer.clone(), region.clone(), key, b"v1");

// Check the leader did a consistent read and renewed its lease.

This comment has been minimized.

Copy link
@queenypingcap

queenypingcap Nov 25, 2016

Contributor

-> "Check if the leader does a consistent read and renews its lease."?

// Issue a write request.
cluster.must_put(key, b"v2");

// Check the leader has renewed its lease so that it could do lease read.

This comment has been minimized.

Copy link
@queenypingcap

queenypingcap Nov 25, 2016

Contributor

-> Check if the leader has renewed its lease so that it can do lease read.

// Issue a read request and check the value on response.
must_read_on_peer(cluster, peer.clone(), region.clone(), key, b"v2");

// Check the leader did a local read.

This comment has been minimized.

Copy link
@queenypingcap

queenypingcap Nov 25, 2016

Contributor

Check if the leader does a local read.

}

// A helper function for testing the lease reads when the lease has expired.
// When the leader lease is expired, there may be new leader elected and

This comment has been minimized.

Copy link
@queenypingcap

queenypingcap Nov 25, 2016

Contributor

->If the leader lease has expired,


// Avoid triggering the log compaction in this test case.
cluster.cfg.raft_store.raft_log_gc_threshold = 100;
// Increase raft tick interval to make this test case running reliably.

This comment has been minimized.

Copy link
@queenypingcap

queenypingcap Nov 25, 2016

Contributor

raft tick interval -> the raft tick interval

let region_id = region.get_id();
cluster.must_transfer_leader(region_id, peer.clone());

// Isolate `peer` from other peers.

This comment has been minimized.

Copy link
@queenypingcap

queenypingcap Nov 25, 2016

Contributor

-> Isolate the leaderpeer from other peers.

hhkbp2 added 2 commits Nov 25, 2016
@@ -110,7 +110,7 @@ fn test_renew_lease<T: Simulator>(cluster: &mut Cluster<T>) {
// Issue a read request and check the value on response.
must_read_on_peer(cluster, peer.clone(), region.clone(), key, b"v1");

// Check the leader did a local read.
// Check the leader does a local read.

This comment has been minimized.

Copy link
@queenypingcap

queenypingcap Nov 25, 2016

Contributor

Check -> Check if

@@ -120,7 +120,7 @@ fn test_renew_lease<T: Simulator>(cluster: &mut Cluster<T>) {
// Issue a read request and check the value on response.
must_read_on_peer(cluster, peer.clone(), region.clone(), key, b"v1");

// Check the leader did a consistent read and renewed its lease.
// Check the leader does a consistent read and renewed its lease.

This comment has been minimized.

Copy link
@queenypingcap

queenypingcap Nov 25, 2016

Contributor

Check -> Check if

hhkbp2 added 2 commits Nov 25, 2016
@queenypingcap

This comment has been minimized.

Copy link
Contributor

queenypingcap commented Nov 25, 2016

LGTM

@ngaut ngaut merged commit 4a92e28 into master Nov 25, 2016
3 checks passed
3 checks passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details
licence/cla Contributor License Agreement is signed.
Details
@ngaut ngaut deleted the hhkbp2/improve-lease-read branch Nov 25, 2016
@hhkbp2

This comment has been minimized.

Copy link
Contributor Author

hhkbp2 commented Nov 25, 2016

The sysbench results on a cluster of AWS i2.xlarge 4 nodes(1 node for tidb and pd, 3 nodes for tikv):

binary sysbench type thread read/write requests per second response time average (ms) response time min (ms) response time max (ms) response time 99% (ms)
master select 256 6899.05 37.10 7.06 81.86 57.53
master select 256 6894.14 37.13 3.08 78.81 57.67
lease improved select 256 6455.28 39.65 2.70 177.38 62.02
lease improved select 256 6597.75 38.80 11.50 176.02 60.43
master insert 128 2309.83 55.41 17.14 773.92 91.11
master insert 128 2240.75 57.12 18.01 812.12 97.87
lease improved insert 128 2578.37 49.64 16.40 763.21 89.28
lease improved insert 128 2224.45 57.54 16.80 769.37 94.99

PTAL @BusyJay @siddontang @ngaut

@ngaut

This comment has been minimized.

Copy link
Member

ngaut commented Nov 26, 2016

Good. It's reasonable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.