Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

store: cache region heartbeat #2701

Closed
wants to merge 9 commits into from

Conversation

overvenus
Copy link
Member

Caching region heartbeats if the current heartbeat is same as the previous one, it should reduce a large number of unnecessary heartbeats.

// After ttl, there must be exact one heartbeat.
sleep(Duration::from_millis(heartbeat_interval_ms * heartbeat_ttl));
assert_eq!(pd_client.get_region_heartbeat_count(), count + 1);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a test that heartbeat is sent when ttl is not expired and heartbeat is updated.

@disksing
Copy link
Contributor

Can we make default region heartbeat interval shorter? (to make updating regions update faster)

@zhangjinpeng87
Copy link
Member

zhangjinpeng87 commented Jan 23, 2018

Also need to handle read statistics at https://github.com/pingcap/tikv/blob/master/src/pd/pd.rs#L511.

@overvenus
Copy link
Member Author

A friendly ping.

overvenus added a commit that referenced this pull request Mar 15, 2018
In tests, the mock pd-server(TestPdClient::Cluster) does not push region heartbeat responses. This PR try to simulate pushing by polling responses in the background.

It's part of region heartbeat cache #2701, for fixing tests.
src/pd/pd.rs Outdated
let cache = self.heartbeat_cache
.entry(region_id)
.or_insert_with(RegionHeartbeatCache::default);
info!("debug: before merge {:?}\n{:?}", cache, hb_task);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove these debug logs.

src/pd/pd.rs Outdated
self.store_stat
.region_bytes_written
.observe(region_stat.written_bytes as f64);
.observe(cache.written_bytes_delta as f64);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it mean that the previous statistics are wrong?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are both right, the written_bytes is a delta. See line 555 https://github.com/pingcap/tikv/pull/2701/files#diff-a2b393fbe5f436bb88e05e100af8c1b4L555

src/pd/pd.rs Outdated
self.store_stat
.region_keys_read
.observe(region_stat.read_keys as f64);
.observe(cache.read_keys_delta as f64);

// Now we use put region protocol for heartbeat.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wha is "put region protocol"?

@overvenus
Copy link
Member Author

PTAL

src/pd/pd.rs Outdated
fn handle_reconnect(&mut self) {
if self.is_handle_reconnect_scheduled {
if self.invalidate_cache.swap(false, Ordering::Relaxed) {
info!("invalidate region heartbeat cache");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what situation will the runner be re-run?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And we need to reset is_handle_reconnect_scheduled here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the client connects to a new PD leader.

src/pd/pd.rs Outdated
self.last_written_bytes = total_written_bytes;
self.valid = false;
}
// TODO: what if last_written_bytes > total_written_bytes?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why could it happend?

@overvenus
Copy link
Member Author

A friendly ping.

Copy link
Contributor

@nolouch nolouch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@overvenus
Copy link
Member Author

A friendly ping.

src/pd/pd.rs Outdated
@@ -486,20 +591,36 @@ impl<T: PdClient> Runner<T> {
self.is_hb_receiver_scheduled = true;
}

fn handle_reconnect(&mut self) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function does two things:

  1. If is_handle_reconnect_scheduled is false, it means that the reconnect handler in pd client hasn't been initialized, we should set it.
  2. If is_handle_reconnect_scheduled is true, it means that we have initialized the reconnect handler, then we should check if a reconnect has happened, and invalidate the cache if it did.

Am I right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I prefer that keep things together.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it is very confusing, it takes me a while to figure this.
I think you can just initialize the reconnect handler somewhere, and just handle reconnect here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need comments here.

src/pd/pd.rs Outdated
self.store_stat
.region_bytes_written
.observe(region_stat.written_bytes as f64);
.observe(cache.written_bytes_delta as f64);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if written_bytes_delta is not changed between multi handle_heartbeat?

@@ -957,8 +957,8 @@ impl Peer {
}
self.mut_store().apply_state = res.apply_state.clone();
self.mut_store().applied_index_term = res.applied_index_term;
self.peer_stat.written_keys += res.metrics.written_keys;
self.peer_stat.written_bytes += res.metrics.written_bytes;
self.peer_stat.total_written_keys += res.metrics.written_keys;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to update peer_stat for followers and learners? And, I think we also need to clear peer_stat in on_role_changed is we do so.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we change peer_stat's behavior, it may incompatible with PD and it becomes a breaking change.

src/pd/pd.rs Show resolved Hide resolved
@hicqu
Copy link
Contributor

hicqu commented Apr 19, 2018

rest LGTM.

@@ -486,20 +595,29 @@ impl<T: PdClient> Runner<T> {
self.is_hb_receiver_scheduled = true;
}

fn maybe_handle_reconnect(&mut self) {
if self.invalidate_cache.swap(false, Ordering::Relaxed) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer Ordering::SeqCst to save other's brain.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relaxed means that there is no memory ordering problem, so I think Relaxed is more clear.

@Hoverbear
Copy link
Contributor

🙄 The build failed because of the Rust version upgrade having a weird Rustfmt incompatability. Please trigger a rebuild without the cache @overvenus .

@overvenus
Copy link
Member Author

@Hoverbear Never mind, we are not going to merge this PR since PD works smoothly even if there are many region heartbeats.

@overvenus overvenus added the DNM label May 4, 2018
@Hoverbear
Copy link
Contributor

@overvenus So should we close this then? Or is it just delayed for now?

@overvenus
Copy link
Member Author

Let's close it.

@overvenus overvenus closed this May 5, 2018
sticnarf pushed a commit to sticnarf/tikv that referenced this pull request Oct 27, 2019
In tests, the mock pd-server(TestPdClient::Cluster) does not push region heartbeat responses. This PR try to simulate pushing by polling responses in the background.

It's part of region heartbeat cache tikv#2701, for fixing tests.
@overvenus overvenus deleted the region-heartbeat-cache branch January 16, 2020 05:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants