Seems region leaders can lose after one tikv instance fail #14547

hicqu · 2023-04-10T08:36:09Z

Bug Report

What version of TiKV are you using?

TiKV
Release Version: 7.1.0-alpha
Edition: Community
Git Commit Hash: abb672b
Git Commit Branch: heads/refs/tags/v7.1.0-alpha
UTC Build Time: 2023-04-08 14:33:07
Rust Version: rustc 1.67.0-nightly (96ddd32c4 2022-11-14)
Enable Features: pprof-fp jemalloc mem-profiling portable sse test-engine-kv-rocksdb test-engine-raft-raft-engine cloud-aws cloud-gcp cloud-azure
Profile: dist_release

What operating system and CPU are you using?

Steps to reproduce

Deploy a TiDB cluster with 4 TiKV instance.
Create 1000 sysbench tables with pre-split = 8, so that we get 256K Regions.
Create a TiCDC changefeed, and then kill one TiKV instance.

What did you expect?

Changefeed checkpoint lag may increase, but should decrease after a while.

What did happened?

Changefeed lag keeps increasing about 1 hour.
This may contains several issues. Let's focus on this one: seems some region leaders are lost in about 10min.

The text was updated successfully, but these errors were encountered:

hicqu · 2023-04-14T06:50:41Z

I think the root cause is when a TiKV node fails, other nodes can still dispatch Raft messages from it, after StoreUnreachable is reported. It will changes raft peers state from GroupState::Chaos to GroupState::Ordered incorrectly.

#14574 adds a test case for this. Here is its log:

test_store_disconnect_with_hibernate 2023/04/14 13:45:58.500 store.rs:2887: [INFO] broadcasting unreachable, unreachable_store_id: 1, store_id: 2
test_store_disconnect_with_hibernate 2023/04/14 13:45:58.522 store.rs:2887: [INFO] broadcasting unreachable, unreachable_store_id: 1, store_id: 3
test_store_disconnect_with_hibernate 2023/04/14 13:45:58.631 peer.rs:2446: [DEBG] handle raft message, to_peer_id: 3, from_peer_id: 1, message_type: MsgAppend, peer_id: 3, region_id: 1
test_store_disconnect_with_hibernate 2023/04/14 13:45:58.631 peer.rs:2446: [DEBG] handle raft message, to_peer_id: 2, from_peer_id: 1, message_type: MsgAppend, peer_id: 2, region_id: 1
test_store_disconnect_with_hibernate 2023/04/14 13:45:58.653 peer.rs:2175: [DEBG] stop ticking, election_elapsed: 1, peer_id: 2, region_id: 1, res: Some(CheckTickResult { leader: false, up_to_date: false, reason: "" })
test_store_disconnect_with_hibernate 2023/04/14 13:45:58.675 peer.rs:2175: [DEBG] stop ticking, election_elapsed: 1, peer_id: 3, region_id: 1, res: Some(CheckTickResult { leader: false, up_to_date: false, reason: "" })
test_store_disconnect_with_hibernate 2023/04/14 13:45:58.363 peer.rs:2136: [DEBG] follower hibernates, missing_ticks: 6, election_elapsed: 1, peer_id: 2, region_id: 1
test_store_disconnect_with_hibernate 2023/04/14 13:45:58.380 peer.rs:2136: [DEBG] follower hibernates, missing_ticks: 6, election_elapsed: 1, peer_id: 3, region_id: 1

ref #14547 raft: peers shouldn't hibernate incorrectly when one node fails Signed-off-by: qupeng <qupeng@pingcap.com> Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>

ref tikv#14547 Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>

…#14574) ref tikv#14547 raft: peers shouldn't hibernate incorrectly when one node fails Signed-off-by: qupeng <qupeng@pingcap.com> Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io> Signed-off-by: lidezhu <lidezhu@pingcap.com>

…) (#14649) ref #14547, ref #14574 raft: peers shouldn't hibernate incorrectly when one node fails Signed-off-by: qupeng <qupeng@pingcap.com> Co-authored-by: qupeng <qupeng@pingcap.com> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>

…) (#14632) ref #14547, ref #14574 raft: peers shouldn't hibernate incorrectly when one node fails Signed-off-by: qupeng <qupeng@pingcap.com> Co-authored-by: qupeng <qupeng@pingcap.com> Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>

hicqu mentioned this issue Apr 14, 2023

raft: peers shouldn't hibernate incorrectly when one node fails #14574

Merged

hicqu self-assigned this Apr 14, 2023

hicqu added type/bug Type: Issue - Confirmed a bug severity/moderate affects-6.5 affects-7.0 labels Apr 14, 2023

ti-chi-bot mentioned this issue Apr 21, 2023

raft: peers shouldn't hibernate incorrectly when one node fails (#14574) #14632

Merged

ti-chi-bot pushed a commit to ti-chi-bot/tikv that referenced this issue Apr 21, 2023

This is an automated cherry-pick of tikv#14574

f5a2202

ref tikv#14547 Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>

ti-chi-bot mentioned this issue Apr 24, 2023

raft: peers shouldn't hibernate incorrectly when one node fails (#14574) #14649

Merged

hicqu added the affects-7.1 label Apr 24, 2023

hicqu closed this as completed Sep 13, 2023

This was referenced Oct 30, 2023

release notes: remove duplicate notes pingcap/docs-cn#15331

Merged

release notes: remove duplicate notes pingcap/docs#15217

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Seems region leaders can lose after one tikv instance fail #14547

Seems region leaders can lose after one tikv instance fail #14547

hicqu commented Apr 10, 2023

hicqu commented Apr 14, 2023

Seems region leaders can lose after one tikv instance fail #14547

Seems region leaders can lose after one tikv instance fail #14547

Comments

hicqu commented Apr 10, 2023

Bug Report

What version of TiKV are you using?

What operating system and CPU are you using?

Steps to reproduce

What did you expect?

What did happened?

hicqu commented Apr 14, 2023