Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seems region leaders can lose after one tikv instance fail #14547

Closed
hicqu opened this issue Apr 10, 2023 · 1 comment
Closed

Seems region leaders can lose after one tikv instance fail #14547

hicqu opened this issue Apr 10, 2023 · 1 comment

Comments

@hicqu
Copy link
Contributor

hicqu commented Apr 10, 2023

Bug Report

What version of TiKV are you using?

TiKV
Release Version: 7.1.0-alpha
Edition: Community
Git Commit Hash: abb672b
Git Commit Branch: heads/refs/tags/v7.1.0-alpha
UTC Build Time: 2023-04-08 14:33:07
Rust Version: rustc 1.67.0-nightly (96ddd32c4 2022-11-14)
Enable Features: pprof-fp jemalloc mem-profiling portable sse test-engine-kv-rocksdb test-engine-raft-raft-engine cloud-aws cloud-gcp cloud-azure
Profile: dist_release

What operating system and CPU are you using?

Steps to reproduce

  1. Deploy a TiDB cluster with 4 TiKV instance.
  2. Create 1000 sysbench tables with pre-split = 8, so that we get 256K Regions.
  3. Create a TiCDC changefeed, and then kill one TiKV instance.

What did you expect?

Changefeed checkpoint lag may increase, but should decrease after a while.

What did happened?

Changefeed lag keeps increasing about 1 hour.
This may contains several issues. Let's focus on this one: seems some region leaders are lost in about 10min.

@hicqu
Copy link
Contributor Author

hicqu commented Apr 14, 2023

I think the root cause is when a TiKV node fails, other nodes can still dispatch Raft messages from it, after StoreUnreachable is reported. It will changes raft peers state from GroupState::Chaos to GroupState::Ordered incorrectly.

#14574 adds a test case for this. Here is its log:

test_store_disconnect_with_hibernate 2023/04/14 13:45:58.500 store.rs:2887: [INFO] broadcasting unreachable, unreachable_store_id: 1, store_id: 2
test_store_disconnect_with_hibernate 2023/04/14 13:45:58.522 store.rs:2887: [INFO] broadcasting unreachable, unreachable_store_id: 1, store_id: 3
test_store_disconnect_with_hibernate 2023/04/14 13:45:58.631 peer.rs:2446: [DEBG] handle raft message, to_peer_id: 3, from_peer_id: 1, message_type: MsgAppend, peer_id: 3, region_id: 1
test_store_disconnect_with_hibernate 2023/04/14 13:45:58.631 peer.rs:2446: [DEBG] handle raft message, to_peer_id: 2, from_peer_id: 1, message_type: MsgAppend, peer_id: 2, region_id: 1
test_store_disconnect_with_hibernate 2023/04/14 13:45:58.653 peer.rs:2175: [DEBG] stop ticking, election_elapsed: 1, peer_id: 2, region_id: 1, res: Some(CheckTickResult { leader: false, up_to_date: false, reason: "" })
test_store_disconnect_with_hibernate 2023/04/14 13:45:58.675 peer.rs:2175: [DEBG] stop ticking, election_elapsed: 1, peer_id: 3, region_id: 1, res: Some(CheckTickResult { leader: false, up_to_date: false, reason: "" })
test_store_disconnect_with_hibernate 2023/04/14 13:45:58.363 peer.rs:2136: [DEBG] follower hibernates, missing_ticks: 6, election_elapsed: 1, peer_id: 2, region_id: 1
test_store_disconnect_with_hibernate 2023/04/14 13:45:58.380 peer.rs:2136: [DEBG] follower hibernates, missing_ticks: 6, election_elapsed: 1, peer_id: 3, region_id: 1

@hicqu hicqu self-assigned this Apr 14, 2023
ti-chi-bot added a commit that referenced this issue Apr 21, 2023
ref #14547

raft: peers shouldn't hibernate incorrectly when one node fails

Signed-off-by: qupeng <qupeng@pingcap.com>

Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>
ti-chi-bot pushed a commit to ti-chi-bot/tikv that referenced this issue Apr 21, 2023
ref tikv#14547

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
lidezhu pushed a commit to lidezhu/tikv that referenced this issue Apr 27, 2023
…#14574)

ref tikv#14547

raft: peers shouldn't hibernate incorrectly when one node fails

Signed-off-by: qupeng <qupeng@pingcap.com>

Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>
Signed-off-by: lidezhu <lidezhu@pingcap.com>
ti-chi-bot bot added a commit that referenced this issue May 5, 2023
…) (#14649)

ref #14547, ref #14574

raft: peers shouldn't hibernate incorrectly when one node fails

Signed-off-by: qupeng <qupeng@pingcap.com>

Co-authored-by: qupeng <qupeng@pingcap.com>
Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
ti-chi-bot bot pushed a commit that referenced this issue May 31, 2023
…) (#14632)

ref #14547, ref #14574

raft: peers shouldn't hibernate incorrectly when one node fails

Signed-off-by: qupeng <qupeng@pingcap.com>

Co-authored-by: qupeng <qupeng@pingcap.com>
Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>
@hicqu hicqu closed this as completed Sep 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant