Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raftstore deadlock during migrating regions #10909

Closed
Lily2025 opened this issue Sep 6, 2021 · 6 comments · Fixed by #10910
Closed

Raftstore deadlock during migrating regions #10909

Lily2025 opened this issue Sep 6, 2021 · 6 comments · Fixed by #10910
Assignees
Labels
severity/critical type/bug Type: Issue - Confirmed a bug

Comments

@Lily2025
Copy link

Lily2025 commented Sep 6, 2021

Bug Report

What version of TiKV are you using?

TiKV
Release Version: 5.2.0
Edition: Community
Git Commit Hash: 556783c
Git Commit Branch: heads/refs/tags/v5.2.0
UTC Build Time: 2021-08-26 05:50:07
Rust Version: rustc 1.56.0-nightly (2faabf579 2021-07-27)
Enable Features: jemalloc mem-profiling portable sse protobuf-codec test-engines-rocksdb cloud-aws cloud-gcp
Profile: dist_release

What operating system and CPU are you using?

k8s 2PD、2TiDB、5TiKV
4C 8G mem

Steps to reproduce

1、setup 2PD、2TiDB、4TiKV
2、run TiKVFailover001(down one tikv)but tikv is not recover due to case logic(do not delect chaos)
3、k8s scale out one tikv(now 4 tikv)
4、delect chaos and recover the tikv(now 5 tikv)

What did you expect?

Store size 、leader and region is balanced

What did happened?

Store size 、leader and region is not balanced
飞书20210906-171837
飞书20210906-171900
飞书20210906-171916

@Lily2025
Copy link
Author

Lily2025 commented Sep 6, 2021

/assign 5kbpers

@Lily2025
Copy link
Author

Lily2025 commented Sep 6, 2021

/severity Critical

@Lily2025
Copy link
Author

Lily2025 commented Sep 6, 2021

/type bug

@ti-chi-bot ti-chi-bot added the type/bug Type: Issue - Confirmed a bug label Sep 6, 2021
@github-actions github-actions bot added this to Need Triage in Question and Bug Reports Sep 6, 2021
@nolouch
Copy link
Contributor

nolouch commented Sep 6, 2021

#0  0x00007f2536da14ed in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f2536d9cdcb in _L_lock_883 () from /lib64/libpthread.so.0
#2  0x00007f2536d9cc98 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x000055a6ffc455c2 in raftstore::store::fsm::store::StoreFsmDelegate$LT$EK$C$ER$C$T$GT$::maybe_create_peer_internal::hb4a38ea8f75478f8 ()
#4  0x000055a6ffc385ac in raftstore::store::fsm::store::StoreFsmDelegate$LT$EK$C$ER$C$T$GT$::maybe_create_peer::h8f442248e8d62dfc ()
#5  0x000055a6ffc37521 in raftstore::store::fsm::store::StoreFsmDelegate$LT$EK$C$ER$C$T$GT$::on_raft_message::h4d43b5dfc6a12f45 ()
#6  0x000055a6ffc3453a in raftstore::store::fsm::store::StoreFsmDelegate$LT$EK$C$ER$C$T$GT$::handle_msgs::h6fa8c55f9c0a2df8 ()

@5kbpers 5kbpers changed the title Store size 、leader and region is not balanced Raftstore deadlock during migrating regions Sep 6, 2021
@5kbpers
Copy link
Member

5kbpers commented Sep 6, 2021

(gdb) bt
#0  0x00007f2536da14ed in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f2536d9cdcb in _L_lock_883 () from /lib64/libpthread.so.0
#2  0x00007f2536d9cc98 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x000055a6ffc1505e in raftstore::store::fsm::peer::PeerFsmDelegate$LT$EK$C$ER$C$T$GT$::on_ready_apply_snapshot::hb8ba6dd7648d255a ()
#4  0x000055a6ffc104b6 in raftstore::store::fsm::peer::PeerFsmDelegate$LT$EK$C$ER$C$T$GT$::post_raft_ready_append::h1fcc32c1524e2da0 ()
#5  0x000055a6ffc278b2 in raftstore::store::fsm::store::RaftPoller$LT$EK$C$ER$C$T$GT$::handle_raft_ready_write::h01da8c1bdafe94bc ()
#6  0x000055a6ff94e036 in _$LT$raftstore..store..fsm..store..RaftPoller$LT$EK$C$ER$C$T$GT$$u20$as$u20$batch_system..batch..PollHandler$LT$raftstore..store..fsm..peer..PeerFsm$LT$EK$C$ER$GT$$C$raftstore..store..fsm..store..StoreFsm$LT$EK$GT$$GT$$GT$::end::ha140758f8328ea0d ()
#7  0x000055a6ff903e8e in batch_system::batch::Poller$LT$N$C$C$C$Handler$GT$::poll::h5c7b49aadf7c2a97 ()
#8  0x000055a6ffce20ae in std::sys_common::backtrace::__rust_begin_short_backtrace::h900083fbcc827393 ()
#9  0x000055a6ffcf8afd in std::panicking::try::do_call::hfbcd62ff2215291e ()
#10 0x000055a6ffe967ad in core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::hb9b7ea1369649798 ()
#11 0x000055a7003b6897 in call_once<(), dyn core::ops::function::FnOnce<(), Output=()>, alloc::alloc::Global> () at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/alloc/src/boxed.rs:1572
#12 call_once<(), alloc::boxed::Box<dyn core::ops::function::FnOnce<(), Output=()>, alloc::alloc::Global>, alloc::alloc::Global> () at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/alloc/src/boxed.rs:1572
#13 std::sys::unix::thread::Thread::new::thread_start::hb71b17934c5f5e68 () at library/std/src/sys/unix/thread.rs:91
#14 0x00007f2536d9add5 in start_thread () from /lib64/libpthread.so.0
#15 0x00007f25363a3ead in clone () from /lib64/libc.so.6

@5kbpers
Copy link
Member

5kbpers commented Sep 6, 2021

According to the stack frames above, we can confirm that the deadlock was caused by the locking sequence of store_meta and global_replication_state which was introduced in #10802.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
severity/critical type/bug Type: Issue - Confirmed a bug
Projects
Question and Bug Reports
  
Closed(This Week)
Development

Successfully merging a pull request may close this issue.

4 participants