Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock between the PD client thread and other threads calling PD sync requests. #8610

Closed
youjiali1995 opened this issue Sep 7, 2020 · 3 comments · Fixed by #8611
Closed
Labels
component/pd-client Component: pd severity/critical type/bug Type: Issue - Confirmed a bug
Milestone

Comments

@youjiali1995
Copy link
Contributor

Bug Report

What version of TiKV are you using?

v4.0.3, v4.0.4, v4.0.5 and master.

Steps to reproduce

#7028 introduced a feature that the PD client reconnects to the PD cluster periodically in the PD client thread. However, if other threads call PD sync requests which hold a read lock and wait for the PD client thread completing the request, the deadlock occours because the PD client thread wants to acquire the write lock.

Some threads' backtrace:

addr-resolver

Thread 49 (Thread 0x7f2f30dff700 (LWP 19448)):
#0  futex_wait_cancelable (private=0, expected=0, futex_word=0x7f2f304020ec) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7f2f30402090, cond=0x7f2f304020c0) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=0x7f2f304020c0, mutex=0x7f2f30402090) at pthread_cond_wait.c:638
#3  0x0000562317f28bb2 in futures::task_impl::std::ThreadNotify::park::hb9c6d2336011224b (self=0x7f2f30441010) at /rustc/16957bd4d3a5377263f76ed74c572aad8e4b7e59/src/libstd/sys/unix/condvar.rs:69
#4  0x000056231800dc04 in kvproto::protos::pdpb_grpc::PdClient::get_store_opt::h1d3ede4757ceaee5 (self=<optimized out>, req=<optimized out>, opt=...) at /rust/registry/src/github.com-1ecc6299db9ec823/futures-0.1.29/src/task_impl/std/mod.rs:237
#5  0x000056231805f47e in _$LT$pd_client..client..RpcClient$u20$as$u20$pd_client..PdClient$GT$::get_store::h1116aaaf5646e934 (self=<optimized out>, store_id=<optimized out>) at components/pd_client/src/client.rs:268
#6  0x0000562317d6992c in std::sys_common::backtrace::__rust_begin_short_backtrace::h911d4b4d33daad2e (f=...) at /home/jenkins/agent/workspace/tikv_ghpr_build_release/tikv/src/server/resolve.rs:76
#7  0x0000562317d68cfc in core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h8aa8753cfbf47b27 () at /rustc/16957bd4d3a5377263f76ed74c572aad8e4b7e59/src/libstd/thread/mod.rs:475
#8  0x000056231849b03a in _$LT$alloc..boxed..Box$LT$F$GT$$u20$as$u20$core..ops..function..FnOnce$LT$A$GT$$GT$::call_once::h13d34828db364579 () at /rustc/16957bd4d3a5377263f76ed74c572aad8e4b7e59/src/liballoc/boxed.rs:1078
#9  _$LT$alloc..boxed..Box$LT$F$GT$$u20$as$u20$core..ops..function..FnOnce$LT$A$GT$$GT$::call_once::hd51b619e0f884abf () at /rustc/16957bd4d3a5377263f76ed74c572aad8e4b7e59/src/liballoc/boxed.rs:1078
#10 std::sys::unix::thread::Thread::new::thread_start::h02c6e34c2c73f344 () at src/libstd/sys/unix/thread.rs:87
#11 0x00007f2f4ac03ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#12 0x00007f2f4ab08daf in clone () from /lib/x86_64-linux-gnu/libc.so.6

pd-0

Thread 46 (Thread 0x7f2f333fc700 (LWP 18989)):
#0  futex_abstimed_wait (private=0, abstime=0x0, clockid=0, expected=2, futex_word=<optimized out>) at ../sysdeps/nptl/futex-internal.h:284
#1  __pthread_rwlock_wrlock_full (abstime=0x0, clockid=0, rwlock=0x7f2f3b899580) at pthread_rwlock_common.c:830
#2  __GI___pthread_rwlock_wrlock (rwlock=0x7f2f3b899580) at pthread_rwlock_wrlock.c:27
#3  0x000056231804f5b6 in _$LT$std..sync..rwlock..RwLock$LT$T$GT$$u20$as$u20$tikv_util..HandyRwLock$LT$T$GT$$GT$::wl::h2609cd093c320029 (self=0x7f2f32c09010) at /rustc/16957bd4d3a5377263f76ed74c572aad8e4b7e59/src/libstd/sys/unix/rwlock.rs:73
#4  0x000056231804bee2 in _$LT$core..future..from_generator..GenFuture$LT$T$GT$$u20$as$u20$core..future..future..Future$GT$::poll::hc8b2c9fb4410b36c (self=..., cx=0x7f2f333db538) at components/pd_client/src/util.rs:191
#5  0x000056231804ac2e in _$LT$futures_util..future..future..unit_error..UnitError$LT$Fut$GT$$u20$as$u20$core..future..future..Future$GT$::poll::hd9b722a5e3118f92 (self=..., cx=0x7f2f333db538) at components/pd_client/src/client.rs:84
#6  0x000056231804a93f in _$LT$futures_util..compat..compat03as01..Compat$LT$Fut$GT$$u20$as$u20$futures..future..Future$GT$::poll::hf842b363148bae94 (self=0x7f2f4a43d0c0) at /rustc/16957bd4d3a5377263f76ed74c572aad8e4b7e59/src/libcore/future/future.rs:119
#7  0x0000562317f3ebe8 in grpcio::task::executor::poll::h72a0e0c492811446 (cq=<optimized out>, task=..., woken=<optimized out>) at /rust/registry/src/github.com-1ecc6299db9ec823/futures-0.1.29/src/future/mod.rs:113
#8  0x0000562317f3d9e8 in std::sys_common::backtrace::__rust_begin_short_backtrace::h02fc2e0ce18af3bb (f=...) at /rust/registry/src/github.com-1ecc6299db9ec823/grpcio-0.5.3/src/task/executor.rs:136
#9  0x0000562317f3b765 in core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::hd6ffd35e8771791c () at /rustc/16957bd4d3a5377263f76ed74c572aad8e4b7e59/src/libstd/thread/mod.rs:475
#10 0x000056231849b03a in _$LT$alloc..boxed..Box$LT$F$GT$$u20$as$u20$core..ops..function..FnOnce$LT$A$GT$$GT$::call_once::h13d34828db364579 () at /rustc/16957bd4d3a5377263f76ed74c572aad8e4b7e59/src/liballoc/boxed.rs:1078
#11 _$LT$alloc..boxed..Box$LT$F$GT$$u20$as$u20$core..ops..function..FnOnce$LT$A$GT$$GT$::call_once::hd51b619e0f884abf () at /rustc/16957bd4d3a5377263f76ed74c572aad8e4b7e59/src/liballoc/boxed.rs:1078
#12 std::sys::unix::thread::Thread::new::thread_start::h02c6e34c2c73f344 () at src/libstd/sys/unix/thread.rs:87
#13 0x00007f2f4ac03ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#14 0x00007f2f4ab08daf in clone () from /lib/x86_64-linux-gnu/libc.so.6

What did happen?

Deadlock between the PD client thread and other threads calling PD sync requests.

@youjiali1995
Copy link
Contributor Author

youjiali1995 commented Sep 7, 2020

The workaround is to modify the config of TiKV. For tiup:

  tikv:
    pd.update-interval: 68719476734ms

68719476734ms is the max duration of tokio-timer. It means there is a chance the bug occurs every 68719476734ms(2years).

@youjiali1995
Copy link
Contributor Author

Steps to confirm it:

  1. pd-ctl store checks disconnected or down stores.
  2. Search the log file covering the last_heartbeat_ts.
  3. Search the last connected to PD leader in the log file. If there are no logs like the logs below, i.e., connected to PD leader is the last log of util.rs, it occurs.
[2020/09/03 15:34:21.315 +08:00] [INFO] [util.rs:452] ["connected to PD leader"] [endpoints=http://172.16.5.32:6379]
[2020/09/03 15:34:21.315 +08:00] [INFO] [util.rs:195] ["heartbeat sender and receiver are stale, refreshing ..."]
[2020/09/03 15:34:21.315 +08:00] [INFO] [client.rs:411] ["cancel region heartbeat sender"]
[2020/09/03 15:34:21.315 +08:00] [WARN] [util.rs:212] ["updating PD client done"] [spend=2.467084ms]

@BusyJay
Copy link
Member

BusyJay commented Sep 10, 2020

Such errors have happened so many times. Maybe we should just avoid locks like what raft_client has done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/pd-client Component: pd severity/critical type/bug Type: Issue - Confirmed a bug
Projects
Question and Bug Reports
  
Closed(This Week)
Development

Successfully merging a pull request may close this issue.

2 participants