Bug Report
What version of TiKV are you using?
v5.3.0
What operating system and CPU are you using?
-
Steps to reproduce
The exact way to trigger the issue is currently uncertain.
What did you expect?
TiKV runs normally.
What did happened?
- Its found that some queries in TiDB reports this error:
MySQL [test]> insert into t1(id,name) value(1,'dfdfd'),(2,'dfdfdffff');
ERROR 9011 (HY000): TiKV max timestamp is not synced
This means async commit / 1pc transaction fails due to not updating the max_ts associated to this region.
-
One of the TiKV process is alive, but PD shows it's down. The TiKV's log is nearly empty, excepts some PD reconnection related logs that's printed every 10 minutes.
-
The metrics shows this TiKV node's PD heartbeat is zero, however, TiKV's GC worker is updating GC safepoint normally.
I think it looks like a deadlock related to the PD worker thread. It's confirmed by @BusyJay .
Partial stack provided by @BusyJay :
Thread 36 (Thread 0x7fb7b5df3700 (LWP 13739)):
#0 0x00007fb82e2dd54d in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00007fb82e2d8e9b in _L_lock_883 () from /lib64/libpthread.so.0
#2 0x00007fb82e2d8d68 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00005597d049cd69 in std::sys::unix::mutex::Mutex::lock (self=<optimized out>) at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/sys/unix/mutex.rs:63
#4 std::sys_common::mutex::MovableMutex::raw_lock (self=<optimized out>) at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/sys_common/mutex.rs:76
#5 std::sync::mutex::Mutex<T>::lock (self=<optimized out>) at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/sync/mutex.rs:261
#6 <tikv_util::future::BatchCommandsWaker as futures_task::arc_wake::ArcWake>::wake_by_ref (arc_self=<optimized out>) at components/tikv_util/src/future.rs:84
36 Thread 0x7fb7b5df3700 (LWP 13739) "pd-worker-0" 0x00007fb82e2dd54d in __lll_lock_wait () from /lib64/libpthread.so.0
Bug Report
What version of TiKV are you using?
v5.3.0
What operating system and CPU are you using?
-
Steps to reproduce
The exact way to trigger the issue is currently uncertain.
What did you expect?
TiKV runs normally.
What did happened?
This means async commit / 1pc transaction fails due to not updating the max_ts associated to this region.
One of the TiKV process is alive, but PD shows it's down. The TiKV's log is nearly empty, excepts some PD reconnection related logs that's printed every 10 minutes.
The metrics shows this TiKV node's PD heartbeat is zero, however, TiKV's GC worker is updating GC safepoint normally.
I think it looks like a deadlock related to the PD worker thread. It's confirmed by @BusyJay .
Partial stack provided by @BusyJay :