Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

other tikv panic when inject one of tikv failure with ticdc changfeed running #16526

Closed
Lily2025 opened this issue Feb 18, 2024 · 3 comments · Fixed by #16645 or #16650
Closed

other tikv panic when inject one of tikv failure with ticdc changfeed running #16526

Lily2025 opened this issue Feb 18, 2024 · 3 comments · Fixed by #16645 or #16650
Assignees
Labels
component/CDC Component: Change Data Capture severity/critical type/bug Type: Issue - Confirmed a bug

Comments

@Lily2025
Copy link

Bug Report

What version of TiKV are you using?

./tikv-server -V
TiKV
Release Version: 8.0.0-alpha
Edition: Community
Git Commit Hash: 55c572a
Git Commit Branch: heads/refs/tags/v8.0.0-alpha
UTC Build Time: 2024-02-10 11:56:41
Rust Version: rustc 1.77.0-nightly (89e2160c4 2023-12-27)
Enable Features: pprof-fp jemalloc mem-profiling portable sse test-engine-kv-rocksdb test-engine-raft-raft-engine cloud-aws cloud-gcp cloud-azure trace-async-tasks openssl-vendored
Profile: dist_release
2024-02-13T21:21:47.678+0800

What operating system and CPU are you using?

8c/32g

Steps to reproduce

1、run ticdc changfeed
2、run workload
3、inject one of tikv failure

What did you expect?

no panic

What did happened?

other tikv panic when inject one of tikv failure with ticdc changfeed running

{"container":"tikv","log":"[lib.rs:477] ["region 33816 commit_ts: TimeStamp(447697680388849796), resolved_ts: TimeStamp(447697683350290435)"] [backtrace=" 0: tikv_util::set_panic_hook::{{closure}}\n at home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/tikv_util/src/lib.rs:476:18\n 1: <alloc::boxed::Box<F,A> as core::ops::function::Fn>::call\n at rust/toolchains/nightly-2023-12-28-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/boxed.rs:2029:9\n std::panicking::rust_panic_with_hook\n at rust/toolchains/nightly-2023-12-28-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panicking.rs:783:13\n 2: std::panicking::begin_panic_handler::{{closure}}\n at rust/toolchains/nightly-2023-12-28-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panicking.rs:657:13\n 3: std::sys_common::backtrace::__rust_end_short_backtrace\n at rust/toolchains/nightly-2023-12-28-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/sys_common/backtrace.rs:171:18\n 4: rust_begin_unwind\n at rust/toolchains/nightly-2023-12-28-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panicking.rs:645:5\n 5: core::panicking::panic_fmt\n at rust/toolchains/nightly-2023-12-28-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/panicking.rs:72:14\n 6: cdc::delegate::Delegate::sink_txn_put\n at home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/cdc/src/delegate.rs:878:21\n cdc::delegate::Delegate::sink_put\n at home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/cdc/src/delegate.rs:837:13\n cdc::delegate::Delegate::sink_data\n at home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/cdc/src/delegate.rs:696:33\n cdc::delegate::Delegate::on_batch\n at home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/cdc/src/delegate.rs:558:17\n 7: cdc::endpoint::Endpoint<T,E,S>::on_multi_batch\n at home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/cdc/src/endpoint.rs:886:33\n <cdc::endpoint::Endpoint<T,E,S> as tikv_util::worker::pool::Runnable>::run\n at home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/cdc/src/endpoint.rs:1275:18\n 8: tikv_util::worker::pool::Worker::start_with_timer_impl::{{closure}}\n at home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/tikv_util/src/worker/pool.rs:517:25\n <tracker::tls::TrackedFuture as core::future::future::Future>::poll::{{closure}}\n at home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/tracker/src/tls.rs:64:23\n std::thread::local::LocalKey::try_with\n at rust/toolchains/nightly-2023-12-28-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/thread/local.rs:270:16\n std::thread::local::LocalKey::with\n at rust/toolchains/nightly-2023-12-28-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/thread/local.rs:246:9\n <tracker::tls::TrackedFuture as core::future::future::Future>::poll\n at home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/tracker/src/tls.rs:62:27\n tikv_util::yatp_pool::future_pool::PoolInner::spawn::{{closure}}\n at home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/tikv_util/src/yatp_pool/future_pool.rs:220:28\n yatp::task::future::RawTask::poll\n at rust/git/checkouts/yatp-e704b73c3ee279b6/793be4d/src/task/future.rs:59:9\n 9: yatp::task::future::TaskCell::poll\n at rust/git/checkouts/yatp-e704b73c3ee279b6/793be4d/src/task/future.rs:103:9\n <yatp::task::future::Runner as yatp::pool::runner::Runner>::handle\n at rust/git/checkouts/yatp-e704b73c3ee279b6/793be4d/src/task/future.rs:387:20\n 10: <tikv_util::yatp_pool::YatpPoolRunner as yatp::pool::runner::Runner>::handle\n at home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/tikv_util/src/yatp_pool/mod.rs:198:24\n yatp::pool::worker::WorkerThread<T,R>::run\n at rust/git/checkouts/yatp-e704b73c3ee279b6/793be4d/src/pool/worker.rs:48:13\n yatp::pool::builder::LazyBuilder::build::{{closure}}\n at rust/git/checkouts/yatp-e704b73c3ee279b6/793be4d/src/pool/builder.rs:114:25\n std::sys_common::backtrace::rust_begin_short_backtrace\n at rust/toolchains/nightly-2023-12-28-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/sys_common/backtrace.rs:155:18\n 11: std::thread::Builder::spawn_unchecked::{{closure}}::{{closure}}\n at rust/toolchains/nightly-2023-12-28-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/thread/mod.rs:529:17\n <core::panic::unwind_safe::AssertUnwindSafe as core::ops::function::FnOnce<()>>::call_once\n at rust/toolchains/nightly-2023-12-28-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/panic/unwind_safe.rs:272:9\n std::panicking::try::do_call\n at rust/toolchains/nightly-2023-12-28-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panicking.rs:552:40\n std::panicking::try\n at rust/toolchains/nightly-2023-12-28-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panicking.rs:516:19\n std::panic::catch_unwind\n at rust/toolchains/nightly-2023-12-28-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panic.rs:142:14\n std::thread::Builder::spawn_unchecked::{{closure}}\n at rust/toolchains/nightly-2023-12-28-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/thread/mod.rs:528:30\n core::ops::function::FnOnce::call_once{{vtable.shim}}\n at rust/toolchains/nightly-2023-12-28-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ops/function.rs:250:5\n 12: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce>::call_once\n at rust/toolchains/nightly-2023-12-28-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/boxed.rs:2015:9\n <alloc::boxed::Box<F,A> as core::ops::function::FnOnce>::call_once\n at rust/toolchains/nightly-2023-12-28-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/boxed.rs:2015:9\n std::sys::unix::thread::Thread::new::thread_start\n at rust/toolchains/nightly-2023-12-28-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/sys/unix/thread.rs:108:17\n 13: start_thread\n 14: clone\n"] [location=components/cdc/src/delegate.rs:878] [thread_name=cdc-0] [thread_id=30]","level":"FATAL","namespace":"endless-ha-test-ticdc-tps-6541512-1-295","pod":"tc-tikv-1"}

@Lily2025 Lily2025 added the type/bug Type: Issue - Confirmed a bug label Feb 18, 2024
@Lily2025
Copy link
Author

/type bug
/severity critical
/assign hicqu

@hicqu
Copy link
Contributor

hicqu commented Feb 20, 2024

We reprocude the problem in another case. Here is some traces:

[2024/02/08 04:01:45.554 +08:00] [INFO] [initializer.rs:574] ["cdc incremental scan uses ts filter: true"] [tables=1] [filtered_versions=0] [mvcc_versions=0] [hint_min_ts=447567913016623946] [region_id=31469] [thread_id=230]
[2024/02/08 04:01:47.811 +08:00] [FATAL] [lib.rs:477] ["region 31469 commit_ts: TimeStamp(447567913200124524), resolved_ts: TimeStamp(447567916411387919)"] [backtrace="   0: tikv_util::set_panic_hook::{{closure}}\n             at home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/tikv_util/src/lib.rs:476:18\n   1: <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call\n             at rust/toolchains/nightly-2023-12-28-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/boxed.rs:2029:9\n      std::panicking::rust_panic_with_hook\n             at rust/toolchains/nightly-2023-12-28-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panicking.rs:783:13\n   2: std::panicking::begin_panic_handler::{{closure}}\n             at rust/toolchains/nightly-2023-12-28-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panicking.rs:657:13\n   3: std::sys_common::backtrace::__rust_end_short_backtrace\n             at rust/toolchains/nightly-2023-12-28-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/sys_common/backtrace.rs:171:18\n   4: rust_begin_unwind\n             at rust/toolchains/nightly-2023-12-28-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panicking.rs:645:5\n   5: core::panicking::panic_fmt\n             at rust/toolchains/nightly-2023-12-28-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/panicking.rs:72:14\n   6: cdc::delegate::Delegate::sink_txn_put\n             at home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/cdc/src/delegate.rs:876:21\n      cdc::delegate::Delegate::sink_put\n             at home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/cdc/src/delegate.rs:835:13\n      cdc::delegate::Delegate::sink_data\n             at home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/cdc/src/delegate.rs:694:33\n      cdc::delegate::Delegate::on_batch\n             at home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/cdc/src/delegate.rs:556:17\n   7: cdc::endpoint::Endpoint<T,E,S>::on_multi_batch\n             at home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/cdc/src/endpoint.rs:886:33\n      <cdc::endpoint::Endpoint<T,E,S> as tikv_util::worker::pool::Runnable>::run\n             at home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/cdc/src/endpoint.rs:1275:18\n   8: tikv_util::worker::pool::Worker::start_with_timer_impl::{{closure}}\n             at home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/tikv_util/src/worker/pool.rs:517:25\n      <tracker::tls::TrackedFuture<F> as core::future::future::Future>::poll::{{closure}}\n             at home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/tracker/src/tls.rs:64:23\n      std::thread::local::LocalKey<T>::try_with\n             at rust/toolchains/nightly-2023-12-28-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/thread/local.rs:270:16\n      std::thread::local::LocalKey<T>::with\n             at rust/toolchains/nightly-2023-12-28-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/thread/local.rs:246:9\n      <tracker::tls::TrackedFuture<F> as core::future::future::Future>::poll\n             at home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/tracker/src/tls.rs:62:27\n      tikv_util::yatp_pool::future_pool::PoolInner::spawn::{{closure}}\n             at home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/tikv_util/src/yatp_pool/future_pool.rs:220:28\n      yatp::task::future::RawTask<F>::poll\n             at rust/git/checkouts/yatp-e704b73c3ee279b6/793be4d/src/task/future.rs:59:9\n   9: yatp::task::future::TaskCell::poll\n             at rust/git/checkouts/yatp-e704b73c3ee279b6/793be4d/src/task/future.rs:103:9\n      <yatp::task::future::Runner as yatp::pool::runner::Runner>::handle\n             at rust/git/checkouts/yatp-e704b73c3ee279b6/793be4d/src/task/future.rs:387:20\n  10: <tikv_util::yatp_pool::YatpPoolRunner<T> as yatp::pool::runner::Runner>::handle\n             at home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/tikv_util/src/yatp_pool/mod.rs:198:24\n      yatp::pool::worker::WorkerThread<T,R>::run\n             at rust/git/checkouts/yatp-e704b73c3ee279b6/793be4d/src/pool/worker.rs:48:13\n      yatp::pool::builder::LazyBuilder<T>::build::{{closure}}\n             at rust/git/checkouts/yatp-e704b73c3ee279b6/793be4d/src/pool/builder.rs:114:25\n      std::sys_common::backtrace::__rust_begin_short_backtrace\n             at rust/toolchains/nightly-2023-12-28-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/sys_common/backtrace.rs:155:18\n  11: std::thread::Builder::spawn_unchecked_::{{closure}}::{{closure}}\n             at rust/toolchains/nightly-2023-12-28-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/thread/mod.rs:529:17\n      <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once\n             at rust/toolchains/nightly-2023-12-28-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/panic/unwind_safe.rs:272:9\n      std::panicking::try::do_call\n             at rust/toolchains/nightly-2023-12-28-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panicking.rs:552:40\n      std::panicking::try\n             at rust/toolchains/nightly-2023-12-28-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panicking.rs:516:19\n      std::panic::catch_unwind\n             at rust/toolchains/nightly-2023-12-28-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panic.rs:142:14\n      std::thread::Builder::spawn_unchecked_::{{closure}}\n             at rust/toolchains/nightly-2023-12-28-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/thread/mod.rs:528:30\n      core::ops::function::FnOnce::call_once{{vtable.shim}}\n             at rust/toolchains/nightly-2023-12-28-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ops/function.rs:250:5\n  12: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once\n             at rust/toolchains/nightly-2023-12-28-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/boxed.rs:2015:9\n      <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once\n             at rust/toolchains/nightly-2023-12-28-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/boxed.rs:2015:9\n      std::sys::unix::thread::Thread::new::thread_start\n             at rust/toolchains/nightly-2023-12-28-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/sys/unix/thread.rs:108:17\n  13: start_thread\n  14: __GI___clone3\n"] [location=components/cdc/src/delegate.rs:876] [thread_name=cdc-0] [thread_id=36]

The region is registered with ts 447567913016623946, which is 2024-02-08 04:01:33.966 +0800 CST;
But its resolved ts is advanced to 447567916411387919, which is 2024-02-08 04:01:46.916 +0800 CST;
However, it receives a command batch from raftstore with ts 447567913200124524, which is 2024-02-08 04:01:34.666 +0800 CST.

It's very strange.

@hicqu
Copy link
Contributor

hicqu commented Mar 6, 2024

I have re-produced the issue in an integration test.
The bug is directly introduced by #16252 . The PR is expected to reduce I/O usage if only partial of a region is subscribed. However, lock tracker also needs to be updated, which was not considered when I wrote the PR.

Suppose we want to subscribe a range [a, j) in a region, and there is a key k already pre-written into the region. So we can see

  1. incremental scan will skip k because it's not in the required range;
  2. and, k won't be tracked in the resolver of the region;
  3. so resolved timestamp can be advanced;
  4. then k is committed with a given commit_ts, which can be less than ResolvedTs;
  5. so the panic occurs.

@jebter jebter added the component/CDC Component: Change Data Capture label Mar 6, 2024
ti-chi-bot bot added a commit that referenced this issue Mar 14, 2024
…16650)

close #16526

Signed-off-by: qupeng <qupeng@pingcap.com>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
dbsid pushed a commit to dbsid/tikv that referenced this issue Mar 24, 2024
… (tikv#16650)

close tikv#16526

Signed-off-by: qupeng <qupeng@pingcap.com>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
Signed-off-by: dbsid <chenhuansheng@pingcap.com>
ti-chi-bot bot pushed a commit that referenced this issue Jun 17, 2024
…cription better (#16645)

close #16526

Signed-off-by: qupeng <qupeng@pingcap.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/CDC Component: Change Data Capture severity/critical type/bug Type: Issue - Confirmed a bug
Projects
None yet
3 participants