Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TiKV panic 'error: Corruption: L6 has overlapping ranges' #8243

Closed
MyonKeminta opened this issue Jul 14, 2020 · 14 comments · Fixed by #9013
Closed

TiKV panic 'error: Corruption: L6 has overlapping ranges' #8243

MyonKeminta opened this issue Jul 14, 2020 · 14 comments · Fixed by #9013
Assignees
Labels
component/rocksdb Component: RocksDB engine priority/high Priority: High severity/critical sig/engine SIG: Engine status/discussion Status: Under discussion or need discussion type/bug Type: Issue - Confirmed a bug
Milestone

Comments

@MyonKeminta
Copy link
Contributor

MyonKeminta commented Jul 14, 2020

Bug Report

What version of TiKV are you using?

v4.0.1

Steps to reproduce

Unknown

What did you expect?

TiKV runs properly

What did happened?

TiKV panics:

[2020/07/03 12:24:11.015 +00:00] [FATAL] [lib.rs:481] ["rocksdb background error. db: kv, reason: compaction, error: Corruption: L6 have overlapping ranges '7A7480000000000112FFBC5F698000000000FF0000040380000000FF000E1xxxx......' seq:0, type:1 vs. '7A7480000000000112FFBC5F698000000000FF0000040380000000FF000E0xxxx....' seq:0, type:1"] [backtrace="stack backtrace:\n   0: tikv_util::set_panic_hook::{{closure}}\n             at components/tikv_util/src/lib.rs:480\n   1: std::panicking::rust_panic_with_hook\n             at src/libstd/panicking.rs:475\n   2: rust_begin_unwind\n             at src/libstd/panicking.rs:375\n   3: std::panicking::begin_panic_fmt\n             at src/libstd/panicking.rs:326\n   4: <engine_rocks::event_listener::RocksEventListener as rocksdb::event_listener::EventListener>::on_background_error\n             at components/engine_rocks/src/event_listener.rs:66\n   5: rocksdb::event_listener::on_background_error\n             at /rust/git/checkouts/rust-rocksdb-a9a28e74c6ead8ef/d472363/src/event_listener.rs:254\n   6: _ZN24crocksdb_eventlistener_t17OnBackgroundErrorEN7rocksdb21BackgroundErrorReasonEPNS0_6StatusE\n             at crocksdb/c.cc:2140\n   7: _ZN7rocksdb12EventHelpers23NotifyOnBackgroundErrorERKSt6vectorISt10shared_ptrINS_13EventListenerEESaIS4_EENS_21BackgroundErrorReasonEPNS_6StatusEPNS_17InstrumentedMutexEPb\n             at /rust/git/checkouts/rust-rocksdb-a9a28e74c6ead8ef/d472363/librocksdb_sys/rocksdb/db/event_helpers.cc:53\n   8: _ZN7rocksdb12ErrorHandler10SetBGErrorERKNS_6StatusENS_21BackgroundErrorReasonE\n             at /rust/git/checkouts/rust-rocksdb-a9a28e74c6ead8ef/d472363/librocksdb_sys/rocksdb/db/error_handler.cc:220\n   9: _ZN7rocksdb6DBImpl20BackgroundCompactionEPbPNS_10JobContextEPNS_9LogBufferEPNS0_19PrepickedCompactionENS_3Env8PriorityE\n             at /rust/git/checkouts/rust-rocksdb-a9a28e74c6ead8ef/d472363/librocksdb_sys/rocksdb/db/db_impl/db_impl_compaction_flush.cc:2797\n  10: _ZN7rocksdb6DBImpl24BackgroundCallCompactionEPNS0_19PrepickedCompactionENS_3Env8PriorityE\n             at /rust/git/checkouts/rust-rocksdb-a9a28e74c6ead8ef/d472363/librocksdb_sys/rocksdb/db/db_impl/db_impl_compaction_flush.cc:2317\n  11: _ZN7rocksdb6DBImpl16BGWorkCompactionEPv\n             at /rust/git/checkouts/rust-rocksdb-a9a28e74c6ead8ef/d472363/librocksdb_sys/rocksdb/db/db_impl/db_impl_compaction_flush.cc:2092\n  12: _ZN7rocksdb14ThreadPoolImpl4Impl8BGThreadEm\n             at /rust/git/checkouts/rust-rocksdb-a9a28e74c6ead8ef/d472363/librocksdb_sys/rocksdb/util/threadpool_imp.cc:266\n  13: _ZN7rocksdb14ThreadPoolImpl4Impl15BGThreadWrapperEPv\n             at /rust/git/checkouts/rust-rocksdb-a9a28e74c6ead8ef/d472363/librocksdb_sys/rocksdb/util/threadpool_imp.cc:307\n  14: execute_native_thread_routine\n  15: start_thread\n  16: clone\n"] [location=components/engine_rocks/src/event_listener.rs:66] [thread_name=<unnamed>]

The key in the panic log cannot be found in TiKV's logs about ingesting sst. Neither can it be found in RocksDB's logs and manifests.

After deleting the panic mark file and restarting the TiKV node, it will soon panic again, printing the same key in the log.

cc @Little-Wallace @yiwu-arbug @zhangjinpeng1987

@MyonKeminta MyonKeminta added the component/rocksdb Component: RocksDB engine label Jul 14, 2020
@yiwu-arbug yiwu-arbug added the status/discussion Status: Under discussion or need discussion label Jul 14, 2020
@yiwu-arbug
Copy link

@MyonKeminta is the instance still exists?

@Little-Wallace is this the issue you mentioned?

@MyonKeminta
Copy link
Contributor Author

MyonKeminta commented Jul 15, 2020

@yiwu-arbug The tikv node was stopped and the data and logs still exists.

@yiwu-arbug yiwu-arbug added sig/engine SIG: Engine type/bug Type: Issue - Confirmed a bug labels Jul 17, 2020
@yiwu-arbug yiwu-arbug added this to To Do in Engine SIG via automation Jul 17, 2020
@github-actions github-actions bot added this to Need Triage in Question and Bug Reports Jul 17, 2020
@yiwu-arbug
Copy link

The above error is caused by RocksDB detected unordered SST files after a L5->L6 compaction. The two unordered SSTs are generated by the same compaction. After investigating the remaining DB, we find that one of the compaction input SST is unordered internally. The problematic SST is generated as a result of multiple compactions, and the intermediate result is gone, so we are not able investigate further this time. In the sequence of compaction that leads to the problematic SST there's also ingested SSTs participated in it, so we cannot rule out if one of the ingested file is unordered.

Followup will be to add logic to fail compaction once unordered result is generated, so that next time we reproduce the problem we can examine the data and see how it could happen.

@yiwu-arbug yiwu-arbug self-assigned this Jul 17, 2020
@MyonKeminta
Copy link
Contributor Author

@yiwu-arbug Thank you :)

@yiwu-arbug
Copy link

Got another reproduction from another user POC test. Following up.

@yiwu-arbug
Copy link

The issue, at least the last reproduction, is due to RocksDB block cache cache key conflict, causing compaction reading wrong block content from file. facebook/rocksdb#7405 (comment)

@yiwu-arbug
Copy link

The cache key conflict is more likely to happen AFTER this kernel patch, which change inode generation number from sequential to random. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=232530680290ba94ca37852ab10d9556ea28badf

@yiwu-arbug
Copy link

Adding compaction and read path consistency check (by @Connor1996): tikv/rocksdb#195

@yiwu-arbug
Copy link

yiwu-arbug commented Oct 15, 2020

known affected operation systems:
ubuntu 20.04LTE, kernal 5.4.0
ubuntu 18.04LTE ,kernel 5.3.0
ubuntu 18.04LTE, kernel 4.15
Debian 10, kernel, kernel 4.19.0

known unaffected operation systems:
CentOS 8.0.1905, kernal 4.18.0
Debian 9, kernal 4.9.0

The test:
run the program https://gist.github.com/ajkr/2eac6fe4d918d0c8819e9656ec4eab41. run ./a.out <path> 10, path being a new file on the disk to store TiKV data. Check if the second number of each line is sequential or random. The operation system is affected if the number is random.

@yiwu-arbug yiwu-arbug assigned Connor1996 and unassigned yiwu-arbug Nov 3, 2020
@scsldb scsldb added this to the v4.0.9 milestone Nov 4, 2020
@zhangjinpeng87 zhangjinpeng87 changed the title TiKV panic 'error: Corruption: L6 have overlapping ranges' TiKV panic 'error: Corruption: L6 has overlapping ranges' Nov 5, 2020
@Connor1996
Copy link
Member

Connor1996 commented Nov 10, 2020

tikv/rocksdb#205 Will fix it by generating uniqueID based on db instance and sst file number instead of inode number.

@yiwu-arbug
Copy link

@Connor1996 Can you help also cherry-pick the change to disable force-consistency-checks and the change to check key ordering?

@Connor1996
Copy link
Member

@Connor1996 Can you help also cherry-pick the change to disable force-consistency-checks and the change to check key ordering?

okay

Question and Bug Reports automation moved this from Need Triage to Closed(This Week) Nov 12, 2020
Engine SIG automation moved this from To Do to Done Nov 12, 2020
@lcl401615068
Copy link

kernel 4.18.20-2.el7.x86_64 The same thing happened,error: L5 have overlapping ranges

@yiwu-arbug
Copy link

yiwu-arbug commented Nov 19, 2020

kernel 4.18.20-2.el7.x86_64 The same thing happened,error: L5 have overlapping ranges

What's your linux distribution and its version, and what's the file system you use? Just want to keep a record. Also can you run the test program in this comment on the server hosting TiKV and report the result? #8243 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/rocksdb Component: RocksDB engine priority/high Priority: High severity/critical sig/engine SIG: Engine status/discussion Status: Under discussion or need discussion type/bug Type: Issue - Confirmed a bug
Projects
Engine SIG
  
Done
Question and Bug Reports
  
Closed(This Week)
5 participants