storage: crash at `offset_translator_state.cc:88) 'data_offset >= min_data_offset'` #3323

jcsp · 2021-12-20T11:44:12Z

Version 21.11.1

Node 5 crashed first, then nodes 3+4 crashed few minutes later.

Summary of crashes (including one line before for the timestamp)

$ grep -i -e Assertion -e assert.failure -B1 /tmp/node[345].log
/tmp/node3.log-INFO  2021-12-20 11:26:16,539 [shard 3] cloud_storage - [fiber4~1 kafka/verifier-1k/652] - remote_segment.cc:166 - Start hydrating segment {"debb3a25/kafka/verifier-1k/652_40/1633-1-v1.log"}, 1 consumers are awaiting
/tmp/node3.log:redpanda: /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/fstream.cc:205: virtual seastar::file_data_source_impl::~file_data_source_impl(): Assertion `_reads_in_progress == 0' failed.
--
/tmp/node4.log-INFO  2021-12-20 11:24:52,601 [shard 14] raft - [group_id:410, {kafka/verifier-1k/386}] consensus.cc:1427 - Received vote request with larger term from node {id: {5}, revision: {40}}, received 6, current 5
/tmp/node4.log:redpanda: /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/fstream.cc:205: virtual seastar::file_data_source_impl::~file_data_source_impl(): Assertion `_reads_in_progress == 0' failed.
--
/tmp/node5.log-INFO  2021-12-20 11:20:38,603 [shard 6] offset_translator - ntp: {kafka/verifier-1k/658} - offset_translator.cc:248 - prefix_truncate at offset: 1578, new state: {base offset/delta: {1578}/3, map size: 3, last delta: 5}
/tmp/node5.log:ERROR 2021-12-20 11:20:38,606 [shard 6] assert - Assert failure: (../../../src/v/storage/offset_translator_state.cc:88) 'data_offset >= min_data_offset' ntp {kafka/verifier-1k/658}: data offset 1132 must be inside translation range (starting at 1576)

node3.log
node4.log
node5.log

Workload leading up to crash:

1 leader transfer script iterating through partitions and changing the leader of each, one after the other.
1 producer writing 500000 128k mesages and then sleeping for 300s in a loop
3 random readers continuously picking a random offset, constructing a franz-go client and reading a batch
1 sequential reader, consuming continuously from offset 0 til the end, then repeating.

Topic details:

SUMMARY
=======
NAME        verifier-1k
PARTITIONS  1000
REPLICAS    3

CONFIGS
=======
KEY                     VALUE                          SOURCE
cleanup.policy          delete                         DYNAMIC_TOPIC_CONFIG
compression.type        producer                       DEFAULT_CONFIG
message.timestamp.type  CreateTime                     DEFAULT_CONFIG
partition_count         1000                           DYNAMIC_TOPIC_CONFIG
redpanda.datapolicy     function_name:  script_name:   DEFAULT_CONFIG
replication_factor      3                              DYNAMIC_TOPIC_CONFIG
retention.bytes         134217728                      DYNAMIC_TOPIC_CONFIG
retention.ms            604800000                      DEFAULT_CONFIG
segment.bytes           67108864                       DYNAMIC_TOPIC_CONFIG

The text was updated successfully, but these errors were encountered:

ztlpn · 2021-12-20T12:29:35Z

Backtrace of the assert at node5:

/opt/redpanda/libexec/redpanda: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /opt/redpanda/lib/ld.so, for GNU/Linux 3.2.0, BuildID[sha1]=d9a764cd1162c67d907d1be6df49bb2a2db82940, stripped

0x3a3cb04 0x3205302 0x1c4bde3 0x1c1078b 0x1c1658a 0x37b8154 0x37bb617 0x380b45d 0x376913f /opt/redpanda/lib/libpthread.so.0+0x9298 /opt/redpanda/lib/libc.so.6+0x1006a2
[Backtrace #0]
void seastar::backtrace<seastar::current_backtrace_tasklocal()::$_3>(seastar::current_backtrace_tasklocal()::$_3&&) at /v/build/v_deps_build/seastar-prefix/src/seastar/include/seastar/util/backtrace.hh:59
 (inlined by) seastar::current_backtrace_tasklocal() at /v/build/v_deps_build/seastar-prefix/src/seastar/src/util/backtrace.cc:86
 (inlined by) seastar::current_tasktrace() at /v/build/v_deps_build/seastar-prefix/src/seastar/src/util/backtrace.cc:137
 (inlined by) seastar::current_backtrace() at /v/build/v_deps_build/seastar-prefix/src/seastar/src/util/backtrace.cc:170
addr2line: DWARF error: could not find variable specification at offset 4b
addr2line: DWARF error: could not find variable specification at offset 33bc
addr2line: DWARF error: could not find variable specification at offset 2e5d4
storage::offset_translator_state::to_log_offset(detail::base_named_type<long, model::model_offset_type, std::__1::integral_constant<bool, true> >, detail::base_named_type<long, model::model_offset_type, std::__1::integral_constant<bool, true> >) const at /var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-058fbd9702ee6e370-1/vectorized/redpanda/vbuild/release/clang/../../../src/v/storage/offset_translator_state.cc:82
addr2line: DWARF error: could not find variable specification at offset 4b
addr2line: DWARF error: could not find variable specification at offset 77f8f
addr2line: DWARF error: could not find variable specification at offset 78201
addr2line: DWARF error: could not find variable specification at offset 783e3
addr2line: DWARF error: could not find variable specification at offset 78f1f
addr2line: DWARF error: could not find variable specification at offset 790e0
kafka::replicated_partition::aborted_transactions(detail::base_named_type<long, model::model_offset_type, std::__1::integral_constant<bool, true> >, detail::base_named_type<long, model::model_offset_type, std::__1::integral_constant<bool, true> >, seastar::lw_shared_ptr<storage::offset_translator_state const>) at /var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-058fbd9702ee6e370-1/vectorized/redpanda/vbuild/release/clang/../../../src/v/kafka/server/replicated_partition.cc:115
addr2line: DWARF error: could not find variable specification at offset 4b
addr2line: DWARF error: could not find variable specification at offset 11678
addr2line: DWARF error: could not find variable specification at offset 11824
addr2line: DWARF error: could not find variable specification at offset f293d
addr2line: DWARF error: could not find variable specification at offset f2bb0
addr2line: DWARF error: could not find variable specification at offset f2d93
addr2line: DWARF error: could not find variable specification at offset f38d8
addr2line: DWARF error: could not find variable specification at offset f3a9c
addr2line: DWARF error: could not find variable specification at offset 199b23
kafka::partition_proxy::aborted_transactions(detail::base_named_type<long, model::model_offset_type, std::__1::integral_constant<bool, true> >, detail::base_named_type<long, model::model_offset_type, std::__1::integral_constant<bool, true> >, seastar::lw_shared_ptr<storage::offset_translator_state const>) at /var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-058fbd9702ee6e370-1/vectorized/redpanda/vbuild/release/clang/../../../src/v/kafka/server/partition_proxy.h:76
 (inlined by) kafka::read_from_partition(kafka::partition_proxy, kafka::fetch_config, bool, std::__1::optional<std::__1::chrono::time_point<seastar::lowres_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000l> > > >) at /var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-058fbd9702ee6e370-1/vectorized/redpanda/vbuild/release/clang/../../../src/v/kafka/server/handlers/fetch.cc:163
std::experimental::coroutines_v1::coroutine_handle<void>::resume() at /vectorized/llvm/bin/../include/c++/v1/experimental/coroutine:121
 (inlined by) seastar::internal::coroutine_traits_base<kafka::read_result>::promise_type::run_and_dispose() at /vectorized/include/seastar/core/coroutine.hh:74
addr2line: DWARF error: could not find variable specification at offset 33378
addr2line: DWARF error: could not find variable specification at offset 4ee7e
seastar::reactor::run_tasks(seastar::reactor::task_queue&) at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:2230
 (inlined by) seastar::reactor::run_some_tasks() at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:2639
seastar::reactor::do_run() at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:2808
operator() at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:3996
 (inlined by) _ZNSt3__18__invokeIRZN7seastar3smp9configureEN5boost15program_options13variables_mapENS1_14reactor_configEE4$_92JEEEDTclclsr3std3__1E7forwardIT_Efp_Espclsr3std3__1E7forwardIT0_Efp0_EEEOS9_DpOSA_ at /vectorized/llvm/bin/../include/c++/v1/type_traits:3694
 (inlined by) void std::__1::__invoke_void_return_wrapper<void, true>::__call<seastar::smp::configure(boost::program_options::variables_map, seastar::reactor_config)::$_92&>(seastar::smp::configure(boost::program_options::variables_map, seastar::reactor_config)::$_92&) at /vectorized/llvm/bin/../include/c++/v1/__functional_base:348
 (inlined by) std::__1::__function::__alloc_func<seastar::smp::configure(boost::program_options::variables_map, seastar::reactor_config)::$_92, std::__1::allocator<seastar::smp::configure(boost::program_options::variables_map, seastar::reactor_config)::$_92>, void ()>::operator()() at /vectorized/llvm/bin/../include/c++/v1/functional:1558
 (inlined by) std::__1::__function::__func<seastar::smp::configure(boost::program_options::variables_map, seastar::reactor_config)::$_92, std::__1::allocator<seastar::smp::configure(boost::program_options::variables_map, seastar::reactor_config)::$_92>, void ()>::operator()() at /vectorized/llvm/bin/../include/c++/v1/functional:1732
std::__1::__function::__value_func<void ()>::operator()() const at /vectorized/llvm/bin/../include/c++/v1/functional:1885
 (inlined by) std::__1::function<void ()>::operator()() const at /vectorized/llvm/bin/../include/c++/v1/functional:2560
 (inlined by) seastar::posix_thread::start_routine(void*) at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/posix.cc:60
/opt/redpanda/lib/libpthread.so.0: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=7b0cdaf878ab4f99078439d864af70a5fd7b5a2c, for GNU/Linux 3.2.0, stripped

__libpthread_freeres at ??:?
/opt/redpanda/lib/libc.so.6: ELF 64-bit LSB shared object, x86-64, version 1 (GNU/Linux), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=6e3c087aca9b39549e4ba92c451f1e399b586e28, for GNU/Linux 3.2.0, stripped

clone at ??:?

So I guess it is not directly related to shadow indexing - I think what happened is that prefix truncation happened right after we read some data from the local start of the log, but before we got the aborted transactions for that range of offsets. And the fix is to hold onto segments that we read from for a little longer in the fetch session.

jcsp · 2021-12-20T13:21:16Z

Just reproduced the ~file_data_source_impl(): Assertion _reads_in_progress == 0' failed.` assertion on 21.11.3-beta2. Going to try again on a non-SI topic to see if it reproduces there too.

ztlpn · 2021-12-20T13:36:35Z

~file_data_source_impl() assertion is probably shadow indexing-specific. So I guess these are two distinct problems.

jcsp · 2021-12-20T14:39:13Z

Just confirmed both assertions are reproducible with 21.11.3-beta2.

Agree with Alexey that these are distinct. On subsequent runs they happened independent of one another, so I think it was just coincidence that in my original test run they happened about the same time. I'll open a separate ticket.

jcsp · 2021-12-20T15:12:25Z

Just reproduced this on a non-SI topic -- 4/6 nodes asserted out in about 20 minutes of runtime.

The case of an extremely short retention window is likely to be more common in SI systems, but this appears to be a preexisting bug.

ztlpn · 2021-12-24T18:49:08Z

The problem arises because log eviction proceeds in the following stages:

There is a background loop in the log that periodically checks which segments can be evicted according to the eviction policy. When it notices that some segments can be evicted it sends a notification to the raft layer via _eviction_monitor, but doesn't delete anything yet.
Raft layer creates a snapshot, prefix-truncates configurations and offset translator and notifies the log (via log::set_collectible_offset) that now it can actually delete the segments. After this step new fetches will start from the new start offset.
Some time after that log eviction loop locks and actually deletes the segments.

The assert fires because between 2 and 3 we can still have readers for the prefix-truncated offset range but offset translation information is already gone.

Discussed mitigation strategies:

(Stopgap measure suggested by @jcsp): throw an exception from offset translator, convert it into the OFFSET_OUT_OF_RANGE kafka error and rely on the client to retry. This kind of works for non-SI partitions: if we fetch from o_1 to o_2 and concurrently prefix truncation happens at offset o_t inside the [o_1 - o_2] range, the client will retry but there will be a gap [o_1 - o_t] in the fetched offsets range (not really a problem because the client can expect gaps if it reads close to the log start). But for a SI partition this doesn't work because o_t is actually deep inside the partition offset range that we report to the client (o_t is not the log start but the boundary between remote and local data). So the client will retry but start from the beginning - leading to data duplication.
RW-lock inside raft. Just as we do with segments we can take the read lock when fetching and take the write lock when doing some destructive operation (like prefix truncation). This will work but can lead to latency spikes on the read path. This problem can be solved by some kind of range-locking scheme (so that reads that start after the prefix truncation offset don't wait for truncation to finish), but this looks complicated.
Split the notifications that the log sends to raft into two: "eviction intention" (after which raft will take a snapshot) and "actual segments deleted" (after which raft will prefix-truncate offset translator).
Actually delete the segments immediately after writing the raft snapshot instead of relying on the next iteration of the log eviction loop (but before prefix-truncating the offset translator). This can be done but is disabled now: write_snapshot_cfg::should_prefix_truncate::no is passed to consensus::write_snapshot, but according to @mmaslankaprv there is no good reason for this.

ztlpn · 2021-12-29T14:58:11Z

The linked pr #3372 implements the last strategy (actually delete the segments immediately after writing the raft snapshot instead of relying on the next iteration of the log eviction loop).

jcsp added kind/bug Something isn't working area/cloud-storage Shadow indexing subsystem labels Dec 20, 2021

jcsp changed the title ~~cloud_storage: crash with ~file_data_source_impl(): Assertion _reads_in_progress == 0' failed.`~~ cloud_storage: crash at offset_translator_state.cc:88) 'data_offset >= min_data_offset' Dec 20, 2021

jcsp mentioned this issue Dec 20, 2021

cloud_storage: crash at ~file_data_source_impl(): Assertion _reads_in_progress == 0'` #3324

Closed

jcsp changed the title ~~cloud_storage: crash at offset_translator_state.cc:88) 'data_offset >= min_data_offset'~~ storage: crash at offset_translator_state.cc:88) 'data_offset >= min_data_offset' Dec 20, 2021

dswang assigned Lazin Dec 21, 2021

ztlpn assigned ztlpn and unassigned Lazin Dec 21, 2021

ztlpn mentioned this issue Dec 29, 2021

Delay prefix truncation of offset translator until segments are deleted #3372

Merged

jcsp mentioned this issue Jan 13, 2022

Shadow indexing backport tracker #3473

Closed

jcsp closed this as completed in #3372 Jan 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: crash at `offset_translator_state.cc:88) 'data_offset >= min_data_offset'` #3323

storage: crash at `offset_translator_state.cc:88) 'data_offset >= min_data_offset'` #3323

jcsp commented Dec 20, 2021 •

edited

ztlpn commented Dec 20, 2021 •

edited

jcsp commented Dec 20, 2021

ztlpn commented Dec 20, 2021

jcsp commented Dec 20, 2021

jcsp commented Dec 20, 2021

ztlpn commented Dec 24, 2021

ztlpn commented Dec 29, 2021

storage: crash at offset_translator_state.cc:88) 'data_offset >= min_data_offset' #3323

storage: crash at offset_translator_state.cc:88) 'data_offset >= min_data_offset' #3323

Comments

jcsp commented Dec 20, 2021 • edited

ztlpn commented Dec 20, 2021 • edited

jcsp commented Dec 20, 2021

ztlpn commented Dec 20, 2021

jcsp commented Dec 20, 2021

jcsp commented Dec 20, 2021

ztlpn commented Dec 24, 2021

ztlpn commented Dec 29, 2021

storage: crash at `offset_translator_state.cc:88) 'data_offset >= min_data_offset'` #3323

storage: crash at `offset_translator_state.cc:88) 'data_offset >= min_data_offset'` #3323

jcsp commented Dec 20, 2021 •

edited

ztlpn commented Dec 20, 2021 •

edited