cloud_storage: Timequery starts from the beginning of the read-replica #16479

Lazin · 2024-02-05T20:03:54Z

Version & Environment

Redpanda version: v23.2.8

What went wrong?

From the client perspective the rpk consume command hangs and then fails with timeout error. The command is executed against the read replica. The request uses timestamp as query parameter. The timestamp is larger than timestamp of any segment in the manifest. Redpanda starts scanning the partition from the first segment available in the manifest. It correctly ignores all segments in all spillover manifests because their timestamps are smaller but it performs full scan on the partition.

What should have happened instead?

The command should return data batches.

How to reproduce the issue?

Issue a timequery with timestamp larger than the timestamp of the last segment and check the logs?

Additional information

The partition manifest has the following characteristics:

It contains replaced segments and spillover manifests (this is likely irrelevant but anyway).
It contains 4563 segments in the main part of the log.
Start offset of the manifest is 12446351348
archive_start_offset is 0, nine spillover manifests are present
Metadata of the first segment:

{
   "is_compacted":false,
   "size_bytes":128226770,
   "committed_offset":12446827209,
   "base_offset":12446351348,
   "base_timestamp":1702285778854,
   "max_timestamp":1702286084142,
   "delta_offset":41921,
   "archiver_term":17,
   "segment_term":17,
   "delta_offset_end":41923,
   "sname_format":3,
   "metadata_size_hint": 0
 }

Metadata of the last segment:

{
   "is_compacted":false,
   "size_bytes":45004,
   "committed_offset":14329186424,
   "base_offset":14329186216,
   "base_timestamp":1706302318051,
   "max_timestamp":1706302795000,
   "delta_offset":51084,
   "archiver_term":17,
   "segment_term":17,
   "delta_offset_end":51086,
   "sname_format":3,
   "metadata_size_hint": 0
 }

There are no bogus timestamps.
Both base_timestamp and max_timestamp columns are monotonic.
max_timestamp is greater than base_timestamp for every segment in the manifest.

The timestamp in the query is Jan 29 2024 02:00:00 which is 1706511600000. This value is larger than the max_timestamp of the last segment which is 1706302795000. When the query is executed we see the following lines in the log:

DEBUG 2024-02-02 23:20:36,408 [shard 2] cloud_storage - [fiber3 kafka/.../0] - remote_partition.cc:1048 - remote partition make_reader invoked, config: {start_offset:{0}, max_offset:{14329135339}, min_bytes:0, max_bytes:2048, type_filter:batch_type::raft_data, first_timestamp:{timestamp: 1706493600000}, bytes_consumed:0, over_budget:0, strict_max_bytes:0, skip_batch_cache:0, abortable:1, aborted:0}, num segments 0

TRACE 2024-02-02 23:20:36,408 [shard 2] cloud_storage - [fiber48134 kafka/.../0] - remote_partition.cc:204 - Constructing reader {kafka/.../0}
DEBUG 2024-02-02 23:20:36,408 [shard 2] cloud_storage - [fiber48134 kafka/.../0] - remote_partition.cc:209 - abort_source is set
DEBUG 2024-02-02 23:20:36,408 [shard 2] cloud_storage - [fiber2 kafka/.../0] - async_manifest_view.cc:808 - Checking timestamp {timestamp: 1706493600000} using timequery
DEBUG 2024-02-02 23:20:36,408 [shard 2] cloud_storage - [fiber2 kafka/.../0] - async_manifest_view.cc:577 - creating_cursor: begin: 0, end: 14329186424, stm_range[{12446351348}/14329186424]
DEBUG 2024-02-02 23:20:36,408 [shard 2] cloud_storage - [fiber2 kafka/.../0] - async_manifest_view.cc:168 - Manifest is not initialized
DEBUG 2024-02-02 23:20:36,408 [shard 2] cloud_storage - [fiber2 kafka/.../0] - async_manifest_view.cc:808 - Checking timestamp {timestamp: 1706493600000} using timequery
DEBUG 2024-02-02 23:20:36,408 [shard 2] cloud_storage - [fiber2 kafka/.../0] - async_manifest_view.cc:1204 - Query {{timestamp: 1706493600000}} matches with STM manifest
DEBUG 2024-02-02 23:20:36,408 [shard 2] cloud_storage - [fiber2 kafka/.../0] - async_manifest_view.cc:246 - Spill manifest range: [12446351348/14329186424], cursor range: [0/14329186424]
DEBUG 2024-02-02 23:20:36,408 [shard 2] cloud_storage - [fiber48134 kafka/.../0] - remote_partition.cc:517 - partition_record_batch_reader_impl initialize reader state
DEBUG 2024-02-02 23:20:36,408 [shard 2] cloud_storage - [fiber3~1608 kafka/.../0 [12446351348:12446827209]] - remote_segment.cc:234 - total 7 chunks in segment of size 128226770, max hydrated chunks at a time: 4, chunk size: 16777216
DEBUG 2024-02-02 23:20:36,408 [shard 2] cloud_storage - [fiber3 kafka/.../0] - segment_state.cc:79 - creating new reader, config: {start_offset:{0}, max_offset:{14329135339}, min_bytes:0, max_bytes:2048, type_filter:batch_type::raft_data, first_timestamp:{timestamp: 1706493600000}, bytes_consumed:0, over_budget:0, strict_max_bytes:0, skip_batch_cache:0, abortable:1, aborted:0}
DEBUG 2024-02-02 23:20:36,408 [shard 2] cloud_storage - [fiber48134 kafka/.../0] - remote_partition.cc:560 - segment log offset range 12446351348-12446827209, next log offset: 12446827210, log reader config: {start_offset:{0}, max_offset:{14329135339}, min_bytes:0, max_bytes:2048, type_filter:batch_type::raft_data, first_timestamp:{timestamp: 1706493600000}, bytes_consumed:0, over_budget:0, strict_max_bytes:0, skip_batch_cache:0, abortable:1, aborted:0}

The client is scanning segments and discards data batches because their timestamps are too small:

TRACE 2024-02-02 23:20:36,413 [shard 2] cloud_storage - [fiber3~1608~0~0 kafka/.../0] - remote_segment.cc:1145 - accept_batch_start {header_crc:3562908567, size_bytes:32804, base_offset:{12446826761}, type:batch_type::raft_data, crc:-1116384543, attrs:{compression:none, type:CreateTime, transactional: 0, control: 0}, last_offset_delta:129, first_timestamp:{timestamp: 1702286083739}, max_timestamp:{timestamp: 1702286083937}, producer_id:808367, producer_epoch:0, base_sequence:9601186, record_count:130, ctx:{term:{-9223372036854775808}, owner_shard:{2}}}, current delta: 41923
DEBUG 2024-02-02 23:20:36,413 [shard 2] cloud_storage - [fiber3~1608~0~0 kafka/.../0] - remote_segment.cc:1194 - accept_batch_start skip because header timestamp is {timestamp: 1702286083739}
DEBUG 2024-02-02 23:20:36,413 [shard 2] cloud_storage - [fiber3~1608~0~0 kafka/.../0] - remote_segment.cc:1223 - skip_batch_start called for 12446826761
TRACE 2024-02-02 23:20:36,413 [shard 2] cloud_storage - [fiber3~1608~0~0 kafka/.../0] - remote_segment.cc:1145 - accept_batch_start {header_crc:2004804145, size_bytes:83883, base_offset:{12446826891}, type:batch_type::raft_data, crc:-505034865, attrs:{compression:none, type:CreateTime, transactional: 0, control: 0}, last_offset_delta:318, first_timestamp:{timestamp: 1702286083942}, max_timestamp:{timestamp: 1702286084142}, producer_id:808367, producer_epoch:0, base_sequence:9601316, record_count:319, ctx:{term:{-9223372036854775808}, owner_shard:{2}}}, current delta: 41923

There are a lot of messages like this in the log:

cat redpanda.log | grep "accept_batch_start skip because header timestamp is" | wc -l
8664

The client doesn't see any data batches. Eventually, it's giving up waiting and closes the connection. On the tiered-storage side we can see this in the log:

INFO  2024-02-02 23:21:25,716 [shard 0] kafka - connection_context.h:141 - Connection input_shutdown; aborting operations

The text was updated successfully, but these errors were encountered:

Lazin · 2024-02-05T20:05:43Z

Classified as sev/high because it looks like a loss of availability to the client. It can also use a lot of resources and degrade performance or availability.

Lazin · 2024-02-06T19:32:13Z

Update: in order to reproduce this bug needs start offset of the request to be smaller than the start offset of the STM part of the manifest and the max offset to be larger than the last offset of the manifest.

Lazin added kind/bug Something isn't working area/cloud-storage Shadow indexing subsystem sev/high loss of availability, pathological performance degradation, recoverable corruption labels Feb 5, 2024

Lazin self-assigned this Feb 6, 2024

Lazin mentioned this issue Feb 6, 2024

cloud_storage: Fix timequery bug that triggers full scan #16503

Merged

7 tasks

Lazin closed this as completed in #16503 Feb 7, 2024

vbotbuildovich mentioned this issue Feb 7, 2024

[v23.3.x] cloud_storage: Timequery starts from the beginning of the read-replica #16517

Closed

This was referenced Feb 7, 2024

[v23.2.x] cloud_storage: Fix timequery bug that triggers full scan #16519

Closed

[v23.2.x] cloud_storage: Fix timequery bug that triggers full scan #16520

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cloud_storage: Timequery starts from the beginning of the read-replica #16479

cloud_storage: Timequery starts from the beginning of the read-replica #16479

Lazin commented Feb 5, 2024

Lazin commented Feb 5, 2024

Lazin commented Feb 6, 2024

cloud_storage: Timequery starts from the beginning of the read-replica #16479

cloud_storage: Timequery starts from the beginning of the read-replica #16479

Comments

Lazin commented Feb 5, 2024

Version & Environment

What went wrong?

What should have happened instead?

How to reproduce the issue?

Additional information

Lazin commented Feb 5, 2024

Lazin commented Feb 6, 2024