Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloud_storage: Timequery starts from the beginning of the read-replica #16479

Closed
Lazin opened this issue Feb 5, 2024 · 2 comments · Fixed by #16503
Closed

cloud_storage: Timequery starts from the beginning of the read-replica #16479

Lazin opened this issue Feb 5, 2024 · 2 comments · Fixed by #16503
Assignees
Labels
area/cloud-storage Shadow indexing subsystem kind/bug Something isn't working sev/high loss of availability, pathological performance degradation, recoverable corruption

Comments

@Lazin
Copy link
Contributor

Lazin commented Feb 5, 2024

Version & Environment

Redpanda version: v23.2.8

What went wrong?

From the client perspective the rpk consume command hangs and then fails with timeout error. The command is executed against the read replica. The request uses timestamp as query parameter. The timestamp is larger than timestamp of any segment in the manifest. Redpanda starts scanning the partition from the first segment available in the manifest. It correctly ignores all segments in all spillover manifests because their timestamps are smaller but it performs full scan on the partition.

What should have happened instead?

The command should return data batches.

How to reproduce the issue?

Issue a timequery with timestamp larger than the timestamp of the last segment and check the logs?

Additional information

The partition manifest has the following characteristics:

  • It contains replaced segments and spillover manifests (this is likely irrelevant but anyway).
  • It contains 4563 segments in the main part of the log.
  • Start offset of the manifest is 12446351348
  • archive_start_offset is 0, nine spillover manifests are present
  • Metadata of the first segment:
{
   "is_compacted":false,
   "size_bytes":128226770,
   "committed_offset":12446827209,
   "base_offset":12446351348,
   "base_timestamp":1702285778854,
   "max_timestamp":1702286084142,
   "delta_offset":41921,
   "archiver_term":17,
   "segment_term":17,
   "delta_offset_end":41923,
   "sname_format":3,
   "metadata_size_hint": 0
 }
  • Metadata of the last segment:
{
   "is_compacted":false,
   "size_bytes":45004,
   "committed_offset":14329186424,
   "base_offset":14329186216,
   "base_timestamp":1706302318051,
   "max_timestamp":1706302795000,
   "delta_offset":51084,
   "archiver_term":17,
   "segment_term":17,
   "delta_offset_end":51086,
   "sname_format":3,
   "metadata_size_hint": 0
 }
  • There are no bogus timestamps.
  • Both base_timestamp and max_timestamp columns are monotonic.
  • max_timestamp is greater than base_timestamp for every segment in the manifest.

The timestamp in the query is Jan 29 2024 02:00:00 which is 1706511600000. This value is larger than the max_timestamp of the last segment which is 1706302795000. When the query is executed we see the following lines in the log:

DEBUG 2024-02-02 23:20:36,408 [shard 2] cloud_storage - [fiber3 kafka/.../0] - remote_partition.cc:1048 - remote partition make_reader invoked, config: {start_offset:{0}, max_offset:{14329135339}, min_bytes:0, max_bytes:2048, type_filter:batch_type::raft_data, first_timestamp:{timestamp: 1706493600000}, bytes_consumed:0, over_budget:0, strict_max_bytes:0, skip_batch_cache:0, abortable:1, aborted:0}, num segments 0

TRACE 2024-02-02 23:20:36,408 [shard 2] cloud_storage - [fiber48134 kafka/.../0] - remote_partition.cc:204 - Constructing reader {kafka/.../0}
DEBUG 2024-02-02 23:20:36,408 [shard 2] cloud_storage - [fiber48134 kafka/.../0] - remote_partition.cc:209 - abort_source is set
DEBUG 2024-02-02 23:20:36,408 [shard 2] cloud_storage - [fiber2 kafka/.../0] - async_manifest_view.cc:808 - Checking timestamp {timestamp: 1706493600000} using timequery
DEBUG 2024-02-02 23:20:36,408 [shard 2] cloud_storage - [fiber2 kafka/.../0] - async_manifest_view.cc:577 - creating_cursor: begin: 0, end: 14329186424, stm_range[{12446351348}/14329186424]
DEBUG 2024-02-02 23:20:36,408 [shard 2] cloud_storage - [fiber2 kafka/.../0] - async_manifest_view.cc:168 - Manifest is not initialized
DEBUG 2024-02-02 23:20:36,408 [shard 2] cloud_storage - [fiber2 kafka/.../0] - async_manifest_view.cc:808 - Checking timestamp {timestamp: 1706493600000} using timequery
DEBUG 2024-02-02 23:20:36,408 [shard 2] cloud_storage - [fiber2 kafka/.../0] - async_manifest_view.cc:1204 - Query {{timestamp: 1706493600000}} matches with STM manifest
DEBUG 2024-02-02 23:20:36,408 [shard 2] cloud_storage - [fiber2 kafka/.../0] - async_manifest_view.cc:246 - Spill manifest range: [12446351348/14329186424], cursor range: [0/14329186424]
DEBUG 2024-02-02 23:20:36,408 [shard 2] cloud_storage - [fiber48134 kafka/.../0] - remote_partition.cc:517 - partition_record_batch_reader_impl initialize reader state
DEBUG 2024-02-02 23:20:36,408 [shard 2] cloud_storage - [fiber3~1608 kafka/.../0 [12446351348:12446827209]] - remote_segment.cc:234 - total 7 chunks in segment of size 128226770, max hydrated chunks at a time: 4, chunk size: 16777216
DEBUG 2024-02-02 23:20:36,408 [shard 2] cloud_storage - [fiber3 kafka/.../0] - segment_state.cc:79 - creating new reader, config: {start_offset:{0}, max_offset:{14329135339}, min_bytes:0, max_bytes:2048, type_filter:batch_type::raft_data, first_timestamp:{timestamp: 1706493600000}, bytes_consumed:0, over_budget:0, strict_max_bytes:0, skip_batch_cache:0, abortable:1, aborted:0}
DEBUG 2024-02-02 23:20:36,408 [shard 2] cloud_storage - [fiber48134 kafka/.../0] - remote_partition.cc:560 - segment log offset range 12446351348-12446827209, next log offset: 12446827210, log reader config: {start_offset:{0}, max_offset:{14329135339}, min_bytes:0, max_bytes:2048, type_filter:batch_type::raft_data, first_timestamp:{timestamp: 1706493600000}, bytes_consumed:0, over_budget:0, strict_max_bytes:0, skip_batch_cache:0, abortable:1, aborted:0}

The client is scanning segments and discards data batches because their timestamps are too small:

TRACE 2024-02-02 23:20:36,413 [shard 2] cloud_storage - [fiber3~1608~0~0 kafka/.../0] - remote_segment.cc:1145 - accept_batch_start {header_crc:3562908567, size_bytes:32804, base_offset:{12446826761}, type:batch_type::raft_data, crc:-1116384543, attrs:{compression:none, type:CreateTime, transactional: 0, control: 0}, last_offset_delta:129, first_timestamp:{timestamp: 1702286083739}, max_timestamp:{timestamp: 1702286083937}, producer_id:808367, producer_epoch:0, base_sequence:9601186, record_count:130, ctx:{term:{-9223372036854775808}, owner_shard:{2}}}, current delta: 41923
DEBUG 2024-02-02 23:20:36,413 [shard 2] cloud_storage - [fiber3~1608~0~0 kafka/.../0] - remote_segment.cc:1194 - accept_batch_start skip because header timestamp is {timestamp: 1702286083739}
DEBUG 2024-02-02 23:20:36,413 [shard 2] cloud_storage - [fiber3~1608~0~0 kafka/.../0] - remote_segment.cc:1223 - skip_batch_start called for 12446826761
TRACE 2024-02-02 23:20:36,413 [shard 2] cloud_storage - [fiber3~1608~0~0 kafka/.../0] - remote_segment.cc:1145 - accept_batch_start {header_crc:2004804145, size_bytes:83883, base_offset:{12446826891}, type:batch_type::raft_data, crc:-505034865, attrs:{compression:none, type:CreateTime, transactional: 0, control: 0}, last_offset_delta:318, first_timestamp:{timestamp: 1702286083942}, max_timestamp:{timestamp: 1702286084142}, producer_id:808367, producer_epoch:0, base_sequence:9601316, record_count:319, ctx:{term:{-9223372036854775808}, owner_shard:{2}}}, current delta: 41923

There are a lot of messages like this in the log:

cat redpanda.log | grep "accept_batch_start skip because header timestamp is" | wc -l
8664

The client doesn't see any data batches. Eventually, it's giving up waiting and closes the connection. On the tiered-storage side we can see this in the log:

INFO  2024-02-02 23:21:25,716 [shard 0] kafka - connection_context.h:141 - Connection input_shutdown; aborting operations
@Lazin Lazin added kind/bug Something isn't working area/cloud-storage Shadow indexing subsystem sev/high loss of availability, pathological performance degradation, recoverable corruption labels Feb 5, 2024
@Lazin
Copy link
Contributor Author

Lazin commented Feb 5, 2024

Classified as sev/high because it looks like a loss of availability to the client. It can also use a lot of resources and degrade performance or availability.

@Lazin
Copy link
Contributor Author

Lazin commented Feb 6, 2024

Update: in order to reproduce this bug needs start offset of the request to be smaller than the start offset of the STM part of the manifest and the max offset to be larger than the last offset of the manifest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cloud-storage Shadow indexing subsystem kind/bug Something isn't working sev/high loss of availability, pathological performance degradation, recoverable corruption
Projects
None yet
1 participant