CI Failure (consumer timed out) in `CloudStorageTimingStressTest.test_cloud_storage` #15042

abhijat · 2023-11-20T15:17:16Z

https://buildkite.com/redpanda/redpanda/builds/41212
https://buildkite.com/redpanda/redpanda/builds/41374

Module: rptest.tests.cloud_storage_timing_stress_test
Class: CloudStorageTimingStressTest
Method: test_cloud_storage
Arguments: {
    "cleanup_policy": "compact,delete"
}

test_id:    CloudStorageTimingStressTest.test_cloud_storage
status:     FAIL
run time:   141.244 seconds

TimeoutError('Workload did not complete within 120s: Consumer consumed only 1225 out of 15360 messages')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 184, in _do_run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 269, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 481, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/services/cluster.py", line 82, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/utils/mode_checks.py", line 63, in f
    return func(*args, **kwargs)
  File "/root/tests/rptest/tests/cloud_storage_timing_stress_test.py", line 505, in test_cloud_storage
    while not self.is_complete():
  File "/root/tests/rptest/tests/cloud_storage_timing_stress_test.py", line 370, in is_complete
    raise TimeoutError(
TimeoutError: Workload did not complete within 120s: Consumer consumed only 1225 out of 15360 messages

The text was updated successfully, but these errors were encountered:

abhijat · 2023-11-23T08:37:15Z

https://buildkite.com/redpanda/redpanda/builds/41654

rockwotj · 2023-12-01T15:58:45Z

Different test (with_partition_moves), same failures: https://buildkite.com/redpanda/redpanda/builds/42019

vbotbuildovich · 2023-12-13T20:21:30Z

*https://buildkite.com/redpanda/redpanda/builds/42187
*https://buildkite.com/redpanda/redpanda/builds/42431
*https://buildkite.com/redpanda/redpanda/builds/42562
*https://buildkite.com/redpanda/redpanda/builds/42577

andrwng · 2023-12-20T01:45:04Z

Looked at https://ci-artifacts.dev.vectorized.cloud/redpanda/42577/018c5cd0-0abd-4319-af6a-3a479f61f53b/vbuild/ducktape/results/final/CloudStorageTimingStressTest/test_cloud_storage/cleanup_policy=compact.delete/70/RedpandaService-0-281472799218544/

Client side at a certain point, we see several attempts to connect, but they're immediately shut down

time="2023-12-12T07:23:26Z" level=debug msg="Calling PollFetches (lwm=[1218] status {\"name\":\"\",\"valid_reads\":1250,\"invalid_reads\":0,\"out_of_scope_invalid_reads\":0,\"max_offsets_consumed\":{\"0\":1349}})"
time="2023-12-12T07:23:26Z" name=KgoVerifierSeqConsumer-0-281471353003584[DEBUG] opening connection to broker; addr: docker-rp-31:9092, broker: 1
time="2023-12-12T07:23:26Z" name=KgoVerifierSeqConsumer-0-281471353003584[DEBUG] connection opened to broker; addr: docker-rp-31:9092, broker: 1
time="2023-12-12T07:23:26Z" name=KgoVerifierSeqConsumer-0-281471353003584[DEBUG] connection initialized successfully; addr: docker-rp-31:9092, broker: 1
time="2023-12-12T07:23:26Z" name=KgoVerifierSeqConsumer-0-281471353003584[DEBUG] wrote Fetch v11; broker: 1, bytes_written: 133, write_wait: 606.124Âµs, time_to_write: 21.956Âµs, err: <nil>
time="2023-12-12T07:23:26Z" name=KgoVerifierSeqConsumer-0-281471353003584[DEBUG] read Fetch v11; broker: 1, bytes_read: 0, read_wait: 37.193Âµs, time_to_read: 3.532833ms, err: EOF
time="2023-12-12T07:23:26Z" name=KgoVerifierSeqConsumer-0-281471353003584[WARN] read from broker errored, killing connection after 0 successful responses (is SASL missing?); addr: docker-rp-31:9092, broker: 1, err: EOF
time="2023-12-12T07:23:26Z" name=KgoVerifierSeqConsumer-0-281471353003584[DEBUG] opening connection to broker; addr: docker-rp-31:9092, broker: 1
time="2023-12-12T07:23:26Z" name=KgoVerifierSeqConsumer-0-281471353003584[DEBUG] connection opened to broker; addr: docker-rp-31:9092, broker: 1
time="2023-12-12T07:23:26Z" name=KgoVerifierSeqConsumer-0-281471353003584[DEBUG] connection initialized successfully; addr: docker-rp-31:9092, broker: 1
time="2023-12-12T07:23:26Z" name=KgoVerifierSeqConsumer-0-281471353003584[DEBUG] wrote Fetch v11; broker: 1, bytes_written: 133, write_wait: 597.131Âµs, time_to_write: 24.516Âµs, err: <nil>
time="2023-12-12T07:23:26Z" name=KgoVerifierSeqConsumer-0-281471353003584[DEBUG] read Fetch v11; broker: 1, bytes_read: 0, read_wait: 31.835Âµs, time_to_read: 3.761087ms, err: EOF
time="2023-12-12T07:23:26Z" name=KgoVerifierSeqConsumer-0-281471353003584[WARN] read from broker errored, killing connection after 0 successful responses (is SASL missing?); addr: docker-rp-31:9092, broker: 1, err: EOF

Broker side, we see connections being repeatedly shutdown as well.

TRACE 2023-12-12 07:23:24,650 [shard 0:main] kafka - server.cc:178 - shutting down connection 172.16.16.13:53406
TRACE 2023-12-12 07:23:24,650 [shard 0:main] rpc - conn_quota.cc:119 - put(172.16.16.13)
TRACE 2023-12-12 07:23:24,650 [shard 0:main] rpc - conn_quota.cc:449 - do_put(::)
TRACE 2023-12-12 07:23:24,650 [shard 0:main] rpc - conn_quota.cc:457 - do_put: release directly to home allowance={ 2040/2048 reclaim=false}
TRACE 2023-12-12 07:23:24,650 [shard 0:main] rpc - conn_quota.cc:449 - do_put(172.16.16.13)
TRACE 2023-12-12 07:23:24,650 [shard 0:main] rpc - conn_quota.cc:465 - do_put: release to local borrowed
TRACE 2023-12-12 07:23:24,650 [shard 0:main] rpc - conn_quota.cc:131 - leaving put(172.16.16.13)
TRACE 2023-12-12 07:23:24,650 [shard 1:main] kafka - connection_context.h:168 - stopped connection context for 172.16.16.13:53398
TRACE 2023-12-12 07:23:24,650 [shard 1:main] kafka - server.cc:178 - shutting down connection 172.16.16.13:53398
TRACE 2023-12-12 07:23:24,650 [shard 1:main] rpc - conn_quota.cc:119 - put(172.16.16.13)
TRACE 2023-12-12 07:23:24,650 [shard 1:main] rpc - conn_quota.cc:449 - do_put(::)
TRACE 2023-12-12 07:23:24,650 [shard 1:main] rpc - conn_quota.cc:465 - do_put: release to local borrowed
TRACE 2023-12-12 07:23:24,650 [shard 1:main] rpc - conn_quota.cc:449 - do_put(172.16.16.13)
TRACE 2023-12-12 07:23:24,650 [shard 1:main] rpc - conn_quota.cc:457 - do_put: release directly to home allowance={ 1021/1024 reclaim=false}
TRACE 2023-12-12 07:23:24,650 [shard 1:main] rpc - conn_quota.cc:131 - leaving put(172.16.16.13)
TRACE 2023-12-12 07:23:24,650 [shard 1:main] rpc - conn_quota.cc:143 - get(172.16.16.13)
TRACE 2023-12-12 07:23:24,650 [shard 1:main] rpc - conn_quota.cc:180 - got local borrowed token
TRACE 2023-12-12 07:23:24,650 [shard 1:main] rpc - conn_quota.cc:363 - home_get_units(172.16.16.13) allowance={ 1022/1024 reclaim=false}
TRACE 2023-12-12 07:23:24,650 [shard 1:main] rpc - conn_quota.cc:379 - home_get_units: fast path got=true
TRACE 2023-12-12 07:23:24,650 [shard 1:main] kafka - server.cc:272 - kafka rpc protocol - Incoming connection from 172.16.16.13:53422 on "dnslistener"
DEBUG 2023-12-12 07:23:24,650 [shard 1:main] kafka - server.cc:307 - max_reauth_ms: 0

vbotbuildovich · 2023-12-23T00:58:09Z

*https://buildkite.com/redpanda/redpanda/builds/43310
*https://buildkite.com/redpanda/redpanda/builds/43325

andrwng · 2024-01-05T04:09:20Z

Marking sev/high since this results in a stuck consumer.

Looking at https://ci-artifacts.dev.vectorized.cloud/redpanda/43310/018c9296-a767-47d6-9645-d7176cee188e/vbuild/ducktape/results/final/CloudStorageTimingStressTest/test_cloud_storage/cleanup_policy=delete/97/

Similar to above, there are a bunch of is SASL missing? logs in the consumer logs.

This time around I noticed that they correspond to some warnings at the Kafka layer. The broker is reporting a NotFound error when performing a timequery from S3. It appears the timestamp of interest is leading us to attempt to download a segment that shouldn't exist in the archive.

DEBUG 2023-12-22 18:05:15,929 [shard 1:au  ] archival - [fiber4~179|0|30000ms kafka/test-topic/0] - ntp_archiver_service.cc:1148 - Uploading segment {source segment offsets: {term:1, base_offset:7278, committed_offset:7410, dirty_offset:7410}, exposed_name: {7278-1-v1.log}, starting_offset: 7278, file_offset: 0, content_length: 4213540, final_offset: 7410, final_file_offset: 4213540, term: 1, source names: {/var/lib/redpanda/data/kafka/test-topic/0_28/7278-1-v1.log}} to {"e6cf5f2c/kafka/test-topic/0_28/7278-7410-4213540-1-v1.log.1"}
DEBUG 2023-12-22 18:05:15,930 [shard 1:au  ] cloud_storage - [fiber4~179~0|1|30000ms] - remote.cc:536 - Uploading segment to path {"e6cf5f2c/kafka/test-topic/0_28/7278-7410-4213540-1-v1.log.1"}, length 4213540
...
INFO  2023-12-22 18:05:46,072 [shard 1:au  ] archival - [fiber4 kafka/test-topic/0] - ntp_archiver_service.cc:2313 - Enqueuing spillover segment delete from cloud storage: {"e6cf5f2c/kafka/test-topic/0_28/7278-7410-4213540-1-v1.log.1"}
...
DEBUG 2023-12-22 18:05:46,675 [shard 1:main] s3 - s3_client.cc:436 - NoSuchKey response received {"e6cf5f2c/kafka/test-topic/0_28/7278-7410-4213540-1-v1.log.1.index"}
WARN  2023-12-22 18:05:46,675 [shard 1:main] cloud_storage - [fiber3~36~2~0|1|59998ms] - remote.cc:877 - Downloading index from {panda-bucket-940eaa78-a0f4-11ee-a4e1-0242ac10101c}, {key_not_found}, index at {"e6cf5f2c/kafka/test-topic/0_28/7278-7410-4213540-1-v1.log.1.index"} not available
WARN  2023-12-22 18:05:46,675 [shard 1:main] cloud_storage - [fiber3~36 kafka/test-topic/0 [7278:7410]] - remote_segment.cc:1587 - Failed to hydrate index e6cf5f2c/kafka/test-topic/0_28/7278-7410-4213540-1-v1.log.1.index: cloud_storage::download_exception (NotFound)
...
TRACE 2023-12-22 18:06:54,962 [shard 1:main] cloud_storage - cache_service.cc:1145 - reserve_space_release: releasing 4215598/1 reserved bytes/objects (wrote 0/0)
WARN  2023-12-22 18:06:54,962 [shard 1:main] cloud_storage - [fiber3~36 kafka/test-topic/0 [7278:7410]] - remote_segment.cc:1587 - Failed to hydrate segment e6cf5f2c/kafka/test-topic/0_28/7278-7410-4213540-1-v1.log.1: cloud_storage::download_exception (NotFound)
WARN  2023-12-22 18:06:54,962 [shard 1:main] cloud_storage - [fiber712 kafka/test-topic/0] - remote_partition.cc:468 - exception thrown while reading from remote_partition: NotFound
DEBUG 2023-12-22 18:06:54,962 [shard 1:main] cloud_storage - [fiber3~36~206 kafka/test-topic/0] - remote_segment.cc:1466 - [{nullopt}] remote_segment_batch_reader::stop
TRACE 2023-12-22 18:06:54,962 [shard 1:main] cloud_storage - [fiber712 kafka/test-topic/0] - remote_partition.cc:261 - Destructing reader {kafka/test-topic/0}
WARN  2023-12-22 18:06:54,962 [shard 0:main] kafka - connection_context.cc:675 - Error processing request: cloud_storage::download_exception (NotFound)

Here's where we can see it's a timequery:

TRACE 2023-12-22 18:05:37,629 [shard 1:main] kafka - requests.cc:96 - [172.16.16.6:50878] processing name:list_offsets, key:2, version:4 for KgoVerifierSeqConsumer-0-140130444990960, mem_units: 8182, ctx_size: 41
TRACE 2023-12-22 18:05:37,629 [shard 1:main] kafka - handler.h:69 - [client_id: {KgoVerifierSeqConsumer-0-140130444990960}] handling list_offsets v4 request {replica_id=-1 isolation_level=0 topics={{name={test-topic} partitions={{partition_index=0 current_leader_epoch=1 timestamp={timestamp: 1703268307239} max_num_offsets=0}}}}}
...
WARN  2023-12-22 18:05:37,632 [shard 1:main] kafka - connection_context.cc:675 - Error processing request: cloud_storage::download_exception (NotFound)
DEBUG 2023-12-22 18:05:37,632 [shard 1:main] kafka - connection_context.cc:85 - Connection input_shutdown; aborting operations for 172.16.16.6:50878
TRACE 2023-12-22 18:05:37,632 [shard 1:main] kafka - connection_context.cc:103 - stopping connection context for 172.16.16.6:50878
TRACE 2023-12-22 18:05:37,632 [shard 1:main] kafka - connection_context.cc:111 - stopped connection context for 172.16.16.6:50878

And we can see the archive start offset has already been moved ahead segment 7278-7410

DEBUG 2023-12-22 18:05:46,072 [shard 1:main] cluster - ntp: {kafka/test-topic/0} - archival_metadata_stm.cc:1432 - Updating archive start offset, current value 7147, update 7411
...
DEBUG 2023-12-22 18:05:47,486 [shard 1:main] cluster - ntp: {kafka/test-topic/0} - archival_metadata_stm.cc:1432 - Updating archive start offset, current value 7411, update 7804
...
INFO  2023-12-22 18:05:47,486 [shard 1:au  ] archival - [fiber4 kafka/test-topic/0] - ntp_archiver_service.cc:2266 - Garbage collecting archive segments in offest range [7411, 7804)
INFO  2023-12-22 18:05:47,486 [shard 1:au  ] archival - [fiber4 kafka/test-topic/0] - ntp_archiver_service.cc:2313 - Enqueuing spillover segment delete from cloud storage: {"3cadd0c9/kafka/test-topic/0_28/7411-7541-4213287-1-v1.log.1"}
INFO  2023-12-22 18:05:47,486 [shard 1:au  ] archival - [fiber4 kafka/test-topic/0] - ntp_archiver_service.cc:2313 - Enqueuing spillover segment delete from cloud storage: {"3f1b03ed/kafka/test-topic/0_28/7542-7672-4213287-1-v1.log.1"}
INFO  2023-12-22 18:05:47,486 [shard 1:au  ] archival - [fiber4 kafka/test-topic/0] - ntp_archiver_service.cc:2313 - Enqueuing spillover segment delete from cloud storage: {"8f769b1d/kafka/test-topic/0_28/7673-7803-4213287-1-v1.log.1"}

What's odd is that we appear to be repeatedly attempting this download. Sure this segment may not be expected to exist according to the archive, but why then do we continue to attempt to download it?

cc @Lazin

andrwng · 2024-01-05T04:18:59Z

Here's a more complete sequence of a download failure:

DEBUG 2023-12-22 18:06:54,960 [shard 1:main] cluster - partition.cc:661 - timequery (cloud) {kafka/test-topic/0} t={timestamp: 1703268307239} max_offset(k)=15360
DEBUG 2023-12-22 18:06:54,960 [shard 1:main] cloud_storage - [fiber3 kafka/test-topic/0] - remote_partition.cc:1113 - remote partition make_reader invoked (waiting for units), config: {start_offset:{7606}, max_offset:{15360}, min_bytes:0, max_bytes:2048, type_filter:batch_type::raft_data, first_timestamp:{timestamp: 1703268307239}, bytes_consumed:0, over_budget:0, strict_max_bytes:0, skip_batch_cache:0, abortable:1, aborted:0}, num segments 2
TRACE 2023-12-22 18:06:54,960 [shard 1:main] cloud_storage - [fiber3 kafka/test-topic/0] - remote_partition.cc:1125 - remote partition make_reader invoked (units acquired), config: {start_offset:{7606}, max_offset:{15360}, min_bytes:0, max_bytes:2048, type_filter:batch_type::raft_data, first_timestamp:{timestamp: 1703268307239}, bytes_consumed:0, over_budget:0, strict_max_bytes:0, skip_batch_cache:0, abortable:1, aborted:0}, num segments 2
TRACE 2023-12-22 18:06:54,960 [shard 1:main] cloud_storage - [fiber712 kafka/test-topic/0] - remote_partition.cc:226 - Constructing reader {kafka/test-topic/0}
DEBUG 2023-12-22 18:06:54,960 [shard 1:main] cloud_storage - [fiber712 kafka/test-topic/0] - remote_partition.cc:231 - abort_source is set
TRACE 2023-12-22 18:06:54,960 [shard 1:main] cloud_storage - [fiber712 kafka/test-topic/0] - remote_partition.cc:250 - partition_record_batch_reader_impl:: start: creating cursor: {start_offset:{7606}, max_offset:{15360}, min_bytes:0, max_bytes:2048, type_filter:batch_type::raft_data, first_timestamp:{timestamp: 1703268307239}, bytes_consumed:0, over_budget:0, strict_max_bytes:0, skip_batch_cache:0, abortable:1, aborted:0}
DEBUG 2023-12-22 18:06:54,960 [shard 1:main] cloud_storage - [fiber2 kafka/test-topic/0] - async_manifest_view.cc:602 - creating_cursor: begin: 7804, end: 15802, stm_range[{13875}/15802]
DEBUG 2023-12-22 18:06:54,960 [shard 1:main] cloud_storage - [fiber2 kafka/test-topic/0] - async_manifest_view.cc:178 - Manifest is not initialized
DEBUG 2023-12-22 18:06:54,960 [shard 1:main] cloud_storage - [fiber2 kafka/test-topic/0] - async_manifest_view.cc:873 - Checking timestamp {timestamp: 1703268307239} using timequery
DEBUG 2023-12-22 18:06:54,960 [shard 1:main] cloud_storage - [fiber2 kafka/test-topic/0] - async_manifest_view.cc:873 - Checking timestamp {timestamp: 1703268307239} using timequery
DEBUG 2023-12-22 18:06:54,960 [shard 1:main] cloud_storage - [fiber2 kafka/test-topic/0] - async_manifest_view.cc:1472 - search_spillover_manifest query: {{timestamp: 1703268307239}}, num manifests: 5, first: {timestamp: 1703268315071}, last: {timestamp: 1703268340066}
DEBUG 2023-12-22 18:06:54,960 [shard 1:main] cloud_storage - [fiber2 kafka/test-topic/0] - async_manifest_view.cc:1311 - Found spillover manifest meta: {{is_compacted: false, size_bytes: 42133382, base_offset: 7278, committed_offset: 8592, base_timestamp: {timestamp: 1703268315071}, max_timestamp: {timestamp: 1703268320067}, delta_offset: 184, ntp_revision: 28, archiver_term: 1, segment_term: 1, delta_offset_end: 219, sname_format: {v3}, metadata_size_hint: 3128}}
DEBUG 2023-12-22 18:06:54,960 [shard 1:main] cloud_storage - [fiber2 kafka/test-topic/0] - materialized_manifest_cache.cc:220 - Cache GET will return [{kafka/test-topic/0}:7278]
DEBUG 2023-12-22 18:06:54,960 [shard 1:main] cloud_storage - [fiber2 kafka/test-topic/0] - async_manifest_view.cc:268 - STM manifest range: [7278/8592], cursor range: [7804/15802]
DEBUG 2023-12-22 18:06:54,960 [shard 1:main] cloud_storage - [fiber712 kafka/test-topic/0] - remote_partition.cc:580 - partition_record_batch_reader_impl initialize reader state, config: {start_offset:{7606}, max_offset:{15360}, min_bytes:0, max_bytes:2048, type_filter:batch_type::raft_data, first_timestamp:{timestamp: 1703268307239}, bytes_consumed:0, over_budget:0, strict_max_bytes:0, skip_batch_cache:0, abortable:1, aborted:0}
DEBUG 2023-12-22 18:06:54,960 [shard 1:main] cloud_storage - [fiber3 kafka/test-topic/0] - segment_state.cc:79 - creating new reader, config: {start_offset:{7606}, max_offset:{15360}, min_bytes:0, max_bytes:2048, type_filter:batch_type::raft_data, first_timestamp:{timestamp: 1703268307239}, bytes_consumed:0, over_budget:0, strict_max_bytes:0, skip_batch_cache:0, abortable:1, aborted:0}
DEBUG 2023-12-22 18:06:54,960 [shard 1:main] cloud_storage - [fiber712 kafka/test-topic/0] - remote_partition.cc:624 - segment log offset range 7278-7410, next log offset: 7411, log reader config: {start_offset:{7606}, max_offset:{15360}, min_bytes:0, max_bytes:2048, type_filter:batch_type::raft_data, first_timestamp:{timestamp: 1703268307239}, bytes_consumed:0, over_budget:0, strict_max_bytes:0, skip_batch_cache:0, abortable:1, aborted:0}
TRACE 2023-12-22 18:06:54,960 [shard 1:main] cloud_storage - [fiber712 kafka/test-topic/0] - remote_partition.cc:255 - partition_record_batch_reader_impl:: start: created cursor: {start_offset:{7606}, max_offset:{15360}, min_bytes:0, max_bytes:2048, type_filter:batch_type::raft_data, first_timestamp:{timestamp: 1703268307239}, bytes_consumed:0, over_budget:0, strict_max_bytes:0, skip_batch_cache:0, abortable:1, aborted:0}
DEBUG 2023-12-22 18:06:54,960 [shard 1:main] cloud_storage - [fiber712 kafka/test-topic/0] - remote_partition.cc:634 - maybe_reset_reader called
DEBUG 2023-12-22 18:06:54,960 [shard 1:main] cloud_storage - [fiber712 kafka/test-topic/0] - remote_partition.cc:656 - maybe_reset_reader, config start_offset: 7606, reader max_offset: 7410
DEBUG 2023-12-22 18:06:54,960 [shard 1:main] cloud_storage - [fiber712 kafka/test-topic/0] - remote_partition.cc:773 - maybe_reset_reader completed, reader is present: true, is end of stream: false
DEBUG 2023-12-22 18:06:54,960 [shard 1:main] cloud_storage - [fiber712 kafka/test-topic/0] - remote_partition.cc:390 - Invoking 'read_some' on current log reader with config: {start_offset:{7606}, max_offset:{15360}, min_bytes:0, max_bytes:2048, type_filter:batch_type::raft_data, first_timestamp:{timestamp: 1703268307239}, bytes_consumed:0, over_budget:0, strict_max_bytes:0, skip_batch_cache:0, abortable:1, aborted:0}
DEBUG 2023-12-22 18:06:54,960 [shard 1:main] cloud_storage - [fiber3~36~206 kafka/test-topic/0] - remote_segment.cc:1421 - remote_segment_batch_reader::init_parser (creating stream), start_offset: 7606, client: {nullopt}
DEBUG 2023-12-22 18:06:54,960 [shard 1:main] cloud_storage - [fiber3~36 kafka/test-topic/0 [7278:7410]] - remote_segment.cc:285 - remote segment file input stream at offset 7606
DEBUG 2023-12-22 18:06:54,960 [shard 1:main] cloud_storage - [fiber3~36 kafka/test-topic/0 [7278:7410]] - remote_segment.cc:958 - segment {"e6cf5f2c/kafka/test-topic/0_28/7278-7410-4213540-1-v1.log.1"} hydration requested
DEBUG 2023-12-22 18:06:54,960 [shard 1:main] cloud_storage - [fiber3~36 kafka/test-topic/0 [7278:7410]] - remote_segment.cc:808 - adding full segment to hydrate paths list
DEBUG 2023-12-22 18:06:54,960 [shard 1:main] cloud_storage - [fiber3~36 kafka/test-topic/0 [7278:7410]] - remote_segment.cc:814 - removing index from hydrate paths list
DEBUG 2023-12-22 18:06:54,960 [shard 1:main] cloud_storage - [fiber3~36 kafka/test-topic/0 [7278:7410]] - remote_segment.cc:824 - Segment {"e6cf5f2c/kafka/test-topic/0_28/7278-7410-4213540-1-v1.log.1"} requested, 1 consumers are awaiting, data file is not available
DEBUG 2023-12-22 18:06:54,960 [shard 1:main] cloud_storage - cache_service.cc:1069 - Checking e6cf5f2c/kafka/test-topic/0_28/7278-7410-4213540-1-v1.log.1.tx in archival cache.
...
WARN  2023-12-22 18:06:54,962 [shard 1:main] s3 - s3_client.cc:571 - S3 GET request failed: Not Found [Accept-Ranges: bytes];[Content-Length: 445];[Content-Security-Policy: block-all-mixed-content];[Content-Type: application/xml];[Server: MinIO];[Strict-Transport-Security: max-age=31536000; includeSubDomains];[Vary: Origin];[Vary: Accept-Encoding];[X-Amz-Request-Id: 17A339995EF1E0A0];[X-Content-Type-Options: nosniff];[X-Xss-Protection: 1; mode=block];[Date: Fri, 22 Dec 2023 18:06:54 GMT];
DEBUG 2023-12-22 18:06:54,962 [shard 1:main] s3 - s3_client.cc:436 - NoSuchKey response received {"e6cf5f2c/kafka/test-topic/0_28/7278-7410-4213540-1-v1.log.1"}
WARN  2023-12-22 18:06:54,962 [shard 1:main] cloud_storage - [fiber3~36~208~0|1|59998ms] - remote.cc:801 - Downloading segment from {panda-bucket-940eaa78-a0f4-11ee-a4e1-0242ac10101c}, {key_not_found}, segment at {"e6cf5f2c/kafka/test-topic/0_28/7278-7410-4213540-1-v1.log.1"} not available
DEBUG 2023-12-22 18:06:54,962 [shard 1:main] client_pool - client_pool.cc:490 - releasing a client, pool size: 19, capacity: 20
DEBUG 2023-12-22 18:06:54,962 [shard 1:main] cloud_storage - [fiber3~36 kafka/test-topic/0 [7278:7410]] - remote_segment.cc:525 - Failed to hydrating a segment {"e6cf5f2c/kafka/test-topic/0_28/7278-7410-4213540-1-v1.log.1"}, 1 waiter will be invoked

andrwng · 2024-01-05T04:39:50Z

https://drive.google.com/file/d/1wZnmolWJW0_ks4iiyZ74Z7w-dsTezLZy/view?usp=sharing

Here are the logs, in case the above ones get lost

andrwng · 2024-01-05T04:48:15Z

It appears that timequeries just aren't taking into account the archive start offset. From the above logs:

Here's a given cursor. begin refers to the archive clean offset, though it doesn't seem like it's actually clean, given we have some segments in the archive still present in the manifest.

DEBUG 2023-12-22 18:06:54,960 [shard 1:main] cloud_storage - [fiber2 kafka/test-topic/0] - async_manifest_view.cc:602 - creating_cursor: begin: 7804, end: 15802, stm_range[{13875}/15802]

Here's a given query:

DEBUG 2023-12-22 18:06:54,960 [shard 1:main] cloud_storage - [fiber2 kafka/test-topic/0] - async_manifest_view.cc:1472 - search_spillover_manifest query: {{timestamp: 1703268307239}}, num manifests: 5, first: {timestamp: 1703268315071}, last: {timestamp: 1703268340066}

And the found manifest:

DEBUG 2023-12-22 18:06:54,960 [shard 1:main] cloud_storage - [fiber2 kafka/test-topic/0] - async_manifest_view.cc:1311 - Found spillover manifest meta: {{is_compacted: false, size_bytes: 42133382, base_offset: 7278, committed_offset: 8592, base_timestamp: {timestamp: 1703268315071}, max_timestamp: {timestamp: 1703268320067}, delta_offset: 184, ntp_revision: 28, archiver_term: 1, segment_term: 1, delta_offset_end: 219, sname_format: {v3}, metadata_size_hint: 3128}}

So there seem to be a couple things off here:

Why is the archive clean offset not actually clean? I was under the impression "clean" meant no segments below exist.
Shouldn't search_spillover_manifest skip over anything below the archive start and archive clean offset?

vbotbuildovich · 2024-01-09T14:22:44Z

*https://buildkite.com/redpanda/redpanda/builds/43502

vbotbuildovich · 2024-01-11T00:05:20Z

*https://buildkite.com/redpanda/redpanda/builds/43629
*https://buildkite.com/redpanda/redpanda/builds/43634

vbotbuildovich · 2024-01-12T00:08:10Z

*https://buildkite.com/redpanda/redpanda/builds/43681

vbotbuildovich · 2024-01-16T23:35:06Z

*https://buildkite.com/redpanda/redpanda/builds/43766
*https://buildkite.com/redpanda/redpanda/builds/43796

vbotbuildovich · 2024-01-18T00:05:07Z

*https://buildkite.com/redpanda/redpanda/builds/43825

vbotbuildovich · 2024-01-20T00:59:49Z

*https://buildkite.com/redpanda/redpanda/builds/43884
*https://buildkite.com/redpanda/redpanda/builds/43931
*https://buildkite.com/redpanda/redpanda/builds/44002

vbotbuildovich · 2024-01-21T00:05:55Z

*https://buildkite.com/redpanda/redpanda/builds/44030

vbotbuildovich · 2024-01-22T00:06:55Z

*https://buildkite.com/redpanda/redpanda/builds/44053

vbotbuildovich · 2024-01-28T00:14:20Z

*https://buildkite.com/redpanda/redpanda/builds/44396

vbotbuildovich · 2024-01-29T00:12:03Z

*https://buildkite.com/redpanda/redpanda/builds/44406

vbotbuildovich · 2024-01-30T00:15:06Z

*https://buildkite.com/redpanda/redpanda/builds/44416

vbotbuildovich · 2024-01-31T00:14:47Z

*https://buildkite.com/redpanda/redpanda/builds/44482

vbotbuildovich · 2024-02-01T02:35:15Z

*https://buildkite.com/redpanda/redpanda/builds/44538

vbotbuildovich · 2024-02-02T00:13:11Z

*https://buildkite.com/redpanda/redpanda/builds/44584

vbotbuildovich · 2024-02-08T00:12:05Z

*https://buildkite.com/redpanda/redpanda/builds/44816

vbotbuildovich · 2024-02-13T20:21:17Z

*https://buildkite.com/redpanda/redpanda/builds/44871
*https://buildkite.com/redpanda/redpanda/builds/44910

vbotbuildovich · 2024-02-17T00:14:16Z

*https://buildkite.com/redpanda/redpanda/builds/45067

vbotbuildovich · 2024-02-18T00:13:42Z

*https://buildkite.com/redpanda/redpanda/builds/45079

This fixes a bug for timequeries that could result in a removed segment being unsuccessfully downloaded during a timequery. The sequence of events is as follows: 1. Spillover manifest [7278, 8592] 2. Spillover more manifests... [...13574] 3. Eventually our stm manifest is [13875, 15802] 4. We housekeep and truncate to 7804. This is the new archive start offset and archive clean offset 5. A timequery comes in whose timestamp falls in manifest [7278, 8592] 6. Segment [7278-7410] is attempted to be downloaded, which has already been removed To fix this, as we're seeking within a given manifest, we'll now check against the overall start offset of the partition, skipping any segments that fall below the start. Fixes redpanda-data#15042

This fixes a bug for timequeries that could result in a removed segment being unsuccessfully downloaded during a timequery. The sequence of events is as follows: 1. Spillover manifest [7278, 8592] 2. Spillover more manifests... [...13574] 3. Eventually our stm manifest is [13875, 15802] 4. We housekeep and truncate to 7804. This is the new archive start offset and archive clean offset 5. A timequery comes in whose timestamp falls in manifest [7278, 8592] 6. Segment [7278-7410] is attempted to be downloaded, which has already been removed To fix this, as we're seeking within a given manifest, we'll now check against the overall start offset of the partition, skipping any segments that fall below the start. Fixes redpanda-data#15042 (cherry picked from commit d056080)

abhijat added kind/bug Something isn't working ci-failure labels Nov 20, 2023

andrwng mentioned this issue Nov 20, 2023

cloud_metadata: utilities for uploading consumer offsets #13662

Merged

7 tasks

mmaslankaprv mentioned this issue Dec 13, 2023

cloud_storage: Fix stuck reader problem #15423

Merged

7 tasks

dotnwat added the area/cloud-storage Shadow indexing subsystem label Dec 15, 2023

mmaslankaprv mentioned this issue Jan 3, 2024

Manage STM creation outside of cluster::partition #15424

Merged

7 tasks

andrwng added the sev/high loss of availability, pathological performance degradation, recoverable corruption label Jan 5, 2024

andrwng mentioned this issue Jan 5, 2024

CI Failure (TimeoutError: Consumer failed to consume up to offsets) in EndToEndShadowIndexingTestWithDisruptions.test_write_with_node_failures #14646

Open

andrwng mentioned this issue Jan 11, 2024

CI Failure (BadLogLines: Failed to query spillover manifests: cloud_storage::error_outcome:10, query: {{timestamp: 1704785344565}}) in CloudStorageTimingStressTest.test_cloud_storage_with_partition_moves #16026

Closed

graphcareful mentioned this issue Jan 17, 2024

[v23.3.x] Fix for CI Failure (Failed to make list_offsets request, unknown broker) in DeleteRecordsTest.test_delete_records_segment_deletion #16113

Merged

andrwng self-assigned this Jan 31, 2024

andrwng mentioned this issue Feb 20, 2024

cloud_storage: skip over segments below archive start #16645

Merged

7 tasks

andrwng closed this as completed in #16645 Feb 21, 2024

vbotbuildovich mentioned this issue Feb 21, 2024

[v23.3.x] CI Failure (consumer timed out) in CloudStorageTimingStressTest.test_cloud_storage #16659

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI Failure (consumer timed out) in `CloudStorageTimingStressTest.test_cloud_storage` #15042

CI Failure (consumer timed out) in `CloudStorageTimingStressTest.test_cloud_storage` #15042

abhijat commented Nov 20, 2023

abhijat commented Nov 23, 2023

rockwotj commented Dec 1, 2023

vbotbuildovich commented Dec 13, 2023

andrwng commented Dec 20, 2023

vbotbuildovich commented Dec 23, 2023

andrwng commented Jan 5, 2024 •

edited

andrwng commented Jan 5, 2024

andrwng commented Jan 5, 2024

andrwng commented Jan 5, 2024

vbotbuildovich commented Jan 9, 2024

vbotbuildovich commented Jan 11, 2024

vbotbuildovich commented Jan 12, 2024

vbotbuildovich commented Jan 16, 2024

vbotbuildovich commented Jan 18, 2024

vbotbuildovich commented Jan 20, 2024

vbotbuildovich commented Jan 21, 2024

vbotbuildovich commented Jan 22, 2024

vbotbuildovich commented Jan 28, 2024

vbotbuildovich commented Jan 29, 2024

vbotbuildovich commented Jan 30, 2024

vbotbuildovich commented Jan 31, 2024

vbotbuildovich commented Feb 1, 2024

vbotbuildovich commented Feb 2, 2024

vbotbuildovich commented Feb 8, 2024

vbotbuildovich commented Feb 13, 2024

vbotbuildovich commented Feb 17, 2024

vbotbuildovich commented Feb 18, 2024

CI Failure (consumer timed out) in CloudStorageTimingStressTest.test_cloud_storage #15042

CI Failure (consumer timed out) in CloudStorageTimingStressTest.test_cloud_storage #15042

Comments

abhijat commented Nov 20, 2023

abhijat commented Nov 23, 2023

rockwotj commented Dec 1, 2023

vbotbuildovich commented Dec 13, 2023

andrwng commented Dec 20, 2023

vbotbuildovich commented Dec 23, 2023

andrwng commented Jan 5, 2024 • edited

andrwng commented Jan 5, 2024

andrwng commented Jan 5, 2024

andrwng commented Jan 5, 2024

vbotbuildovich commented Jan 9, 2024

vbotbuildovich commented Jan 11, 2024

vbotbuildovich commented Jan 12, 2024

vbotbuildovich commented Jan 16, 2024

vbotbuildovich commented Jan 18, 2024

vbotbuildovich commented Jan 20, 2024

vbotbuildovich commented Jan 21, 2024

vbotbuildovich commented Jan 22, 2024

vbotbuildovich commented Jan 28, 2024

vbotbuildovich commented Jan 29, 2024

vbotbuildovich commented Jan 30, 2024

vbotbuildovich commented Jan 31, 2024

vbotbuildovich commented Feb 1, 2024

vbotbuildovich commented Feb 2, 2024

vbotbuildovich commented Feb 8, 2024

vbotbuildovich commented Feb 13, 2024

vbotbuildovich commented Feb 17, 2024

vbotbuildovich commented Feb 18, 2024

CI Failure (consumer timed out) in `CloudStorageTimingStressTest.test_cloud_storage` #15042

CI Failure (consumer timed out) in `CloudStorageTimingStressTest.test_cloud_storage` #15042

andrwng commented Jan 5, 2024 •

edited