Failure in `CloudRetentionTest.test_cloud_retention` #7708

VadimPlh · 2022-12-12T15:20:04Z

https://buildkite.com/redpanda/vtools/builds/4587#0184f5eb-1f40-4102-a3ec-f323deb5d8ed

FAIL test: CloudRetentionTest.test_cloud_retention.max_consume_rate_mb=20 (1/1 runs)
  failure at 2022-12-09T15:24:03.345Z: TimeoutError(None)
      on (amd64, VM) in job https://buildkite.com/redpanda/vtools/builds/4587#0184f5eb-1f40-4102-a3ec-f323deb5d8ed

FAIL test: CloudRetentionTest.test_cloud_retention.max_consume_rate_mb=None (1/1 runs)
  failure at 2022-12-09T15:24:03.345Z: TimeoutError(None)
      on (amd64, VM) in job https://buildkite.com/redpanda/vtools/builds/4587#0184f5eb-1f40-4102-a3ec-f323deb5d8ed

Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/mark/_mark.py", line 476, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/scale_tests/cloud_retention_test.py", line 128, in test_cloud_retention
    consumer.wait()
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/services/service.py", line 261, in wait
    if not self.wait_node(node, end - now):
  File "/home/ubuntu/redpanda/tests/rptest/services/kgo_verifier_services.py", line 162, in wait_node
    self._redpanda.wait_until(lambda: self._status.active is False or self.
  File "/home/ubuntu/redpanda/tests/rptest/services/redpanda.py", line 1068, in wait_until
    wait_until(wrapped,
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/utils/util.py", line 58, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError: None

The text was updated successfully, but these errors were encountered:

bharathv · 2022-12-20T02:36:46Z

another instance: https://buildkite.com/redpanda/vtools/builds/4735#01852968-fc44-4e6f-b086-daa428888ef2

bharathv · 2022-12-21T17:11:27Z

another instance: https://buildkite.com/redpanda/vtools/builds/4801#018533b7-87a2-4c64-b23b-db696e5251de

BenPope · 2022-12-30T13:52:32Z

This test doesn't seem to succeed:

Lazin · 2023-01-05T11:29:28Z

Fails on cdt https://buildkite.com/redpanda/vtools/builds/4961#01857415-de4d-45ef-82aa-fd1e1e438729/6-7526

dlex · 2023-01-06T04:06:17Z

max_consume_rate_mb=20:
on (amd64, VM) https://buildkite.com/redpanda/vtools/builds/5016#018580f8-b7c9-4d92-952d-90927558c17c
on (arm64, VM) https://buildkite.com/redpanda/vtools/builds/5007#01857e64-9b47-44f1-ade7-92fcfff06d73

max_consume_rate_mb=None:
on (amd64, VM) https://buildkite.com/redpanda/vtools/builds/5016#018580f8-b7c9-4d92-952d-90927558c17c
on (arm64, VM) https://buildkite.com/redpanda/vtools/builds/5007#01857e64-9b47-44f1-ade7-92fcfff06d73

dlex · 2023-01-06T23:25:35Z

on (arm64, VM) https://buildkite.com/redpanda/vtools/builds/5047#0185838c-0b8d-4784-abe4-d77cf7b61b3e
on (arm64, VM) https://buildkite.com/redpanda/vtools/builds/5007#01858500-14b5-4276-87c4-449e1f6c4880
on (amd64, VM) https://buildkite.com/redpanda/vtools/builds/5067#0185861b-ccfa-4aaa-8fd6-61d16ce629fd

max_consume_rate_mb=20
max_consume_rate_mb=None

andrwng · 2023-01-09T03:13:15Z

I think this is caused by a bug in offset translation. I ran test with trace logging enabled in kgo-verifier, franz go, and in Redpanda, and found in the client logs several messages (newly added for debugging) like:

time="2023-01-07T00:05:12Z" level=debug msg="Not done reading after reading si_test_topic/1: 0 < 77194"

...after the last_pass call is made.

On the servers, it looks like we keep on fetching the same offset, but end up not returning any results, e.g.:

WARN 2023-01-07 00:05:10,525 [shard 1] kafka - fetch.cc:198 - fetch offset out of range for {kafka/si_test_topic/1}, requested offset: 83272, partition start offset: 84186, high watermark: 87788, ec: { error_code: offset_out_of_range [1] }

There's an RPC to list offsets:
TRACE 2023-01-07 00:05:10,526 [shard 1] kafka - handler.h:68 - [client_id: {KgoVerifierConsumerGroupConsumer-0-140251922618768}] handling list_offsets request {replica_id=-1 isolation_level=0 topics={{name={si_test_topic} partitions={{partition_index=1 current_leader_epoch=23 timestamp={timestamp: -2} max_num_offsets=0}}}}}

It seems like a segment is truncated:
INFO 2023-01-07 00:05:10,527 [shard 1] archival - [fiber34581 kafka/si_test_topic/1] - ntp_archiver_service.cc:1185 - Deleted segment from cloud storage: {"e65edc1c/kafka/si_test_topic/1_18/83439-84353-11268782-22-v1.log.22"}

We then return the offset 84186, which is presumably in the remaining segments:
TRACE 2023-01-07 00:05:10,527 [shard 1] kafka - request_context.h:168 - [172.24.0.10:51958] sending 2:list_offsets for {KgoVerifierConsumerGroupConsumer-0-140251922618768}, response {throttle_time_ms=0 topics={{name={si_test_topic} partitions={{partition_index=1 error_code={ error_code: none [0] } old_style_offsets={} timestamp={timestamp: missing} offset=84186 leader_epoch=23}}}}}

The fetches thereafter are for offset 84186 all return 0 bytes:
DEBUG 2023-01-07 00:05:12,778 [shard 1] cloud_storage - partition_manifest.cc:288 - Metadata lookup using kafka offset 84186

DEBUG 2023-01-07 00:05:12,778 [shard 1] cloud_storage - [fiber30979 kafka/si_test_topic/1] - remote_partition.cc:347 - partition_record_batch_reader_impl initialize reader state - segment not found

TRACE 2023-01-07 00:05:12,778 [shard 1] kafka - fetch.cc:348 - fetch_ntps_in_parallel: for 1 partitions returning 0 total bytes

Looking at the offset translation code for segment lookup, we first do an offset lookup at model offset 84186 and then traverse forward to look for the first segment with a higher base kafka offset, and return the previous segment expecting it to contain the kafka offset. In this case, it looks like that first offset lookup returns with no segments (because model offset 84186 has been truncated) and we return immediately.

When serving a fetch request by a kafka::offset, we previously used the `partition_manifest` to perform a segment lookup by `kafka::offset_cast(ko)` first in order to traverse the segments forward to find the actual segment that contained `ko`. This doesn't work when the manifest has been truncated, e.g. if the casted offset falls before the start of the manifest, we would previously return that no segment exists for the fetch. This could result in segments erronesouly not being returned, and fetches erroneously being met with no data when some existed. This commit fixes the behavior to no longer use the casted segment lookup and adds some test coverage for kafka::offset lookups. Fixes redpanda-data#7708

Lazin · 2023-01-09T06:35:11Z

Isn't it the expected behavior? The consumer tries to fetch offset below start offset and gets an error. Or it just acts as if the partition is empty in this case?

andrwng · 2023-01-09T07:46:10Z

Isn't it the expected behavior? The consumer tries to fetch offset below start offset and gets an error. Or it just acts as if the partition is empty in this case?

I don't think so. In this case, the partition has kafka offsets 84186 - 87788 (see partition start offset: 84186, high watermark: 87788), so a fetch for 84186 should be valid and met with some results.

Lazin · 2023-01-09T08:37:37Z

So basically, prefix truncation breaks segment lookup so it can't find the first segment anymore?

andrwng · 2023-01-09T17:30:43Z

Right, a simple example of what I think is happening is:

mo: model offset
ko: kafka offset

mo: 10     20     30
    [b    ][c    ]end
ko: 5      10     15

We may see a fetch at kafka offset 7, and in today's code, the segment_contains(kafka::offset) method will return that there's no data because no segment contains model::offset(7)

When serving a fetch request by a kafka::offset, we previously used the `partition_manifest` to perform a segment lookup by `kafka::offset_cast(ko)` first in order to traverse the segments forward to find the actual segment that contained `ko`. This doesn't work when the manifest has been truncated, e.g. if the casted offset falls before the start of the manifest, we would previously return that no segment exists for the fetch. This could result in segments erronesouly not being returned, and fetches erroneously being met with no data when some existed. This commit fixes the behavior to no longer use the casted segment lookup and adds some test coverage for kafka::offset lookups. Fixes redpanda-data#7708 (cherry picked from commit 10f87c1)

When serving a fetch request by a kafka::offset, we previously used the `partition_manifest` to perform a segment lookup by `kafka::offset_cast(ko)` first in order to traverse the segments forward to find the actual segment that contained `ko`. This doesn't work when the manifest has been truncated, e.g. if the casted offset falls before the start of the manifest, we would previously return that no segment exists for the fetch. This could result in segments erronesouly not being returned, and fetches erroneously being met with no data when some existed. This commit fixes the behavior to no longer use the casted segment lookup and adds some test coverage for kafka::offset lookups. Fixes redpanda-data#7708

VadimPlh added kind/bug Something isn't working ci-failure labels Dec 12, 2022

jcsp added the area/cloud-storage Shadow indexing subsystem label Dec 12, 2022

jcsp assigned andrwng Dec 13, 2022

andrwng mentioned this issue Jan 9, 2023

cloud_storage: fix kafka::offset lookup in partition_manifest #8110

Merged

6 tasks

andrwng closed this as completed in #8110 Jan 9, 2023

vbotbuildovich mentioned this issue Jan 9, 2023

[v22.3.x] Failure in CloudRetentionTest.test_cloud_retention #8125

Closed

dlex mentioned this issue Jan 16, 2023

[v22.3.x] NodesDecommissioningTest.test_decommissioning_rebalancing_node cleanup after rouge RP instances #8189

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure in `CloudRetentionTest.test_cloud_retention` #7708

Failure in `CloudRetentionTest.test_cloud_retention` #7708

VadimPlh commented Dec 12, 2022 •

edited

bharathv commented Dec 20, 2022

bharathv commented Dec 21, 2022

BenPope commented Dec 30, 2022

Lazin commented Jan 5, 2023

dlex commented Jan 6, 2023 •

edited

dlex commented Jan 6, 2023

andrwng commented Jan 9, 2023

Lazin commented Jan 9, 2023

andrwng commented Jan 9, 2023 •

edited

Lazin commented Jan 9, 2023

andrwng commented Jan 9, 2023

Failure in CloudRetentionTest.test_cloud_retention #7708

Failure in CloudRetentionTest.test_cloud_retention #7708

Comments

VadimPlh commented Dec 12, 2022 • edited

bharathv commented Dec 20, 2022

bharathv commented Dec 21, 2022

BenPope commented Dec 30, 2022

Lazin commented Jan 5, 2023

dlex commented Jan 6, 2023 • edited

dlex commented Jan 6, 2023

andrwng commented Jan 9, 2023

Lazin commented Jan 9, 2023

andrwng commented Jan 9, 2023 • edited

Lazin commented Jan 9, 2023

andrwng commented Jan 9, 2023

Failure in `CloudRetentionTest.test_cloud_retention` #7708

Failure in `CloudRetentionTest.test_cloud_retention` #7708

VadimPlh commented Dec 12, 2022 •

edited

dlex commented Jan 6, 2023 •

edited

andrwng commented Jan 9, 2023 •

edited