Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure in CloudRetentionTest.test_cloud_retention #7708

Closed
VadimPlh opened this issue Dec 12, 2022 · 11 comments · Fixed by #8110
Closed

Failure in CloudRetentionTest.test_cloud_retention #7708

VadimPlh opened this issue Dec 12, 2022 · 11 comments · Fixed by #8110
Assignees
Labels
area/cloud-storage Shadow indexing subsystem ci-failure kind/bug Something isn't working

Comments

@VadimPlh
Copy link
Contributor

VadimPlh commented Dec 12, 2022

https://buildkite.com/redpanda/vtools/builds/4587#0184f5eb-1f40-4102-a3ec-f323deb5d8ed

FAIL test: CloudRetentionTest.test_cloud_retention.max_consume_rate_mb=20 (1/1 runs)
  failure at 2022-12-09T15:24:03.345Z: TimeoutError(None)
      on (amd64, VM) in job https://buildkite.com/redpanda/vtools/builds/4587#0184f5eb-1f40-4102-a3ec-f323deb5d8ed
FAIL test: CloudRetentionTest.test_cloud_retention.max_consume_rate_mb=None (1/1 runs)
  failure at 2022-12-09T15:24:03.345Z: TimeoutError(None)
      on (amd64, VM) in job https://buildkite.com/redpanda/vtools/builds/4587#0184f5eb-1f40-4102-a3ec-f323deb5d8ed
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/mark/_mark.py", line 476, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/scale_tests/cloud_retention_test.py", line 128, in test_cloud_retention
    consumer.wait()
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/services/service.py", line 261, in wait
    if not self.wait_node(node, end - now):
  File "/home/ubuntu/redpanda/tests/rptest/services/kgo_verifier_services.py", line 162, in wait_node
    self._redpanda.wait_until(lambda: self._status.active is False or self.
  File "/home/ubuntu/redpanda/tests/rptest/services/redpanda.py", line 1068, in wait_until
    wait_until(wrapped,
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/utils/util.py", line 58, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError: None
@VadimPlh VadimPlh added kind/bug Something isn't working ci-failure labels Dec 12, 2022
@jcsp jcsp added the area/cloud-storage Shadow indexing subsystem label Dec 12, 2022
@bharathv
Copy link
Contributor

@bharathv
Copy link
Contributor

@BenPope
Copy link
Member

BenPope commented Dec 30, 2022

This test doesn't seem to succeed:

FAIL test: CloudRetentionTest.test_cloud_retention.max_consume_rate_mb=20 (14/14 runs)
failure at 2022-12-30T04:28:17.969Z: TimeoutError(None)
on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4928#01855f7c-4001-4ed2-90cc-e7a1398b565a
failure at 2022-12-29T15:33:23.538Z: TimeoutError(None)
on (amd64, VM) in job https://buildkite.com/redpanda/vtools/builds/4926#01855ce9-9185-4a0f-a5a0-405a19882192
failure at 2022-12-29T04:31:38.661Z: TimeoutError(None)
on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4923#01855a56-3e54-4997-9a52-fa50cb839888
failure at 2022-12-28T15:45:22.952Z: TimeoutError(None)
on (amd64, VM) in job https://buildkite.com/redpanda/vtools/builds/4921#018557c4-2990-4c6b-9b28-9ec4be83e9cb
failure at 2022-12-28T04:28:12.819Z: TimeoutError(None)
on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4918#0185552f-c829-43a8-9d3b-b74db0421f55
failure at 2022-12-27T15:23:38.216Z: TimeoutError(None)
on (amd64, VM) in job https://buildkite.com/redpanda/vtools/builds/4916#0185529d-3893-43cc-a82e-6524ae0839d8
failure at 2022-12-27T04:39:59.859Z: TimeoutError(None)
on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4913#0185500b-bdaa-4cb0-b640-31259ecac3be
failure at 2022-12-26T15:30:53.221Z: TimeoutError(None)
on (amd64, VM) in job https://buildkite.com/redpanda/vtools/builds/4911#01854d77-3803-4f06-94df-09d2713c4046
failure at 2022-12-26T04:32:21.108Z: TimeoutError(None)
on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4908#01854ae2-7717-429d-85ac-87d26f4e421a
failure at 2022-12-25T15:25:50.631Z: TimeoutError(None)
on (amd64, VM) in job https://buildkite.com/redpanda/vtools/builds/4903#0185484f-5cfb-4d8e-acbf-0f796ad567d8
failure at 2022-12-24T15:28:11.263Z: TimeoutError(None)
on (amd64, VM) in job https://buildkite.com/redpanda/vtools/builds/4894#0185432b-ee15-4712-b049-dd6f66506ba4
failure at 2022-12-24T04:13:23.294Z: TimeoutError(None)
on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4891#01854096-20d5-401c-b8b1-94e4d21e8e45
failure at 2022-12-25T04:20:18.252Z: TimeoutError(None)
on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4898#018545bd-095d-4b37-b549-0173b1d272c1
failure at 2022-12-23T15:23:44.002Z: TimeoutError(None)
on (amd64, VM) in job https://buildkite.com/redpanda/vtools/builds/4875#01853e04-0484-4b8e-8b95-458bcbc3db86
FAIL test: CloudRetentionTest.test_cloud_retention.max_consume_rate_mb=None (14/14 runs)
failure at 2022-12-30T04:28:17.969Z: TimeoutError(None)
on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4928#01855f7c-4001-4ed2-90cc-e7a1398b565a
failure at 2022-12-29T15:33:23.538Z: TimeoutError(None)
on (amd64, VM) in job https://buildkite.com/redpanda/vtools/builds/4926#01855ce9-9185-4a0f-a5a0-405a19882192
failure at 2022-12-29T04:31:38.661Z: TimeoutError(None)
on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4923#01855a56-3e54-4997-9a52-fa50cb839888
failure at 2022-12-28T15:45:22.952Z: TimeoutError(None)
on (amd64, VM) in job https://buildkite.com/redpanda/vtools/builds/4921#018557c4-2990-4c6b-9b28-9ec4be83e9cb
failure at 2022-12-28T04:28:12.819Z: TimeoutError(None)
on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4918#0185552f-c829-43a8-9d3b-b74db0421f55
failure at 2022-12-27T15:23:38.216Z: TimeoutError(None)
on (amd64, VM) in job https://buildkite.com/redpanda/vtools/builds/4916#0185529d-3893-43cc-a82e-6524ae0839d8
failure at 2022-12-27T04:39:59.859Z: TimeoutError(None)
on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4913#0185500b-bdaa-4cb0-b640-31259ecac3be
failure at 2022-12-26T15:30:53.221Z: TimeoutError(None)
on (amd64, VM) in job https://buildkite.com/redpanda/vtools/builds/4911#01854d77-3803-4f06-94df-09d2713c4046
failure at 2022-12-26T04:32:21.108Z: TimeoutError(None)
on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4908#01854ae2-7717-429d-85ac-87d26f4e421a
failure at 2022-12-25T15:25:50.631Z: TimeoutError(None)
on (amd64, VM) in job https://buildkite.com/redpanda/vtools/builds/4903#0185484f-5cfb-4d8e-acbf-0f796ad567d8
failure at 2022-12-24T15:28:11.263Z: TimeoutError(None)
on (amd64, VM) in job https://buildkite.com/redpanda/vtools/builds/4894#0185432b-ee15-4712-b049-dd6f66506ba4
failure at 2022-12-24T04:13:23.294Z: TimeoutError(None)
on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4891#01854096-20d5-401c-b8b1-94e4d21e8e45
failure at 2022-12-25T04:20:18.252Z: TimeoutError(None)
on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4898#018545bd-095d-4b37-b549-0173b1d272c1
failure at 2022-12-23T15:23:44.002Z: TimeoutError(None)
on (amd64, VM) in job https://buildkite.com/redpanda/vtools/builds/4875#01853e04-0484-4b8e-8b95-458bcbc3db86

@Lazin
Copy link
Contributor

Lazin commented Jan 5, 2023

@andrwng
Copy link
Contributor

andrwng commented Jan 9, 2023

I think this is caused by a bug in offset translation. I ran test with trace logging enabled in kgo-verifier, franz go, and in Redpanda, and found in the client logs several messages (newly added for debugging) like:

time="2023-01-07T00:05:12Z" level=debug msg="Not done reading after reading si_test_topic/1: 0 < 77194"

...after the last_pass call is made.

On the servers, it looks like we keep on fetching the same offset, but end up not returning any results, e.g.:

WARN 2023-01-07 00:05:10,525 [shard 1] kafka - fetch.cc:198 - fetch offset out of range for {kafka/si_test_topic/1}, requested offset: 83272, partition start offset: 84186, high watermark: 87788, ec: { error_code: offset_out_of_range [1] }

There's an RPC to list offsets:
TRACE 2023-01-07 00:05:10,526 [shard 1] kafka - handler.h:68 - [client_id: {KgoVerifierConsumerGroupConsumer-0-140251922618768}] handling list_offsets request {replica_id=-1 isolation_level=0 topics={{name={si_test_topic} partitions={{partition_index=1 current_leader_epoch=23 timestamp={timestamp: -2} max_num_offsets=0}}}}}

It seems like a segment is truncated:
INFO 2023-01-07 00:05:10,527 [shard 1] archival - [fiber34581 kafka/si_test_topic/1] - ntp_archiver_service.cc:1185 - Deleted segment from cloud storage: {"e65edc1c/kafka/si_test_topic/1_18/83439-84353-11268782-22-v1.log.22"}

We then return the offset 84186, which is presumably in the remaining segments:
TRACE 2023-01-07 00:05:10,527 [shard 1] kafka - request_context.h:168 - [172.24.0.10:51958] sending 2:list_offsets for {KgoVerifierConsumerGroupConsumer-0-140251922618768}, response {throttle_time_ms=0 topics={{name={si_test_topic} partitions={{partition_index=1 error_code={ error_code: none [0] } old_style_offsets={} timestamp={timestamp: missing} offset=84186 leader_epoch=23}}}}}

The fetches thereafter are for offset 84186 all return 0 bytes:
DEBUG 2023-01-07 00:05:12,778 [shard 1] cloud_storage - partition_manifest.cc:288 - Metadata lookup using kafka offset 84186

DEBUG 2023-01-07 00:05:12,778 [shard 1] cloud_storage - [fiber30979 kafka/si_test_topic/1] - remote_partition.cc:347 - partition_record_batch_reader_impl initialize reader state - segment not found

TRACE 2023-01-07 00:05:12,778 [shard 1] kafka - fetch.cc:348 - fetch_ntps_in_parallel: for 1 partitions returning 0 total bytes

Looking at the offset translation code for segment lookup, we first do an offset lookup at model offset 84186 and then traverse forward to look for the first segment with a higher base kafka offset, and return the previous segment expecting it to contain the kafka offset. In this case, it looks like that first offset lookup returns with no segments (because model offset 84186 has been truncated) and we return immediately.

andrwng added a commit to andrwng/redpanda that referenced this issue Jan 9, 2023
When serving a fetch request by a kafka::offset, we previously used the
`partition_manifest` to perform a segment lookup by
`kafka::offset_cast(ko)` first in order to traverse the segments forward
to find the actual segment that contained `ko`.

This doesn't work when the manifest has been truncated, e.g. if the
casted offset falls before the start of the manifest, we would
previously return that no segment exists for the fetch. This could
result in segments erronesouly not being returned, and fetches
erroneously being met with no data when some existed.

This commit fixes the behavior to no longer use the casted segment
lookup and adds some test coverage for kafka::offset lookups.

Fixes redpanda-data#7708
andrwng added a commit to andrwng/redpanda that referenced this issue Jan 9, 2023
When serving a fetch request by a kafka::offset, we previously used the
`partition_manifest` to perform a segment lookup by
`kafka::offset_cast(ko)` first in order to traverse the segments forward
to find the actual segment that contained `ko`.

This doesn't work when the manifest has been truncated, e.g. if the
casted offset falls before the start of the manifest, we would
previously return that no segment exists for the fetch. This could
result in segments erronesouly not being returned, and fetches
erroneously being met with no data when some existed.

This commit fixes the behavior to no longer use the casted segment
lookup and adds some test coverage for kafka::offset lookups.

Fixes redpanda-data#7708
@Lazin
Copy link
Contributor

Lazin commented Jan 9, 2023

Isn't it the expected behavior? The consumer tries to fetch offset below start offset and gets an error. Or it just acts as if the partition is empty in this case?

@andrwng
Copy link
Contributor

andrwng commented Jan 9, 2023

Isn't it the expected behavior? The consumer tries to fetch offset below start offset and gets an error. Or it just acts as if the partition is empty in this case?

I don't think so. In this case, the partition has kafka offsets 84186 - 87788 (see partition start offset: 84186, high watermark: 87788), so a fetch for 84186 should be valid and met with some results.

@Lazin
Copy link
Contributor

Lazin commented Jan 9, 2023

So basically, prefix truncation breaks segment lookup so it can't find the first segment anymore?

@andrwng
Copy link
Contributor

andrwng commented Jan 9, 2023

Right, a simple example of what I think is happening is:

mo: model offset
ko: kafka offset

mo: 10     20     30
    [b    ][c    ]end
ko: 5      10     15

We may see a fetch at kafka offset 7, and in today's code, the segment_contains(kafka::offset) method will return that there's no data because no segment contains model::offset(7)

vbotbuildovich pushed a commit to vbotbuildovich/redpanda that referenced this issue Jan 9, 2023
When serving a fetch request by a kafka::offset, we previously used the
`partition_manifest` to perform a segment lookup by
`kafka::offset_cast(ko)` first in order to traverse the segments forward
to find the actual segment that contained `ko`.

This doesn't work when the manifest has been truncated, e.g. if the
casted offset falls before the start of the manifest, we would
previously return that no segment exists for the fetch. This could
result in segments erronesouly not being returned, and fetches
erroneously being met with no data when some existed.

This commit fixes the behavior to no longer use the casted segment
lookup and adds some test coverage for kafka::offset lookups.

Fixes redpanda-data#7708

(cherry picked from commit 10f87c1)
andrwng added a commit to andrwng/redpanda that referenced this issue Jan 10, 2023
When serving a fetch request by a kafka::offset, we previously used the
`partition_manifest` to perform a segment lookup by
`kafka::offset_cast(ko)` first in order to traverse the segments forward
to find the actual segment that contained `ko`.

This doesn't work when the manifest has been truncated, e.g. if the
casted offset falls before the start of the manifest, we would
previously return that no segment exists for the fetch. This could
result in segments erronesouly not being returned, and fetches
erroneously being met with no data when some existed.

This commit fixes the behavior to no longer use the casted segment
lookup and adds some test coverage for kafka::offset lookups.

Fixes redpanda-data#7708
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cloud-storage Shadow indexing subsystem ci-failure kind/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants