-
Notifications
You must be signed in to change notification settings - Fork 552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI Failure (consumer timed out) in CloudStorageTimingStressTest.test_cloud_storage
#15042
Comments
Different test (with_partition_moves), same failures: https://buildkite.com/redpanda/redpanda/builds/42019 |
Client side at a certain point, we see several attempts to connect, but they're immediately shut down
Broker side, we see connections being repeatedly shutdown as well.
|
Marking sev/high since this results in a stuck consumer. Similar to above, there are a bunch of This time around I noticed that they correspond to some warnings at the Kafka layer. The broker is reporting a NotFound error when performing a timequery from S3. It appears the timestamp of interest is leading us to attempt to download a segment that shouldn't exist in the archive.
Here's where we can see it's a timequery:
And we can see the archive start offset has already been moved ahead segment
What's odd is that we appear to be repeatedly attempting this download. Sure this segment may not be expected to exist according to the archive, but why then do we continue to attempt to download it? cc @Lazin |
Here's a more complete sequence of a download failure:
|
https://drive.google.com/file/d/1wZnmolWJW0_ks4iiyZ74Z7w-dsTezLZy/view?usp=sharing Here are the logs, in case the above ones get lost |
It appears that timequeries just aren't taking into account the archive start offset. From the above logs: Here's a given cursor.
Here's a given query:
And the found manifest:
So there seem to be a couple things off here:
|
This fixes a bug for timequeries that could result in a removed segment being unsuccessfully downloaded during a timequery. The sequence of events is as follows: 1. Spillover manifest [7278, 8592] 2. Spillover more manifests... [...13574] 3. Eventually our stm manifest is [13875, 15802] 4. We housekeep and truncate to 7804. This is the new archive start offset and archive clean offset 5. A timequery comes in whose timestamp falls in manifest [7278, 8592] 6. Segment [7278-7410] is attempted to be downloaded, which has already been removed To fix this, as we're seeking within a given manifest, we'll now check against the overall start offset of the partition, skipping any segments that fall below the start. Fixes redpanda-data#15042
This fixes a bug for timequeries that could result in a removed segment being unsuccessfully downloaded during a timequery. The sequence of events is as follows: 1. Spillover manifest [7278, 8592] 2. Spillover more manifests... [...13574] 3. Eventually our stm manifest is [13875, 15802] 4. We housekeep and truncate to 7804. This is the new archive start offset and archive clean offset 5. A timequery comes in whose timestamp falls in manifest [7278, 8592] 6. Segment [7278-7410] is attempted to be downloaded, which has already been removed To fix this, as we're seeking within a given manifest, we'll now check against the overall start offset of the partition, skipping any segments that fall below the start. Fixes redpanda-data#15042
This fixes a bug for timequeries that could result in a removed segment being unsuccessfully downloaded during a timequery. The sequence of events is as follows: 1. Spillover manifest [7278, 8592] 2. Spillover more manifests... [...13574] 3. Eventually our stm manifest is [13875, 15802] 4. We housekeep and truncate to 7804. This is the new archive start offset and archive clean offset 5. A timequery comes in whose timestamp falls in manifest [7278, 8592] 6. Segment [7278-7410] is attempted to be downloaded, which has already been removed To fix this, as we're seeking within a given manifest, we'll now check against the overall start offset of the partition, skipping any segments that fall below the start. Fixes redpanda-data#15042
This fixes a bug for timequeries that could result in a removed segment being unsuccessfully downloaded during a timequery. The sequence of events is as follows: 1. Spillover manifest [7278, 8592] 2. Spillover more manifests... [...13574] 3. Eventually our stm manifest is [13875, 15802] 4. We housekeep and truncate to 7804. This is the new archive start offset and archive clean offset 5. A timequery comes in whose timestamp falls in manifest [7278, 8592] 6. Segment [7278-7410] is attempted to be downloaded, which has already been removed To fix this, as we're seeking within a given manifest, we'll now check against the overall start offset of the partition, skipping any segments that fall below the start. Fixes redpanda-data#15042 (cherry picked from commit d056080)
https://buildkite.com/redpanda/redpanda/builds/41212
https://buildkite.com/redpanda/redpanda/builds/41374
The text was updated successfully, but these errors were encountered: