-
Notifications
You must be signed in to change notification settings - Fork 552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upload error after all cloud segments get GCed #8945
Labels
area/cloud-storage
Shadow indexing subsystem
kind/bug
Something isn't working
sev/high
loss of availability, pathological performance degradation, recoverable corruption
Comments
andrwng
added
kind/bug
Something isn't working
area/cloud-storage
Shadow indexing subsystem
sev/high
loss of availability, pathological performance degradation, recoverable corruption
labels
Feb 17, 2023
Theory from the chat just now: that this could be an off by one error in local retention code. |
andrwng
added a commit
to andrwng/redpanda
that referenced
this issue
Feb 17, 2023
We previously predicated on whether there were existing segments to choose an upload start offset. This wouldn't work in cases where the manifest is entirely truncated away. Without this, once attempting to upload after GCing all cloud segments, we could end up with errors like: ``` ERROR 2023-02-16 22:54:19,549 [shard 14] archival - [fiber51 kafka/scale_000000/165] - ntp_archiver_service.cc:184 - upload loop error: std::runtime_error (ntp {kafka/scale_000000/165}: log offset 4085 is outside the translation range (starting at 8830)) ``` Fixes redpanda-data#8945
6 tasks
andrwng
added a commit
to andrwng/redpanda
that referenced
this issue
Feb 17, 2023
We previously predicated on whether there were existing segments to choose an upload start offset. This wouldn't work in cases where the manifest is entirely truncated away. Without this, once attempting to upload after GCing all cloud segments, we could end up with errors like: ``` ERROR 2023-02-16 22:54:19,549 [shard 14] archival - [fiber51 kafka/scale_000000/165] - ntp_archiver_service.cc:184 - upload loop error: std::runtime_error (ntp {kafka/scale_000000/165}: log offset 4085 is outside the translation range (starting at 8830)) ``` Fixes redpanda-data#8945
andrwng
added a commit
to andrwng/redpanda
that referenced
this issue
Feb 17, 2023
We previously predicated on whether there were existing segments to choose an upload start offset. This wouldn't work in cases where the manifest is entirely truncated away. Without this, once attempting to upload after GCing all cloud segments, we could end up with errors like: ``` ERROR 2023-02-16 22:54:19,549 [shard 14] archival - [fiber51 kafka/scale_000000/165] - ntp_archiver_service.cc:184 - upload loop error: std::runtime_error (ntp {kafka/scale_000000/165}: log offset 4085 is outside the translation range (starting at 8830)) ``` Fixes redpanda-data#8945
andrwng
changed the title
Upload error when increasing segment size past existing retention limits
Upload error after all cloud segments get GCed
Feb 17, 2023
andrwng
added a commit
to andrwng/redpanda
that referenced
this issue
Feb 27, 2023
CONFLICT: - updated SISettings construction which takes no test_context in this branch We previously predicated on whether there were existing segments to choose an upload start offset. This wouldn't work in cases where the manifest is entirely truncated away. Without this, once attempting to upload after GCing all cloud segments, we could end up with errors like: ``` ERROR 2023-02-16 22:54:19,549 [shard 14] archival - [fiber51 kafka/scale_000000/165] - ntp_archiver_service.cc:184 - upload loop error: std::runtime_error (ntp {kafka/scale_000000/165}: log offset 4085 is outside the translation range (starting at 8830)) ``` Fixes redpanda-data#8945 (cherry picked from commit 6dd451d)
andrwng
added a commit
to andrwng/redpanda
that referenced
this issue
Feb 27, 2023
CONFLICT: - updated SISettings construction which takes no test_context in this branch - backported tests/cloud_retention_test We previously predicated on whether there were existing segments to choose an upload start offset. This wouldn't work in cases where the manifest is entirely truncated away. Without this, once attempting to upload after GCing all cloud segments, we could end up with errors like: ``` ERROR 2023-02-16 22:54:19,549 [shard 14] archival - [fiber51 kafka/scale_000000/165] - ntp_archiver_service.cc:184 - upload loop error: std::runtime_error (ntp {kafka/scale_000000/165}: log offset 4085 is outside the translation range (starting at 8830)) ``` Fixes redpanda-data#8945 (cherry picked from commit 6dd451d)
andrwng
added a commit
to andrwng/redpanda
that referenced
this issue
Apr 18, 2023
CONFLICT: - removed tests/cloud_retention_test We previously predicated on whether there were existing segments to choose an upload start offset. This wouldn't work in cases where the manifest is entirely truncated away. Without this, once attempting to upload after GCing all cloud segments, we could end up with errors like: ``` ERROR 2023-02-16 22:54:19,549 [shard 14] archival - [fiber51 kafka/scale_000000/165] - ntp_archiver_service.cc:184 - upload loop error: std::runtime_error (ntp {kafka/scale_000000/165}: log offset 4085 is outside the translation range (starting at 8830)) ``` Fixes redpanda-data#8945 (cherry picked from commit 6dd451d)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
area/cloud-storage
Shadow indexing subsystem
kind/bug
Something isn't working
sev/high
loss of availability, pathological performance degradation, recoverable corruption
Version & Environment
Redpanda version: (use
rpk version
): dev / ~v23.1.1What went wrong?
I ran through a variant of
many_partitions_test
, but the cluster used 4 brokers, 315 partitions, and used non-infinite retention (~325MiB, 100GiB split across partitions). The initial segment size was 4KiB, initial local retention of 96KiB per partition, and as warmup I wrote 1000 segments per partitions. I bumped the segment size to 512MiB (note, above the retention limit!), and the local retention to ~500MiB (note, below the segment size!) and started seeing these errors pop up on every upload for this partition after writing and reading some (the test goes on to perform some other restarts, decomms, etc but the errors show up before then).What should have happened instead?
It isn't clear where we can do anything better than error if we've indeed truncated the manifest to beyond than what's local to a new leader. Perhaps there are safeguards we can install around increasing segment size (e.g. disallowing altering segment size to be larger than the retention size seems like a reasonable start).
Here's a tarball with logs from the cluster that exhibited these errors, and the tweaked version of many_partitions_test that ran with it.
https://drive.google.com/file/d/1P9Tym_oZsxzbuQyAMo4k6FxLg_KzcQOo/view?usp=share_link
The text was updated successfully, but these errors were encountered: