Upload error after all cloud segments get GCed #8945

andrwng · 2023-02-17T06:53:11Z

Version & Environment

Redpanda version: (use rpk version): dev / ~v23.1.1

What went wrong?

I ran through a variant of many_partitions_test, but the cluster used 4 brokers, 315 partitions, and used non-infinite retention (~325MiB, 100GiB split across partitions). The initial segment size was 4KiB, initial local retention of 96KiB per partition, and as warmup I wrote 1000 segments per partitions. I bumped the segment size to 512MiB (note, above the retention limit!), and the local retention to ~500MiB (note, below the segment size!) and started seeing these errors pop up on every upload for this partition after writing and reading some (the test goes on to perform some other restarts, decomms, etc but the errors show up before then).

On node ip-172-31-3-46
INFO  2023-02-16 22:50:08,526 [shard 14] storage - disk_log_impl.cc:1237 - Removing "/var/lib/redpanda/data/kafka/scale_000000/165_25/4066-1-v1.log" (remove_prefix_full_segments, {offset_tracker:{term:1, base_offset:4066, committed_offset:4084, dirty_offset:4084}, compacted_segment=0, finished_self_compaction=0, generation={11}, reader={/var/lib/redpanda/data/kafka/scale_000000/165_25/4066-1-v1.log, (395741 bytes)}, writer=nullptr, cache={cache_size=0}, compaction_index:nullopt, closed=0, tombstone=0, index={file:/var/lib/redpanda/data/kafka/scale_000000/165_25/4066-1-v1.base_index, offsets:{4066}, index:{header_bitflags:0, base_offset:{4066}, max_offset:{4084}, base_timestamp:{timestamp: 1676587578473}, max_timestamp:{timestamp: 1676587578473}, batch_timestamps_are_monotonic:1, index(2,2,2)}, step:32768, needs_persistence:0}})
...
On node ip-172-31-2-174
INFO  2023-02-16 22:51:27,867 [shard 13] cluster - ntp: {kafka/scale_000000/165} - archival_metadata_stm.cc:386 - truncate command replicated, truncated up to 8830, remote start_offset: 8830, last_offset: 8829
...
On node ip-172-31-3-46
INFO  2023-02-16 22:54:19,548 [shard 14] raft - [group_id:166, {kafka/scale_000000/165}] vote_stm.cc:279 - became the leader term: 2
ERROR 2023-02-16 22:54:19,549 [shard 14] archival - [fiber51 kafka/scale_000000/165] - ntp_archiver_service.cc:184 - upload loop error: std::runtime_error (ntp {kafka/scale_000000/165}: log offset 4085 is outside the translation range (starting at 8830))
... a lot more identical errors

What should have happened instead?

It isn't clear where we can do anything better than error if we've indeed truncated the manifest to beyond than what's local to a new leader. Perhaps there are safeguards we can install around increasing segment size (e.g. disallowing altering segment size to be larger than the retention size seems like a reasonable start).

Here's a tarball with logs from the cluster that exhibited these errors, and the tweaked version of many_partitions_test that ran with it.
https://drive.google.com/file/d/1P9Tym_oZsxzbuQyAMo4k6FxLg_KzcQOo/view?usp=share_link

The text was updated successfully, but these errors were encountered:

jcsp · 2023-02-17T16:39:35Z

Theory from the chat just now: that this could be an off by one error in local retention code.

We previously predicated on whether there were existing segments to choose an upload start offset. This wouldn't work in cases where the manifest is entirely truncated away. Without this, once attempting to upload after GCing all cloud segments, we could end up with errors like: ``` ERROR 2023-02-16 22:54:19,549 [shard 14] archival - [fiber51 kafka/scale_000000/165] - ntp_archiver_service.cc:184 - upload loop error: std::runtime_error (ntp {kafka/scale_000000/165}: log offset 4085 is outside the translation range (starting at 8830)) ``` Fixes redpanda-data#8945

CONFLICT: - updated SISettings construction which takes no test_context in this branch We previously predicated on whether there were existing segments to choose an upload start offset. This wouldn't work in cases where the manifest is entirely truncated away. Without this, once attempting to upload after GCing all cloud segments, we could end up with errors like: ``` ERROR 2023-02-16 22:54:19,549 [shard 14] archival - [fiber51 kafka/scale_000000/165] - ntp_archiver_service.cc:184 - upload loop error: std::runtime_error (ntp {kafka/scale_000000/165}: log offset 4085 is outside the translation range (starting at 8830)) ``` Fixes redpanda-data#8945 (cherry picked from commit 6dd451d)

CONFLICT: - updated SISettings construction which takes no test_context in this branch - backported tests/cloud_retention_test We previously predicated on whether there were existing segments to choose an upload start offset. This wouldn't work in cases where the manifest is entirely truncated away. Without this, once attempting to upload after GCing all cloud segments, we could end up with errors like: ``` ERROR 2023-02-16 22:54:19,549 [shard 14] archival - [fiber51 kafka/scale_000000/165] - ntp_archiver_service.cc:184 - upload loop error: std::runtime_error (ntp {kafka/scale_000000/165}: log offset 4085 is outside the translation range (starting at 8830)) ``` Fixes redpanda-data#8945 (cherry picked from commit 6dd451d)

CONFLICT: - removed tests/cloud_retention_test We previously predicated on whether there were existing segments to choose an upload start offset. This wouldn't work in cases where the manifest is entirely truncated away. Without this, once attempting to upload after GCing all cloud segments, we could end up with errors like: ``` ERROR 2023-02-16 22:54:19,549 [shard 14] archival - [fiber51 kafka/scale_000000/165] - ntp_archiver_service.cc:184 - upload loop error: std::runtime_error (ntp {kafka/scale_000000/165}: log offset 4085 is outside the translation range (starting at 8830)) ``` Fixes redpanda-data#8945 (cherry picked from commit 6dd451d)

andrwng added kind/bug Something isn't working area/cloud-storage Shadow indexing subsystem sev/high loss of availability, pathological performance degradation, recoverable corruption labels Feb 17, 2023

andrwng mentioned this issue Feb 17, 2023

archival: always use last_offset if available to schedule uploads #8968

Merged

6 tasks

piyushredpanda assigned andrwng Feb 17, 2023

andrwng changed the title ~~Upload error when increasing segment size past existing retention limits~~ Upload error after all cloud segments get GCed Feb 17, 2023

andrwng closed this as completed in #8968 Feb 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upload error after all cloud segments get GCed #8945

Upload error after all cloud segments get GCed #8945

andrwng commented Feb 17, 2023

jcsp commented Feb 17, 2023

Upload error after all cloud segments get GCed #8945

Upload error after all cloud segments get GCed #8945

Comments

andrwng commented Feb 17, 2023

Version & Environment

What went wrong?

What should have happened instead?

jcsp commented Feb 17, 2023