Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upload error after all cloud segments get GCed #8945

Closed
andrwng opened this issue Feb 17, 2023 · 1 comment · Fixed by #8968
Closed

Upload error after all cloud segments get GCed #8945

andrwng opened this issue Feb 17, 2023 · 1 comment · Fixed by #8968
Assignees
Labels
area/cloud-storage Shadow indexing subsystem kind/bug Something isn't working sev/high loss of availability, pathological performance degradation, recoverable corruption

Comments

@andrwng
Copy link
Contributor

andrwng commented Feb 17, 2023

Version & Environment

Redpanda version: (use rpk version): dev / ~v23.1.1

What went wrong?

I ran through a variant of many_partitions_test, but the cluster used 4 brokers, 315 partitions, and used non-infinite retention (~325MiB, 100GiB split across partitions). The initial segment size was 4KiB, initial local retention of 96KiB per partition, and as warmup I wrote 1000 segments per partitions. I bumped the segment size to 512MiB (note, above the retention limit!), and the local retention to ~500MiB (note, below the segment size!) and started seeing these errors pop up on every upload for this partition after writing and reading some (the test goes on to perform some other restarts, decomms, etc but the errors show up before then).

On node ip-172-31-3-46
INFO  2023-02-16 22:50:08,526 [shard 14] storage - disk_log_impl.cc:1237 - Removing "/var/lib/redpanda/data/kafka/scale_000000/165_25/4066-1-v1.log" (remove_prefix_full_segments, {offset_tracker:{term:1, base_offset:4066, committed_offset:4084, dirty_offset:4084}, compacted_segment=0, finished_self_compaction=0, generation={11}, reader={/var/lib/redpanda/data/kafka/scale_000000/165_25/4066-1-v1.log, (395741 bytes)}, writer=nullptr, cache={cache_size=0}, compaction_index:nullopt, closed=0, tombstone=0, index={file:/var/lib/redpanda/data/kafka/scale_000000/165_25/4066-1-v1.base_index, offsets:{4066}, index:{header_bitflags:0, base_offset:{4066}, max_offset:{4084}, base_timestamp:{timestamp: 1676587578473}, max_timestamp:{timestamp: 1676587578473}, batch_timestamps_are_monotonic:1, index(2,2,2)}, step:32768, needs_persistence:0}})
...
On node ip-172-31-2-174
INFO  2023-02-16 22:51:27,867 [shard 13] cluster - ntp: {kafka/scale_000000/165} - archival_metadata_stm.cc:386 - truncate command replicated, truncated up to 8830, remote start_offset: 8830, last_offset: 8829
...
On node ip-172-31-3-46
INFO  2023-02-16 22:54:19,548 [shard 14] raft - [group_id:166, {kafka/scale_000000/165}] vote_stm.cc:279 - became the leader term: 2
ERROR 2023-02-16 22:54:19,549 [shard 14] archival - [fiber51 kafka/scale_000000/165] - ntp_archiver_service.cc:184 - upload loop error: std::runtime_error (ntp {kafka/scale_000000/165}: log offset 4085 is outside the translation range (starting at 8830))
... a lot more identical errors

What should have happened instead?

It isn't clear where we can do anything better than error if we've indeed truncated the manifest to beyond than what's local to a new leader. Perhaps there are safeguards we can install around increasing segment size (e.g. disallowing altering segment size to be larger than the retention size seems like a reasonable start).

Here's a tarball with logs from the cluster that exhibited these errors, and the tweaked version of many_partitions_test that ran with it.
https://drive.google.com/file/d/1P9Tym_oZsxzbuQyAMo4k6FxLg_KzcQOo/view?usp=share_link

@andrwng andrwng added kind/bug Something isn't working area/cloud-storage Shadow indexing subsystem sev/high loss of availability, pathological performance degradation, recoverable corruption labels Feb 17, 2023
@jcsp
Copy link
Contributor

jcsp commented Feb 17, 2023

Theory from the chat just now: that this could be an off by one error in local retention code.

andrwng added a commit to andrwng/redpanda that referenced this issue Feb 17, 2023
We previously predicated on whether there were existing segments to
choose an upload start offset. This wouldn't work in cases where the
manifest is entirely truncated away.

Without this, once attempting to upload after GCing all cloud segments,
we could end up with errors like:

```
ERROR 2023-02-16 22:54:19,549 [shard 14] archival - [fiber51 kafka/scale_000000/165] - ntp_archiver_service.cc:184 - upload loop error: std::runtime_error (ntp {kafka/scale_000000/165}: log offset 4085 is outside the translation range (starting at 8830))
```

Fixes redpanda-data#8945
andrwng added a commit to andrwng/redpanda that referenced this issue Feb 17, 2023
We previously predicated on whether there were existing segments to
choose an upload start offset. This wouldn't work in cases where the
manifest is entirely truncated away.

Without this, once attempting to upload after GCing all cloud segments,
we could end up with errors like:

```
ERROR 2023-02-16 22:54:19,549 [shard 14] archival - [fiber51 kafka/scale_000000/165] - ntp_archiver_service.cc:184 - upload loop error: std::runtime_error (ntp {kafka/scale_000000/165}: log offset 4085 is outside the translation range (starting at 8830))
```

Fixes redpanda-data#8945
andrwng added a commit to andrwng/redpanda that referenced this issue Feb 17, 2023
We previously predicated on whether there were existing segments to
choose an upload start offset. This wouldn't work in cases where the
manifest is entirely truncated away.

Without this, once attempting to upload after GCing all cloud segments,
we could end up with errors like:

```
ERROR 2023-02-16 22:54:19,549 [shard 14] archival - [fiber51 kafka/scale_000000/165] - ntp_archiver_service.cc:184 - upload loop error: std::runtime_error (ntp {kafka/scale_000000/165}: log offset 4085 is outside the translation range (starting at 8830))
```

Fixes redpanda-data#8945
@andrwng andrwng changed the title Upload error when increasing segment size past existing retention limits Upload error after all cloud segments get GCed Feb 17, 2023
andrwng added a commit to andrwng/redpanda that referenced this issue Feb 27, 2023
CONFLICT:
- updated SISettings construction which takes no test_context in this
  branch

We previously predicated on whether there were existing segments to
choose an upload start offset. This wouldn't work in cases where the
manifest is entirely truncated away.

Without this, once attempting to upload after GCing all cloud segments,
we could end up with errors like:

```
ERROR 2023-02-16 22:54:19,549 [shard 14] archival - [fiber51 kafka/scale_000000/165] - ntp_archiver_service.cc:184 - upload loop error: std::runtime_error (ntp {kafka/scale_000000/165}: log offset 4085 is outside the translation range (starting at 8830))
```

Fixes redpanda-data#8945

(cherry picked from commit 6dd451d)
andrwng added a commit to andrwng/redpanda that referenced this issue Feb 27, 2023
CONFLICT:
- updated SISettings construction which takes no test_context in this
  branch
- backported tests/cloud_retention_test

We previously predicated on whether there were existing segments to
choose an upload start offset. This wouldn't work in cases where the
manifest is entirely truncated away.

Without this, once attempting to upload after GCing all cloud segments,
we could end up with errors like:

```
ERROR 2023-02-16 22:54:19,549 [shard 14] archival - [fiber51 kafka/scale_000000/165] - ntp_archiver_service.cc:184 - upload loop error: std::runtime_error (ntp {kafka/scale_000000/165}: log offset 4085 is outside the translation range (starting at 8830))
```

Fixes redpanda-data#8945

(cherry picked from commit 6dd451d)
andrwng added a commit to andrwng/redpanda that referenced this issue Apr 18, 2023
CONFLICT:
- removed tests/cloud_retention_test

We previously predicated on whether there were existing segments to
choose an upload start offset. This wouldn't work in cases where the
manifest is entirely truncated away.

Without this, once attempting to upload after GCing all cloud segments,
we could end up with errors like:

```
ERROR 2023-02-16 22:54:19,549 [shard 14] archival - [fiber51 kafka/scale_000000/165] - ntp_archiver_service.cc:184 - upload loop error: std::runtime_error (ntp {kafka/scale_000000/165}: log offset 4085 is outside the translation range (starting at 8830))
```

Fixes redpanda-data#8945

(cherry picked from commit 6dd451d)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cloud-storage Shadow indexing subsystem kind/bug Something isn't working sev/high loss of availability, pathological performance degradation, recoverable corruption
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants