Skip to content

fix(buffers): disk buffers v2 reopen buffer after restart crash loop#25570

Open
blt wants to merge 2 commits into
blt/antithesis_vector_end-to-end_ack_scenariofrom
blt/disk_buffers_v2_reopen_buffer_after_restart_crash_loop
Open

fix(buffers): disk buffers v2 reopen buffer after restart crash loop#25570
blt wants to merge 2 commits into
blt/antithesis_vector_end-to-end_ack_scenariofrom
blt/disk_buffers_v2_reopen_buffer_after_restart_crash_loop

Conversation

@blt
Copy link
Copy Markdown
Contributor

@blt blt commented Jun 3, 2026

Summary

We have observed in Antithesis scenario vector_to_vector_e2e_disk that
the disk buffer can be driven to a state where Vector will crash loop
without recovery. This happens when the reader is set to resume from a
data file that no longer exists and will not be created.

Strictly, the disk buffer reader is meant to chase the writer. This
happens because the deletion of completed data files is not atomic. When
a reader finishes an acks a data file delete_completed_data_file:

  • unlinks the .dat file
  • advances the reader id in the ledger
  • fsyncs the ledger

If Vector crashes after the unlink but before the ledger is fsync'ed
this will cause the next instance of Vector to attempt to read a .dat
that will never appear, crashing said instance. Mostly. OS page cache
complicates this some, as the newly introduced
reopen_recovers_with_reader_resume_data_file_is_missing discusses.

Vector configuration

How did you test this PR?

Change Type

  • Bug fix
  • New feature
  • Dependencies
  • Non-functional (chore, refactoring, docs)
  • Performance

Is this a breaking change?

  • Yes
  • No

Does this PR include user facing changes?

  • Yes. Please add a changelog fragment based on our guidelines.
  • No. A maintainer will apply the no-changelog label to this PR.

References

Notes

  • Please read our Vector contributor resources.
  • Do not hesitate to use @vectordotdev/vector to reach out to us regarding this PR.
  • Some CI checks run only after we manually approve them.
    • We recommend adding a pre-push hook, please see this template.
    • Alternatively, we recommend running the following locally before pushing to the remote branch:
      • make fmt
      • make check-clippy (if there are failures it's possible some of them can be fixed with make clippy-fix)
      • make test
  • After a review is requested, please avoid force pushes to help us review incrementally.
    • Feel free to push as many commits as you want. They will be squashed into one before merging.
    • For example, you can run git merge origin master and git push.
  • If this PR introduces changes Vector dependencies (modifies Cargo.lock), please
    run make build-licenses to regenerate the license inventory and commit the changes (if any). More details on the dd-rust-license-tool.

Copy link
Copy Markdown
Contributor Author

blt commented Jun 3, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@datadog-vectordotdev
Copy link
Copy Markdown

datadog-vectordotdev Bot commented Jun 4, 2026

Tests

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: fe305fa | Docs | Give us feedback!

@blt blt force-pushed the blt/antithesis-harness-first-scenario branch from 3614c5a to 5209ae6 Compare June 4, 2026 00:25
@blt blt force-pushed the blt/disk_buffers_v2_reopen_buffer_after_restart_crash_loop branch from db4960b to e3cc97c Compare June 4, 2026 00:25
@blt blt changed the base branch from blt/antithesis-harness-first-scenario to graphite-base/25570 June 4, 2026 01:06
@blt blt force-pushed the blt/disk_buffers_v2_reopen_buffer_after_restart_crash_loop branch from e3cc97c to f6542c0 Compare June 4, 2026 01:06
@blt blt changed the base branch from graphite-base/25570 to blt/antithesis_vector_end-to-end_ack_scenario June 4, 2026 01:06
@blt blt force-pushed the blt/antithesis_vector_end-to-end_ack_scenario branch from 3cce5aa to 84df4d7 Compare June 4, 2026 04:48
@blt blt force-pushed the blt/disk_buffers_v2_reopen_buffer_after_restart_crash_loop branch from f6542c0 to 61e1f3e Compare June 4, 2026 04:48
of failing the buffer build, so the buffer always reopens and continues delivering.

Disk buffer (v2) durability was also hardened: the directory holding the buffer is now
fsync'd after a data file is created. Previously only file contents were synced, so a
@blt blt force-pushed the blt/disk_buffers_v2_reopen_buffer_after_restart_crash_loop branch 2 times, most recently from 6cf0d05 to 597b732 Compare June 4, 2026 13:17
@blt blt force-pushed the blt/antithesis_vector_end-to-end_ack_scenario branch from 84df4d7 to c066e34 Compare June 4, 2026 13:17
@blt blt force-pushed the blt/disk_buffers_v2_reopen_buffer_after_restart_crash_loop branch from 597b732 to b3c5011 Compare June 4, 2026 14:57
@blt blt force-pushed the blt/antithesis_vector_end-to-end_ack_scenario branch 2 times, most recently from 8ff610c to 4811fb8 Compare June 4, 2026 15:05
@blt blt force-pushed the blt/disk_buffers_v2_reopen_buffer_after_restart_crash_loop branch from b3c5011 to 05d79f8 Compare June 4, 2026 15:05
blt added 2 commits June 4, 2026 15:19
We have observed in Antithesis scenario `vector_to_vector_e2e_disk` that
the disk buffer can be driven to a state where Vector will crash loop
without recovery. This happens when the reader is set to resume from a
data file that no longer exists and will not be created.

Strictly, the disk buffer reader is meant to chase the writer. This
happens because the deletion of completed data files is not atomic. When
a reader finishes an acks a data file `delete_completed_data_file`:

* unlinks the .dat file
* advances the reader id in the ledger
* fsyncs the ledger

If Vector crashes after the unlink but before the ledger is fsync'ed
this will cause the next instance of Vector to attempt to read a .dat
that will never appear, crashing said instance. Mostly. OS page cache
complicates this some, as the newly introduced
`reopen_recovers_with_reader_resume_data_file_is_missing` discusses.

REF SMPTNG-749
This commit fixes the crash-loop by carefully re-arranging when the
ledger is updated and ensuring that directory metadata is also
fsynced. This later is important as ledger still updates after the
unlink takes place. If a file is gone we can assume that it has been
read iff the ledger points to a new dat file, meaning that the data was
previously read entirely and readers may skip the 'stale' reference.

I have left this skip event as a warn! but considering how likely this
is in practice in the event of node termination, for example, it might
rightly be a debug.

Overflow of dat file indexes remains a concern but 1. I would like to
demonstrate this in an antithesis run and 2. that belongs in a separate
commit.
@blt blt force-pushed the blt/disk_buffers_v2_reopen_buffer_after_restart_crash_loop branch from 05d79f8 to fe305fa Compare June 4, 2026 15:20
@blt blt force-pushed the blt/antithesis_vector_end-to-end_ack_scenario branch from 4811fb8 to 85cfa4b Compare June 4, 2026 15:20
@github-actions github-actions Bot added the domain: vdev Anything related to the vdev tooling label Jun 4, 2026
@blt blt changed the title disk buffers v2 reopen buffer after restart crash loop fix: disk buffers v2 reopen buffer after restart crash loop Jun 4, 2026
@blt blt changed the title fix: disk buffers v2 reopen buffer after restart crash loop fix(buffers): disk buffers v2 reopen buffer after restart crash loop Jun 4, 2026
@blt blt marked this pull request as ready for review June 4, 2026 15:28
@blt blt requested a review from a team as a code owner June 4, 2026 15:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

domain: vdev Anything related to the vdev tooling

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants