Skip to content

Parquet crash testing unit testing hooks#3028

Merged
jewei1997 merged 10 commits intomainfrom
STO-378/parquet-crash-testing
Mar 9, 2026
Merged

Parquet crash testing unit testing hooks#3028
jewei1997 merged 10 commits intomainfrom
STO-378/parquet-crash-testing

Conversation

@jewei1997
Copy link
Contributor

@jewei1997 jewei1997 commented Mar 5, 2026

Describe your changes and provide context

This PR adds test-only fault-injection hooks to the parquet receipt store so we can simulate crashes at specific points in the write pipeline and validate recovery behavior. The hooks cover the key stages of persistence: after WAL write, before parquet flush, after parquet flush, after closing writers during file rotation, and after WAL clear during rotation.

It also adds a SimulateCrash() helper that intentionally abandons the store without the normal flush/finalization path, which lets the tests mimic abrupt process termination and then reopen the same store directory to verify recovery.

On top of that, this PR adds parquet receipt crash-recovery coverage that:

verifies recovery at each hook point, including file-rotation scenarios
runs randomized multi-crash stress tests to ensure WAL-committed blocks remain readable after reopen
verifies concurrent readers can still read committed receipts and logs while writes are artificially slowed
The goal is to increase confidence in parquet receipt durability and crash recovery behavior without changing normal production behavior outside of tests.

Testing performed to validate your change

go test ./sei-db/ledger_db/receipt -run 'TestCrashRecoveryAtEachHookPoint|TestCrashRecoveryStress|TestSlowFlushWithConcurrentReads' -count=1

@github-actions
Copy link

github-actions bot commented Mar 5, 2026

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed✅ passed✅ passed✅ passedMar 9, 2026, 1:49 PM

@codecov
Copy link

codecov bot commented Mar 5, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 58.27%. Comparing base (6cb9631) to head (51fabfe).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3028      +/-   ##
==========================================
- Coverage   58.33%   58.27%   -0.06%     
==========================================
  Files        2079     2077       -2     
  Lines      171723   171262     -461     
==========================================
- Hits       100168    99810     -358     
+ Misses      62630    62563      -67     
+ Partials     8925     8889      -36     
Flag Coverage Δ
sei-chain-pr 74.24% <100.00%> (?)
sei-db 70.41% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
sei-db/ledger_db/parquet/store.go 69.66% <100.00%> (+4.98%) ⬆️

... and 37 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@jewei1997 jewei1997 marked this pull request as ready for review March 6, 2026 12:54
// file descriptors and locks so the test process can reopen the same directory.
func (s *Store) SimulateCrash() {
if s.pruneStop != nil {
close(s.pruneStop)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set pruneStop to nil after close? otherwise Close() will do double close on a closed channel

// be recoverable via WAL replay.
func TestCrashRecoveryStress(t *testing.T) {
seed := int64(42)
t.Logf("random seed: %d (change to reproduce a specific run)", seed)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks the seed is always 42 not random?

@jewei1997 jewei1997 enabled auto-merge (squash) March 9, 2026 13:48
@jewei1997 jewei1997 merged commit aabf783 into main Mar 9, 2026
39 checks passed
@jewei1997 jewei1997 deleted the STO-378/parquet-crash-testing branch March 9, 2026 14:06
yzang2019 pushed a commit that referenced this pull request Mar 19, 2026
## Describe your changes and provide context
This PR adds test-only fault-injection hooks to the parquet receipt
store so we can simulate crashes at specific points in the write
pipeline and validate recovery behavior. The hooks cover the key stages
of persistence: after WAL write, before parquet flush, after parquet
flush, after closing writers during file rotation, and after WAL clear
during rotation.

It also adds a SimulateCrash() helper that intentionally abandons the
store without the normal flush/finalization path, which lets the tests
mimic abrupt process termination and then reopen the same store
directory to verify recovery.

On top of that, this PR adds parquet receipt crash-recovery coverage
that:

verifies recovery at each hook point, including file-rotation scenarios
runs randomized multi-crash stress tests to ensure WAL-committed blocks
remain readable after reopen
verifies concurrent readers can still read committed receipts and logs
while writes are artificially slowed
The goal is to increase confidence in parquet receipt durability and
crash recovery behavior without changing normal production behavior
outside of tests.

## Testing performed to validate your change
go test ./sei-db/ledger_db/receipt -run
'TestCrashRecoveryAtEachHookPoint|TestCrashRecoveryStress|TestSlowFlushWithConcurrentReads'
-count=1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants