Skip to content

WAL utility tears itself down after the first error#3006

Merged
cody-littley merged 6 commits intomainfrom
cody-littley/wal-fail-behavior
Mar 3, 2026
Merged

WAL utility tears itself down after the first error#3006
cody-littley merged 6 commits intomainfrom
cody-littley/wal-fail-behavior

Conversation

@cody-littley
Copy link
Contributor

Describe your changes and provide context

https://linear.app/seilabs/issue/STO-397/address-wal-feedback

Per feedback, the desired behavior of the WAL utility is that it should tear itself down after it encounters the first error.

Testing performed to validate your change

Unit test coverage.

@github-actions
Copy link

github-actions bot commented Mar 3, 2026

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed✅ passed✅ passed✅ passedMar 3, 2026, 8:26 PM

@github-actions
Copy link

github-actions bot commented Mar 3, 2026

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed✅ passed✅ passed✅ passedMar 3, 2026, 2:56 PM

@codecov
Copy link

codecov bot commented Mar 3, 2026

Codecov Report

❌ Patch coverage is 58.97436% with 32 lines in your changes missing coverage. Please review.
✅ Project coverage is 58.13%. Comparing base (fb21209) to head (8069893).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
sei-db/wal/wal.go 58.97% 25 Missing and 7 partials ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3006      +/-   ##
==========================================
- Coverage   58.26%   58.13%   -0.14%     
==========================================
  Files        2108     2113       +5     
  Lines      173664   174000     +336     
==========================================
- Hits       101181   101147      -34     
- Misses      63456    63798     +342     
- Partials     9027     9055      +28     
Flag Coverage Δ
sei-chain-pr 67.19% <58.97%> (?)
sei-db 70.41% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
sei-db/wal/wal.go 68.85% <58.97%> (-2.91%) ⬇️

... and 95 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@@ -341,13 +372,23 @@ func (walLog *WAL[T]) handleTruncate(req *truncateRequest) {
err = walLog.log.TruncateBack(req.index)
}
if err != nil {
req.errChan <- fmt.Errorf("failed to truncate: %w", err)
err = fmt.Errorf("failed to truncate: %w", err)
if strings.Contains(err.Error(), "out of range") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in case tidwall/wal library changes its error message wording in the future, how about:

if strings.Contains(err.Error(), "out of range") {
        req.errChan <- fmt.Errorf("failed to truncate: %w", err)
        return
    }
    walLog.reportFatalError(fmt.Errorf("failed to truncate: %w", err), req.errChan)
    return

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrapped the error inside the "out of range" block, but I'm not sure if that is what you were asking me to do. New code is below:

	if err != nil {
		err = fmt.Errorf("failed to truncate: %w", err)
		if strings.Contains(err.Error(), "out of range") {
			err = fmt.Errorf("out of range truncate error: %w", err)
			req.errChan <- err
			return
		}
		walLog.reportFatalError(err, req.errChan)
		return
	}

// Store on heap so the pointer remains valid after this function returns.
p := new(error)
*p = err
walLog.asyncError.Store(p)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we call walLog.cancel() in the reportFatalError()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explicitly calling cancel() is not required. When asyncError is set, the loop exits, and immediately after the loop exits the context is cancelled.

	for running && walLog.asyncError.Load() == nil {
		select {
		case <-walLog.ctx.Done():
			running = false
		case req := <-walLog.writeChan:
			walLog.handleWrite(req)
		case req := <-walLog.truncateChan:
			walLog.handleTruncate(req)
		case <-pruneChan:
			walLog.prune()
		case <-walLog.closeReqChan:
			running = false
		}
	}

	walLog.cancel()

// Store on heap so the pointer remains valid after this function returns.
p := new(error)
*p = err
walLog.asyncError.Store(p)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have multiple errors, will the later errors replace the previous one here? And are we OK with that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went back and checked, and there was one edge case where this was possible (i.e. during drain() when we are tearing down the system). I fixed this issue, and so now it should never be possible for asyncError to be set more than once.

@cody-littley cody-littley enabled auto-merge (squash) March 3, 2026 18:47
@cody-littley cody-littley merged commit ed4946d into main Mar 3, 2026
35 checks passed
@cody-littley cody-littley deleted the cody-littley/wal-fail-behavior branch March 3, 2026 20:41
yzang2019 pushed a commit that referenced this pull request Mar 19, 2026
## Describe your changes and provide context

https://linear.app/seilabs/issue/STO-397/address-wal-feedback

Per feedback, the desired behavior of the WAL utility is that it should
tear itself down after it encounters the first error.

## Testing performed to validate your change

Unit test coverage.

---------

Co-authored-by: Cody Littley <cody.littley@seinetwork.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants