Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure to fsync on XDR stream close #2202

Closed
graydon opened this issue Jul 18, 2019 · 0 comments · Fixed by #2204
Closed

Failure to fsync on XDR stream close #2202

graydon opened this issue Jul 18, 2019 · 0 comments · Fixed by #2204
Assignees
Labels

Comments

@graydon
Copy link
Contributor

graydon commented Jul 18, 2019

As noted in #2123 and #1395, we are seeing more instances of "missing buckets" than we expect from mere misconfiguration (which sometimes causes it, but not always).

On investigation with a node operator, we discovered a state that is best understood to be caused by our failure to fsync bucket writes:

  • Bucket went missing at moment of node crash (cloud provider killed host)
  • SQL database matched public network state (postgres WAL integrity maintained)
  • Once restored, subsequent buckets diverge with difference appearing just to be missing entries or updates from lower levels (i.e. some merge restarted against a zero-sized bucket)

There are three necessary fixes for this. This bug tracks one of them: start properly fsync()'ing when we close a bucket (or any XDR file) so we don't lose them even if the OS crashes.

The other two fixes are better diagnostics (tracked with #2123) and verification of hashes while reading buckets (#2203).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant