New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

handle corrupt wal file #573

Closed
ebuchman opened this Issue Jul 7, 2017 · 13 comments

Comments

Projects
None yet
5 participants
@ebuchman
Contributor

ebuchman commented Jul 7, 2017

node should still be able to start if arbitrary bytes have been truncated from end of file.

we should add a checksum and line length to each entry to help detect corruption

@aphyr

This comment has been minimized.

Show comment
Hide comment
@aphyr

aphyr Aug 12, 2017

To elaborate somewhat: we've discussed three theoretical issues in Tendermint's write-ahead-log.

  1. First, WAL entries are not checksummed, which potentially allows bit-flips or truncation to go undetected in WAL records. These may be detected by the cryptographic hash embedded in each transaction.
  2. The second issue is that we're not sure whether WAL writes are fsynced before being acknowledged to the client, which might, in theory, allow committed operations to be rolled back.
  3. The third problem is that depending on the filesystem, a write of a log entry may not be appended atomically. If the node crashes prior to an fsync, some sequence of bytes might be left in the WAL which are not recoverable.

Exploratory testing with process crashes has failed to produce WAL safety issues, but I haven't yet tested WAL truncation, node power failure or filesystem-level snapshots. It might be worth testing with those failure modes to see what happens.

One simple approach to writing recoverable WALs is to write a fixed-size checksum, fixed-size length header, then variable-length record. Errors can be detected whenever a record fails its checksum, or not enough bytes are present in the file to complete a read. For more advanced WAL formats, you might consider taking a look at the SQLite WAL format, which incorporates checkpoints.

aphyr commented Aug 12, 2017

To elaborate somewhat: we've discussed three theoretical issues in Tendermint's write-ahead-log.

  1. First, WAL entries are not checksummed, which potentially allows bit-flips or truncation to go undetected in WAL records. These may be detected by the cryptographic hash embedded in each transaction.
  2. The second issue is that we're not sure whether WAL writes are fsynced before being acknowledged to the client, which might, in theory, allow committed operations to be rolled back.
  3. The third problem is that depending on the filesystem, a write of a log entry may not be appended atomically. If the node crashes prior to an fsync, some sequence of bytes might be left in the WAL which are not recoverable.

Exploratory testing with process crashes has failed to produce WAL safety issues, but I haven't yet tested WAL truncation, node power failure or filesystem-level snapshots. It might be worth testing with those failure modes to see what happens.

One simple approach to writing recoverable WALs is to write a fixed-size checksum, fixed-size length header, then variable-length record. Errors can be detected whenever a record fails its checksum, or not enough bytes are present in the file to complete a read. For more advanced WAL formats, you might consider taking a look at the SQLite WAL format, which incorporates checkpoints.

@aphyr

This comment has been minimized.

Show comment
Hide comment
@aphyr

aphyr Aug 25, 2017

To follow up: the truncate-tendermint nemesis now lets us test the effects of arbitrary truncation of the tendermint WAL file; where Merkleeyes crashes, Tendermint silently ignores corrupt entries at the end of the file. This is probably fine, though a log message might be helpful for administrators.

However, we are now relatively certain that Tendermint can lose acknowledged writes due to the lack of any fsync call on the write-ahead log. Saving an entry to the WAL calls writeLine to append data to the autofile group backing the WAL, then calls group.Flush() to flush that data to persistent storage. group.Flush() in turn proxies to headBuf.flush(), flushing the active bufio.BufferedWriter. However, BufferedWriter wraps a Writer, not another BufferedWriter, and the way it flushes is by calling io.Writer.Write() to clear the BufferedWriter's buffer.

The io.Writer we're wrapping here is AutoFile, whose Write method calls os.File.Write(), performing an unbuffered write to the operating system, where, I assume, it sits in the OS buffers awaiting sync. This means that Wal.Save does not, in fact, ensure the saved operation is synced to disk before returning.

There is a Sync function for AutoFile, but it's not invoked. I think all that has to be done here is call head.Sync() after headBuf.Flush call in Group.Flush.

Eventually, you'll want to have a thread that periodically syncs the file every n milliseconds or n writes, and have WAL writes be asynchronous, completing whenever the flushing thread has written them to disk. That way you can batch multiple WAL writes into a single fsync call.

aphyr commented Aug 25, 2017

To follow up: the truncate-tendermint nemesis now lets us test the effects of arbitrary truncation of the tendermint WAL file; where Merkleeyes crashes, Tendermint silently ignores corrupt entries at the end of the file. This is probably fine, though a log message might be helpful for administrators.

However, we are now relatively certain that Tendermint can lose acknowledged writes due to the lack of any fsync call on the write-ahead log. Saving an entry to the WAL calls writeLine to append data to the autofile group backing the WAL, then calls group.Flush() to flush that data to persistent storage. group.Flush() in turn proxies to headBuf.flush(), flushing the active bufio.BufferedWriter. However, BufferedWriter wraps a Writer, not another BufferedWriter, and the way it flushes is by calling io.Writer.Write() to clear the BufferedWriter's buffer.

The io.Writer we're wrapping here is AutoFile, whose Write method calls os.File.Write(), performing an unbuffered write to the operating system, where, I assume, it sits in the OS buffers awaiting sync. This means that Wal.Save does not, in fact, ensure the saved operation is synced to disk before returning.

There is a Sync function for AutoFile, but it's not invoked. I think all that has to be done here is call head.Sync() after headBuf.Flush call in Group.Flush.

Eventually, you'll want to have a thread that periodically syncs the file every n milliseconds or n writes, and have WAL writes be asynchronous, completing whenever the flushing thread has written them to disk. That way you can batch multiple WAL writes into a single fsync call.

@ebuchman

This comment has been minimized.

Show comment
Hide comment
@ebuchman

ebuchman Aug 25, 2017

Contributor

Thanks Kyle.

Note we do periodically close the autofile and reopen it on the next write, ie.

https://github.com/tendermint/tmlibs/blob/master/autofile/autofile.go#L72

and

https://github.com/tendermint/tmlibs/blob/master/autofile/autofile.go#L93

Currently happens every second.

Contributor

ebuchman commented Aug 25, 2017

Thanks Kyle.

Note we do periodically close the autofile and reopen it on the next write, ie.

https://github.com/tendermint/tmlibs/blob/master/autofile/autofile.go#L72

and

https://github.com/tendermint/tmlibs/blob/master/autofile/autofile.go#L93

Currently happens every second.

@aphyr

This comment has been minimized.

Show comment
Hide comment
@aphyr

aphyr Aug 25, 2017

Note that close (2) does not call fsync; closed files may not be persisted to disk.

aphyr commented Aug 25, 2017

Note that close (2) does not call fsync; closed files may not be persisted to disk.

@melekes

This comment has been minimized.

Show comment
Hide comment
@melekes

melekes Sep 20, 2017

Contributor

4B checksum (uint32) and 4B (uint32) length should be enough

|---------------------- | ----------------------- | ---- |
| crc checksum of data | length of data in bytes | data |
|---------------------- | ----------------------- | ---- |

Contributor

melekes commented Sep 20, 2017

4B checksum (uint32) and 4B (uint32) length should be enough

|---------------------- | ----------------------- | ---- |
| crc checksum of data | length of data in bytes | data |
|---------------------- | ----------------------- | ---- |

@melekes

This comment has been minimized.

Show comment
Hide comment
@melekes

melekes Sep 21, 2017

Contributor

From https://www.postgresql.org/docs/9.5/static/runtime-config-wal.html:

Method used for forcing WAL updates out to disk. If fsync is off then this setting is irrelevant, since WAL file updates will not be forced out at all. Possible values are:

open_datasync (write WAL files with open() option O_DSYNC)

fdatasync (call fdatasync() at each commit)

fsync (call fsync() at each commit)

fsync_writethrough (call fsync() at each commit, forcing write-through of any disk write cache)

open_sync (write WAL files with open() option O_SYNC)

The open_* options also use O_DIRECT if available. Not all of these choices are available on all platforms. The default is the first method in the above list that is supported by the platform, except that fdatasync is the default on Linux. The default is not necessarily ideal; it might be necessary to change this setting or other aspects of your system configuration in order to create a crash-safe configuration or achieve optimal performance. These aspects are discussed in Section 29.1. This parameter can only be set in the postgresql.conf file or on the server command line
Contributor

melekes commented Sep 21, 2017

From https://www.postgresql.org/docs/9.5/static/runtime-config-wal.html:

Method used for forcing WAL updates out to disk. If fsync is off then this setting is irrelevant, since WAL file updates will not be forced out at all. Possible values are:

open_datasync (write WAL files with open() option O_DSYNC)

fdatasync (call fdatasync() at each commit)

fsync (call fsync() at each commit)

fsync_writethrough (call fsync() at each commit, forcing write-through of any disk write cache)

open_sync (write WAL files with open() option O_SYNC)

The open_* options also use O_DIRECT if available. Not all of these choices are available on all platforms. The default is the first method in the above list that is supported by the platform, except that fdatasync is the default on Linux. The default is not necessarily ideal; it might be necessary to change this setting or other aspects of your system configuration in order to create a crash-safe configuration or achieve optimal performance. These aspects are discussed in Section 29.1. This parameter can only be set in the postgresql.conf file or on the server command line
@melekes

This comment has been minimized.

Show comment
Hide comment
@melekes

melekes Sep 23, 2017

Contributor
  1. #672
  2. tendermint/tmlibs#52
  3. should be fixed by (1)

What's left:

  • check (1) using Jepsen
  • (2) and (3) using ???
Contributor

melekes commented Sep 23, 2017

  1. #672
  2. tendermint/tmlibs#52
  3. should be fixed by (1)

What's left:

  • check (1) using Jepsen
  • (2) and (3) using ???
@ebuchman

This comment has been minimized.

Show comment
Hide comment
@ebuchman

ebuchman Nov 8, 2017

Contributor

If the WAL is corrupted, we now detect it, but just abort. There are a few cases:

  • the corruption is in an older height. in this case, we don't care, because we only replay messages from the current height, so we can just note the error and move on.
  • the corruption is in the latest message/s (ie. at the end of the WAL). in this case, we should note the error, and just start the consensus state. hopefully we're fine.
  • the corruption is not in the latest message, but is in a message in the latest height. in this case, we should note the error, and continue trying to replay subsequent messages - we may have an issue since we will have missed a message, but it's probably better to play the additional messages than to just skip them and start the consensus.

@aphyr does that sound right to you ? we should be able to test case2 with the jepsen truncation nemesis, but we probably need something else for cases1/3

Contributor

ebuchman commented Nov 8, 2017

If the WAL is corrupted, we now detect it, but just abort. There are a few cases:

  • the corruption is in an older height. in this case, we don't care, because we only replay messages from the current height, so we can just note the error and move on.
  • the corruption is in the latest message/s (ie. at the end of the WAL). in this case, we should note the error, and just start the consensus state. hopefully we're fine.
  • the corruption is not in the latest message, but is in a message in the latest height. in this case, we should note the error, and continue trying to replay subsequent messages - we may have an issue since we will have missed a message, but it's probably better to play the additional messages than to just skip them and start the consensus.

@aphyr does that sound right to you ? we should be able to test case2 with the jepsen truncation nemesis, but we probably need something else for cases1/3

@milosevic

This comment has been minimized.

Show comment
Hide comment
@milosevic

milosevic Nov 8, 2017

Contributor

We shouldn't forget to spec this part also.

Contributor

milosevic commented Nov 8, 2017

We shouldn't forget to spec this part also.

@aphyr

This comment has been minimized.

Show comment
Hide comment
@aphyr

aphyr Nov 8, 2017

It sounds like no matter what happens to the WAL, you plan to continue operating? That... doesn't sound particularly safe to me. I think that if you are ever in a situation where you may have lost information, you should abort with a prominent error, rather than start--users are not going to read the logs. Give them a special command to clean up the WAL as best you can, like --recover-as-much-as-possible.

aphyr commented Nov 8, 2017

It sounds like no matter what happens to the WAL, you plan to continue operating? That... doesn't sound particularly safe to me. I think that if you are ever in a situation where you may have lost information, you should abort with a prominent error, rather than start--users are not going to read the logs. Give them a special command to clean up the WAL as best you can, like --recover-as-much-as-possible.

melekes added a commit that referenced this issue Dec 12, 2017

@zramsay

This comment has been minimized.

Show comment
Hide comment
@zramsay

zramsay Dec 16, 2017

Contributor
Contributor

zramsay commented Dec 16, 2017

@zramsay zramsay closed this Dec 16, 2017

@aphyr

This comment has been minimized.

Show comment
Hide comment
@aphyr

aphyr Dec 16, 2017

aphyr commented Dec 16, 2017

@ebuchman

This comment has been minimized.

Show comment
Hide comment
@ebuchman

ebuchman Dec 20, 2017

Contributor

Yeh we've been thinking through this. I think for now our solution is as suggested: fail if corruption is detected, and point the user to corruption docs: https://github.com/tendermint/tendermint/blob/develop/docs/specification/corruption.rst#wal-corruption

They can then try to repair manually, or just delete all lines from the corrupted one on and try to resync.

Thanks Kyle.

Contributor

ebuchman commented Dec 20, 2017

Yeh we've been thinking through this. I think for now our solution is as suggested: fail if corruption is detected, and point the user to corruption docs: https://github.com/tendermint/tendermint/blob/develop/docs/specification/corruption.rst#wal-corruption

They can then try to repair manually, or just delete all lines from the corrupted one on and try to resync.

Thanks Kyle.

ebuchman pushed a commit that referenced this issue Jun 20, 2018

call fsync after flush (Refs #573)
short: flushing the bufio buffer is not enough to ensure data
consistency.

long:
Saving an entry to the WAL calls writeLine to append data to the
autofile group backing the WAL, then calls group.Flush() to flush that
data to persistent storage. group.Flush() in turn proxies to
headBuf.flush(), flushing the active bufio.BufferedWriter. However,
BufferedWriter wraps a Writer, not another BufferedWriter, and the way
it flushes is by calling io.Writer.Write() to clear the BufferedWriter's
buffer. The io.Writer we're wrapping here is AutoFile, whose Write
method calls os.File.Write(), performing an unbuffered write to the
operating system, where, I assume, it sits in the OS buffers awaiting
sync. This means that Wal.Save does not, in fact, ensure the saved
operation is synced to disk before returning.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment