rollback to a consistent state after fsync error #131

MrCroxx · 2021-10-27T16:00:36Z

Make PipeLog try to rollback to a consistent state after fsync error.

According to PostgreSQL's fsync() surprise, fsync is not retriable. We must make pipe log rollback to where it last synced.

What's changed?

invariant: non-active log files are always synced and safe.

Updated:

After discussion, the final decisions of this pr are listed below:

Panics when there is any fsync error.
Truncate active log file when there is write error during buffer writing log file.
Panics when there is any error during rotating log file (truncate/sync/create).

~~Introduce pre_write and post_write for PipeLog to acquire or release a WriteContext, which can be used to track the sync state.~~

If fsync raises an error, log will rollback to last synced state, and set errors for the writers in the current write group. But if there are bytes rollbacked but in writers in the previous write groups, RaftEngine will panic, because that means some data that the user thinks is safe but lost actually.

new_log_file is devided into 2 phase. First truncate and sync the old log file, then rotate to a new log file. The second phase is safe to retry if sync_dir fails, because the new log will overwrite the failed one.

TODO:

Unit tests with failpoints.

Signed-off-by: MrCroxx mrcroxx@outlook.com

invariant: non-active log files are always synced and safe. Introduce `pre_write` and `post_write` for `PipeLog` to acquire or release a `WriteContext`, which can be used to track the sync state. If `fsync` raises an error, log will rollback to last synced state, and set errors for the writers in the current write group. But if there are bytes rollbacked but in writers in the previous write groups, RaftEngine will panic, because that means some data that the user thinks is safe but lost actually. `new_log_file` is devided into 2 phase. First sync old log file, then rotate new log file. The second phase is safe to retry if `sync_dir` fails, because the new log will overwrite the failed one. Signed-off-by: MrCroxx <mrcroxx@outlook.com>

Signed-off-by: MrCroxx <mrcroxx@outlook.com>

codecov · 2021-10-28T11:46:00Z

Codecov Report

Merging #131 (8588698) into master (9f0c313) will increase coverage by 0.33%.
The diff coverage is 98.26%.

@@            Coverage Diff             @@
##           master     #131      +/-   ##
==========================================
+ Coverage   96.69%   97.02%   +0.33%     
==========================================
  Files          22       23       +1     
  Lines        5816     6194     +378     
==========================================
+ Hits         5624     6010     +386     
+ Misses        192      184       -8

Impacted Files	Coverage Δ
src/errors.rs	`100.00% <ø> (ø)`
src/pipe_log.rs	`93.10% <ø> (ø)`
tests/failpoints/mod.rs	`100.00% <ø> (ø)`
src/engine.rs	`96.87% <90.00%> (+0.61%)`	⬆️
tests/failpoints/test_io_error.rs	`98.13% <98.13%> (ø)`
src/file_pipe_log.rs	`95.69% <100.00%> (+0.33%)`	⬆️
src/log_file.rs	`92.15% <100.00%> (+3.92%)`	⬆️
src/purge.rs	`94.78% <100.00%> (+0.27%)`	⬆️
tests/failpoints/test_engine.rs	`100.00% <100.00%> (ø)`
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9f0c313...8588698. Read the comment docs.

tabokie

On a second thought, I think it's risky to silently rollback on buffer write failure. Consider this case:

write (r=1, index=1)
write (r=1, index=2,3,4) failed, but index 2-3 is written to disk
write (r=2, index=1) succeeded, this write overwrite (r=1, index=2), but (r=1, index=3) remains
engine restart, the data is inconsistent

This is unsolvable unless we tolerate logical error when recovering tail records. but that's difficult to reason.

I think it would be better to physically truncate file when buffer write fails. This truncate call could fail too, so we must keep track of outstanding dirty writes in a counter (e.g. written_dirty). Next buffer writes must attempt to do the truncate again.

A rough sketch of my thoughts:

buffered_write:
  if written_dirty > written {
    truncate(written)?;
  }
  let r = pwrite(...);
  if r.is_err()  {
    if truncate(written).is_err() {
      written_dirty += len;
    }
  } else {
    written += len;
  }
  return r;

post_write:
  if let Err(e) = fsync() {
    // panic if needed
    if truncate(last_sync).is_err() {
      written_dirty = written;
    }
    written = last_sync;
    return e;
  }
  if needs_rotate {
    // Errors from these calls don't bubble to user.
    if ftruncate(old_file).is_ok() && open(new_file).is_ok() && fsync_dir().is_ok() {
      if let Err(e) = pwrite(header) {
        written_dirty = header.len
      } else {
        written = header.len
      }
    }
  }

src/file_pipe_log.rs

tabokie · 2021-10-28T11:30:43Z

src/file_pipe_log.rs

+            return Err(Error::Fsync(ctx.synced, e.to_string()));
+        }
+        if ctx.syncable == Syncable::NeedRotate {
+            self.mut_queue(queue).rotate()?;


When last_sync == 0, written = LOG_FILE_HEADER_LEN, we should tolerate fsync error.

The impl split rotation to 2 parts, first call truncate_and_sync for the current log, then call rotate to create and move the writer to the new log file. So calling rotate here means the old log file has already been truncated and synced.

src/file_pipe_log.rs

MrCroxx · 2021-10-28T13:14:04Z

On a second thought, I think it's risky to silently rollback on buffer write failure. Consider this case:

write (r=1, index=1)

write (r=1, index=2,3,4) failed, but index 2-3 is written to disk

write (r=2, index=1) succeeded, this write overwrite (r=1, index=2), but (r=1, index=3) remains

engine restart, the data is inconsistent

@tabokie Agree with you on the case.

I think it would be better to physically truncate file when buffer write fails. This truncate call could fail too, so we must keep track of outstanding dirty writes in a counter (e.g. written_dirty). Next buffer writes must attempt to do the truncate again.

IMO, I think this method shares the some outcome if the engine restart at the moment that fsync error happens. If we keep written_dirty in memory, when RaftEngine restarts, it still won't know if the tail has been successfully truncated yet, and the case cannot be solved. Maybe RaftEngine should panic if the truncate after fsync error fails is acceptable, which means the disk is highly unreliable.

tabokie · 2021-10-28T15:46:34Z

It's not needed, written_dirty is only there to block incoming writes (actually a bool variable suffice). If a dirty write isn't overwritten by other writes, then the disk state is still consistent.

And I made a mistake with the case, ideally a log batch can only be atomically written to disk. step 2 is unlikely to happen. So I guess my proposal can be reduced to only physically truncate file for "write group" fsync failure, not buffered write failure.

Signed-off-by: MrCroxx <mrcroxx@outlook.com>

tabokie

Needs unit tests. I've added some IO failpoints in #139, every one of them needs to be tested. You can use catch_unwind_silent to test the panics.

src/engine.rs

tabokie · 2021-11-03T04:40:04Z

src/purge.rs

-                .append(LogQueue::Rewrite, log_batch.encoded_bytes(), sync)?;
+        let file_handle = self
+            .pipe_log
+            .append(LogQueue::Rewrite, log_batch.encoded_bytes())?;


You are not calling truncate after this failure.

src/file_pipe_log.rs

src/pipe_log.rs

src/errors.rs

Panics directly in the following cases: - Truncate failed after buffer write error. - Sync or rotate failed. Signed-off-by: MrCroxx <mrcroxx@outlook.com>

Signed-off-by: MrCroxx <mrcroxx@outlook.com>

tabokie · 2021-11-03T15:33:08Z

Cargo.toml

@@ -58,3 +58,9 @@ members = [ "stress" ]
 name = "bench_recovery"
 harness = false
 required-features = ["failpoints"]
+
+[[test]]


You should send another PR to move all existing failpoint tests to this folder.

OK. I'd like to move io error test to this folder first in this PR, then send another one to move the existing test.

Then add a mod.rs for failpoints folder, listing individual test in main Cargo.toml doesn't look good.

tabokie · 2021-11-03T15:37:27Z

src/file_pipe_log.rs

@@ -160,9 +161,6 @@ impl<W: Seek + Write> ActiveFile<W> {
    }

    fn rotate(&mut self, fd: Arc<LogFd>, writer: W) -> Result<()> {


maybe rename it to reset

tabokie · 2021-11-03T15:47:25Z

failpoints/io_error_test.rs

+
+        // b0 (ctx); b1 success; b2 fail, truncate; b3 success
+        let mut hook = FailpointsHook::new();
+        hook.register_pre_append_action(2, || {


I don't particularly like this approach. Could you try using cfg_callback? might be easier.

and fix ci Signed-off-by: MrCroxx <mrcroxx@outlook.com>

Signed-off-by: MrCroxx <mrcroxx@outlook.com>

tabokie

Tests are failing.

tabokie · 2021-11-04T15:05:12Z

Cargo.toml

@@ -58,3 +58,9 @@ members = [ "stress" ]
 name = "bench_recovery"
 harness = false
 required-features = ["failpoints"]
+
+[[test]]


Then add a mod.rs for failpoints folder, listing individual test in main Cargo.toml doesn't look good.

failpoints/io_error_test.rs

tabokie · 2021-11-04T15:11:25Z

failpoints/io_error_test.rs

+        let timer = AtomicU64::new(0);
+        fail::cfg_callback("engine::write::pre", move || {
+            match timer.fetch_add(1, std::sync::atomic::Ordering::SeqCst) {
+                2 => fail::cfg("log_fd::write::post_err", "return").unwrap(),


what's the difference between post_err and err? If you want to test file is truncated, a engine restart is needed.

This is for checking if the following appends correctly overwrite the pervious ones.

src/file_pipe_log.rs

MrCroxx · 2021-11-04T15:15:49Z

Tests are failing.

Yep. It works fine with my env and I’m looking for the reason.

Signed-off-by: MrCroxx <mrcroxx@outlook.com>

tests/failpoints/test_io_error.rs

tabokie · 2021-11-06T08:57:07Z

src/file_pipe_log.rs

+
+        if manager.active_file.written >= manager.rotate_size {
+            if let Err(e) = manager.rotate() {
+                panic!(


It says this panic is not covered.

I ran the coverage test locally and it said these lines are covered. And I checked manually and made sure that it is covered.

I've also checked the CI script, which seems fine. I'm not sure if it's a bug of the CI coverage test.

Just found where the problem lays, fixing.

tests/failpoints/test_io_error.rs

tabokie · 2021-11-06T09:03:59Z

tests/failpoints/test_io_error.rs

+    fail::cfg("active_file::truncate::force", "return").unwrap();
+    fail::cfg("log_fd::truncate::err", "return").unwrap();
+    assert!(catch_unwind_silent(|| {
+        write_tmp_engine(ReadableSize::kb(1024), ReadableSize::kb(4), 1024, 1, 4)


write_tmp_engine seems an overkill.

src/log_file.rs

Signed-off-by: MrCroxx <mrcroxx@outlook.com>

MrCroxx · 2021-11-11T04:08:16Z

stress --regions 1000 --write-sync true --time 600

[current]
Throughput(QPS) = 10119.08
Latency(μs) min = 40, avg = 83.03, p50 = 61, p90 = 161, p95 = 187, p99 = 230, p99.9 = 288, max = 24607
Fairness = 100.0%
Write Bandwidth = 7.5MiB/s

[master]
Throughput(QPS) = 10249.05
Latency(μs) min = 42, avg = 80.07, p50 = 59, p90 = 155, p95 = 174, p99 = 214, p99.9 = 265, max = 16639
Fairness = 100.0%
Write Bandwidth = 7.7MiB/s

Signed-off-by: MrCroxx <mrcroxx@outlook.com>

MrCroxx force-pushed the extract-fsync-handle branch from 0609f78 to 4d22351 Compare October 28, 2021 04:32

MrCroxx added 3 commits October 28, 2021 15:02

fix broken tests

751965d

Signed-off-by: MrCroxx <mrcroxx@outlook.com>

Merge branch 'master' into extract-fsync-handle

847aa7a

Signed-off-by: MrCroxx <mrcroxx@outlook.com>

Merge branch 'master' into extract-fsync-handle

0caa976

Signed-off-by: MrCroxx <mrcroxx@outlook.com>

tabokie reviewed Oct 28, 2021

View reviewed changes

tikv deleted a comment from coveralls Oct 28, 2021

MrCroxx added 4 commits November 1, 2021 12:39

Merge branch 'master' into extract-fsync-handle

26f9c3d

truncate, sync, panic after io error

9b07cbd

Signed-off-by: MrCroxx <mrcroxx@outlook.com>

truncate after append error

a7bbabd

Signed-off-by: MrCroxx <mrcroxx@outlook.com>

fix typo

d5a58f3

Signed-off-by: MrCroxx <mrcroxx@outlook.com>

MrCroxx force-pushed the extract-fsync-handle branch from 7b85746 to 9b07cbd Compare November 2, 2021 08:27

tabokie reviewed Nov 3, 2021

View reviewed changes

MrCroxx added 6 commits November 3, 2021 13:53

panic directly if needed

e391e96

Panics directly in the following cases: - Truncate failed after buffer write error. - Sync or rotate failed. Signed-off-by: MrCroxx <mrcroxx@outlook.com>

Merge branch 'master' into extract-fsync-handle

07a0dc5

fix sync after truncate when rotate

0468de4

Signed-off-by: MrCroxx <mrcroxx@outlook.com>

add io error test

86631aa

Signed-off-by: MrCroxx <mrcroxx@outlook.com>

remove Cargo.lock

23bd3c1

Signed-off-by: MrCroxx <mrcroxx@outlook.com>

make clippy happy

1d3fb70

Signed-off-by: MrCroxx <mrcroxx@outlook.com>

MrCroxx force-pushed the extract-fsync-handle branch from eb64eac to 1d3fb70 Compare November 3, 2021 14:08

tabokie reviewed Nov 3, 2021

View reviewed changes

MrCroxx added 6 commits November 4, 2021 04:08

use failpoints with cfg_callback

e67c58e

and fix ci Signed-off-by: MrCroxx <mrcroxx@outlook.com>

update README

cc0f094

Signed-off-by: MrCroxx <mrcroxx@outlook.com>

refine mod test_io_error

aba71e3

Signed-off-by: MrCroxx <mrcroxx@outlook.com>

update ci

d1f85b0

Signed-off-by: MrCroxx <mrcroxx@outlook.com>

mute expected panic, update ci

94c274d

Signed-off-by: MrCroxx <mrcroxx@outlook.com>

make test stable

f133b25

Signed-off-by: MrCroxx <mrcroxx@outlook.com>

MrCroxx added 4 commits November 4, 2021 16:53

fix concurrent bug in test_io_error

c2a5c52

Signed-off-by: MrCroxx <mrcroxx@outlook.com>

Merge branch 'master' into extract-fsync-handle

790bdce

Signed-off-by: MrCroxx <mrcroxx@outlook.com>

use ticket lock to make ConcurrentWriteContext scale

c3d8b62

Signed-off-by: MrCroxx <mrcroxx@outlook.com>

fix asan ci

575f9a8

Signed-off-by: MrCroxx <mrcroxx@outlook.com>

tabokie reviewed Nov 4, 2021

View reviewed changes

MrCroxx force-pushed the extract-fsync-handle branch 4 times, most recently from 09ed081 to 790bdce Compare November 5, 2021 13:16

tabokie and others added 2 commits November 6, 2021 16:11

Merge branch 'master' into extract-fsync-handle

a2ebaef

refine panic detect

3f934ad

Signed-off-by: MrCroxx <mrcroxx@outlook.com>

tabokie reviewed Nov 6, 2021

View reviewed changes

MrCroxx added 4 commits November 9, 2021 14:46

refine test_io_error code

583c192

Signed-off-by: MrCroxx <mrcroxx@outlook.com>

Merge branch 'master' into extract-fsync-handle

fa5ffc1

fix merge error

1903deb

Signed-off-by: MrCroxx <mrcroxx@outlook.com>

remove Cargo.lock

88582e7

Signed-off-by: MrCroxx <mrcroxx@outlook.com>

Merge branch 'master' into extract-fsync-handle

8588698

Signed-off-by: MrCroxx <mrcroxx@outlook.com>

tabokie merged commit 2e465e7 into tikv:master Nov 11, 2021

This was referenced Nov 11, 2021

Recover to a consistent state after IO errors #126

Closed

Make Raft Engine production ready for TiKV #95

Closed

tabokie mentioned this pull request Nov 3, 2023

Return error instead of panicking if rewriting fails #343

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rollback to a consistent state after fsync error #131

rollback to a consistent state after fsync error #131

MrCroxx commented Oct 27, 2021 •

edited

Loading

codecov bot commented Oct 28, 2021 •

edited

Loading

tabokie left a comment •

edited

Loading

tabokie Oct 28, 2021

MrCroxx Nov 2, 2021

MrCroxx commented Oct 28, 2021 •

edited

Loading

tabokie commented Oct 28, 2021 •

edited

Loading

tabokie left a comment

tabokie Nov 3, 2021

tabokie Nov 3, 2021

MrCroxx Nov 4, 2021

tabokie Nov 4, 2021

tabokie Nov 3, 2021

tabokie Nov 3, 2021

tabokie left a comment

tabokie Nov 4, 2021

tabokie Nov 4, 2021

MrCroxx Nov 4, 2021

MrCroxx commented Nov 4, 2021

tabokie Nov 6, 2021

MrCroxx Nov 6, 2021

MrCroxx Nov 9, 2021

tabokie Nov 6, 2021

MrCroxx commented Nov 11, 2021

		@@ -160,9 +161,6 @@ impl<W: Seek + Write> ActiveFile<W> {
		}

		fn rotate(&mut self, fd: Arc<LogFd>, writer: W) -> Result<()> {

rollback to a consistent state after fsync error #131

rollback to a consistent state after fsync error #131

Conversation

MrCroxx commented Oct 27, 2021 • edited Loading

What's changed?

codecov bot commented Oct 28, 2021 • edited Loading

Codecov Report

tabokie left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MrCroxx commented Oct 28, 2021 • edited Loading

tabokie commented Oct 28, 2021 • edited Loading

tabokie left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tabokie left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MrCroxx commented Nov 4, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MrCroxx commented Nov 11, 2021

MrCroxx commented Oct 27, 2021 •

edited

Loading

codecov bot commented Oct 28, 2021 •

edited

Loading

tabokie left a comment •

edited

Loading

MrCroxx commented Oct 28, 2021 •

edited

Loading

tabokie commented Oct 28, 2021 •

edited

Loading