-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rollback to a consistent state after fsync error #131
Conversation
invariant: non-active log files are always synced and safe. Introduce `pre_write` and `post_write` for `PipeLog` to acquire or release a `WriteContext`, which can be used to track the sync state. If `fsync` raises an error, log will rollback to last synced state, and set errors for the writers in the current write group. But if there are bytes rollbacked but in writers in the previous write groups, RaftEngine will panic, because that means some data that the user thinks is safe but lost actually. `new_log_file` is devided into 2 phase. First sync old log file, then rotate new log file. The second phase is safe to retry if `sync_dir` fails, because the new log will overwrite the failed one. Signed-off-by: MrCroxx <mrcroxx@outlook.com>
0609f78
to
4d22351
Compare
Signed-off-by: MrCroxx <mrcroxx@outlook.com>
Signed-off-by: MrCroxx <mrcroxx@outlook.com>
Signed-off-by: MrCroxx <mrcroxx@outlook.com>
Codecov Report
@@ Coverage Diff @@
## master #131 +/- ##
==========================================
+ Coverage 96.69% 97.02% +0.33%
==========================================
Files 22 23 +1
Lines 5816 6194 +378
==========================================
+ Hits 5624 6010 +386
+ Misses 192 184 -8
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a second thought, I think it's risky to silently rollback on buffer write failure. Consider this case:
- write (r=1, index=1)
- write (r=1, index=2,3,4) failed, but index 2-3 is written to disk
- write (r=2, index=1) succeeded, this write overwrite (r=1, index=2), but (r=1, index=3) remains
- engine restart, the data is inconsistent
This is unsolvable unless we tolerate logical error when recovering tail records. but that's difficult to reason.
I think it would be better to physically truncate file when buffer write fails. This truncate
call could fail too, so we must keep track of outstanding dirty writes in a counter (e.g. written_dirty). Next buffer writes must attempt to do the truncate again.
A rough sketch of my thoughts:
buffered_write:
if written_dirty > written {
truncate(written)?;
}
let r = pwrite(...);
if r.is_err() {
if truncate(written).is_err() {
written_dirty += len;
}
} else {
written += len;
}
return r;
post_write:
if let Err(e) = fsync() {
// panic if needed
if truncate(last_sync).is_err() {
written_dirty = written;
}
written = last_sync;
return e;
}
if needs_rotate {
// Errors from these calls don't bubble to user.
if ftruncate(old_file).is_ok() && open(new_file).is_ok() && fsync_dir().is_ok() {
if let Err(e) = pwrite(header) {
written_dirty = header.len
} else {
written = header.len
}
}
}
src/file_pipe_log.rs
Outdated
return Err(Error::Fsync(ctx.synced, e.to_string())); | ||
} | ||
if ctx.syncable == Syncable::NeedRotate { | ||
self.mut_queue(queue).rotate()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When last_sync == 0
, written = LOG_FILE_HEADER_LEN
, we should tolerate fsync error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The impl split rotation to 2 parts, first call truncate_and_sync
for the current log, then call rotate
to create and move the writer to the new log file. So calling rotate
here means the old log file has already been truncated and synced.
@tabokie Agree with you on the case.
IMO, I think this method shares the some outcome if the engine restart at the moment that fsync error happens. If we keep |
It's not needed, written_dirty is only there to block incoming writes (actually a bool variable suffice). If a dirty write isn't overwritten by other writes, then the disk state is still consistent. And I made a mistake with the case, ideally a log batch can only be atomically written to disk. step 2 is unlikely to happen. So I guess my proposal can be reduced to only physically truncate file for "write group" fsync failure, not buffered write failure. |
Signed-off-by: MrCroxx <mrcroxx@outlook.com>
Signed-off-by: MrCroxx <mrcroxx@outlook.com>
7b85746
to
9b07cbd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs unit tests. I've added some IO failpoints in #139, every one of them needs to be tested. You can use catch_unwind_silent
to test the panics.
.append(LogQueue::Rewrite, log_batch.encoded_bytes(), sync)?; | ||
let file_handle = self | ||
.pipe_log | ||
.append(LogQueue::Rewrite, log_batch.encoded_bytes())?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are not calling truncate after this failure.
Panics directly in the following cases: - Truncate failed after buffer write error. - Sync or rotate failed. Signed-off-by: MrCroxx <mrcroxx@outlook.com>
Signed-off-by: MrCroxx <mrcroxx@outlook.com>
Signed-off-by: MrCroxx <mrcroxx@outlook.com>
Signed-off-by: MrCroxx <mrcroxx@outlook.com>
Signed-off-by: MrCroxx <mrcroxx@outlook.com>
eb64eac
to
1d3fb70
Compare
Cargo.toml
Outdated
@@ -58,3 +58,9 @@ members = [ "stress" ] | |||
name = "bench_recovery" | |||
harness = false | |||
required-features = ["failpoints"] | |||
|
|||
[[test]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should send another PR to move all existing failpoint tests to this folder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I'd like to move io error test to this folder first in this PR, then send another one to move the existing test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then add a mod.rs for failpoints folder, listing individual test in main Cargo.toml doesn't look good.
@@ -160,9 +161,6 @@ impl<W: Seek + Write> ActiveFile<W> { | |||
} | |||
|
|||
fn rotate(&mut self, fd: Arc<LogFd>, writer: W) -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe rename it to reset
failpoints/io_error_test.rs
Outdated
|
||
// b0 (ctx); b1 success; b2 fail, truncate; b3 success | ||
let mut hook = FailpointsHook::new(); | ||
hook.register_pre_append_action(2, || { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't particularly like this approach. Could you try using cfg_callback
? might be easier.
and fix ci Signed-off-by: MrCroxx <mrcroxx@outlook.com>
Signed-off-by: MrCroxx <mrcroxx@outlook.com>
Signed-off-by: MrCroxx <mrcroxx@outlook.com>
Signed-off-by: MrCroxx <mrcroxx@outlook.com>
Signed-off-by: MrCroxx <mrcroxx@outlook.com>
Signed-off-by: MrCroxx <mrcroxx@outlook.com>
Signed-off-by: MrCroxx <mrcroxx@outlook.com>
Signed-off-by: MrCroxx <mrcroxx@outlook.com>
Signed-off-by: MrCroxx <mrcroxx@outlook.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests are failing.
Cargo.toml
Outdated
@@ -58,3 +58,9 @@ members = [ "stress" ] | |||
name = "bench_recovery" | |||
harness = false | |||
required-features = ["failpoints"] | |||
|
|||
[[test]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then add a mod.rs for failpoints folder, listing individual test in main Cargo.toml doesn't look good.
failpoints/io_error_test.rs
Outdated
let timer = AtomicU64::new(0); | ||
fail::cfg_callback("engine::write::pre", move || { | ||
match timer.fetch_add(1, std::sync::atomic::Ordering::SeqCst) { | ||
2 => fail::cfg("log_fd::write::post_err", "return").unwrap(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the difference between post_err and err? If you want to test file is truncated, a engine restart is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for checking if the following appends correctly overwrite the pervious ones.
Yep. It works fine with my env and I’m looking for the reason. |
09ed081
to
790bdce
Compare
Signed-off-by: MrCroxx <mrcroxx@outlook.com>
|
||
if manager.active_file.written >= manager.rotate_size { | ||
if let Err(e) = manager.rotate() { | ||
panic!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It says this panic is not covered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just found where the problem lays, fixing.
tests/failpoints/test_io_error.rs
Outdated
fail::cfg("active_file::truncate::force", "return").unwrap(); | ||
fail::cfg("log_fd::truncate::err", "return").unwrap(); | ||
assert!(catch_unwind_silent(|| { | ||
write_tmp_engine(ReadableSize::kb(1024), ReadableSize::kb(4), 1024, 1, 4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
write_tmp_engine
seems an overkill.
Signed-off-by: MrCroxx <mrcroxx@outlook.com>
Signed-off-by: MrCroxx <mrcroxx@outlook.com>
Signed-off-by: MrCroxx <mrcroxx@outlook.com>
stress --regions 1000 --write-sync true --time 600 [current] [master] |
Signed-off-by: MrCroxx <mrcroxx@outlook.com>
Make
PipeLog
try to rollback to a consistent state after fsync error.According to PostgreSQL's fsync() surprise,
fsync
is not retriable. We must make pipe log rollback to where it last synced.What's changed?
invariant: non-active log files are always synced and safe.
Updated:
After discussion, the final decisions of this pr are listed below:
fsync
error.write
error during buffer writing log file.Introducepre_write
andpost_write
forPipeLog
to acquire or release aWriteContext
, which can be used to track the sync state.Iffsync
raises an error, log will rollback to last synced state, and set errors for the writers in the current write group. But if there are bytes rollbacked but in writers in the previous write groups, RaftEngine will panic, because that means some data that the user thinks is safe but lost actually.new_log_file
is devided into 2 phase. First truncate and sync the old log file, then rotate to a new log file. The second phase is safe to retry ifsync_dir
fails, because the new log will overwrite the failed one.TODO:
Signed-off-by: MrCroxx mrcroxx@outlook.com