Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RheaKV snapshot inconsistent if node panic during compress save. #604

Closed
yuyang0423 opened this issue Jun 8, 2021 · 5 comments · Fixed by #745
Closed

RheaKV snapshot inconsistent if node panic during compress save. #604

yuyang0423 opened this issue Jun 8, 2021 · 5 comments · Fixed by #745

Comments

@yuyang0423
Copy link
Contributor

Describe the bug

we see this error after node panic back.

2021-06-08 14:41:05.847 [main] INFO  JRaftServiceLoader:275 - SPI service [com.alipay.sofa.jraft.util.JRaftSignalHandler - com.alipay.sofa.jraft.NodeDescribeSignalHandler] loading.
2021-06-08 14:41:05.849 [main] INFO  JRaftServiceLoader:275 - SPI service [com.alipay.sofa.jraft.util.JRaftSignalHandler - com.alipay.sofa.jraft.NodeMetricsSignalHandler] loading.
2021-06-08 14:41:05.850 [main] INFO  JRaftServiceLoader:275 - SPI service [com.alipay.sofa.jraft.util.JRaftSignalHandler - com.alipay.sofa.jraft.ThreadPoolMetricsSignalHandler] loading.
2021-06-08 14:41:05.851 [main] INFO  JRaftServiceLoader:275 - SPI service [com.alipay.sofa.jraft.util.JRaftSignalHandler - com.alipay.sofa.jraft.rhea.RheaKVDescribeSignalHandler] loading.
2021-06-08 14:41:05.853 [main] INFO  JRaftServiceLoader:275 - SPI service [com.alipay.sofa.jraft.util.JRaftSignalHandler - com.alipay.sofa.jraft.rhea.RheaKVMetricsSignalHandler] loading.
2021-06-08 14:41:05.857 [main] INFO  JRaftServiceLoader:275 - SPI service [com.alipay.sofa.jraft.util.timer.RaftTimerFactory - com.alipay.sofa.jraft.util.timer.DefaultRaftTimerFactory] loading.
2021-06-08 14:41:05.863 [main] INFO  NodeImpl:547 - The number of active nodes increment to 1.
2021-06-08 14:41:05.990 [main] INFO  FSMCallerImpl:201 - Starts FSMCaller successfully.
2021-06-08 14:41:06.012 [Rpc-netty-server-worker-10-thread-1] INFO  NamedThreadFactory:82 - Creates new Thread[rheakv-raft-rpc-executor #0,5,main].
2021-06-08 14:41:06.006 [main] INFO  SnapshotExecutorImpl:259 - Loading snapshot, meta=last_included_index: 1366
last_included_term: 1
peers: "192.168.80.166:8082"
peers: "192.168.80.167:8082"
peers: "192.168.80.168:8082"
.
2021-06-08 14:41:06.058 [JRaft-FSMCaller-Disruptor-0] ERROR AbstractKVStoreSnapshotFile:104 - Fail to load snapshot, path=PATH_TO_SNAP/raft_data_region_-1_8082/snapshot/snapshot_1366, file list=[kv.zip], java.io.EOFException: Unexpected end of ZLIB input stream
  at java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:240)
  at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158)
  at java.util.zip.ZipInputStream.read(ZipInputStream.java:194)
  at java.io.FilterInputStream.read(FilterInputStream.java:107)
  at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
  at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
  at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
  at com.alipay.sofa.jraft.rhea.util.ZipUtil.decompress(ZipUtil.java:86)
  at com.alipay.sofa.jraft.rhea.storage.AbstractKVStoreSnapshotFile.decompressSnapshot(AbstractKVStoreSnapshotFile.java:140)
  at com.alipay.sofa.jraft.rhea.storage.AbstractKVStoreSnapshotFile.load(AbstractKVStoreSnapshotFile.java:92)
  at com.alipay.sofa.jraft.rhea.storage.KVStoreStateMachine.onSnapshotLoad(KVStoreStateMachine.java:261)
  at com.alipay.sofa.jraft.core.FSMCallerImpl.doSnapshotLoad(FSMCallerImpl.java:652)
  at com.alipay.sofa.jraft.core.FSMCallerImpl.runApplyTask(FSMCallerImpl.java:398)
  at com.alipay.sofa.jraft.core.FSMCallerImpl.access$100(FSMCallerImpl.java:73)
  at com.alipay.sofa.jraft.core.FSMCallerImpl$ApplyTaskHandler.onEvent(FSMCallerImpl.java:148)
  at com.alipay.sofa.jraft.core.FSMCallerImpl$ApplyTaskHandler.onEvent(FSMCallerImpl.java:142)
  at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:137)
  at java.lang.Thread.run(Thread.java:748

Expected behavior

compress shouldn't return success before sync to os.

Actual behavior

Snapshot kv data lost.

Steps to reproduce

panic node during kvStore file save.

Minimal yet complete reproducer code (or GitHub URL to code)

Environment

  • SOFAJRaft version: 1.3.7
  • JVM version (e.g. java -version):
  • OS version (e.g. uname -a):
  • Maven version:
  • IDE version:
@yuyang0423
Copy link
Contributor Author

yuyang0423 commented Jun 8, 2021

very similar to this issue #480, please check this code snippet
com.alipay.sofa.jraft.rhea.util.ZipUtil::compress()

public static void compress(final String rootDir, final String sourceDir, final String outputFile,
                                final Checksum checksum) throws IOException {
        try (final FileOutputStream fos = new FileOutputStream(outputFile);
                final CheckedOutputStream cos = new CheckedOutputStream(fos, checksum);
                final ZipOutputStream zos = new ZipOutputStream(new BufferedOutputStream(cos))) {
            ZipUtil.compressDirectoryToZipFile(rootDir, sourceDir, zos);
            //
            // if panic happens here !!!!
            //
            zos.flush();
            fos.getFD().sync();
        }
    }

@killme2008
Copy link
Contributor

It's fine here, you can delete the corrupted snapshot and restart the node.

When the code panic at compress method, it would not call SaveSnapshotClosure callback, and the raft logs were not deleted from log storage.

@yuyang0423
Copy link
Contributor Author

Node can't back to normal unless copy __raft_snapshot_meta and kv.zip from other nodes.
Anyway, I think this just a repair way, for a long-live system it's better to avoid operations from man.
I think the way you mentioned in #480 is much better than current one.
Write temp file and rename to appropriate one.

@yuyang0423
Copy link
Contributor Author

first fix, ensure temp is sync to disk, still test rename in LocalSnapshotStorage::close():

diff --git a/jraft-rheakv/rheakv-core/src/main/java/com/alipay/sofa/jraft/rhea/util/ZipUtil.java b/jraft-rheakv/rheakv-core/src/main/java/com/alipay/sofa/jraft/rhea/util/ZipUtil.java
index 08e3607..aa85322 100644
--- a/jraft-rheakv/rheakv-core/src/main/java/com/alipay/sofa/jraft/rhea/util/ZipUtil.java
+++ b/jraft-rheakv/rheakv-core/src/main/java/com/alipay/sofa/jraft/rhea/util/ZipUtil.java
@@ -34,6 +34,7 @@ import org.apache.commons.io.FileUtils;
 import org.apache.commons.io.IOUtils;
 import org.apache.commons.io.output.NullOutputStream;

+import com.alipay.sofa.jraft.util.Utils;
 import com.alipay.sofa.jraft.util.Requires;

 /**
@@ -51,6 +52,7 @@ public final class ZipUtil {
             zos.flush();
             fos.getFD().sync();
         }
+        Utils.fsync(new File(outputFile));
     }

     private static void compressDirectoryToZipFile(final String rootDir, final String sourceDir,

@yuyang0423
Copy link
Contributor Author

maybe more safe if use Utils.atomicMoveFile instead of File.renameTo in RocksRawKVStore and LocalSnapshotStorage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants