Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replication fails when receive side apparently hangs #136

Open
stingray-11 opened this issue Mar 13, 2019 · 3 comments
Open

Replication fails when receive side apparently hangs #136

stingray-11 opened this issue Mar 13, 2019 · 3 comments
Labels
awaiting_feedback bug-candidate possible bug, needs investigation
Projects

Comments

@stingray-11
Copy link

stingray-11 commented Mar 13, 2019

I'm running zrepl 0.1.0rc3 on FreeBSD 12, replicating to a Ubuntu Linux VM also running zrepl 0.1.0rc3. Using an ssh connection. I cannot get replication to work correctly about 95% of the time. Sometimes it fails right away, sometimes it goes for a few minutes and then fails. I tried using a manual zfs send | ssh backup zfs recv command and the replication was successful, so I'm confident the problem is in zrepl, and that it's probably not a Linux/FreeBSD incompatibility issue. The only thing I've noticed is that the transfer rate goes to 0 for several seconds before it fails. I thought perhaps the backup hard drive I was using was hanging for a few seconds on writes and that this was tripping up zrepl, but I could not reproduce any write hangs with a dd test, so I think it's tripping up for some other reason. It's always the same error message "unconsumed input stream".

I should also note that during the manual send/recv the transfer rate still went to 0 for a few seconds several times, but it would just recover and continue after that. So I'm not sure if there's a problem with my I/O or not, but if there is, zrepl times out almost immediately and fails.

    Type: push                                                                                                                              
    Replication:                                                                                                                            
        Status: PermanentError                                                                                                              
        Problem: multiple filesystems could not be replicated: multiple different errors                                                    
        Progress: [>--------------------------------------------------] 7.9 GiB / 615.7 GiB @ 0 B/s                                         
          zfspool/oldmachines Ready (step 0/1, 7.4 GiB/454.7 GiB)                                                                           
            step zfspool/oldmachines@zrepl_20190312_234215_000 (full) failed: handler error with unconsumed input stream: zfs exited with er
            ror: exit status 1                                                                                                              
            stderr:                                                                                                                         
            cannot receive new filesystem stream: checksum mismatch or incomplete stream                                                    
          zfspool/proxmox     Completed (step 1/1, 18.6 KiB/26.6 KiB)                                                                       
          zfspool/vms         Ready (step 0/1, 524.5 MiB/161.0 GiB)                                                                         
            step zfspool/vms@zrepl_20190312_234215_000 (full) failed: handler error with unconsumed input stream: zfs exited with error: exi
            t status 1                                                                                                                      
            stderr:                                                                                                                         
            cannot receive new filesystem stream: checksum mismatch or incomplete stream

This is the log.
Mar 12 22:16:52 fileserver zrepl[39020]: job=machines subsystem=repl msg="receive request failed (might also be error on sender)" step="zfspool/oldmachines@zrepl_20190312_234215_000 (full)" err="handler error with unconsumed input stream: zfs exited with error: exit status 1\nstderr:\ncannot receive new filesystem stream: checksum mismatch or incomplete stream\n" fs=zfspool/oldmachines invocation=5 errType=*streamrpc.RemoteEndpointError

@problame
Copy link
Member

I merged the RPC rewrite today: #111
Further, there have been significant changes to the replication in #139
If possible, please try the binaries I posted here: #139 (comment) (They are builds of #139)

@problame
Copy link
Member

Also, this issue has been encountered before in #104

@problame
Copy link
Member

As can be seen from the commits above, work on this issue is progressing in branch master...problame/overlapping-dataset-hierarchy-improvements

Could you please verify that the issue was fixed or is at least mitigated?
To get binary builds, click on the latest commit with a checkmark and navigate to the CircleCI binary artifacts, e.g. https://circleci.com/gh/zrepl/zrepl/139#artifacts/containers/0

@problame problame added this to To DO in 0.1 via automation Mar 21, 2019
@problame problame moved this from Todo to Awaiting Feedback in 0.1 Mar 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting_feedback bug-candidate possible bug, needs investigation
Projects
No open projects
0.1
  
Awaiting Feedback
Development

No branches or pull requests

2 participants