Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

restic 0.14.0 (and current master branch) hangs indefinitely after approx. 3.8 million files on Solaris 11 x86_64 #4003

Closed
racingmars opened this issue Nov 3, 2022 · 5 comments · Fixed by pkg/xattr#70 or #4053
Labels

Comments

@racingmars
Copy link

racingmars commented Nov 3, 2022

Output of restic version

restic 0.14.0-dev (compiled manually) compiled with go1.19.2 on solaris/amd64

This issue originally occurred for me in the 0.14.0 release downloaded from the Restic web site.

My current build is from the git master branch as of earlier today, commit 92df039, built with go build -tags debug ./cmd/restic, using go version go1.19.2 solaris/amd64

How did you run restic exactly?

# DEBUG_LOG=debug.log /root/restic \
    -r sftp:mwilson@backupfs://restic -p /root/restic.pw  --verbose backup \
    /tank/athena /export/b /export/home/mwilson /export/media  \
    /export/owncloud /export/smb /export/software /tank/c /tank/postgres  \
    --exclude /export/home/mwilson/.cache

What backend/server/service did you use to store the repository?

sftp to a FreeBSD 13.1-RELEASE-p3 system. Repository originally created with restic 0.12.1 and has been working fine for about a year in this setup. I upgraded to 0.14.0 and upgraded the repository to v2 and the first attempt to backup hit this problem.

I have also tried using the REST/http server on the FreeBSD target server, but the behavior is the same. So appears to be a problem in reading the files on the Solaris box, not the backend implementation.

Expected behavior

I expect the backup to complete successfully as it did on the same system with restic 0.12.1.

Actual behavior

After approximately 3.8 million files, the restic process appears to hang indefinitely. I have let it run over 24 hours, and the timer keeps increasing but no further progress is made.

The output from restic was:

debug log file debug.log
debug enabled
open repository
repository e98316a6 opened (version 2, compression level auto)
lock repository
using parent snapshot 2354d7f9
load index files
start scan on [/tank/athena /export/b /export/home/mwilson /export/media /export/owncloud /export/smb /export/software /tank/c /tank/postgres]
start backup on [/tank/athena /export/b /export/home/mwilson /export/media /export/owncloud /export/smb /export/software /tank/c /tank/postgres]
scan finished in 1221.960s: 4741527 files, 11.174 TiB
[1:42:47] 3856780 files 10.423 TiB, total 4741527 files 11.174 TiB, 0 errors

The timer continues going up, but the file/size/etc counts freeze where they are.

Steps to reproduce the behavior

Unfortunately I don't know anything other than "use restic with a lot of files that you're backing up."

I have tried deleting the restic cache directory, have tried both the sftp and rest backend, and I've built with several different versions of Go (both from Solaris itself and cross-compiling from Linux).. Restic 0.12.1 was working fine for me, but I can't revert to that since I already upgraded the repository version with 0.14.0.

Do you have any idea what may have caused this?

I do not. However, I can provide some additional detail...

It doesn't appear to be related to the specific files it's trying to back up. In an earlier attempt to debug, I excluded the directory that had files the debug log was last referencing. That just causes it to freeze somewhere else with a slightly different file number count.

After restic stops making progress, here's the dump after I send a SIGABRT:

https://paste.sr.ht/blob/67bcbf3b097934f7222dcef8c036461539ba23dc

And here are the last several hundred lines from the debug log:

https://paste.sr.ht/blob/b8c24053c1c929d1f646049af226552aa089c251

Do you have an idea how to solve the issue?

No. Ideally I'd try creating a new repository from scratch and see if that works, but I don't have storage space locally to try that and the system I'm backing up to currently is at a remote location and a new full initial backup would be too big to practically do over the internet.

Did restic help you today? Did it make you happy in any way?

In general restic makes me very happy :-). This last week as I've been troubleshooting this and trying to figure out how to get my backups working again (I'm basically stuck since I can't downgrade to 0.12.1 that was working since the repo version has already been upgraded!) has, I must admit, been frustrating. Once I can get past this I'm sure I'll be loving restic again, though!

@MichaelEischer
Copy link
Member

Thanks for the detailed problem report! I think I've found the issue:

goroutine 104 [syscall, 7 minutes]:
runtime.syscall_sysvicall6(0xe1b760, 0x3, 0xc03f1b51f0, 0x800000, 0x0, 0x0, 0x0, 0x0)
        /home/mwilson/src/golang/1.19/go/src/runtime/syscall_solaris.go:40 +0x85 fp=0xc01172e798 sp=0xc01172e740 pc=0x468725
syscall.Open({0xc03f1b4fc0?, 0xc01172e8a0?}, 0x6?, 0x0)
        /home/mwilson/src/golang/1.19/go/src/syscall/zsyscall_solaris_amd64.go:744 +0xc5 fp=0xc01172e830 sp=0xc01172e798 pc=0x47e025
os.openFileNolog({0xc03f1b4fc0, 0x65}, 0x0, 0x0)
        /home/mwilson/src/golang/1.19/go/src/os/file_unix.go:216 +0xcf fp=0xc01172e880 sp=0xc01172e830 pc=0x4afbef
os.OpenFile({0xc03f1b4fc0, 0x65}, 0x0, 0xebbb00?)
        /home/mwilson/src/golang/1.19/go/src/os/file.go:337 +0x45 fp=0xc01172e8b8 sp=0xc01172e880 pc=0x4ad9e5
github.com/pkg/xattr.listxattr({0xc03f1b4fc0?, 0x4?}, {0x0, 0x0, 0x0})
        /home/mwilson/go/pkg/mod/github.com/pkg/xattr@v0.4.8/xattr_solaris.go:117 +0x5b fp=0xc01172e920 sp=0xc01172e8b8 pc=0x79c9fb
github.com/pkg/xattr.List.func1({0x0?, 0x1935b40?, 0x1d50ba7e?})
        /home/mwilson/go/pkg/mod/github.com/pkg/xattr@v0.4.8/xattr.go:207 +0x35 fp=0xc01172e958 sp=0xc01172e920 pc=0x79be55
github.com/pkg/xattr.list({0xc03f1b4fc0, 0x65}, 0xc01172ea30)
        /home/mwilson/go/pkg/mod/github.com/pkg/xattr@v0.4.8/xattr.go:232 +0x4d fp=0xc01172ea18 sp=0xc01172e958 pc=0x79beed
github.com/pkg/xattr.List({0xc03f1b4fc0?, 0x123fb20?})
        /home/mwilson/go/pkg/mod/github.com/pkg/xattr@v0.4.8/xattr.go:206 +0x48 fp=0xc01172ea58 sp=0xc01172ea18 pc=0x79bde8
github.com/restic/restic/internal/restic.Listxattr({0xc03f1b4fc0?, 0xe700010018?})
        /home/mwilson/src/restic/internal/restic/node_xattr.go:23 +0x1e fp=0xc01172ea90 sp=0xc01172ea58 pc=0x7a86de
github.com/restic/restic/internal/restic.(*Node).fillExtendedAttributes(0xc03fd84c80, {0xc03f1b4fc0, 0x65})

Can you test whether reverting commit dd30083c2b539e7d982b47275e394edd2d775432 solves the issue? (git revert dd30083c2b539e7d982b47275e394edd2d775432; git checkout --our go.mod go.sum; go add go.mod go.sum; go commit should fix the merge conflict)

@MichaelEischer
Copy link
Member

The last line printed by the archiver in the debug log is probably the file which caused restic to become stuck:
2022/11/03 15:41:10 archiver/archiver.go:469 archiver.(*Archiver).Save 104 /tank/athena/export/olddata/users/oldsystem/old-zeus/home/mwilson-old/mwilson-oldhome/.licq/licq_fifo other

To expand a bit on my previous comment, in #3628 (included in restic 0.14.0) we've added support to also backup xattrs on Solaris. Based on the filename I wonder whether trying to backup a file of type fifo just causes the xattr code to hang.

@racingmars
Copy link
Author

Hi @MichaelEischer, thank you for looking at this.

Your theory about the fifo type file appears to be correct. I have been able to reproduce this quickly based on that knowledge.

On the Solaris box, I can:

mkdir repo target
mkfifo target/myfifo
restic -r repo init
restic -r repo backup target

With the official 0.14.0 release, this hangs on the single file.

After reverting commit dd30083 as you suggested, the backup completes successfully.

I've kicked off my large backup with the reverted binary; I suspect it will complete successfully now in a couple of hours.

Thanks again!

@racingmars
Copy link
Author

...success! Backup completed successfully in 20 minutes, which is back to how quick it was pre-0.14.0. I'm guessing 0.14.0 was taking a couple hours because it was backing up the new xattr data for files for the first time and would return to being fast on future backups? I don't know... maybe the xattr calls themselves are just really slow?

Regardless, the immediate problem of not getting through the backup is solved!

@greatroar
Copy link
Contributor

pkg/xattr opens the file to get an fd, then uses openat on the fd to get to the attribute namespace. I think the first open should use O_NONBLOCK and/or O_NDELAY so as not to wait for the write end to be opened, but I'm not sure what that will do to other processes that use the FIFO.

greatroar added a commit to greatroar/xattr that referenced this issue Nov 5, 2022
On Solaris, files need to be opened to obtain or change their
attributes. But opening a FIFO file causes the calling thread to block
until the write end it opened. Files are now opened using O_NONBLOCK to
prevent this from happening.

Should fix restic/restic#4003.

Also fix comment on stringsFromByteSlice.
greatroar added a commit to greatroar/xattr that referenced this issue Nov 5, 2022
On Solaris, files need to be opened to obtain or change their
attributes. But opening a FIFO file causes the calling thread to block
until the write end is opened. Files are now opened using O_NONBLOCK to
prevent this from happening.

Should fix restic/restic#4003.

Also fix comment on stringsFromByteSlice.
greatroar added a commit to greatroar/xattr that referenced this issue Nov 5, 2022
On Solaris, files need to be opened to obtain or change their
attributes. But opening a FIFO file causes the calling thread to block
until the write end is opened. Files are now opened using O_NONBLOCK to
prevent this from happening.

Should fix restic/restic#4003.

Also fix comment on stringsFromByteSlice.
greatroar added a commit to greatroar/xattr that referenced this issue Nov 6, 2022
On Solaris, files need to be opened to obtain or change their
attributes. But opening a FIFO file causes the calling thread to block
until the write end is opened. Files are now opened using O_NONBLOCK to
prevent this from happening.

Should fix restic/restic#4003.

Also fix comment on stringsFromByteSlice.
kuba-- pushed a commit to pkg/xattr that referenced this issue Nov 20, 2022
* Fix FIFO handling on Solaris

On Solaris, files need to be opened to obtain or change their
attributes. But opening a FIFO file causes the calling thread to block
until the write end is opened. Files are now opened using O_NONBLOCK to
prevent this from happening.

Should fix restic/restic#4003.

Also fix comment on stringsFromByteSlice.

* (Solaris) Open O_NONBLOCK with unix.Open, not os.Open

Passing this flag to os.Open works, but this relies on undocumented
implementation details of package os.

* (Solaris) Fix removexattr/hanging test
greatroar added a commit to greatroar/restic that referenced this issue Nov 28, 2022
This version doesn't have a release tag yet, but it's 0.4.9 + one patch.

Fixes restic#4003.
greatroar added a commit to greatroar/restic that referenced this issue Nov 28, 2022
This version doesn't have a release tag yet, but it's 0.4.9 + one patch.

Fixes restic#4003.
greatroar added a commit to greatroar/restic that referenced this issue Nov 28, 2022
This version doesn't have a release tag yet, but it's 0.4.9 + one patch.

Fixes restic#4003.
greatroar added a commit to greatroar/restic that referenced this issue Nov 28, 2022
This version doesn't have a release tag yet, but it's 0.4.9 + one patch.

Fixes restic#4003.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants