Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lock not found warnings on parallel restic runs #1523

Closed
timdeluxe opened this issue Jan 2, 2018 · 6 comments · Fixed by #4520
Closed

Lock not found warnings on parallel restic runs #1523

timdeluxe opened this issue Jan 2, 2018 · 6 comments · Fixed by #4520

Comments

@timdeluxe
Copy link

Output of restic version

restic 0.8.1
compiled with go1.9.2 on windows/amd64

How did you run restic exactly?

restic -o sftp.command="plink -s -l mysftpuser -P 555 -i C:/Users/Myself/mysftpuser_id_rsa.ppk myremotesftplocation.com sftp" --repo sftp:myremotesftplocation.com:/data -p C:\restic\backup_password.txt backup C:\Users\Myself\Documents\SomeDir

As described in the title i made two runs, the first run (started a second earlier) worked just fine:

password is correct
using parent snapshot a471a9e6
scan [C:\Users\Myself\Documents\SomeDir]
scanned 9 directories, 304 files in 0:00
[0:00] 100.00%  0B/s  1.628 GiB / 1.628 GiB  313 / 313 items  0 errors  ETA 0:00
duration: 0:00, 2076.29MiB/s
snapshot a2522615 saved

The second run obviously waited for the lock of the first one, but gave strange messages about that:

password is correct
Load(<lock/44ef36d35f>, 0, 0) returned error, retrying after 252.908174ms: file does not exist
Load(<lock/44ef36d35f>, 0, 0) returned error, retrying after 402.123616ms: file does not exist
Load(<lock/44ef36d35f>, 0, 0) returned error, retrying after 909.577925ms: file does not exist
Load(<lock/44ef36d35f>, 0, 0) returned error, retrying after 2.064778206s: file does not exist
Load(<lock/44ef36d35f>, 0, 0) returned error, retrying after 3.125126695s: file does not exist
Load(<lock/44ef36d35f>, 0, 0) returned error, retrying after 4.165555777s: file does not exist
Load(<lock/44ef36d35f>, 0, 0) returned error, retrying after 3.840376221s: file does not exist
Load(<lock/44ef36d35f>, 0, 0) returned error, retrying after 8.745851428s: file does not exist
Load(<lock/44ef36d35f>, 0, 0) returned error, retrying after 17.599426733s: file does not exist
Load(<lock/44ef36d35f>, 0, 0) returned error, retrying after 9.716109781s: file does not exist
using parent snapshot a2522615
scan [C:\Users\Myself\Documents\SomeDir]
scanned 9 directories, 304 files in 0:00
[0:01] 100.00%  1.628 GiB/s  1.628 GiB / 1.628 GiB  313 / 313 items  0 errors  ETA 0:00
duration: 0:01, 1497.99MiB/s
snapshot 8feaf923 saved

What backend/server/service did you use to store the repository?

Linux system (Debian) with SFTP

Expected behavior

More meaningful message like "waiting for parallel run to be finished" and best would be to include some kind of process id, starting time of the other process and a timeout (and how to solve that on a timeout - i dont know how the locking works...).

Actual behavior

Messages ("Load...") shown above

Steps to reproduce the behavior

Use restic on Windows. Use SFTP Backend with plink. Start two parallel runs or restic at the same time for a directory with some files/MBs (else the runs end too fast).

Do you have any idea what may have caused this?

Looks like the locking is already handled well (the second run waited for the first one to finish + x and then started itself), but the locking handled is presented to the user with not really self explaining messages.

Do you have an idea how to solve the issue?

Implement an error handler for locking-related things and in that case we have here, present a better message to the user.

Did restic help you or made you happy in any way?

The answer is 42! ...just kidding ;)
No restic in fact makes me very happy. You wont believe how many free backup solutions i searched and tried (for Linux + Windows). restic is the first one which fulfills my requirements AND works.
To be the killer-backup-app for Windows it would need a nice GUI and native-implemented VSS handling (and logging, and...), but i understand that the basics first need to work very stable and after that one could concentrate on nice extras like the ones i described.

@fd0
Copy link
Member

fd0 commented Jan 7, 2018

Huh, interesting, thanks for the report. What happens in the background is that restic creates a lock file (non-exclusive, there may be multiple clients running backup in parallel), then waiting for other locks to appear. During that time, there was a lock file present (maybe from the other process), but which was removed. The messages are just telling you that restic retries loading the lock file several times (just in case it appears after a while, some cloud backends may do that).

So, strictly speaking, restic isn't waiting for the other backup run to finish, but for the removed lock file to appear again. That's a rare timing issue, and I suspect not many users will run into this. We could disable retrying Load() for lock files, but I don't think it's a good idea.

I agree that the messages could be better, do you have an idea for that?

@fd0 fd0 added the state: need feedback waiting for feedback, e.g. from the submitter label Jan 7, 2018
@timdeluxe
Copy link
Author

Oh, thanks for the comprehensive explanation.

To give more background: I am planning to backup a bunch of Windows systems and wanted to test a few things first. Depending on how the backup is planned it might occur that two runs of restic are run at the same time and that is what i wanted to test.

Based on your explanation it might not be a good idea to mention something like "waiting for parallel run to be finished", which i proposed in the initial report already. But maybe something else, which describes your explanation in short - maybe "Short presence of lock file detected, pausing and checking". I don't know how the waiting time between the checks is calculated and how many loops are done. Maybe the message could be extended with that information, so the user knows what is and will happen?

But i think this is a very low priority issue, postpone it and do more important stuff earlier :)

@fd0 fd0 added type: feature enhancement improving existing features help: wanted and removed state: need feedback waiting for feedback, e.g. from the submitter labels Jan 9, 2018
@fd0
Copy link
Member

fd0 commented Jan 9, 2018

Thanks for the feedback

@MichaelEischer MichaelEischer changed the title Strange messages on parallel run on Windows (however both runs are succesful) Lock not found warnings on parallel restic runs Aug 20, 2022
@MichaelEischer
Copy link
Member

Ideally restic shouldn't retry loading locks, but instead list the lock folder again to check whether the lock file was just removed in the meantime. If it still exists, then it should retry loading the lock file.

@double-p
Copy link

double-p commented Aug 25, 2022

I see this "every night", too. Some 190 clients across 8 repositories and one or two show this (not always the same ofc).

Interestingly with always the exact same timings down to sub-nanoseconds? The same as Michael's output in #3652 , too .. sus, innit? :)

Load(<lock/20e347e327>, 0, 0) returned error, retrying after 552.330144ms: The specified key does not exist.
Load(<lock/20e347e327>, 0, 0) returned error, retrying after 1.080381816s: The specified key does not exist.
Load(<lock/20e347e327>, 0, 0) returned error, retrying after 1.31013006s: The specified key does not exist.
Load(<lock/20e347e327>, 0, 0) returned error, retrying after 1.582392691s: The specified key does not exist.
Load(<lock/20e347e327>, 0, 0) returned error, retrying after 2.340488664s: The specified key does not exist.
Load(<lock/20e347e327>, 0, 0) returned error, retrying after 4.506218855s: The specified key does not exist.
Load(<lock/20e347e327>, 0, 0) returned error, retrying after 3.221479586s: The specified key does not exist.
Load(<lock/20e347e327>, 0, 0) returned error, retrying after 5.608623477s: The specified key does not exist.
Load(<lock/20e347e327>, 0, 0) returned error, retrying after 7.649837917s: The specified key does not exist.
Load(<lock/20e347e327>, 0, 0) returned error, retrying after 15.394871241s: The specified key does not exist.

This happens in backup runs, and the error output is just not ensuring me the backup is running properly.

This is restic 0.13.1 from releases on a variety of linux machines. The OpenBSD machines are not showing this (so far?).

I can roll+run a debug version if that helps.

@MichaelEischer
Copy link
Member

Interestingly with always the exact same timings down to sub-nanoseconds?

That sounds like the randomization didn't work as it should. But besides that the lock retries are sort of what is to be expected at the moment when many restic instances access the same repository at the same time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants