Skip to content
This repository has been archived by the owner on Oct 15, 2022. It is now read-only.

Failed to symlink, File exists #389

Open
Skyfold opened this issue May 3, 2020 · 4 comments
Open

Failed to symlink, File exists #389

Skyfold opened this issue May 3, 2020 · 4 comments

Comments

@Skyfold
Copy link

Skyfold commented May 3, 2020

Describe the bug

I run lorri daemon in its own shell. It had been working correctly for quite some time before it crashed with this error log:

name = 'lorri'
operating_system = 'unix:Unknown'
crate_version = '1.0.0'
explanation = '''
Panic occurred in file 'src/build_loop.rs' at line 135
'''
cause = '''
Unrecoverable error:
Io {
    msg: "Failed to symlink /home/user/.cache/lorri/gc_roots/43b29393751a4b4155eb2df05ec2bf91/gc_root/shell_gc_root to /nix/var/nix/gcroots/per-user/user/43b29393751a4b4155eb2df05ec2bf91-shell_gc_root: File exists (os error 17)",
}'''
method = 'Panic'
backtrace = '''

   0: 0x55ace73cf8f9 - std::panicking::try::do_call::h701385363879df93
   1: 0x55ace75c8b4a - __rust_maybe_catch_panic
   2: 0x55ace73ed1a6 - core::ops::function::FnOnce::call_once{{vtable.shim}}::h7d93166c1161340f
   3: 0x55ace75a88bf - <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once::h3e052558cbe55fd8
   4: 0x55ace75ae620 - std::sys::unix::thread::Thread::new::thread_start::h5262a6ff4b26522f
   5: 0x7fe132c17ef7 - start_thread
   6: 0x7fe13293722f - __GI___clone
   7:        0x0 - <unknown>'''

To Reproduce

I'm not completely sure the steps that lead to the error, but I did notice before it crashed that my shell was missing a library I had specified and lorri wasn't building a new environment, so I ran touch default.nix to refresh the shell. After restarting lorri daemon it built the new environment without crashing.

Metadata

The exact version of lorri I'm running is: /nix/store/slvajmdzd3v080fn9b9kqypqdb4bgwnk-lorri-1.0/bin/lorri. I setup lorri with services.lorri.enable = true; in my configuration.nix. I'm on nixos-19.09.2470.4a0df0ce263 (I got that from the channel's manifest file)

@curiousleo
Copy link
Collaborator

Thanks! Here's what I've found:

Bottom-up:

  • lorri attempted to create a symlink in /nix/var/nix/gcroots, but that symlink already existed
  • these symlinks are created in Roots::add, which is only called from Roots::create_roots

Top-down:

  • BuildLoop::forever calls BuildLoop::once calls BuidLoop::root_result calls Roots::create_roots

I can think of two explanations:

  • There were two lorri instances that raced to build the project; one succeeded, the other failed with the error in the description. There cannot be two daemons at the same time because only one process can own the daemon socket, but there could have been a lorri daemon and a lorri watch process, for instance.
  • There was a hash collision. Unlikely.

@Skyfold, is it possible that there was a lorri daemon and a lorri watch process running at the same time here?

In terms of fixes / mitigations, I can think of the following:

  • The code that creates the symlink could check if the symlink already exists and points to the right cache directory. If so, there is no need to create the symlink. If the symlink already exists but points somewhere else, that's an issue. I think lorri should probably fail with a helpful error in that case since it points to a larger problem with the cache. This would be a safe way to deal with a race.

@Skyfold
Copy link
Author

Skyfold commented May 4, 2020

I was only running a lorri daemon. I checked my shell history and the last time I ran lorri watch was before my last restart (more than a month ago).

Is it possible lorri daemon duplicated the build? I'm only guessing this because I edited my cabal file, which should have triggered a build, but there was no notification that a new build had started. So, I manually triggered one with touch default.nix. Maybe lorri was already building the shell, but failed to notify or keep track of this, so it started a second build resulting in the above error?

Either way checking if the symlink already exists and points to the right cache directory would handle this and cases where someone was running lorri watch and lorri daemon at the same time.

@curiousleo
Copy link
Collaborator

I was only running a lorri daemon. I checked my shell history and the last time I ran lorri watch was before my last restart (more than a month ago).

Thanks for the clarification.

Is it possible lorri daemon duplicated the build? I'm only guessing this because I edited my cabal file, which should have triggered a build, but there was no notification that a new build had started. So, I manually triggered one with touch default.nix. Maybe lorri was already building the shell, but failed to notify or keep track of this, so it started a second build resulting in the above error?

#341 is about lorri duplicating builds. Crucially, lorri should never run multiple builds for the same project in parallel; #341 is about lorri enqueueing redundant builds and then running them sequentially.

Either way checking if the symlink already exists and points to the right cache directory would handle this and cases where someone was running lorri watch and lorri daemon at the same time.

👍

@Profpatsch
Copy link
Collaborator

Profpatsch commented May 6, 2020 via email

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants