-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix/stage symlinked files #102
Conversation
cmd/fs_handlers.go
Outdated
// Since Walk does not travel symlinks, there is no point | ||
// to stage links pointing to a directory. None of its files | ||
// will be walked and staged. | ||
log.Warningf("Not staging %v which point to a non regular file: %v %v", childPath, info.Mode(), resolvedPath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't that kinda inconsistent? We follow symlinks for files, but not directories? That doesn't really follow the principle of least surprise. Also, I would prefer to have an option to ignore symlinks altogether, which disables this behavior. Sometimes I don't want to follow symbolic links that might me take elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I am not proud of it. The problem is that filepath.Walk
by design does not follow symlinks to the directories. Fixing it would require a use of a different library.
Shall I dig in that direction?
I will work on ignore symlinks option as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixing it would require a use of a different library.
Why that? When you encounter a (valid) symbolic link to a directory, you simply call filepath.Walk
on that resolved directory path. I agree that it might get a bit clumsy, but it will work. A better option might be ioutil.ReadDir()
and use that in a recursive function: When encountering a directory (either regular one or resolved from a symbolic link) one would call the function that uses ioutil.ReadDir()
.
Care must be taken to limit the recursion depth, since symlinks can result in loops. But that applies to filepath.Walk too...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
redone in series of patches. It is not as elegant as I wished. But it also simplified staging logic: no need to create dirs, internal Stage
creates all necessary parent folders. Also all traveling is in one place.
Side effect: empty dirs will not be created. Do we want them? It would be relatively quick fix.
cmd/fs_handlers.go
Outdated
log.Warningf("%v symlink points to non existing file or directory", childPath) | ||
return nil | ||
} | ||
if err.Error() == "EvalSymlinks: too many links" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we just return the error instead of remapping the error to a log message that kinda says the same as the error? I mean for the logic it shouldn't matter if the file does not exist, is a cycle or failed for e.g. permission issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we return a error, it will interrupt the rest of the staging. With a warning, we let user know that the request is impossible for only this link.
I tried both ways. It presents less surprises to a user if we batch stage as much as possible, and skip things which brig
does not support vs stopping at the first filesystem incompatible link.
Would you agree?
Another option is to process errors in the caller and decide which is critical and which is not there. This way, we can return a non zero exit code, so a user can do some cli
logic around to decide if it is mission critical or not.
I really need a second opinion here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you agree?
Hmm, don't fully agree. Not so much about whether traversal should be stopped or not (that might be preference thing), but rather how the current error handling is implemented:
- You ignore the error when symlink is broken or results in a loop.
- You error out if the symlink could not be read for other reasons (permissions etc.)
I don't see a clear line between either cases, therefore I'd vote to handle them equally. In both cases it results in at least one file that could not be added. Handling equally would also rid you from having to add branches for specific error causes.
About whether to stop or not: Maybe that could be an option too (--ignore-errors
) which would apply to regular files and directories too (if they could not be read due to whatever reasons). It would not be the default, so we would error out to inform users that something is wrong. My thinking for the option is that there might be different use cases:
- "Live" usage where a user reads the log messages. Users might wish to ignore errors if they deem it safe.
- "Automated" usage by a script that just fires the command. Errors should normally lead to a script abort here.
Another option is to process errors in the caller and decide which is critical and which is not there. This way, we can return a non zero exit code, so a user can do some cli logic around to decide if it is mission critical or not.
Would probably require interactivity (i.e. ask on the cmdline), which I'm not a big fan of in cli tools.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You logic is more consistent. Error is a error. I will modify accordingly.
It is a bit wasteful for a single file stage, but it would allow to handle staging option in one place
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the fix gets more complicated 😞
Also, did we agree on an option to follow links and/or to ignore errors?
cmd/fs_handlers.go
Outdated
} | ||
childPath = resolvedPath | ||
if info.Mode().IsDir() { | ||
toBeDereferenced = append(toBeDereferenced, stagePair{childPath, repoPath}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More a musing: I'm a little confused by the usage of this toBeDereferenced
slice. What would stop you from calling walk()
again directly?
Also after some more thinking I found another issue: This method can lead to adding the same file more than once. I mentioned earlier that symbolic links transform a file tree into a graph. This leads to the possibility of encountering the same node more than once while traversing (i.e. directory symlink points one directory level up: ..
).
So maybe it's actually worth looking at libraries like symwalk which claim to handle this stdlib weakness. Just this disclaimer is a bit embarassing:
CAVEAT: Note that symwalk.Walk does not terminate if there are any non-terminating loops in the file structure.
Maybe there are better libraries also :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at the symwalk
code and that CAVEAT inspired me to write our own,
it is not very different, from our walk. The main change was to track depth
of recursion. Honestly, speaking I would say having that walk
in the library is somewhat overkill. It is a typical undergraduate problem, with 10 LOC solution.
Admittedly, I made it a bit more complicated. As you suggest, instead of collecting toBeDereferenced
, I could have jump to the walk
right away. It would be cleaner, and I will redo it accordingly.
Traveling same node more than once is a feature of the symlinks. So this is a necessary evil. I even tested our code dir1/dir2 --> ../dir1
, it creates quite a long unwraped tree (limited by the set depth).
if err != nil { | ||
return fmt.Errorf("Failed to resolve: %v: %v", childPath, err) | ||
} | ||
info, err = os.Stat(resolvedPath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another evil thing: Symbolic links can resolve to more symbolic links. They're really a hostile concept to
implement.
Nevermind, I was thinking about os.Lstat()
- filepath.Evalsymlinks
should handle this.
I guess the
|
What we already have does it. I run some test and I see no obvious bugs in it.
I would disagree on this. I think this is the feature of symlinks: multiple links can point to the same My use case: having a template dir (with makefiles, styles, and common configs). I have quite a lot of link to this template dir within a working tree. |
This is in my TODO list. I was looking for a general approval of the symlink handling, before adding extra options. An embarrassing story: I spent 5 hours figuring out how to |
Hmm, you have a point here. I was more worried about such pathological (hah, "path") cases:
...in that case the current code would add an infinitely nested structure, each level with So both solutions lack something:
I really don't like the concept of symbolic links. Without supporting some sort of "link" in brig we can't really cover both cases and that is really out of scope. Maybe one way would be to "burn" a directory-symlink, so we can traverse it only once and won't the next time. Since I can't come up with a good solution here you have my go to proceed. Might play around at some point, so please leave a TODO with the problem described above.
Sorry to hear that 😁 The |
file.big link
Just small correction: we already have recursion depth protection with the current code. The limit is set to 255 in the current But since I collect long list of files We might even attempt it with directories, but this is scary. I am not sure our linker will survive it.
Sure thing I will leave the TODO. |
I am still working on proper link handling. |
As a side effect, it does not go into infinite recursion loops, when a link references direct or indirect link to itself.
still working on it. In particular, on deduplication: if several repo paths points to the same content via symlinks during staging, then we should just copy repo node to new repopaths. This is how I triggered bug #105 |
How are you implementing the linking? Are you using |
I am going to use |
If during staging there are symlinks pointing to the same file, then relevant repo path will have similar content and backend hashes. This is achieved by staging the first occurrence of the file and then copying the content via brig `Copy(src, dst)`. This also allow faster staging by reducing duplicate work of encrypting and compressing.
Status update. Looks like I completed reflinking of the symlinked files. Next is proper treatment of the symlinked directories, then implementation for options. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concept looks good, but I think there are some flaws...
cmd/fs_handlers.go
Outdated
return nil | ||
} | ||
} | ||
|
||
if info.Mode().IsRegular() { | ||
toBeStaged = append(toBeStaged, stagePair{childPath, repoPath}) | ||
list, ok := toBeStaged[childPath] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused. You're defining equalness based on the path. This will work only in case of absolute symbolic links. Sadly, the more common form I saw are relative symbolic links. Those will lead to different paths for the same file (did I mention that I hate the concept?):
→ touch im-the-same-file-always
→ ln -s . sub-directory
→ cd sub-directory
→ pwd
/tmp/t/sub-directory/sub-directory
→ stat /tmp/t/sub-directory/sub-directory/im-the-same-file-always
File: /tmp/t/sub-directory/sub-directory/im-the-same-file-always
Device: 30h/48d Inode: 61479275 Links: 1
...
→ cd sub-directory
→ stat /tmp/t/sub-directory/sub-directory/sub-directory/im-the-same-file-always
File: /tmp/t/sub-directory/sub-directory/sub-directory/im-the-same-file-always
Device: 30h/48d Inode: 61479275 Links: 1
...
As you can see, the path changes since it's totally valid for the same file to have several paths (-> hardlinks). The only thing that stays constant is the combination of device/inode
number. You should use this as key to uniqueness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused. You're defining equalness based on the path. This will work only in case of absolute symbolic links.
If we track only symbolic links to files, the path equality is good enough.
filepath.EvalSymlinks
resolves the link to its absolute path so we are covered.
Tracking links to directories is hard as you correctly pointed out. I am working on it in a new set of patches. I think EvalSymlinks
still would resolve to the absolute parent, but I need more tests for this case. Current, algorithm reject simple infinite loops due to depth control. But I want to have directory reflinking for the non abusive loop free symlinks.
Hopefully, a user is not in self destruction/harm mode, and does not do loop symlinks unless there is a good case for it. I myself cannot come up with a useful case for looped symlinks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we track only symbolic links to files, the path equality is good enough.
filepath.EvalSymlinks resolves the link to its absolute path so we are covered.
True. Just tested, and it seems that filepath.EvalSymlinks
is clever enough to always return /tmp/t//im-the-same-file-always
. It would be interesting to see how it does that, will check. I still suggest to use dev/inode instead of path due to the added benefit of detecting hardlinks.
Hopefully, a user is not in self destruction/harm mode, and does not do loop symlinks unless there is a good case for it. I myself cannot come up with a useful case for looped symlinks.
There is probably none. It's "just" a security thing from my perspective.
cmd/fs_handlers.go
Outdated
} | ||
continue | ||
} | ||
ctl.Copy(firstToStage, repoPath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're missing the error handling here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ups, that was left over of copy-pasting. Added a better error handler in 49ddee9
// first occurrence is staged | ||
if err := ctl.Stage(pairSet.local, repoPath); err != nil { | ||
fmt.Printf("failed to stage %s: %v\n", pairSet.local, err) | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to stop all staging when there was errors? Probably also something for the error ignoring option...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the failed staging might be a minor thing: someone changed a local filesystem (deleted a file or set unreadable permissions). I think it is better to report than stop staging at this stage.
Besides, I have no idea how to send an abort message from the go routine :(
cmd/fs_handlers.go
Outdated
continue | ||
} | ||
ctl.Copy(firstToStage, repoPath) | ||
if err := ctl.Stage(pairSet.local, repoPath); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we staging again here? Shouldn't the copy alone suffice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad, leftover of copy-pasting. Note to myself: do not push commits at midnight.
Should be fixed in 49ddee9.
When symlinked files are copied (reflinked) often they point from a non existing directory. This patch takes care of this problem.
Still working on hardlinks and dirs |
Do we really want to implement hardlinks reflinking? To track hardlinks we have to use OS dependent package x/sys. I would argue that this is an overkill and will be hard to maintain. The price we pay is some data duplication (I start to regret our decision on a random key choice for every file, instead of content based.) I personally never used hardlinks in real life. But I need a weighted @sahib opinion on it. One more thing: Other than this, I will be working on skip symlinks and continue on the minor staging errors switches/options. |
It's not an easy discussion. Here's my take. I care mostly for consistency. If we support symbolic links we should also support hardlinks. To be fair, using
The second point is inconsistent, but I would be okay to let it slip since it's a non-essential optimization and it's likely that most users don't even care or know. So I'm kinda okay with the current state of not explicitly supporting for the following reasons:
I use them quite often. For example parts of my music library are hardlinks and I use them to create "playlists" that can be easily copied to elsewhere without fear of tools that do not handle symbolic links right. But also many tools like ostree, git and many more are based on them. Personally, I see them more often than symbolic links.
Well, yes. It adds a bit of complexity, but not much. One could just have a "key" type that is either dev/inode on unix or path on windows.
👍 |
This done by tracking files inodes and converting them to a unique string in the form "DeviceID/Inode". This work only on systems where sys/Lstat is supported. Currently, it is various unixes and more specifically linux for this patch.
I was convinced that hardlinks are a nice feature (should I expect anything different from I have added hardlinks reflinking with 9f93d01. It is tested only on linux, and on other systems should fall back to just staging a hardlinked file independently (not sure it supported by Windows at all). Technically, it should work on any unix like system, not just linux. But I do not see the way to make a build flag similar to
if I replace |
Nice! Implementation looks okay to me. Will have deeper look once the options are also implemented.
Damn, my past catches up. 😄 But yes, not scanning hardlinks twice was an issue there.
If I understood you right you could have the following configuration (untested, but did something similar once): $ ls inode_*
inode_unix.go inode_other.go
$ head -1 inode_unix.go
// +build !windows
$ head -1 inode_other.go
// +build windows Under the assumption that pretty much everything beside windows is a unix or at least has inodes. Even if that does not apply I think it's better to keep it like that so users can complain when compiling for this platform. |
The previous version distinguished between linux and other, which is less general. Since inode discovery exists on many unixes not just linux.
Excellent suggestion, implemented with 38704e1 Now, back to the staging options... |
…, -c It mostly protects against broken link, or some other issues at the local (hard drive) file system.
I have added options:
@sahib I think this patch set is ready for the final review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me overall, just two minor comments.
cmd/fs_handlers.go
Outdated
if err := ctl.Mkdir(repoPath, true); err != nil { | ||
return e.Wrapf(err, "mkdir: %s", repoPath) | ||
if opt.dereference && info.Mode()&os.ModeSymlink != 0 { | ||
// TODO: `brig` does not have concept of symlink |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/TODO/NOTE/g
(Not the lack of symlinks is the problem, but the lack of a good available implementation)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in 2376ea1
cmd/fs_handlers.go
Outdated
// First occurrence is staged. | ||
// Stage creates all needed parent directories. | ||
if err := ctl.Stage(twinsSet.localPath, repoPath); err != nil { | ||
fmt.Printf("failed to stage '%s' as '%s': %v\n", twinsSet.localPath, repoPath, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Common guideline is that one outputs warnings and errors to stderr
rather than stdout
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stdout -> stderr with d89e161
Small personal note: I have to sort out some personal stuff and and will be slow to respond/develop in the next 1-2 weeks. |
Fixes #101
Also fixes exotic case when now to be staged files are detected (i.e. we asked to stage symlinked directory), in this case there were endless wait for
pbars