Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support explicit relative paths using common /./ notation #4762

Open
eharris opened this issue Apr 10, 2024 · 11 comments
Open

Support explicit relative paths using common /./ notation #4762

eharris opened this issue Apr 10, 2024 · 11 comments

Comments

@eharris
Copy link

eharris commented Apr 10, 2024

Output of restic version

restic 0.16.4 (v0.16.4-0-g3786536dc) compiled with go1.21.8 on linux/amd64

What should restic do differently? Which functionality do you think we should add?

Add support for explicit relative path designations (specifying the intended root of path) using common /some/path/./another/path notation to anchor the portion of the path to be stored.
Backups and exclude/includes should be able to be specified with the new "root" of the path starting after the /./ part. So the path above becomes /another/path for all storage/referencing purposes, including: parent use/detection, include/excludes, etc.

This is a simpler, and already known method of accomplishing the indication of relative path intent.
(See also: #2714, #2246, #2993, #3131, #3200, and possibly others)

The forced absolute path design that restic currently uses to identify backups actually causes several problems, and there are several tickets (some of which are listed above) already trying to solve or abate them.

What are you trying to do? What problem would this solve?

In my current use case, I have backups that were done via zfs snapshots, and I'd like to "move"/convert them into restic instead.
However, since each snapshot appears within zfs under its own unique snapshot directory, restic resists being informed that those are actually the same files, just from another point in time, and clutters the snapshot listing with path information that is completely irrelevant, as well as breaking restic's parent and file-change detection and forcing all files to be completely rescanned. It may also result in restic being less efficient at storing those changes (this is just a guess).

Example command:

restic backup -x --time "2024-01-01 02:00:00" --host argon  \\
/mnt/backups/.zfs/snapshot/20240101-020000/argon/root/./ \\
/mnt/backups/.zfs/snapshot/20240101-020000/argon/./boot \\
/mnt/backups/.zfs/snapshot/20240101-020000/argon/./usr/ \\
/mnt/backups/.zfs/snapshot/20240101-020000/argon/./var/

Admittedly, this can be worked around by using bind mounts to position the directories to be backed up in a manner that restic can no longer mistakenly infer that they are different sources, but that requires root privileges and is overly convoluted when this capability should (I think) be provided within restic. And the bind mount workaround still does not allow the paths to actually be stored/referenced at the actual root of the filesystem if that is what is desired.

Did restic help you today? Did it make you happy in any way?

I massively appreciate restic being able to reduce the overall storage needs for backups of often changing but otherwise relatively very similar files like mbox mail files, where one message deleted near the beginning of the file causes the entire rest of the file to be rewritten (and thus changed as far as zfs is concerned) even though the actual total data changed may have only been a few kb deleted.

@damoclark
Copy link

Take a look at #4026 which proposes to change the absolute path model and is under consideration.

@eharris
Copy link
Author

eharris commented Apr 11, 2024

@damoclark I did see that ticket while I was looking for workarounds, but that doesn't address the issue that it would still store useless (and wrong for the purposes of restore) path information. It also seems like a more significant undertaking, because it re-architects what restic uses to identify backups, so this proposal seems like a much less impactful change.

@damoclark
Copy link

but that doesn't address the issue that it would still store useless (and wrong for the purposes of restore) path information.

It does address the issue you describe, using the -C semantics of tar, such that all backups snapshots are relative. The path information, while in your case might have no use, is not necessarily useless to everyone. In #4026, storing of the path where the backup took place might still be recorded in the metadata of the snapshot, but would not be used in any restores, unless --inplace #4575 were used for instance.

Add support for explicit relative path designations (specifying the intended root of path) using common /some/path/./another/path notation to anchor the portion of the path to be stored.
Backups and exclude/includes should be able to be specified with the new "root" of the path starting after the /./

A novel approach, but I don't think it is a good idea encoding special meaning within paths using the current directory . period. It is not uncommon to build paths in shell scripts using command line args or shell variables for instance, that may include a period when invoked manually (e.g. ./path/to/somewhere). This may lead to unintended invocation of this special meaning. The -C semantics of tar are already a pretty well established solution to this exact problem, and are explicit.

It also seems like a more significant undertaking, because it re-architects what restic uses to identify backups, so this proposal seems like a much less impactful change.

Unfortunately, I think re-architecting this part of restic is necessary. If done well, impact can be minimised. In fact, I actually think it will make restic easier and more intuitive to use. And yes, it will be a substantial undertaking. An interim solution might be worthwhile, but the challenge with half measures, such as the one you suggest is that they might appear simple, but risk breaking other things, such as parent matching. These ideas are discussed in #4026. Mucking around with the absolute paths under the current model, in my view, is risky.

I appreciate the need for a quick fix for your use case. I have similar requirements, although mine are for backing up LVM and APFS snapshots, rather than ZFS, and for storing backup archives.

I propose that the sooner the present model that relies upon absolute paths for identifying backup-sets is replaced, the sooner and easier a great number of existing issues and limitations of restic can be resolved. The evidence to me suggests it's worth the trouble.

In the meantime, there are some clever hacks you can use to work around your zfs issue, using symbolic links. They are barbaric, but they work if you are interested. :)

@eharris
Copy link
Author

eharris commented Apr 11, 2024

The path information, while in your case might have no use, is not necessarily useless to everyone.

I never claimed it was never useful. I said in my case it wasn't useful since it was wrong in my specific situation. Which is why I'm looking for a way to override the default assumption that the path things are stored at now (backup/snapshot creation time) is meaningful in the future, which is often correct, but sometimes not.

It is not uncommon to build paths in shell scripts using command line args or shell variables for instance, that may include a period when invoked manually (e.g. ./path/to/somewhere). This may lead to unintended invocation of this special meaning.

You are correct that that could potentially happen, and if it is desired to protect against this "unintended" use, by all means, add a --use-relative flag that enables this behavior. I was just trying to avoid adding flags since there seems to be a cultural resistance in this project to the addition of flags because it is seen as complicating usage for "normal" users.

However it should be noted that if some script specifies a relative path and then some other part blindly prepends something to that path such that it ended up with an embedded /./, it is already probably not going to do what was intended.

The -C semantics of tar are already a pretty well established solution to this exact problem, and are explicit.

Perhaps. But those semantics do not actually handle the example I've given in this ticket. If you'll note, I've specified different relative roots in the paths to be backed up in my example, which -C does not handle, since you can only change to one directory.

Unfortunately, I think re-architecting this part of restic is necessary.

I don't know that I'd go so far as necessary, but I do agree that it seems like a good addition. But your proposal also does not completely solve my use case. Introducing labels does not alleviate the problem that useless and distracting "wrong" paths are being stored as an intrinsic part of the backup, and there is no mechanism to correct them.

But I also note that your proposed re-architecture ticket has been around for about 18 months without anyone showing any indication of being willing to take it on. In fact, you noted yourself in #4026 (comment) that it doesn't look like it will be implemented anytime soon. Whereas the change I've proposed is much simpler, and I might even be willing to take a stab at making a PR for it, assuming there was some indication from the maintainers that it might be accepted.

In the meantime, there are some clever hacks you can use to work around your zfs issue, using symbolic links. They are barbaric, but they work if you are interested. :)

I don't see how symbolic links could help, since restic resolves paths to absolute, which would undo any symbolic links. I already mentioned the only reasonable workaround I see for this in the OP, which is using bind mounts. I do agree that having to use bind mounts is pretty barbaric (as well as an incomplete solution). Can you explain how symlinks can help?

@damoclark
Copy link

Hi Evan,

Which is why I'm looking for a way to override the default assumption that the path things are stored at now (backup/snapshot creation time) is meaningful in the future, which is often correct, but sometimes not.

I understand now what you are saying.

The -C semantics of tar are already a pretty well established solution to this exact problem, and are explicit.

Perhaps. But those semantics do not actually handle the example I've given in this ticket.

Actually, they do if I have understood your requirements correctly.

You want:

/mnt/backups/.zfs/snapshot/20240101-020000/argon/root/ to map to path /
/mnt/backups/.zfs/snapshot/20240101-020000/argon/boot/ to start at /boot
/mnt/backups/.zfs/snapshot/20240101-020000/argon/usr/ to map to /usr and
/mnt/backups/.zfs/snapshot/20240101-020000/argon/var/ to map to /var

Hope I understand this correctly.

If you'll note, I've specified different relative roots in the paths to be backed up in my example, which -C does not handle, since you can only change to one directory.

This is incorrect. -C can be provided multiple times using tar, and it is proposed to do the same for restic. So your requirements would be satisfied as follows:

restic backup -x --time "2024-01-01 02:00:00" --host argon \
-C /mnt/backups/.zfs/snapshot/20240101-020000/argon/root/  . \
-C /mnt/backups/.zfs/snapshot/20240101-020000/argon/  boot \
-C /mnt/backups/.zfs/snapshot/20240101-020000/argon/  usr \
-C /mnt/backups/.zfs/snapshot/20240101-020000/argon/  var

Again, based on my correct interpretation of your requirements. :)

Unfortunately, I think re-architecting this part of restic is necessary.

I don't know that I'd go so far as necessary, but I do agree that it seems like a good addition.

Strong words, I know. :). My confidence in making such a bold statement is based on my observations in the forums and the github issues over an extended period of time. The current model confuses people, generates lots of questions, tricky feature requests, and workarounds, taking up maintainers' time that I think could be better spent.

That is why I think a change to the underlying model is necessary.

But your proposal also does not completely solve my use case. Introducing labels does not alleviate the problem that useless and distracting "wrong" paths are being stored as an intrinsic part of the backup, and there is no mechanism to correct them.

You need to read #4026 more thoroughly as its a discussion, with evolving ideas based on contributions from others. I actually prefer to use 'name' rather than label - naming backup-sets.

But I also note that your proposed re-architecture ticket has been around for about 18 months without anyone showing any indication of being willing to take it on. In fact, you noted yourself in #4026 (comment) that it doesn't look like it will be implemented anytime soon.

Yes. As you can tell, I was disappointed by that revelation. But I'm not doing all the work and I fully respect that.

Whereas the change I've proposed is much simpler, and I might even be willing to take a stab at making a PR for it, assuming there was some indication from the maintainers that it might be accepted.

Go for it. I see you have referenced #3200. Have a good read, as it would be a great place to start. I was disappointed that Alex closed that PR. Also, have a good read of #4026 and Michael's comments in particular.

Cracking this nut, if it can be done in the short-term would benefit many (including me). My only concern is that some users could get themselves in a 'bit of a pickle' by 'fudging' the paths of a snapshot without decoupling it's implicit meaning as a backup set identifier, especially if they don't know what they are doing, or make a mistake.

I don't see how symbolic links could help, since restic resolves paths to absolute, which would undo any symbolic links. I already mentioned the only reasonable workaround I see for this in the OP, which is using bind mounts. I do agree that having to use bind mounts is pretty barbaric (as well as an incomplete solution). Can you explain how symlinks can help?

I made a mistake here. You are correct that symbolic links won't help you with zfs. The symbolic links have helped me with APFS, because its a single volume that I am backing up (Macintosh HD - Data) and I don't need to reassemble different mount points. Essentially, for each APFS snapshot, I create a symbolic link /Volumes/Time Machine -> /Volumes/.timemachine/<UUID>/<timestamp>.backup/<timestamp>.backup/Macintosh HD - Data and put this in a loop.

This way, the absolute path is always /Volumes/Time Machine for each time machine snapshot, and restic snapshot thereof.

I think you are correct RE bind mounts for zfs.

D.

@eharris
Copy link
Author

eharris commented Apr 12, 2024

This is incorrect. -C can be provided multiple times using tar, and it is [proposed] to do the same for restic.

I was not aware that -C could be specified multiple times for tar. But even if that is the case, I would guess that tar -C would also end up storing files more than once if any of the -C destinations ended up overlapping, whereas I think restic would not? tar has the ability to concatenate archives, whereas restic does not (yet) have the ability to merge backups, so it doesn't seem entirely similar.

However after already having had a cursory look at how restic processes paths to be backed up (as a single list of paths), it would seem significantly more complicated to have to change all the internals to be able to handle a more complicated data structure, whereas it would seem that my suggested method could still be done with the same simple list of paths, since both items (actual and wanted path) are communicated within each individual path.

@damoclark
Copy link

I was not aware that -C could be specified multiple times for tar. But even if that is the case, ...

No 'ifs'. Its a fact. Try it for yourself. BSD & GNU tar alike.

I would guess that tar -C would also end up storing files more than once if any of the -C destinations ended up overlapping, whereas I think restic would not?

I have already done the research on this, in terms of how tar behaves and how restic could better implement it - it's in #4026. I'm not going to restate things here.

tar has the ability to concatenate archives, whereas restic does not (yet) have the ability to merge backups, so it doesn't seem entirely similar.

This is not a convincing argument to me. What is identical is the concept. That is what's important. Those who are already familiar with the concept will be able to easy apply it to restic. It's also a simple concept, so those who aren't will learn a new concept.

whereas it would seem that my suggested method could still be done with the same simple list of paths, since both items (actual and wanted path) are communicated within each individual path.

I strongly advise you not to pursue your proposed approach of using the . directory as a separator for translating paths. Don't get me wrong - it's a clever idea.

And naturally, it's your call as the PR initiator, and it's the Restic team's call as to whether they accept it. But...

Without turning into a computer science lecture, you are creating a semantic overload.

Encoding special meaning in the paths can only cause unintended grief for users down the road. I have already provided a concrete example in the way of concatenating user-provided path segments. So I'm not just speculating. Part of the issues with the current model of the paths being used to match parent snapshots is that the paths have been endowed with a special meaning beyond their literal meaning of being a path to 'what' is being backed up. The path can change, while the 'what' may not.

It is why, for instance, databases use arbitrary incremental numbers as primary keys (i.e. surrogate versus natural keys).

The . character is not just a separator for your path mapping, it is also a legitimate directory in and of itself. At some point, these two legitimate meanings will collide, and it will only cause issues for the user. : is no good either (as in docker volume mapping or PATH env), because it has special meaning in windows.

Furthermore, with the tar syntax -C it is not just a 1-1 mapping of paths for the creation of the tar archive VFS. You can do the following:

tar cv -C /home fred margot barry

say, if you only wanted to backup those three home directories, relative to /home.

The concept is: tar cv -C <set VFS root path here> <backup> <stuff> <relative> <to> <here> ... -C <now VFS root is here> ...

We teach this simple concept to users who don't already know it in the doco, and these semantics become very powerful, without the potential aforementioned problems.

This problem has already been solved - let's stand on the shoulders of giants. :)

However after already having had a cursory look at how restic processes paths to be backed up (as a single list of paths), it would seem significantly more complicated to have to change all the internals to be able to handle a more complicated data structure

I think you are arguing that your solution is much quicker and easier to implement, than #4026. If I have misunderstood, disregard my response below and help me understand.

I am not contesting that your short-term solution wouldn't be easier, not withstanding the challenges of retrospectively enabling path mapping and/or relative paths under a model that is heavily reliant on absolute paths. And in a way that doesn't compound the issues of semantic overload. That is where I don't think your short-term solution is necessarily easier, and will require careful work.

This is why I have been advocating for changing the model earlier, rather than later. By doing so, these problems (and many others) become much easier to solve. Or at least I think they will be. Again, I am guided by those who are very familiar with the code-base here.

Very happy to have intellectual discussion and debate on these ideas with you. But with sincerity, I would prefer it if you do the readings I suggested earlier first. This way, we have a common starting point for discussing new ideas going forward.

Damien.

@MichaelEischer
Copy link
Member

This discussion completely mixes two only partially related issues. #4026 primarily addresses how to identify backup sets (something similar can already be achieved using backup --tag something-unique --group-by tags), but that is not what this issue seems to be about. The discussion here instead duplicates a lot of what has already been discussed in #2092 and #555, which primarily address where files end up within a snapshot.

/mnt/backups/.zfs/snapshot/20240101-020000/argon/root/./
/mnt/backups/.zfs/snapshot/20240101-020000/argon/./boot

That won't behave as you'd expect. The root subvolume normally contains an empty boot folder as mountpoint. As a consequence the resulting snapshot would contain /boot (from root) and /boot-1 (from ./boot). Restic has no idea whether files with colliding names should be overwritten or not (duplicate filenames will never be valid within a restic snapshot). As it's a backup program, the heuristic is to just keep everything. The incredibly ugly part there is that restic has to enumerate all files in root/./ to know whether /./boot will collide or not.

You are correct that that could potentially happen, and if it is desired to protect against this "unintended" use, by all means, add a --use-relative flag that enables this behavior. I was just trying to avoid adding flags since there seems to be a cultural resistance in this project to the addition of flags because it is seen as complicating usage for "normal" users.

There is a resistance to blindly adding flags without first discussing whether it is actually necessary; things should just work without requiring users to assemble the right magic combination of options. More flags that interact with each other in surprising ways don't help anyone. Although, the file layout issue will likely require some additional options.

At some point, these two legitimate meanings will collide, and it will only cause issues for the user. : is no good either (as in docker volume mapping or PATH env), because it has special meaning in windows.

: is also a perfectly valid character in filenames on unix.

But I also note that your proposed re-architecture ticket has been around for about 18 months without anyone showing any indication of being willing to take it on. In fact, you noted yourself in #4026 (comment) that it doesn't look like it will be implemented anytime soon.

Yes. As you can tell, I was disappointed by that revelation. But I'm not doing all the work and I fully respect that.

I can only take a look at so many things at a time, which also means that I can't get involved in every discussion as otherwise I wouldn't get anything done.

it would seem significantly more complicated to have to change all the internals to be able to handle a more complicated data structure

The usual rule of thumb is that a more complicated implementation is ok if that simplifies the interface (all within certain limits obviously).

And naturally, it's your call as the PR initiator, and it's the Restic team's call as to whether they accept it. But...

I won't have time to have a closer look at this and the related issues before the development cycle for restic 0.18.0 starts. Probably anything you start working on now won't match the outcome of the corresponding discussions.

@damoclark
Copy link

Yes. As you can tell, I was disappointed by that revelation. But I'm not doing all the work and I fully respect that.

I can only take a look at so many things at a time, which also means that I can't get involved in every discussion as otherwise I wouldn't get anything done.

Important things first.

My comment wasn't a criticism. It is okay that I am disappointed - you don't owe me, or anyone else anything. You have to make decisions and they can't always please everyone.

And I take your point RE discussion. I'm making my reply deliberately brief. :)

This discussion completely mixes two only partially related issues. #4026 primarily addresses how to identify backup sets (something similar can already be achieved using backup --tag something-unique --group-by tags)

Not entirely Michael. In fact, this point was discussed at length in the issue. But let's leave for a more appropriate time...

I won't have time to have a closer look at this and the related issues before the development cycle for restic 0.18.0 starts.

No problem. Appreciate the clarity.

/mnt/backups/.zfs/snapshot/20240101-020000/argon/root/./
/mnt/backups/.zfs/snapshot/20240101-020000/argon/./boot

That won't behave as you'd expect. The root subvolume normally contains an empty boot folder as mountpoint. As a consequence the resulting snapshot would contain /boot (from root) and /boot-1 (from ./boot).

This important point slipped by me, due to my lack of experience with ZFS. Thanks for this Michael. There are proposed ideas in #4026 and elsewhere to address this - at a later time.

The discussion here instead duplicates a lot of what has already been discussed in #2092 and #555, which primarily address where files end up within a snapshot.

Thanks for this. I totally missed #555. And especially that rsync already partially adopts what Evan was proposing by applying special meaning to . directory in path translation. While this doesn't change my position, I do acknowledge that it has been implemented in a widely used technology. I have skimmed over that section of the rsync man page many times in the past and never paid attention to what it actually does. It may have baffled some people in the past when it was unintentionally applied.

The usual rule of thumb is that a more complicated implementation is ok if that simplifies the interface (all within certain limits obviously).

I share your philosophy.

@MichaelEischer
Copy link
Member

This important point slipped by me, due to my lack of experience with ZFS. Thanks for this Michael. There are proposed ideas in #4026 and elsewhere to address this - at a later time.

That is actually unrelated to ZFS. Using Linux, you can only mount a volume over a folder. That is to mount a volume at some path you have to use mkdir -p /some/path && mount /dev/something /some/path. Therefore the root volume snapshot will contain an empty folder at /some/path.

@damoclark
Copy link

That is actually unrelated to ZFS. Using Linux, you can only mount a volume over a folder. That is to mount a volume at some path you have to use mkdir -p /some/path && mount /dev/something /some/path. Therefore the root volume snapshot will contain an empty folder at /some/path.

Yes, I understand this. My misunderstanding was with how ZFS was arranging its mounted snapshots. What you describe is an important problem to solve for restic going forward, but it has been solved by others already. This provides some guidance on how to approach it within the model of restic. GNU and BSD tar approaches are a good place to start as discussed in #4026.

I have great respect for the work you and your team do on Restic. Backup technologies are high-stakes complicated beasts, especially when they are multi-platform (including the black sheep Windows juggernaut).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants