-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Back up and restore Windows Alternate Data Streams #4614
base: master
Are you sure you want to change the base?
Conversation
Modify the Archiver to read the ADS from each file and add them as independent Nodes with the full ADS name as the name of the file. During restore the ADS files are restored using the ADS name and that automatically attaches them as ADS to the main file.
In the case where the main file gets restored after the ADS, it could overwrite and remove the ADS. Hence we try creating the file without the O_CREATE option first and if the file does not exist, then add the option and try again.
ADS are added as top level files in the repository. However, they must not be counted in the finished and total count, but their sizes must be counted.
The ADS is essentially a part of the main file. Hence during filtering, we need to derive the main file name from the ADS name for filtering to work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm somewhat torn on whether I prefer the current approach of just mapping ADS into separate files and then merging them back together on restore, or whether I'd prefer them to be attributes of the corresponding directory / file. The mapping approach seems to be relatively simple to implement, and even allows the ADS to become visible on Unix system (as separate files, but I guess that's acceptable). But it somewhat complicates the restore steps. However, handling ADS as file attributes would require changes all across the place (even in prune) that are way more complex. So I guess the basic approach here is the right one.
I haven't looked at the windows API code yet, I guess there's no nice way to abstract it away ^^ ...
Would you like to prioritize this PR or #4611 ? I don't have time look at both (huge) PRs at the same time.
func preProcessTargets(filesys fs.FS, targets *[]string) { | ||
for _, target := range *targets { | ||
target = filesys.Clean(target) | ||
addADSStreams(target, targets) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couldn't we discover the ads stream just in time when a file/dir is backed up and thereby get rid of all this preprocessing? Shouldn't be too complex to add that to SaveDir
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the beginning, resolveRelativeTargets is called from scanner's Scan method and from archiver Snapshot method and the sorted cleanTargets obtained in that step are then passed to NewTree.
We need these ADS entries to be available before the Node creation in the tree.
We do add this in SaveDir as well, but it is also needed in resolveRelativeTargets.
@@ -336,6 +336,15 @@ func (res *Restorer) RestoreTo(ctx context.Context, dst string) error { | |||
return err | |||
} | |||
|
|||
func (res *Restorer) addFile(node *restic.Node, size uint64) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't we just pretend the ADS are also regular files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually, users try to verify backups by relying on the file count as a metric. If ADS is treated as a regular file, the counts would be added, and it could cause confusion.
Users would have to use dir /R to list the ADS as well and get the count. When you try to get the count by using Right Click -> Properties, which is what most users will do, you will not see the ADS count.
Similarly, if you restore on non-windows systems, ADS files should not appear (discussing this on a separate comment), and the counts would differ.
Besides by definition I feel Alternate Data Streams are attachments to the main files and not equal in status to regular files.
The other option may be to count it and give an extra ADS count in the summary for windows, so that users can use it to correlate the difference in the results. It still feels like it would be confusing for most users by including ads count in main file count.
On the other hand, if we skip it in the counts in summary and an ADS is updated, the summary will show 0 changed, but data size added to the repo will be shown. This may not be too bad.
Any more considerations?
var flags int | ||
var f *os.File | ||
var err error | ||
// TODO optimize this. When GenericAttribute change is merged, we can |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need the GenericAttribute to know whether a file is an ADS or not; looking a the file name is enough.
This change is only correct, if the target directory is empty, otherwise the file can already exist and even have the wrong file size! So, I'm afraid we'll have to teach the filerestorer.go
how to properly handle the ordering constraints for ADS. (I'm not sure so far what's the best approach for that)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this TODO is in reference to the main file which has the ADS attached.
GenericAtrribute could be TypeHasADS.
Without this the main file is oblivious to any ADS attached to it, the ADS themselves just exist as independent nodes.
And we want to know that the main file has ADS attached, because only in that case do we want to check for presence of the main file before using the O_CREATE option. Otherwise, we can directly use the O_CREATE option like the existing master code.
Regarding the "This change is only correct, if the target directory is empty, otherwise the file can already exist and even have the wrong file size", I'm not sure I understand. There may be some confusion.
Let's say we have a test file - D:\targetfile.txt. It has an ADS attached - D:\targetfile.txt:Stream1:$DATA.
Now this file is backed up along with its ADS and these are stored as 2 separate files in the restic repo.
During restore, this method OpenFile will be called twice, and the calls can come any order depending on the content:
Once, for file name - D:\targetfile.txt
Another time, for the ADS - D:\targetfile.txt:Stream1:$DATA
(For now, consider that the target restore directory did not already exist.)
Whichever happens first will find - os.IsNotExist(err) is true and will then call the code below resulting in the creation of the main file D:\targetfile.txt.
flags = os.O_CREATE | os.O_TRUNC | os.O_WRONLY
f, err = os.OpenFile(path, flags, 0600)
If it was called for D:\targetfile.txt first,
then D:\targetfile.txt will be created and the contents of the main file will be written to it.
Then the call comes for D:\targetfile.txt:Stream1:$DATA at some point, it will find that the file for the stream does not exist.
It will use the os.O_CREATE option and it creates the stream only and writes the data to the stream without any issue.
If it was called for D:\targetfile.txt:Stream1:$DATA first,
then it finds that the stream file does not exist and it uses the os.O_CREATE flag, and in that case, it creates an empty file D:\targetfile.txt and starts writing the ADS at path - D:\targetfile.txt:Stream1:$DATA.
Next when the call comes in for the main file targetfile.txt, it will find that os.IsNotExist(err) is false, because the main file was already created as a 0 size file and it will not use the os.O_CREATE option in this case. It will just continue to write the contents on the existing 0 size targetfile.txt.
Without this new check in place, when the second call comes in for targetfile.txt, it would replace the 0 size file which was written when the write for D:\targetfile.txt:Stream1:$DATA had started and ultimately the ADS is lost during restore.
Even if the target restore directory and all the files already existed before restore, the first time any file is opened for restore, createSize >= 0 and we initially open the file with os.O_TRUNC | os.O_WRONLY.
So, the file is truncated before the new content for restore is written. The file sizes after restore are correct.
Am I missing some scenario?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is another scenario that depends on the GenericAttribute.
If a file exists in target with say 10 ADS (streams) and it needs to be replaced with an older file with 5 ADS (streams) during restore, we need to know when restoring the main file, that the extra 5 streams need to be removed.
For this the main file needs to know what ADS (streams) are attached to it. Remaining streams will be removed during restore of main file.
So, the main file will have a GenericAttribute - TypeHasADS with value as -
Stream1|Stream2|Stream3 ...
However, fileswriter_windows.go is not aware of the Node and does not have access to the generic attributes.
Need to think of the best approach for this. One approach is to restore to a temp location first and then move the file to actual location, replacing the old file. This way the old streams are removed and we do not need to rely on the nodes.
Thanks for looking into these. Definitely prefer prioritizing #4611. |
Another reason to keep ADS as separate files is because technically they can be of any size. They're just like attaching a full blown file to another existing file which we refer to as the main file. Regarding restore of a file with ADS on non-windows, I think the technically correct expectation would be to ignore the streams on those systems since it is a windows specific function. Otherwise it will restore the ADS as separate files with name as filename:adsname:$DATA. On a side note, since unix filenames can have :, though not recommended for cross-platform use, this is another reason to use GenericAttributes to distinguish ADS files and ADS main files from normal files. |
Move SaveDir back to archiver and use hooks in the os specific archivers. Rename _unix to _notwin.
I actually think that non-Windows systems should just create those |
This has the advantage that the user restoring data could choose to ignore ADS. Would it also mean that there's the possibility that a file with the "regular" name filename:adsname:$DATA could overwrite an ADS when backing up? For example: document.docx has an ADS adsname:$DATA. If document.docx:adsname:$DATA is then backed up from a UNIX system, it "overwrites" the ADS data. Would a regular file filename:adsname:$DATA on a Unix filesystem be backed up as NTFS ADS attached to document.docx? Would a user expect it to? |
Yes that would result in the ads data being overwritten as it currently stands. I have made updates to this code, which use generic attributes to handle such cases and other cases and plan to rebase this after #4611 is merged. Even though this mechanism can be used to create ADS-like files in Unix systems, my concern is if it should. Most realworld usecases where ADS is useful is where tools are using them on windows. It will probably be safer and less intrusive to exclude the streams from being restored as files on the Unix systems to begin with and if there is a real requirement for it, it can be taken as a PR later, possibly as an optional flag. |
What does this PR change? What problem does it solve?
Restic did not back up Alternate Data Streams in Windows. Restic now backs up Alternate Data Streams (ADS) and restores them back to the main files.
The Alternate Data Streams are backed up like any other normal files, and the full name of the stream like 'file_name.txt:stream_name' is stored as the name of the file representing the ADS.
During restore, the ADS files are restored and attached to the original files as Alternate Data Streams. For progress and summary, the ADS are not counted in the "filesFinished" and "filesTotal" count, but the sizes of the ADS files are counted.
This feature does not require any update to the repository structure.
However, if #4611 is merged, it can be leveraged for making an optimization in this PR.
While restoring the ADS files, since the main file and ADS can be restored in any order, it is possible that the ADS can be restored before the main file, in which case it creates an empty main file and starts the ADS restore. After this when the main file is restored, if it is restored with the os.O_CREATE option, then it would replace the full file and ADS would be lost. Hence, we need to first try without the os.O_CREATE option and if we find that the file does not exist, then use the os.O_CREATE option. Presently this is being done for all files. With #4611 merged, we would be able to add a new generic attribute TypeADS to indicate if a file has ADS and only do the check without the os.O_CREATE option for those files.
Was the change previously discussed in an issue or on the forum?
Closes #1401
Checklist
changelog/unreleased/
that describes the changes for our users (see template).gofmt
on the code in all commits.