New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support non utf symlink targets #3802
Support non utf symlink targets #3802
Conversation
Seems like a sound design overall. Does need a changelog entry, though. |
Let's say, you have a file names So, quoting filenames but not quoting Of course you can add more quirks and either add stuff to handle only non-unicode (leaving this as above) or also change this example to: Sorry, but this is still a mess! IMO the only clean solution is to deprecate I know, this breaks forward-compatibility but so does the repo v2 format which supports compression. I know that this adds some headache if you want to correct existing repos. But this could be done by adding this to a new repo version an supplying a migrate command. |
Just a comment from the sidelines: if the symlink target field is just an arbitrary sequence of bytes (not necessarily UTF-8, it'll be whatever the OS and/or the filesystem is configured to use), then we should handle it as such and store/restore a byte slice, e.g. keep the |
@fd0 you are right. And I also strongly support a convention that valid UTF-8 (which is the standard case) should be stored as-is, as this makes reading the JSON tree (e.g. using However, all these arguments also apply to the |
e698fe7
to
9f121f6
Compare
I've changed the PR to encode non-utf8 symlinks using base64, which is surprisingly simple as go automatically base64-encodes Yes, these considerations also apply to the Just using either the old or the new filename encoding would also complicate things a lot, as that would require encoding a Node differently depending on the repository version. |
Hmm, the cost-benefit calculation would probably be different if we clean up the [Edit]Although even that solution would mean that tree blobs in the current format will eventually have to be rewritten in the future format. So will still have some costs attached to it. Adding support for chunked trees is actually possible in a rather simple way why does not require rewriting trees: if there is only a single chunk, then use the current I'm wondering whether it would be a good idea to just store |
9f121f6
to
dffa62e
Compare
@MichaelEischer May I suggest to also add a Of course that wouldn't work with already-written data or older program versions. But it would provide a migration path towards a clean solution by first making the |
@aawsome I'd like to discuss the
So your suggestion is to only keep |
dffa62e
to
331ff40
Compare
I actually have two issues with
No. Actually I like the solution you implemented here with |
331ff40
to
4e4e57f
Compare
4e4e57f
to
f12bbd9
Compare
LGTM. |
@MichaelEischer Shouldn't this treatment of symlink targets be added to https://github.com/restic/restic/blob/master/doc/design.rst? |
Another remark - and sorry if there are errors or things are not fully understandable, I don't use windows and am no expert in its details (just did dig into how Rust handles filenames and it is very pedantic about being correct)... Actually, I still don't know what's going on on Windows and IMO this also needs documentation. On Windows, filenames are not stored as byte-sequences but as sequences of 16-bit words. This means, we cannot talk about utf8 on windows, at least when we are talking about how the OS stores filenames. I know that Golang does some internal conversion, so all filenames always look at byte-sequences and you can treat them as UTF8 if they are valid unicode. But I don't know how non-unicode filenames (if they are allowed) are converted from 16-bit words to an 8-bit byte sequence. Moreover, I don't even know if there are cases where a valid Windows filename cannot be transformed to a non-UTF8 byte sequence and what Golang would do in such a case. Can anyone provide information about this topic? Besides this, I also don't know if this linktarget topic anyway applies to Windows or if this topic would only arise when applying it to "standard" filenames.... |
Oh indeed. I'll open a PR later.
The Windows API methods suffixed with a [Edit]The conversion to UTF-8 isn't optional as otherwise it would become impossible to restore windows backups on linux and vice versa.[/Edit] |
Ah great, thanks for the information! |
I've opened #4422 |
What does this PR change? What problem does it solve?
Currently symlinks whose target name is invalid utf8 are not backed up correctly. The invalid characters are replaced with
\ufffd
. This PR introduces an optional quoted field which is used for symlink targets which might not encode correctly. The oldlinktarget
field in a Node still exists and continues to be set to stay backwards compatible to older clients. Thus, for old clients the symlink will behave just as before, whereas later restic versions which support the new field will use that one automatically instead.This allows for a backwards compatible addition of a correct symlink encoding. I've decided against binding the symlink encoding to a specific repository version as that would create all sorts of headaches when copying snapshots between repositories with different versions.
Was the change previously discussed in an issue or on the forum?
Fixes #3311
Checklist
changelog/unreleased/
that describes the changes for our users (see template).gofmt
on the code in all commits.