Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support non utf symlink targets #3802

Merged
merged 2 commits into from Jul 22, 2023

Conversation

MichaelEischer
Copy link
Member

@MichaelEischer MichaelEischer commented Jun 17, 2022

What does this PR change? What problem does it solve?

Currently symlinks whose target name is invalid utf8 are not backed up correctly. The invalid characters are replaced with \ufffd. This PR introduces an optional quoted field which is used for symlink targets which might not encode correctly. The old linktarget field in a Node still exists and continues to be set to stay backwards compatible to older clients. Thus, for old clients the symlink will behave just as before, whereas later restic versions which support the new field will use that one automatically instead.

This allows for a backwards compatible addition of a correct symlink encoding. I've decided against binding the symlink encoding to a specific repository version as that would create all sorts of headaches when copying snapshots between repositories with different versions.

Was the change previously discussed in an issue or on the forum?

Fixes #3311

Checklist

  • I have read the contribution guidelines.
  • I have enabled maintainer edits.
  • I have added tests for all code changes.
  • I have added documentation for relevant changes (in the manual).
  • There's a new file in changelog/unreleased/ that describes the changes for our users (see template).
  • I have run gofmt on the code in all commits.
  • All commit messages are formatted in the same style as the other commits in the repo.
  • I'm done! This pull request is ready for review.

internal/restic/node.go Outdated Show resolved Hide resolved
internal/restic/node.go Outdated Show resolved Hide resolved
@greatroar
Copy link
Contributor

Seems like a sound design overall. Does need a changelog entry, though.

@aawsome
Copy link
Contributor

aawsome commented Jul 3, 2022

Let's say, you have a file names " linking to "a.
Then current restic generated the json:
{"nodes":[{"name":"\\\"","linktarget":"\"a"},{"name":"\\\"a"}]}

So, quoting filenames but not quoting linktarget not only broke non-unicode linktargets, but also introduced a mess in the tree format.

Of course you can add more quirks and either add stuff to handle only non-unicode (leaving this as above) or also change this example to:
{"nodes":[{"name":"\\\"","linktarget":"\"a", "linktarget_quoted":"\\\"a"},{"name":"\\\"a"}]}

Sorry, but this is still a mess!

IMO the only clean solution is to deprecate linktarget and add a new field like target to let this always look
{"nodes":[{"name":"\\\"", "target":"\\\"a"},{"name":"\\\"a"}]}
(or rethink the quoting altogether, but this you have already declined in #3808 ...)

I know, this breaks forward-compatibility but so does the repo v2 format which supports compression. I know that this adds some headache if you want to correct existing repos. But this could be done by adding this to a new repo version an supplying a migrate command.

@fd0
Copy link
Member

fd0 commented Jul 6, 2022

Just a comment from the sidelines: if the symlink target field is just an arbitrary sequence of bytes (not necessarily UTF-8, it'll be whatever the OS and/or the filesystem is configured to use), then we should handle it as such and store/restore a byte slice, e.g. keep the linktarget field, but add a linktarget_raw field which is []byte and encodes as base64 in JSON.

@aawsome
Copy link
Contributor

aawsome commented Jul 6, 2022

Just a comment from the sidelines: if the symlink target field is just an arbitrary sequence of bytes, then we should handle it as such and store/restore a byte slice, e.g. keep the linktarget field, but add a linktarget_raw field which is []byte and encodes as base64 in JSON.

@fd0 you are right. And I also strongly support a convention that valid UTF-8 (which is the standard case) should be stored as-is, as this makes reading the JSON tree (e.g. using restic cat) much easier for most files.

However, all these arguments also apply to the name field...

@MichaelEischer
Copy link
Member Author

MichaelEischer commented Jul 13, 2022

Just a comment from the sidelines: if the symlink target field is just an arbitrary sequence of bytes, then we should handle it as such and store/restore a byte slice, e.g. keep the linktarget field, but add a linktarget_raw field which is []byte and encodes as base64 in JSON.

@fd0 you are right. And I also strongly support a convention that valid UTF-8 (which is the standard case) should be stored as-is, as this makes reading the JSON tree (e.g. using restic cat) much easier for most files.

However, all these arguments also apply to the name field...

I've changed the PR to encode non-utf8 symlinks using base64, which is surprisingly simple as go automatically base64-encodes []byte fields. The linktarget field is still always set to maintain backwards compatibility.

Yes, these considerations also apply to the name field, but changing that field would require two new fields (as they are interpreted differently) and result in quite a mess to correctly decode and encode it. And I don't think that's really worth it. Especially as unlike the linktarget field, we won't gain any new functionality. The only thing that changes is that we get rid of the somewhat non-standard filename encoding.

Just using either the old or the new filename encoding would also complicate things a lot, as that would require encoding a Node differently depending on the repository version.

@MichaelEischer
Copy link
Member Author

MichaelEischer commented Jul 13, 2022

Hmm, the cost-benefit calculation would probably be different if we clean up the name field along with other breaking changes. I'm thinking of adding support for chunked trees to the nodes, which won't be possible in a backwards compatible way. But that could be an opportunity to clean up other stuff.

[Edit]Although even that solution would mean that tree blobs in the current format will eventually have to be rewritten in the future format. So will still have some costs attached to it. Adding support for chunked trees is actually possible in a rather simple way why does not require rewriting trees: if there is only a single chunk, then use the current tree field, otherwise use treechunks.[/Edit]

I'm wondering whether it would be a good idea to just store name and linktarget always as base64 encoded. It won't make much of a difference in the filesize, at least after it's compressed. But it would simplify decoding.

@aawsome
Copy link
Contributor

aawsome commented Jul 9, 2023

@MichaelEischer May I suggest to also add a name_raw field which is filled if the (encoded) name field doesn't exactly reflect the filename? When created that way, program reading trees can always base64-decode a *_raw field when given and use the * (without _raw) field directly, if not.

Of course that wouldn't work with already-written data or older program versions. But it would provide a migration path towards a clean solution by first making the name_raw field optional, then required and later deprecating name fields which contain encoding.

@MichaelEischer
Copy link
Member Author

@aawsome I'd like to discuss the name field separately from linktarget. I've created issue #4408 for that. The main difference in my opinion is that the linktarget modifications add support for non-UTF8 symlink targets, whereas the suggested name changes are purely cosmetic (strconv.Quote is perfectly capable of handling non-UTF8).

Of course that wouldn't work with already-written data or older program versions. But it would provide a migration path towards a clean solution by first making the name_raw field optional, then required and later deprecating name fields which contain encoding.

So your suggestion is to only keep name_raw and linktarget_raw and deprecate both name and linktarget? If we make such a change, I'd like to coordinate it with #4006 and the support for chunked trees (likely won't happen this year) to minimize the resulting disruptions.

@aawsome
Copy link
Contributor

aawsome commented Jul 17, 2023

(strconv.Quote is perfectly capable of handling non-UTF8).

I actually have two issues with strconv.Quote:

  • I personally am not able to judge if it is perfectly capable of handling non-UTF8. I tried to find a specification of what it does and wasn't able so far. Even in https://github.com/restic/restic/blob/master/doc/design.rst we are not able to tell, what it exactly does. So it's kind of a black box we are using here. For a backup format this IMO is even worse than an open question about whether it's possible to handle non-UTF.
  • Its use is easy to oversee as for many common filenames it doesn't change anything. This leads uninformed consumers to just use that field - and maybe make mistakes. E.g. using \"abc\" as filename which in fact was "abc" before getting backup'ed. See quoted file names (non-unicode filenames) juergenhoetzel/pyrrhic#1 for example.

So your suggestion is to only keep name_raw and linktarget_raw and deprecate both name and linktarget?

No. Actually I like the solution you implemented here with linktarget and linktarget_raw very much. So I propose to change the repository tree-format step-by-step to also use name and name_raw in exactly the same way. I.e. deprecate the usage of strconv.Quote on the name field.
(an alternative could be introduce new fields like filename/filename_raw and deprecate name.)

@MichaelEischer
Copy link
Member Author

LGTM.

@MichaelEischer MichaelEischer merged commit 98fb56b into restic:master Jul 22, 2023
12 checks passed
@MichaelEischer MichaelEischer deleted the support-non-utf-symlinks branch July 22, 2023 22:15
@aawsome
Copy link
Contributor

aawsome commented Jul 23, 2023

@MichaelEischer Shouldn't this treatment of symlink targets be added to https://github.com/restic/restic/blob/master/doc/design.rst?

@aawsome
Copy link
Contributor

aawsome commented Jul 23, 2023

Another remark - and sorry if there are errors or things are not fully understandable, I don't use windows and am no expert in its details (just did dig into how Rust handles filenames and it is very pedantic about being correct)...

Actually, I still don't know what's going on on Windows and IMO this also needs documentation. On Windows, filenames are not stored as byte-sequences but as sequences of 16-bit words. This means, we cannot talk about utf8 on windows, at least when we are talking about how the OS stores filenames. I know that Golang does some internal conversion, so all filenames always look at byte-sequences and you can treat them as UTF8 if they are valid unicode. But I don't know how non-unicode filenames (if they are allowed) are converted from 16-bit words to an 8-bit byte sequence. Moreover, I don't even know if there are cases where a valid Windows filename cannot be transformed to a non-UTF8 byte sequence and what Golang would do in such a case. Can anyone provide information about this topic?

Besides this, I also don't know if this linktarget topic anyway applies to Windows or if this topic would only arise when applying it to "standard" filenames....

@MichaelEischer
Copy link
Member Author

MichaelEischer commented Jul 23, 2023

@MichaelEischer Shouldn't this treatment of symlink targets be added to https://github.com/restic/restic/blob/master/doc/design.rst?

Oh indeed. I'll open a PR later.

On Windows, filenames are not stored as byte-sequences but as sequences of 16-bit words.

The Windows API methods suffixed with a W use UTF-16 which is converted by Go into UTF-8 . On the filesystem NTFS stores all names as Unicode: https://learn.microsoft.com/en-us/windows/win32/intl/character-sets-used-in-file-names .

[Edit]The conversion to UTF-8 isn't optional as otherwise it would become impossible to restore windows backups on linux and vice versa.[/Edit]

@aawsome
Copy link
Contributor

aawsome commented Jul 24, 2023

The Windows API methods suffixed with a W use UTF-16 which is converted by Go into UTF-8 . On the filesystem NTFS stores all names as Unicode:

Ah great, thanks for the information!
So, on Windows there is no non-unicode topic at all and a conversion into UTF-8 is sufficient. Maybe this is also a good info for the design document?

@MichaelEischer
Copy link
Member Author

I've opened #4422

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Symlinks with paths that are invalid utf8 not backed up correctly
4 participants