Support non utf symlink targets #3802

MichaelEischer · 2022-06-17T17:43:29Z

What does this PR change? What problem does it solve?

Currently symlinks whose target name is invalid utf8 are not backed up correctly. The invalid characters are replaced with \ufffd. This PR introduces an optional quoted field which is used for symlink targets which might not encode correctly. The old linktarget field in a Node still exists and continues to be set to stay backwards compatible to older clients. Thus, for old clients the symlink will behave just as before, whereas later restic versions which support the new field will use that one automatically instead.

This allows for a backwards compatible addition of a correct symlink encoding. I've decided against binding the symlink encoding to a specific repository version as that would create all sorts of headaches when copying snapshots between repositories with different versions.

Was the change previously discussed in an issue or on the forum?

Fixes #3311

Checklist

I have read the contribution guidelines.
I have enabled maintainer edits.
I have added tests for all code changes.
I have added documentation for relevant changes (in the manual).
There's a new file in changelog/unreleased/ that describes the changes for our users (see template).
I have run gofmt on the code in all commits.
All commit messages are formatted in the same style as the other commits in the repo.
I'm done! This pull request is ready for review.

internal/restic/node.go

internal/restic/node_test.go

internal/restic/node.go

greatroar · 2022-06-18T09:21:56Z

Seems like a sound design overall. Does need a changelog entry, though.

aawsome · 2022-07-03T03:47:58Z

Let's say, you have a file names " linking to "a.
Then current restic generated the json:
{"nodes":[{"name":"\\\"","linktarget":"\"a"},{"name":"\\\"a"}]}

So, quoting filenames but not quoting linktarget not only broke non-unicode linktargets, but also introduced a mess in the tree format.

Of course you can add more quirks and either add stuff to handle only non-unicode (leaving this as above) or also change this example to:
{"nodes":[{"name":"\\\"","linktarget":"\"a", "linktarget_quoted":"\\\"a"},{"name":"\\\"a"}]}

Sorry, but this is still a mess!

IMO the only clean solution is to deprecate linktarget and add a new field like target to let this always look
{"nodes":[{"name":"\\\"", "target":"\\\"a"},{"name":"\\\"a"}]}
(or rethink the quoting altogether, but this you have already declined in #3808 ...)

I know, this breaks forward-compatibility but so does the repo v2 format which supports compression. I know that this adds some headache if you want to correct existing repos. But this could be done by adding this to a new repo version an supplying a migrate command.

fd0 · 2022-07-06T06:01:58Z

Just a comment from the sidelines: if the symlink target field is just an arbitrary sequence of bytes (not necessarily UTF-8, it'll be whatever the OS and/or the filesystem is configured to use), then we should handle it as such and store/restore a byte slice, e.g. keep the linktarget field, but add a linktarget_raw field which is []byte and encodes as base64 in JSON.

aawsome · 2022-07-06T10:08:34Z

Just a comment from the sidelines: if the symlink target field is just an arbitrary sequence of bytes, then we should handle it as such and store/restore a byte slice, e.g. keep the linktarget field, but add a linktarget_raw field which is []byte and encodes as base64 in JSON.

@fd0 you are right. And I also strongly support a convention that valid UTF-8 (which is the standard case) should be stored as-is, as this makes reading the JSON tree (e.g. using restic cat) much easier for most files.

However, all these arguments also apply to the name field...

MichaelEischer · 2022-07-13T16:56:10Z

Just a comment from the sidelines: if the symlink target field is just an arbitrary sequence of bytes, then we should handle it as such and store/restore a byte slice, e.g. keep the linktarget field, but add a linktarget_raw field which is []byte and encodes as base64 in JSON.

@fd0 you are right. And I also strongly support a convention that valid UTF-8 (which is the standard case) should be stored as-is, as this makes reading the JSON tree (e.g. using restic cat) much easier for most files.

However, all these arguments also apply to the name field...

I've changed the PR to encode non-utf8 symlinks using base64, which is surprisingly simple as go automatically base64-encodes []byte fields. The linktarget field is still always set to maintain backwards compatibility.

Yes, these considerations also apply to the name field, but changing that field would require two new fields (as they are interpreted differently) and result in quite a mess to correctly decode and encode it. And I don't think that's really worth it. Especially as unlike the linktarget field, we won't gain any new functionality. The only thing that changes is that we get rid of the somewhat non-standard filename encoding.

Just using either the old or the new filename encoding would also complicate things a lot, as that would require encoding a Node differently depending on the repository version.

MichaelEischer · 2022-07-13T20:11:35Z

Hmm, the cost-benefit calculation would probably be different if we clean up the name field along with other breaking changes. I'm thinking of adding support for chunked trees to the nodes, which won't be possible in a backwards compatible way. But that could be an opportunity to clean up other stuff.

[Edit]Although even that solution would mean that tree blobs in the current format will eventually have to be rewritten in the future format. So will still have some costs attached to it. Adding support for chunked trees is actually possible in a rather simple way why does not require rewriting trees: if there is only a single chunk, then use the current tree field, otherwise use treechunks.[/Edit]

I'm wondering whether it would be a good idea to just store name and linktarget always as base64 encoded. It won't make much of a difference in the filesize, at least after it's compressed. But it would simplify decoding.

aawsome · 2023-07-09T19:29:55Z

@MichaelEischer May I suggest to also add a name_raw field which is filled if the (encoded) name field doesn't exactly reflect the filename? When created that way, program reading trees can always base64-decode a *_raw field when given and use the * (without _raw) field directly, if not.

Of course that wouldn't work with already-written data or older program versions. But it would provide a migration path towards a clean solution by first making the name_raw field optional, then required and later deprecating name fields which contain encoding.

MichaelEischer · 2023-07-16T16:25:03Z

@aawsome I'd like to discuss the name field separately from linktarget. I've created issue #4408 for that. The main difference in my opinion is that the linktarget modifications add support for non-UTF8 symlink targets, whereas the suggested name changes are purely cosmetic (strconv.Quote is perfectly capable of handling non-UTF8).

Of course that wouldn't work with already-written data or older program versions. But it would provide a migration path towards a clean solution by first making the name_raw field optional, then required and later deprecating name fields which contain encoding.

So your suggestion is to only keep name_raw and linktarget_raw and deprecate both name and linktarget? If we make such a change, I'd like to coordinate it with #4006 and the support for chunked trees (likely won't happen this year) to minimize the resulting disruptions.

aawsome · 2023-07-17T04:10:35Z

(strconv.Quote is perfectly capable of handling non-UTF8).

I actually have two issues with strconv.Quote:

I personally am not able to judge if it is perfectly capable of handling non-UTF8. I tried to find a specification of what it does and wasn't able so far. Even in https://github.com/restic/restic/blob/master/doc/design.rst we are not able to tell, what it exactly does. So it's kind of a black box we are using here. For a backup format this IMO is even worse than an open question about whether it's possible to handle non-UTF.
Its use is easy to oversee as for many common filenames it doesn't change anything. This leads uninformed consumers to just use that field - and maybe make mistakes. E.g. using \"abc\" as filename which in fact was "abc" before getting backup'ed. See quoted file names (non-unicode filenames) juergenhoetzel/pyrrhic#1 for example.

So your suggestion is to only keep name_raw and linktarget_raw and deprecate both name and linktarget?

No. Actually I like the solution you implemented here with linktarget and linktarget_raw very much. So I propose to change the repository tree-format step-by-step to also use name and name_raw in exactly the same way. I.e. deprecate the usage of strconv.Quote on the name field.
(an alternative could be introduce new fields like filename/filename_raw and deprecate name.)

MichaelEischer · 2023-07-22T22:09:48Z

LGTM.

aawsome · 2023-07-23T05:16:05Z

@MichaelEischer Shouldn't this treatment of symlink targets be added to https://github.com/restic/restic/blob/master/doc/design.rst?

aawsome · 2023-07-23T07:23:18Z

Another remark - and sorry if there are errors or things are not fully understandable, I don't use windows and am no expert in its details (just did dig into how Rust handles filenames and it is very pedantic about being correct)...

Actually, I still don't know what's going on on Windows and IMO this also needs documentation. On Windows, filenames are not stored as byte-sequences but as sequences of 16-bit words. This means, we cannot talk about utf8 on windows, at least when we are talking about how the OS stores filenames. I know that Golang does some internal conversion, so all filenames always look at byte-sequences and you can treat them as UTF8 if they are valid unicode. But I don't know how non-unicode filenames (if they are allowed) are converted from 16-bit words to an 8-bit byte sequence. Moreover, I don't even know if there are cases where a valid Windows filename cannot be transformed to a non-UTF8 byte sequence and what Golang would do in such a case. Can anyone provide information about this topic?

Besides this, I also don't know if this linktarget topic anyway applies to Windows or if this topic would only arise when applying it to "standard" filenames....

MichaelEischer · 2023-07-23T10:24:47Z

@MichaelEischer Shouldn't this treatment of symlink targets be added to https://github.com/restic/restic/blob/master/doc/design.rst?

Oh indeed. I'll open a PR later.

On Windows, filenames are not stored as byte-sequences but as sequences of 16-bit words.

The Windows API methods suffixed with a W use UTF-16 which is converted by Go into UTF-8 . On the filesystem NTFS stores all names as Unicode: https://learn.microsoft.com/en-us/windows/win32/intl/character-sets-used-in-file-names .

[Edit]The conversion to UTF-8 isn't optional as otherwise it would become impossible to restore windows backups on linux and vice versa.[/Edit]

aawsome · 2023-07-24T02:53:07Z

The Windows API methods suffixed with a W use UTF-16 which is converted by Go into UTF-8 . On the filesystem NTFS stores all names as Unicode:

Ah great, thanks for the information!
So, on Windows there is no non-unicode topic at all and a conversion into UTF-8 is sufficient. Maybe this is also a good info for the design document?

MichaelEischer · 2023-07-25T18:16:42Z

I've opened #4422

greatroar reviewed Jun 18, 2022

View reviewed changes

internal/restic/node.go Outdated Show resolved Hide resolved

greatroar reviewed Jun 18, 2022

View reviewed changes

internal/restic/node_test.go Outdated Show resolved Hide resolved

greatroar reviewed Jun 18, 2022

View reviewed changes

internal/restic/node.go Outdated Show resolved Hide resolved

aawsome mentioned this pull request Jul 8, 2022

Quoting of filenames in metadata (running restic check on a repo generated by rustic fails) rustic-rs/rustic#34

Closed

MichaelEischer force-pushed the support-non-utf-symlinks branch 2 times, most recently from e698fe7 to 9f121f6 Compare July 13, 2022 16:46

aawsome mentioned this pull request Oct 25, 2022

quoted file names (non-unicode filenames) juergenhoetzel/pyrrhic#1

Open

MichaelEischer force-pushed the support-non-utf-symlinks branch from 9f121f6 to dffa62e Compare July 9, 2023 18:58

MichaelEischer mentioned this pull request Jul 16, 2023

Cleanup support for non-ASCII filenames in tree blobs #4408

Open

MichaelEischer force-pushed the support-non-utf-symlinks branch from dffa62e to 331ff40 Compare July 16, 2023 16:38

aawsome mentioned this pull request Jul 17, 2023

non-unicode link targets are not supported rustic-rs/rustic#117

Closed

MichaelEischer force-pushed the support-non-utf-symlinks branch from 331ff40 to 4e4e57f Compare July 22, 2023 21:59

MichaelEischer added 2 commits July 23, 2023 00:05

Add support for non-utf8 symlink targets

6adb629

restic: check that Node.LinkTarget can handle non-utf8 characters

f12bbd9

MichaelEischer force-pushed the support-non-utf-symlinks branch from 4e4e57f to f12bbd9 Compare July 22, 2023 22:05

MichaelEischer merged commit 98fb56b into restic:master Jul 22, 2023
12 checks passed

MichaelEischer deleted the support-non-utf-symlinks branch July 22, 2023 22:15

aawsome mentioned this pull request Jul 23, 2023

handle non-unicode link-targets rustic-rs/rustic#785

Merged

MichaelEischer mentioned this pull request Jul 25, 2023

Document linktarget_raw field for non utf symlinks #4422

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support non utf symlink targets #3802

Support non utf symlink targets #3802

MichaelEischer commented Jun 17, 2022 •

edited

greatroar commented Jun 18, 2022

aawsome commented Jul 3, 2022 •

edited

fd0 commented Jul 6, 2022 •

edited

aawsome commented Jul 6, 2022 •

edited by fd0

MichaelEischer commented Jul 13, 2022 •

edited by fd0

MichaelEischer commented Jul 13, 2022 •

edited

aawsome commented Jul 9, 2023

MichaelEischer commented Jul 16, 2023

aawsome commented Jul 17, 2023

MichaelEischer commented Jul 22, 2023

aawsome commented Jul 23, 2023

aawsome commented Jul 23, 2023

MichaelEischer commented Jul 23, 2023 •

edited

aawsome commented Jul 24, 2023

MichaelEischer commented Jul 25, 2023

Support non utf symlink targets #3802

Support non utf symlink targets #3802

Conversation

MichaelEischer commented Jun 17, 2022 • edited

What does this PR change? What problem does it solve?

Was the change previously discussed in an issue or on the forum?

Checklist

greatroar commented Jun 18, 2022

aawsome commented Jul 3, 2022 • edited

fd0 commented Jul 6, 2022 • edited

aawsome commented Jul 6, 2022 • edited by fd0

MichaelEischer commented Jul 13, 2022 • edited by fd0

MichaelEischer commented Jul 13, 2022 • edited

aawsome commented Jul 9, 2023

MichaelEischer commented Jul 16, 2023

aawsome commented Jul 17, 2023

MichaelEischer commented Jul 22, 2023

aawsome commented Jul 23, 2023

aawsome commented Jul 23, 2023

MichaelEischer commented Jul 23, 2023 • edited

aawsome commented Jul 24, 2023

MichaelEischer commented Jul 25, 2023

MichaelEischer commented Jun 17, 2022 •

edited

aawsome commented Jul 3, 2022 •

edited

fd0 commented Jul 6, 2022 •

edited

aawsome commented Jul 6, 2022 •

edited by fd0

MichaelEischer commented Jul 13, 2022 •

edited by fd0

MichaelEischer commented Jul 13, 2022 •

edited

MichaelEischer commented Jul 23, 2023 •

edited