lib/fs: Add encoder filesystem (fixes #1734) #7828

rasa · 2021-07-19T05:58:13Z

Superseded by #7876

Purpose

A NewEncoderFilesystem ensures that paths that contain characters that are reserved on certain filesystems (such "*:<>?| as NTFS, exFAT, etc.) can be safety stored. It does this by transparently replacing the reserved characters with UNICODE characters in the private use area (\uf000-\uf07f). This conversion is compatible with Cygwin, Git-Bash, Msys2, Windows Subsystem for Linux (WSL), and other platforms.

Testing

All the tests pass. go vet is clean. go lint is clean. go fmt produces no output.

I have tested it with Linux, Android, macOS and iOS clients pushing changes to a Windows client. Directory listings showed the correct filenames in WSL, Cygwin, Msys2, and git-bash environments. I also created filenames containing reserved characters in WSL, Cygwin, Msys2 and git-bash, and these transferred successfully to Linux, Android, macOS, and iOS clients.

On my Android fuse filesystem, files with <>:"|?* are rejected, but files ending in dot, or space are accepted.

Documentation

If this PR is accepted, I will submit a PR to update the documentation.

Authorship

My name and email are already in the AUTHORS file.

Links

For reference, see:
https://cygwin.com/cygwin-ug-net/using-specialnames.html
http://msdn.microsoft.com/en-us/library/aa365247%28VS.85%29.aspx
https://en.wikipedia.org/wiki/Filename#In_Windows
https://docs.microsoft.com/en-us/windows/win32/fileio/naming-a-file

For implementations, see
https://github.com/mirror/newlib-cygwin/blob/fb01286fab9b370c86323f84a46285cfbebfe4ff/winsup/cygwin/path.cc#L435
https://github.com/billziss-gh/winfsp/blob/6e3a8f70b2bd958960012447544d492fc6a2f1af/src/shared/ku/posix.c#L1250

tomasz1986 · 2021-07-19T14:19:38Z

I have a few questions regarding the behaviour of such files in Windows. I have not tested the actual PR, so some of them may be a non-issue.

What happens if one changes such a filename with Unicode characters in Windows later on? Is the file going to be renamed/moved/deleted on the other side with no problems? What happens if you copy or move the file on the Windows side to a different folder?
Are Windows CMD and PowerShell able to operate on such files with Unicode characters with no issues?
I assume that the testing was done using NTFS. Are there no issues when the files are located on (ex)FAT and ReFS partitions?

rasa · 2021-07-19T15:49:13Z

@tomasz1986 All good questions. Thanks for asking.

What happens if one changes such a filename with Unicode characters in Windows later on? Is the file going to be renamed/moved/deleted on the other side with no problems? What happens if you copy or move the file on the Windows side to a different folder?

I renamed to and from filenames with reserved characters on each platform, and saw no problems, other than on my Android client, which allows for spaces and dots at the end of a names, but doesn't like the reserved characters, for some reason, on its root file system. Moving files works as expected, with the above proviso. Deleting works as expected.

Are Windows CMD and PowerShell able to operate on such files with Unicode characters with no issues?

Yes. The only issue is that the 0xf0xx characters display funny in CMD, and File Explorer. Of course, if you installed a font that populated these 0xf0xx characters with glyphs, this display issue would be solved.

I assume that the testing was done using NTFS. Are there no issues when the files are located on (ex)FAT and ReFS partitions?

Yes, I've only tested on NTFS so far. According to https://en.m.wikipedia.org/wiki/Comparison_of_file_systems#Limits this should work on ReFS. It's not clear if it will work on exFAT, FAT32, or other filesystems. I will test exFAT and FAT32 (on Windows) and report back.

If it does work on exFAT, it would be trivial to enable this fix on any platform that can mount exFAT partitions.

BTW, I haven't tested on macOS, but will soon.

AudriusButkevicius · 2021-07-19T16:11:24Z

What happens if someone syncs foo and foo<unicode character for space>?

I suspect this will cause data loss, hence any form of translation is lossy.

Also, I'd expect (have not gone through the code in detail) that this causes ping pong, namely, I sync foo , and then after next scan that file disappears and a new file appears with unicode characters which then ends up a deletion and a new file on other devices?

rasa · 2021-07-19T17:34:22Z

@AudriusButkevicius :

What happens if someone syncs foo and foo<unicode character for space>?

That would be an issue, if <unicode char for space>, in this case 0xf020, was actually used in practice, but since it's in a Private Use Area, it's actually designed to be used in this (private) way. Note that this "fix" was originally implemented in Interix over 25 years ago.

I suspect this will cause data loss, hence any form of translation is lossy.

You're right. This PR should be fixed to flag received files containing 0xf0xx characters as invalid, ~~if the "Auto Fix Invalid Files" option has been set~~. Otherwise, chaos will ensue.

This could easily happen if someone turns on the "Auto Fix Invalid Files" switch, and then turns it off after syncing files. These 0xf0xx characters would then be synced as is, causing duplicate files to appear, but no data loss that I can see (as they really are the same file, in a sense).

Also, I'd expect (have not gone through the code in detail) that this causes ping pong, namely, I sync foo , and then after next scan that file disappears and a new file appears with unicode characters which then ends up a deletion and a new file on other devices?

I haven't seen that yet, but it's certainly possible. I will report any issues found.

For now, let's consider this PR a draft, until I test on exFAT, FAT32, macOS, and address syncing files with 0xf0xx characters issue noted above. Thanks for the feedback.

calmh · 2021-07-19T19:30:46Z

This is an interesting PR. It has some promise. It seems like names with private-space characters must be rejected on the disk side when this option is disabled (scanning error), and similarly rejected on the network side when the option enabled (as other invalid file names are today), as these can't be represented on the local filesystem any more.

This might mean we can no longer sync files with such names in a regular setup, something that might annoy existing users.

issue is that the 0xf0xx characters display funny in CMD, and File Explorer.

This might in fact be to everyone's advantage, as the user sees it, thinks "what the hell?" and fixes the filename to something portable. :) The display might be even worse if the Windows box isn't running UTF-8 at all.

(I also think the naming leaves something to be desired, but that can be discussed once it's technically sound.)

rasa · 2021-07-19T20:55:25Z

OK, so I tested it, and it works fine in Windows on exFAT and FAT32 partitions. It also works fine on macOS, on a APFS filesystem. The only issue (other than my Android not creating files with "*:<>?|) are files created with backslashes on Linux or Mac, do not migrate anywhere, but this has nothing to do with this PR.

I will now update this PR to flag received files that contain 0xf0xx characters as invalid. We should decide if we'll want to allow users to use the Private Use characters 0xf000-0xf07f for some other use. If so, we could add another config option to enable this (which would toggle off the "Auto Fix Invalid Files" config option.)

imsodin

Why do we need to reject items with private unicode chars when scanning without this option enabled? If it shares such items with a device using this option, that other device will just reject them.

lib/fs/fixfs.go

Per syncthing#7828 (comment)

rasa · 2021-07-22T14:32:17Z

@imsodin :

Why do we need to reject items with private unicode chars when scanning without this option enabled? If it shares such items with a device using this option, that other device will just reject them.

The problem comes when a user enables this option, shares some files, and then disables the option. Here's an example:

User enables the option on device A, as the underlying filesystem is exFAT
Device B sends device A a file named "B>A"
Device A saves the file as "B\uf003eA"
User disables the option on device A
Device A now sends the file "B\uf003eA" to device B
Device B is an older client, so it happily saves the file "B\uf003eA"
Device B now has both "B>A" and "B\uf003eA"

The quick fix is for device A to simply ignore all files containing \uf00xx characters when the option is disabled. This is effectively the old functionality, where the filesystem simply failed writing "B>A", leading to an out of sync error.

Of course, the duplicate file issue can still appear if the user downgrades the client on device A.

rasa · 2021-07-25T18:33:14Z

After seeing how the different filesystems handle filenames, how does this sound as a plan:

encodefs.go: defines EncodeFilesystem struct
encodefs_default.go: DefaultEncodeFilesystem: disallows \x00 and /: ext2, ext3, ext4, etc.
encodefs_windows_.go: WindowsEncodeFilesystem: disallows \x00, /, encodes \x01-\x1f, <, >, :, ", |, ?, *, \, encodes trailing dots and spaces: NTFS, exFAT, FAT32, reFS, etc. Contrary to here, \x7f is allowed. Also, encoding reserved names, NUL, CON, etc, is not required, as we always preface the path with \\?\. See here
encodefs_android_.go: AndroidEncodeFilesystem: disallows \x00 and /, encodes \x01-\x1f, <, >, :, ", |, ?, *, \, allows trailing dots and spaces: fuse on Android
encodefs_ios_.go: IOSEncodeFilesystem: disallows \x00, encodes / (?), \ (?) and :, encodes leading dots: mobiusSync on iOS
encodefs_plan9_.go : Plan9EncodeFilesystem: disallows \x00 and /, encodes \x01-\x1f, \x80-\x9f
See https://9fans.github.io/plan9port/man/man9/intro.html
encodefs_safe.go: SafeEncodeFilesystem: most of the above: disallows \x00 and /, encodes \x01-\x1f, <, >, :, ", |, ?, *, \, 0x7f-\x9f, encodes trailing dots and spaces, encodes Windows reserved names (NUL, CON, etc)
encodefs_custom.go: CustomEncodeFilesystem: user defined, for example, a "shell friendly" encoding would be: disallows \x00 and /, encodes \x00-\x1f, \x7f, backslash \, quotes ', double quotes ", leading dashes - and spaces, and trailing spaces

In the UI, we show:

Encoding system for reserved characters and filenames:

And an input field that accepts:

default
auto
windows
android
ios
plan9
safe
custom

The default option would be default as that is the current functionality, or possibly auto if it's reliable. Perhaps default should be named none, or disabled.

The auto option would use a combination of GOOS and gopsutil's PartitionStat.Fstype to determine the correct encoding type.

The custom option would need a few more input fields. Seems like overkill, but if a user is on some rare filesystem, this could help.

A NewEncoderFilesystem ensures that paths that contain characters that are reserved on filesystems such as NTFS can be safety stored. It does this by replacing the reserved characters with UNICODE characters in the private use area (\uf000-\uf07f). This conversion is compatible with Cygwin, Git-Bash, Msys2, Windows Subsystem for Linux (WSL), and other platforms. For reference, see: https://cygwin.com/cygwin-ug-net/using-specialnames.html http://msdn.microsoft.com/en-us/library/aa365247%28VS.85%29.aspx https://en.wikipedia.org/wiki/Filename#In_Windows https://docs.microsoft.com/en-us/windows/win32/fileio/naming-a-file For implementations, see https://github.com/mirror/newlib-cygwin/blob/fb01286fab9b370c86323f84a46285cfbebfe4ff/winsup/cygwin/path.cc#L435 https://github.com/billziss-gh/winfsp/blob/6e3a8f70b2bd958960012447544d492fc6a2f1af/src/shared/ku/posix.c#L1250 Signed-off-by: Ross Smith II <ross@smithii.com>

rasa · 2021-07-27T17:37:23Z

I would mark this PR as draft, if I could, as it needs a lot more testing, though it would be safe to deploy as it defaults to the current functionality, unless the user changes an advanced setting.

The code works running on Windows, and syncing to and fro with Windows, Liunx, Android, MacOS, and iOS clients. But I am seeing errors when I restart Syncthing. It seems some code paths use a BasicFilesystem, instead of a EncoderFilesystem. I will delve into this, and see if I can pinpoint the issues.

Also, I need to create tests for the android, plan9, ios, and safe encoders. I also need to rewrite the windows encoder test to pass on non-Windows systems, as users can mount exFAT partitions on any OS.

And I need to test it on other OSs, and especially the auto encoder functionality.

All feedback is appreciated.

imsodin · 2021-07-27T19:46:04Z

Wow, you are really moving ahead quickly :)

That's also why

All feedback is appreciated.

is lagging behind. There's a lot to unpack in the implementation and while I think I get it conceptionally it still needs thinking about the consequences of the behaviour. It adds complexity both to the implementation and user scenarios. Currently things are simple: We do whatever works, and have a few conditions where we say "this doesn't work" if the result isn't desirable (mostly windows being windows :) ). I do think the mechanism quite nicely compartmentalizes all the complexity.

My recommendation: Don't spend too much time on polish and testing all the corner cases until "concept review" happened, i.e. there's a general consensus that the way chosen here is good to go.

See https://deepsource.io/gh/syncthing/syncthing/run/7eec2796-6846-4cdf-9fa7-ab7d7740e22e/go/CRT-A0004 and https://deepsource.io/gh/syncthing/syncthing/run/7eec2796-6846-4cdf-9fa7-ab7d7740e22e/go/SCC-U1000

See https://deepsource.io/gh/syncthing/syncthing/run/9e4454f6-c260-48eb-99f0-2d5624d0c413/go/SCC-U1000

…o fs/add-fix-fs

AudriusButkevicius · 2021-08-02T19:31:04Z

I think this is slowly getting out of hand, because something that was a few tens of lines has now got to nearly 2k lines.

As suggested, you should wait for feedback before investing more time, just in case you are investing your time into something that is going the wrong way.

calmh · 2021-08-03T07:11:46Z

I have significant concerns with this. It has expanded somewhat in scope since I looked. Some current thoughts, not necessarily things that are broken with the current approach but that need addressing one way or the other at some point.

What happens when the functionality is turned on or off on an existing setup in various scenarios. (Disasters aren't OK, here.)
- In relation to the above, I see that there is now auto detection. I'm not so sure we should have that.
There are a bunch of different encoders for different filesystem types now.
- It's not clear to me if that takes into account any differences that may be due to the filesystem running on an unexpected OS? (NTFS on Linux; HFS+ on Windows; etc). This might be a non-issue but it needs double checking. (People dual boot.)
- As above, auto detection may or may not be a great thing.
- What happens if I copy a file converted by the Windows variant to my Plan9 workstation and it gets picked up there? Some of the characters will be interpreted and others won't?
Various leakage... The EncoderFS takes care of paths in the outgoing direction and converts in the incoming direction for readdir and such. But:
- Are filesystem notifications captured and converted somewhere?
- What about our API, where filenames now don't match what's on disk any more? Does this confuse any of the external wrappers or tools in bad ways? Are we sure? If I'm a desktop integration thing I can no longer just look at a file on disk and ask for a rescan of it via the API? Do we want to have events talk about paths that don't actually exist on disk?

So there's a lot to unpack here. Given that this affects interoperability between systems in various ways the testing matrix quickly becomes an n-dimensional beast that we'll never be sure of.

I think for this to have a chance to live it needs to be short, sweet, and unambiguously correct in all the simple cases we can dream up. That means starting with the simplest thing that could possibly work. Lots of different encoders and autodetection and stuff is probably the antithesis of that.

rasa · 2021-08-03T14:55:54Z

Thank you all for the valuable feedback. Let's close this PR, as it was always just a proof-of-concept/draft PR. If I can develop something that actually works, that is worthy of review, and that incorporates the feedback above, I will create another PR.

To respond to @calmh's feedback:

What happens when the functionality is turned on or off on an existing setup in various scenarios.

The system where the functionality was turned off will then share the encoded files, and receiving systems that have more accommodating filesystems, such as ext4, will see duplicate files. This can be addressed by having the basicFS reject encoded files, which is effectively how things work now: files that can't be stored on the underlying filesystem are simply rejected.

It's not clear to me if that takes into account any differences that may be due to the filesystem running on an unexpected OS? (NTFS on Linux; HFS+ on Windows; etc).

The various encoders address the limitations imposed by the filesystem and the OS. For example, On the Android OS, on a exFAT/FAT32 filesystem, files with *?"<>| are rejected, but files ending with periods or spaces are allowed. On Windows, on exFAT/FAT32, the UI doesn't allow files ending with periods or spaces (even though the system allows them to be created using the \\?\ trick, as Syncthing already does).

What happens if I copy a file converted by the Windows variant to my Plan9 workstation and it gets picked up there? Some of the characters will be interpreted and others won't?

That won't happen, as the filename decoder is shared across all encoders: rune &^= 0xf000. See here.

Are filesystem notifications captured and converted somewhere?

Sorry, I can't answer that. See next comment.

What about our API, where filenames now don't match what's on disk any more? Does this confuse any of the external wrappers or tools in bad ways? Are we sure? If I'm a desktop integration thing I can no longer just look at a file on disk and ask for a rescan of it via the API? Do we want to have events talk about paths that don't actually exist on disk?

That's a good question. Forgive me for asking a follow up: If a user turns on encryption, will an external tool see the original unencoded filename? If not, then does it make sense for external wrappers/tools to be looking at our file store? If so, how is file-content encryption any different than file-name encoding? In a sense, this PR simply "encrypts" the filenames to hide their names from legacy-hobbled OSes, using an industry standard filename-encoding scheme (Microsoft in WSL, etc).

Given that this affects interoperability between systems in various ways the testing matrix quickly becomes an n-dimensional beast that we'll never be sure of.

Wow, that certainly sounds ominous. In response, I would say that Interix solved the interoperability issue of Windows failing to host Linux files 25+ years ago using this encoding hack, so it's a tried and tested solution, that numerous systems have since implemented, such as WSL. Implementing this logic in Syncthing will simply add it to the growing list of environments that support this hack.

I think for this to have a chance to live it needs to be short, sweet, and unambiguously correct in all the simple cases we can dream up. That means starting with the simplest thing that could possibly work. Lots of different encoders and autodetection and stuff is probably the antithesis of that.

Ok, stay tuned. Thanks to all again for your critical, yet non-discouraging, feedback.

rasa · 2021-08-08T13:58:09Z

See #7876

rasa force-pushed the fs/add-fix-fs branch from a9da7b2 to 3d8fbc6 Compare July 19, 2021 06:16

imsodin reviewed Jul 21, 2021

View reviewed changes

lib/fs/fixfs.go Outdated Show resolved Hide resolved

imsodin reviewed Jul 22, 2021

View reviewed changes

lib/fs/fixfs.go Outdated Show resolved Hide resolved

rasa added a commit to rasa/syncthing that referenced this pull request Jul 22, 2021

Unfix Stat's FileInfo result

458cdfc

Per syncthing#7828 (comment)

rasa force-pushed the fs/add-fix-fs branch from 458cdfc to 7425180 Compare July 27, 2021 05:31

rasa changed the title ~~lib/fs: Add fixing filesystem (fixes #1734)~~ lib/fs: Add encoder filesystem (fixes #1734) Jul 27, 2021

rasa added 4 commits July 26, 2021 23:28

Tidy up commands, standardize const chars

680f744

Rework decodedFileInfo to avoid panic

14f27d7

Fix encode tests on non-Windows systems

75bf853

Lowercase fstype in auto encoder

9d3a6cb

rasa mentioned this pull request Jul 27, 2021

Bug: Receive encrypted on Windows can have illegal file names. #7808

Open

rasa added 2 commits July 27, 2021 11:07

Display the correct encoder name

70024ca

go fmt encodefs.go

e6e6d1c

rasa added 6 commits July 27, 2021 13:00

go fmt encodefs.go (again)

0ff9e88

Quiet CRT-A0004 and SCC-U1000

5b87a66

See https://deepsource.io/gh/syncthing/syncthing/run/7eec2796-6846-4cdf-9fa7-ab7d7740e22e/go/CRT-A0004 and https://deepsource.io/gh/syncthing/syncthing/run/7eec2796-6846-4cdf-9fa7-ab7d7740e22e/go/SCC-U1000

Reserve "*:<>?| in Android encoder

3984cfa

Remove duplicate line

7d51052

Quiet SCC-U1000

ce75762

See https://deepsource.io/gh/syncthing/syncthing/run/9e4454f6-c260-48eb-99f0-2d5624d0c413/go/SCC-U1000

Increase mapping to allow for plan9 & safe

0615d89

Fat32 has unicode

1f15710

rasa mentioned this pull request Jul 30, 2021

name is invalid, must not start with period on iOS. MobiusSync/MobiusSync#26

Closed

rasa added 3 commits July 31, 2021 07:46

Add Glob() to encoderfs

3a63151

Merge branch 'fs/add-fix-fs' of https://github.com/rasa/syncthing int…

fade341

…o fs/add-fix-fs

Only fixup FileInfo if non-nil

5bd13b4

rasa closed this Aug 3, 2021

rasa deleted the fs/add-fix-fs branch February 22, 2022 14:59

st-review added the frozen-due-to-age Issues closed and untouched for a long time, together with being locked for discussion label Mar 10, 2023

syncthing locked and limited conversation to collaborators Mar 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lib/fs: Add encoder filesystem (fixes #1734) #7828

lib/fs: Add encoder filesystem (fixes #1734) #7828

rasa commented Jul 19, 2021 •

edited

tomasz1986 commented Jul 19, 2021

rasa commented Jul 19, 2021

AudriusButkevicius commented Jul 19, 2021 •

edited

rasa commented Jul 19, 2021 •

edited

calmh commented Jul 19, 2021 •

edited

rasa commented Jul 19, 2021 •

edited

imsodin left a comment

rasa commented Jul 22, 2021 •

edited

rasa commented Jul 25, 2021 •

edited

rasa commented Jul 27, 2021 •

edited

imsodin commented Jul 27, 2021

AudriusButkevicius commented Aug 2, 2021

calmh commented Aug 3, 2021 •

edited

rasa commented Aug 3, 2021

rasa commented Aug 8, 2021

lib/fs: Add encoder filesystem (fixes #1734) #7828

lib/fs: Add encoder filesystem (fixes #1734) #7828

Conversation

rasa commented Jul 19, 2021 • edited

Purpose

Testing

Documentation

Authorship

Links

tomasz1986 commented Jul 19, 2021

rasa commented Jul 19, 2021

AudriusButkevicius commented Jul 19, 2021 • edited

rasa commented Jul 19, 2021 • edited

calmh commented Jul 19, 2021 • edited

rasa commented Jul 19, 2021 • edited

imsodin left a comment

Choose a reason for hiding this comment

rasa commented Jul 22, 2021 • edited

rasa commented Jul 25, 2021 • edited

rasa commented Jul 27, 2021 • edited

imsodin commented Jul 27, 2021

AudriusButkevicius commented Aug 2, 2021

calmh commented Aug 3, 2021 • edited

rasa commented Aug 3, 2021

rasa commented Aug 8, 2021

rasa commented Jul 19, 2021 •

edited

AudriusButkevicius commented Jul 19, 2021 •

edited

rasa commented Jul 19, 2021 •

edited

calmh commented Jul 19, 2021 •

edited

rasa commented Jul 19, 2021 •

edited

rasa commented Jul 22, 2021 •

edited

rasa commented Jul 25, 2021 •

edited

rasa commented Jul 27, 2021 •

edited

calmh commented Aug 3, 2021 •

edited