Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lib/fs: Add encoder filesystem (fixes #1734) #7828

Closed
wants to merge 17 commits into from

Conversation

rasa
Copy link
Member

@rasa rasa commented Jul 19, 2021

Superseded by #7876


Purpose

A NewEncoderFilesystem ensures that paths that contain characters that are reserved on certain filesystems (such "*:<>?| as NTFS, exFAT, etc.) can be safety stored. It does this by transparently replacing the reserved characters with UNICODE characters in the private use area (\uf000-\uf07f). This conversion is compatible with Cygwin, Git-Bash, Msys2, Windows Subsystem for Linux (WSL), and other platforms.

Testing

All the tests pass. go vet is clean. go lint is clean. go fmt produces no output.

I have tested it with Linux, Android, macOS and iOS clients pushing changes to a Windows client. Directory listings showed the correct filenames in WSL, Cygwin, Msys2, and git-bash environments. I also created filenames containing reserved characters in WSL, Cygwin, Msys2 and git-bash, and these transferred successfully to Linux, Android, macOS, and iOS clients.

On my Android fuse filesystem, files with <>:"|?* are rejected, but files ending in dot, or space are accepted.

Documentation

If this PR is accepted, I will submit a PR to update the documentation.

Authorship

My name and email are already in the AUTHORS file.

Links

For reference, see:
https://cygwin.com/cygwin-ug-net/using-specialnames.html
http://msdn.microsoft.com/en-us/library/aa365247%28VS.85%29.aspx
https://en.wikipedia.org/wiki/Filename#In_Windows
https://docs.microsoft.com/en-us/windows/win32/fileio/naming-a-file

For implementations, see
https://github.com/mirror/newlib-cygwin/blob/fb01286fab9b370c86323f84a46285cfbebfe4ff/winsup/cygwin/path.cc#L435
https://github.com/billziss-gh/winfsp/blob/6e3a8f70b2bd958960012447544d492fc6a2f1af/src/shared/ku/posix.c#L1250

@tomasz1986
Copy link
Contributor

I have a few questions regarding the behaviour of such files in Windows. I have not tested the actual PR, so some of them may be a non-issue.

  1. What happens if one changes such a filename with Unicode characters in Windows later on? Is the file going to be renamed/moved/deleted on the other side with no problems? What happens if you copy or move the file on the Windows side to a different folder?

  2. Are Windows CMD and PowerShell able to operate on such files with Unicode characters with no issues?

  3. I assume that the testing was done using NTFS. Are there no issues when the files are located on (ex)FAT and ReFS partitions?

@rasa
Copy link
Member Author

rasa commented Jul 19, 2021

@tomasz1986 All good questions. Thanks for asking.

  1. What happens if one changes such a filename with Unicode characters in Windows later on? Is the file going to be renamed/moved/deleted on the other side with no problems? What happens if you copy or move the file on the Windows side to a different folder?

I renamed to and from filenames with reserved characters on each platform, and saw no problems, other than on my Android client, which allows for spaces and dots at the end of a names, but doesn't like the reserved characters, for some reason, on its root file system. Moving files works as expected, with the above proviso. Deleting works as expected.

  1. Are Windows CMD and PowerShell able to operate on such files with Unicode characters with no issues?

Yes. The only issue is that the 0xf0xx characters display funny in CMD, and File Explorer. Of course, if you installed a font that populated these 0xf0xx characters with glyphs, this display issue would be solved.

  1. I assume that the testing was done using NTFS. Are there no issues when the files are located on (ex)FAT and ReFS partitions?

Yes, I've only tested on NTFS so far. According to https://en.m.wikipedia.org/wiki/Comparison_of_file_systems#Limits this should work on ReFS. It's not clear if it will work on exFAT, FAT32, or other filesystems. I will test exFAT and FAT32 (on Windows) and report back.

If it does work on exFAT, it would be trivial to enable this fix on any platform that can mount exFAT partitions.

BTW, I haven't tested on macOS, but will soon.

@AudriusButkevicius
Copy link
Member

AudriusButkevicius commented Jul 19, 2021

What happens if someone syncs foo and foo<unicode character for space>?

I suspect this will cause data loss, hence any form of translation is lossy.

Also, I'd expect (have not gone through the code in detail) that this causes ping pong, namely, I sync foo , and then after next scan that file disappears and a new file appears with unicode characters which then ends up a deletion and a new file on other devices?

@rasa
Copy link
Member Author

rasa commented Jul 19, 2021

@AudriusButkevicius :

What happens if someone syncs foo and foo<unicode character for space>?

That would be an issue, if <unicode char for space>, in this case 0xf020, was actually used in practice, but since it's in a Private Use Area, it's actually designed to be used in this (private) way. Note that this "fix" was originally implemented in Interix over 25 years ago.

I suspect this will cause data loss, hence any form of translation is lossy.

You're right. This PR should be fixed to flag received files containing 0xf0xx characters as invalid, if the "Auto Fix Invalid Files" option has been set. Otherwise, chaos will ensue.

This could easily happen if someone turns on the "Auto Fix Invalid Files" switch, and then turns it off after syncing files. These 0xf0xx characters would then be synced as is, causing duplicate files to appear, but no data loss that I can see (as they really are the same file, in a sense).

Also, I'd expect (have not gone through the code in detail) that this causes ping pong, namely, I sync foo , and then after next scan that file disappears and a new file appears with unicode characters which then ends up a deletion and a new file on other devices?

I haven't seen that yet, but it's certainly possible. I will report any issues found.

For now, let's consider this PR a draft, until I test on exFAT, FAT32, macOS, and address syncing files with 0xf0xx characters issue noted above. Thanks for the feedback.

@calmh
Copy link
Member

calmh commented Jul 19, 2021

This is an interesting PR. It has some promise. It seems like names with private-space characters must be rejected on the disk side when this option is disabled (scanning error), and similarly rejected on the network side when the option enabled (as other invalid file names are today), as these can't be represented on the local filesystem any more.

This might mean we can no longer sync files with such names in a regular setup, something that might annoy existing users.

issue is that the 0xf0xx characters display funny in CMD, and File Explorer.

This might in fact be to everyone's advantage, as the user sees it, thinks "what the hell?" and fixes the filename to something portable. :) The display might be even worse if the Windows box isn't running UTF-8 at all.

(I also think the naming leaves something to be desired, but that can be discussed once it's technically sound.)

@rasa
Copy link
Member Author

rasa commented Jul 19, 2021

OK, so I tested it, and it works fine in Windows on exFAT and FAT32 partitions. It also works fine on macOS, on a APFS filesystem. The only issue (other than my Android not creating files with "*:<>?|) are files created with backslashes on Linux or Mac, do not migrate anywhere, but this has nothing to do with this PR.

I will now update this PR to flag received files that contain 0xf0xx characters as invalid. We should decide if we'll want to allow users to use the Private Use characters 0xf000-0xf07f for some other use. If so, we could add another config option to enable this (which would toggle off the "Auto Fix Invalid Files" config option.)

Copy link
Member

@imsodin imsodin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to reject items with private unicode chars when scanning without this option enabled? If it shares such items with a device using this option, that other device will just reject them.

lib/fs/fixfs.go Outdated Show resolved Hide resolved
lib/fs/fixfs.go Outdated Show resolved Hide resolved
rasa added a commit to rasa/syncthing that referenced this pull request Jul 22, 2021
@rasa
Copy link
Member Author

rasa commented Jul 22, 2021

@imsodin :

Why do we need to reject items with private unicode chars when scanning without this option enabled? If it shares such items with a device using this option, that other device will just reject them.

The problem comes when a user enables this option, shares some files, and then disables the option. Here's an example:

  1. User enables the option on device A, as the underlying filesystem is exFAT
  2. Device B sends device A a file named "B>A"
  3. Device A saves the file as "B\uf003eA"
  4. User disables the option on device A
  5. Device A now sends the file "B\uf003eA" to device B
  6. Device B is an older client, so it happily saves the file "B\uf003eA"
  7. Device B now has both "B>A" and "B\uf003eA"

The quick fix is for device A to simply ignore all files containing \uf00xx characters when the option is disabled. This is effectively the old functionality, where the filesystem simply failed writing "B>A", leading to an out of sync error.

Of course, the duplicate file issue can still appear if the user downgrades the client on device A.

@rasa
Copy link
Member Author

rasa commented Jul 25, 2021

After seeing how the different filesystems handle filenames, how does this sound as a plan:

  1. encodefs.go: defines EncodeFilesystem struct
  2. encodefs_default.go: DefaultEncodeFilesystem: disallows \x00 and /: ext2, ext3, ext4, etc.
  3. encodefs_windows_.go: WindowsEncodeFilesystem: disallows \x00, /, encodes \x01-\x1f, <, >, :, ", |, ?, *, \, encodes trailing dots and spaces: NTFS, exFAT, FAT32, reFS, etc. Contrary to here, \x7f is allowed. Also, encoding reserved names, NUL, CON, etc, is not required, as we always preface the path with \\?\. See here
  4. encodefs_android_.go: AndroidEncodeFilesystem: disallows \x00 and /, encodes \x01-\x1f, <, >, :, ", |, ?, *, \, allows trailing dots and spaces: fuse on Android
  5. encodefs_ios_.go: IOSEncodeFilesystem: disallows \x00, encodes / (?), \ (?) and :, encodes leading dots: mobiusSync on iOS
  6. encodefs_plan9_.go : Plan9EncodeFilesystem: disallows \x00 and /, encodes \x01-\x1f, \x80-\x9f
    See https://9fans.github.io/plan9port/man/man9/intro.html
  7. encodefs_safe.go: SafeEncodeFilesystem: most of the above: disallows \x00 and /, encodes \x01-\x1f, <, >, :, ", |, ?, *, \, 0x7f-\x9f, encodes trailing dots and spaces, encodes Windows reserved names (NUL, CON, etc)
  8. encodefs_custom.go: CustomEncodeFilesystem: user defined, for example, a "shell friendly" encoding would be: disallows \x00 and /, encodes \x00-\x1f, \x7f, backslash \, quotes ', double quotes ", leading dashes - and spaces, and trailing spaces

In the UI, we show:

Encoding system for reserved characters and filenames:

And an input field that accepts:

default
auto
windows
android
ios
plan9
safe
custom

The default option would be default as that is the current functionality, or possibly auto if it's reliable. Perhaps default should be named none, or disabled.

The auto option would use a combination of GOOS and gopsutil's PartitionStat.Fstype to determine the correct encoding type.

The custom option would need a few more input fields. Seems like overkill, but if a user is on some rare filesystem, this could help.

A NewEncoderFilesystem ensures that paths that
contain characters that are reserved on
filesystems such as NTFS can be safety stored.
It does this by replacing the reserved
characters with UNICODE characters in
the private use area (\uf000-\uf07f). This
conversion is compatible with Cygwin, Git-Bash,
Msys2, Windows Subsystem for Linux (WSL), and
other platforms.

For reference, see:
https://cygwin.com/cygwin-ug-net/using-specialnames.html
http://msdn.microsoft.com/en-us/library/aa365247%28VS.85%29.aspx
https://en.wikipedia.org/wiki/Filename#In_Windows
https://docs.microsoft.com/en-us/windows/win32/fileio/naming-a-file

For implementations, see
https://github.com/mirror/newlib-cygwin/blob/fb01286fab9b370c86323f84a46285cfbebfe4ff/winsup/cygwin/path.cc#L435
https://github.com/billziss-gh/winfsp/blob/6e3a8f70b2bd958960012447544d492fc6a2f1af/src/shared/ku/posix.c#L1250

Signed-off-by: Ross Smith II <ross@smithii.com>
@rasa rasa changed the title lib/fs: Add fixing filesystem (fixes #1734) lib/fs: Add encoder filesystem (fixes #1734) Jul 27, 2021
@rasa
Copy link
Member Author

rasa commented Jul 27, 2021

I would mark this PR as draft, if I could, as it needs a lot more testing, though it would be safe to deploy as it defaults to the current functionality, unless the user changes an advanced setting.

The code works running on Windows, and syncing to and fro with Windows, Liunx, Android, MacOS, and iOS clients. But I am seeing errors when I restart Syncthing. It seems some code paths use a BasicFilesystem, instead of a EncoderFilesystem. I will delve into this, and see if I can pinpoint the issues.

Also, I need to create tests for the android, plan9, ios, and safe encoders. I also need to rewrite the windows encoder test to pass on non-Windows systems, as users can mount exFAT partitions on any OS.

And I need to test it on other OSs, and especially the auto encoder functionality.

All feedback is appreciated.

@imsodin
Copy link
Member

imsodin commented Jul 27, 2021

Wow, you are really moving ahead quickly :)

That's also why

All feedback is appreciated.

is lagging behind. There's a lot to unpack in the implementation and while I think I get it conceptionally it still needs thinking about the consequences of the behaviour. It adds complexity both to the implementation and user scenarios. Currently things are simple: We do whatever works, and have a few conditions where we say "this doesn't work" if the result isn't desirable (mostly windows being windows :) ). I do think the mechanism quite nicely compartmentalizes all the complexity.

My recommendation: Don't spend too much time on polish and testing all the corner cases until "concept review" happened, i.e. there's a general consensus that the way chosen here is good to go.

@AudriusButkevicius
Copy link
Member

I think this is slowly getting out of hand, because something that was a few tens of lines has now got to nearly 2k lines.

As suggested, you should wait for feedback before investing more time, just in case you are investing your time into something that is going the wrong way.

@calmh
Copy link
Member

calmh commented Aug 3, 2021

I have significant concerns with this. It has expanded somewhat in scope since I looked. Some current thoughts, not necessarily things that are broken with the current approach but that need addressing one way or the other at some point.

  • What happens when the functionality is turned on or off on an existing setup in various scenarios. (Disasters aren't OK, here.)
    • In relation to the above, I see that there is now auto detection. I'm not so sure we should have that.
  • There are a bunch of different encoders for different filesystem types now.
    • It's not clear to me if that takes into account any differences that may be due to the filesystem running on an unexpected OS? (NTFS on Linux; HFS+ on Windows; etc). This might be a non-issue but it needs double checking. (People dual boot.)
    • As above, auto detection may or may not be a great thing.
    • What happens if I copy a file converted by the Windows variant to my Plan9 workstation and it gets picked up there? Some of the characters will be interpreted and others won't?
  • Various leakage... The EncoderFS takes care of paths in the outgoing direction and converts in the incoming direction for readdir and such. But:
    • Are filesystem notifications captured and converted somewhere?
    • What about our API, where filenames now don't match what's on disk any more? Does this confuse any of the external wrappers or tools in bad ways? Are we sure? If I'm a desktop integration thing I can no longer just look at a file on disk and ask for a rescan of it via the API? Do we want to have events talk about paths that don't actually exist on disk?

So there's a lot to unpack here. Given that this affects interoperability between systems in various ways the testing matrix quickly becomes an n-dimensional beast that we'll never be sure of.

I think for this to have a chance to live it needs to be short, sweet, and unambiguously correct in all the simple cases we can dream up. That means starting with the simplest thing that could possibly work. Lots of different encoders and autodetection and stuff is probably the antithesis of that.

@rasa
Copy link
Member Author

rasa commented Aug 3, 2021

Thank you all for the valuable feedback. Let's close this PR, as it was always just a proof-of-concept/draft PR. If I can develop something that actually works, that is worthy of review, and that incorporates the feedback above, I will create another PR.

To respond to @calmh's feedback:

  • What happens when the functionality is turned on or off on an existing setup in various scenarios.

The system where the functionality was turned off will then share the encoded files, and receiving systems that have more accommodating filesystems, such as ext4, will see duplicate files. This can be addressed by having the basicFS reject encoded files, which is effectively how things work now: files that can't be stored on the underlying filesystem are simply rejected.

  • It's not clear to me if that takes into account any differences that may be due to the filesystem running on an unexpected OS? (NTFS on Linux; HFS+ on Windows; etc).

The various encoders address the limitations imposed by the filesystem and the OS. For example, On the Android OS, on a exFAT/FAT32 filesystem, files with *?"<>| are rejected, but files ending with periods or spaces are allowed. On Windows, on exFAT/FAT32, the UI doesn't allow files ending with periods or spaces (even though the system allows them to be created using the \\?\ trick, as Syncthing already does).

  • What happens if I copy a file converted by the Windows variant to my Plan9 workstation and it gets picked up there? Some of the characters will be interpreted and others won't?

That won't happen, as the filename decoder is shared across all encoders: rune &^= 0xf000. See here.

  • Are filesystem notifications captured and converted somewhere?

Sorry, I can't answer that. See next comment.

  • What about our API, where filenames now don't match what's on disk any more? Does this confuse any of the external wrappers or tools in bad ways? Are we sure? If I'm a desktop integration thing I can no longer just look at a file on disk and ask for a rescan of it via the API? Do we want to have events talk about paths that don't actually exist on disk?

That's a good question. Forgive me for asking a follow up: If a user turns on encryption, will an external tool see the original unencoded filename? If not, then does it make sense for external wrappers/tools to be looking at our file store? If so, how is file-content encryption any different than file-name encoding? In a sense, this PR simply "encrypts" the filenames to hide their names from legacy-hobbled OSes, using an industry standard filename-encoding scheme (Microsoft in WSL, etc).

Given that this affects interoperability between systems in various ways the testing matrix quickly becomes an n-dimensional beast that we'll never be sure of.

Wow, that certainly sounds ominous. In response, I would say that Interix solved the interoperability issue of Windows failing to host Linux files 25+ years ago using this encoding hack, so it's a tried and tested solution, that numerous systems have since implemented, such as WSL. Implementing this logic in Syncthing will simply add it to the growing list of environments that support this hack.

I think for this to have a chance to live it needs to be short, sweet, and unambiguously correct in all the simple cases we can dream up. That means starting with the simplest thing that could possibly work. Lots of different encoders and autodetection and stuff is probably the antithesis of that.

Ok, stay tuned. Thanks to all again for your critical, yet non-discouraging, feedback.

@rasa rasa closed this Aug 3, 2021
@rasa
Copy link
Member Author

rasa commented Aug 8, 2021

See #7876

@rasa rasa deleted the fs/add-fix-fs branch February 22, 2022 14:59
@st-review st-review added the frozen-due-to-age Issues closed and untouched for a long time, together with being locked for discussion label Mar 10, 2023
@syncthing syncthing locked and limited conversation to collaborators Mar 10, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
frozen-due-to-age Issues closed and untouched for a long time, together with being locked for discussion
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants