Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rename special/unsupported characters in filenames #1734

Open
Zillode opened this issue Apr 29, 2015 · 62 comments
Open

Rename special/unsupported characters in filenames #1734

Zillode opened this issue Apr 29, 2015 · 62 comments
Labels
enhancement New features or improvements of some kind, as opposed to a problem (bug)

Comments

@Zillode
Copy link
Contributor

Zillode commented Apr 29, 2015

As mentioned in syncthing/syncthing-android#192 , some filenames are not accepted by windows hosts because they contain 'special characters' like colons or bars.

@schuft69 suggested to add a "Rename special characters to '_'" option to resolve this issue.

< (less than)
> (greater than)
: (colon)
" (double quote)
/ (forward slash)
\ (backslash)
| (vertical bar or pipe)
? (question mark)
* (asterisk)
@calmh
Copy link
Member

calmh commented Apr 29, 2015

This needs to happen on the source then, or we need to keep the translation in the database somewhere, and hilariousness ensues when the database is reset or lost.

@neclepsio
Copy link

Maybe there could be a warning on the source, inviting to rename the file to avoid problems.

@somasis
Copy link

somasis commented May 4, 2015

I would think that with the substitution of characters it wouldn't really be required on each machine's database... would it not be possible to just check files incoming/outgoing against the substituted version, if they exist already on the host or don't, do what's appropriate?

@Ichimonji10
Copy link

Ichimonji10 commented May 4, 2015

I would think that with the substitution of characters it wouldn't really be required on each machine's database... would it not be possible to just check files incoming/outgoing against the substituted version, if they exist already on the host or don't, do what's appropriate?

No. Consider a case where a *nix machine has two files in a directory, and their contents are unique:

$ echo 'first file' > 'foo:bar'
$ echo 'second file' > 'foo>bar'
$ ls -1
foo>bar
foo:bar
$ cat 'foo:bar'
first file
$ cat 'foo>bar'
second file

Simply renaming each file by substituting offending characters would not work. You would end up with two conflicting files, each named foo_bar.

Also, consider this scenario:

  1. A user has a *nix machine with a file on it named foo:bar.
  2. The user adds a Windows machine to the syncthing cluster. The file is copied to the Windows machine as foo_bar.
  3. The user adds a second *nix machine to the syncthing cluster. This second *nix machine receives a file named foo:bar from the first *nix machine and a file named foo_bar from the Windows machine.

Oops. One file magically became two files.

@neclepsio
Copy link

Moreover there are other forbidden names under Windows: for example, nul or com1 and several other, or names longer than 255 characters. And names cannot end with a dot or space. And Mac could be a problem too: I think it requires normalized unicode, while in Linux normalized and non-normalized version could be two different files. (I'm just remembering, cannot verify right now, maybe there are some imprecisions).

@calmh
Copy link
Member

calmh commented May 4, 2015

We enforce unicode normalization; there's been some pain around that, for people having files with the "wrong" normalization for their OS. There we actually silently fix it, unless configured otherwise.

for example, nul or com1 and several other, or names longer than 255 characters. And names cannot end with a dot or space.

😱 I'd forgotten about the first of that, and didn't know about the second. We should probably handle that too (at least with a reject). Dammit, Windows...

But @Ichimonji10 above summarizes quite nicely why we probably won't be doing character substitutions anytime soon.

@neclepsio
Copy link

Just to add all the information in one place: you have to handle case sensitivity too.

Moreover, the problem is not with Windows bit also affects NTFS with other OSs. While in Linux shares under Windows have an error if trying to create a wrong filename, direct NTFS mounts do not have errors by default, and files are accessibile under Linux but not under Windows. So, you cannot rely on errors to know if a filename is legal.

@kozec
Copy link
Contributor

kozec commented May 5, 2015

This is maybe even crazier idea, but how about simple escaping? On windows, invalid character in incoming file names can be substituted, All*files? to All^afiles^q, for example, and reversed substitution can be used for outgoing files. Substitution may be configurable per-repository and enabled on Windows by default, so this would help even with samba mounts on *nixes.

@calmh
Copy link
Member

calmh commented May 5, 2015

We could do escaping, but it would have to be to something the user would not enter herself (to avoid confusion), and it would be ugly. I.e. foo:bar -> st--foo%3abar. The point being that we should be able to recognize the escaping even if the database is gone and we're doing an initial scan. I have files with legitimate ^'s in them (well, legitimateness can be debated, but they are there) and probably the same goes for someone else and whatever other character we could think of.

The case sensitivityness is an open bug somewhere else.

And I don't think should try to guess hidden rules when the OS and filesystem are fine with them...

@kozec
Copy link
Contributor

kozec commented May 5, 2015

I have files with legitimate ^'s in them (well, legitimateness can be debated, but they are there) and probably the same goes for someone else and whatever other character we could think of.

True, but that can be "solved" by escaping escape character, so incoming file^2 would become file^^2. Of course, if user manages to create file^a manually on Windows machine, Linux will receive file*. I can't argue it's not ugly, but it's not ambiguous.

There is also thing that cygwin does - they are translating invalid characters to character from unicode private space. That may, in plain theory, create conflicts, but chances are really slim and it looks better.

@calmh
Copy link
Member

calmh commented May 5, 2015

To be honest, I think the much simpler solution is for people who sync across multiple OS:es to just stick to the lowest common denominator in file names or live with the errors. It's not that onerous, and people using more than one platform should be kind of aware of the issue. For cases where someone has just a Windows box and a NAS running Linux, they'll most likely create all their files from the Windows side and automatically stay within the limitations.

(The case sensitivity thing still needs to be handled better though.)

@neclepsio
Copy link

neclepsio commented May 5, 2015 via email

@kozec
Copy link
Contributor

kozec commented May 5, 2015

I think the much simpler solution is for people who sync across multiple OS:es to just stick to the lowest common denominator in file names or live with the errors.

There are two problems with this approach.

  1. Sometimes user can't choose what characters are included.
  2. User usually realizes that [:-)] is invalid Windows filename only after he can't find it on other side. And that may be especially sweet realization if source machine is already offline.

@calmh
Copy link
Member

calmh commented May 5, 2015

I don't see point number one related to syncthing; that'd be a problem for the poor user on Windows with someone forcing invalid filenames on him no matter the delivery mechanism? Point number two implies the user is also a Linux user, who I'm assuming are more aware about things like filesystems and name limitations?

@calmh
Copy link
Member

calmh commented May 5, 2015

(This is all to say that I think this should be solved in a clean way, not that we shouldn't solve it at all. But something that is really ugly or has bad side effects is probably not worth it IMHO.)

@kozec
Copy link
Contributor

kozec commented May 5, 2015

Point number two implies the user is also a Linux user, who I'm assuming are more aware about things like filesystems and name limitation?

I can tell from experience that it doesn't works like that :D
But yeah, it is not biggest issue in known universe :)

@neclepsio
Copy link

neclepsio commented May 5, 2015 via email

@kozec
Copy link
Contributor

kozec commented May 5, 2015

Android is Linux, but I don't think the majority of Android users even know
what a filesystem is.

Now when you mention it... I'm ot saying anything bad about their skill level, but this will concern MacOS users as well...

@calmh
Copy link
Member

calmh commented May 5, 2015

Mac OS could be a valid concern, when sharing files with Windows. Mac has an awesome historic "feature" around path separators and filenames too:

screen shot 2015-05-05 at 14 21 34

😮

Android I'm not so worried about - specifically, I don't think users there generate files with weird names too often?

@Ichimonji10
Copy link

Android I'm not so worried about - specifically, I don't think users there generate files with weird names too often?

I don't know what the larger Android population does. But I personally sync photos between my Android phone and PC, which means that I end up with file names like:

IMG_20140725_184948.jpg
IMG_20140725_184948 (cropped).jpg
Tamalika Mukherjee in Balinese (aksara Bali).png

Also, that screenshot is terrifying.

@neclepsio
Copy link

@Ichimonji10 I agree, but if you take a photo with Hangouts you also have something like:

IMG_20140725_184948:nopm:.jpg

@jorangreef
Copy link

I think it would be good if Syncthing could tackle the problem. It's the kind of problem that needs to be tackled when building a sync solution for everyone that is easy to use. Most people would not mind if ":" is converted to "_" and those that do could change the default setting, whereas not syncing a file at all has bigger impact.

Dropbox replace trailing spaces in filenames on any platform, when first detected. If you create a file "test " with trailing space on Mac, then as soon as Dropbox detects it (even if you aren't syncing with any other platforms), it will rename the file locally to "test" without the trailing space.

I am working on a new kind of sync app for Ronomon, and recently worked on replacing reserved characters with underscores.

These are the characters that would need to be replaced to support almost every platform:

  • " Double Quote
    • Asterisk
  • : Colon
  • < Less Than
  • Greater Than

  • ? Question Mark
  • | Pipe
  • NUL Character (Byte 0)
  • Control Characters (Bytes 1-31, Byte 127)
  • Leading Hyphens (cause problems with many command line tools on Linux and Mac)
  • Trailing Dots
  • Trailing Spaces
  • Parent directory alias (..)
  • Current directory alias (.)
  • Home directory alias (~)

These characters should be replaced with as many underscores.

And then these device names also need to be modified slightly because AUX and AUX.txt are invalid on Windows but AUX_.txt is fine:

  • $IDLE$
  • AUX
  • COM1
  • COM2
  • COM3
  • COM4
  • COM5
  • COM6
  • COM7
  • COM8
  • COM9
  • CONFIG$
  • CON
  • CLOCK$
  • KEYBD$
  • LPT1
  • LPT2
  • LPT3
  • LPT4
  • LPT5
  • LPT6
  • LPT7
  • LPT8
  • LPT9
  • LST
  • NUL
  • PRN
  • SCREEN$
  • $AttrDef
  • $BadClus
  • $Bitmap
  • $Boot
  • $LogFile
  • $MFT
  • $MFTMirr
  • pagefile.sys
  • $Secure
  • $UpCase
  • $Volume
  • $Extend

These reserved device names should be automatically appended with an underscore (e.g. "AUX_" or "aux_.txt").

Please let me know if something is left out here.

If it would be helpful here, then these are some key ideas which might make the problem manageable and not so hard:

  1. Rename the files on any platform as soon as they are spotted. Renaming them only when they reach a platform where they are invalid only delays the rename and may surprise the user later.
  2. The file should be renamed across the cluster. i.e. There should be no special mapping to preserve invalid characters on platforms which allow them.
  3. When renaming the file locally when the file is first detected by the scanner, first check if another file already exists with the proposed rename. If it does, then add a "(Reserved Character Conflict 1)" label to the end of the filename but before the extension, and then try again or increment the conflict count until the destination is unique.
  4. Take care with hidden files (e.g. ".hidden:file") to not add the conflict label before the period. In this case there is no extension and the conflict label needs to be added on the right hand side of the period not the left hand side (Dropbox gets this wrong).
  5. Rare case: Make sure that after replacing reserved characters, the filename has not been inadvertently converted into an Apple Double file (which starts with "."), if it was not previously an Apple Double file before the replacement. If it is now an Apple Double file, then convert the "." prefix to ".-", i.e. use a dash instead of an underscore.
  6. Very rare case: Make sure that after replacing reserved characters, the filename has not been inadvertently converted into a ".DS_Store" file. If it was, then convert the "_" to "-", i.e. use a dash instead of an underscore.
  7. Very rare case: On Windows, certain short 8.3 filenames with no corresponding long filename and also containing a tilde, such as "SECURE~1.TXT", can cause rare conflicts with other files such as "SecureSocketsLayer.txt" and "SecureFTPServer.txt", depending on the order in which they are synced from different machines, if they are resolved by Windows to the same short 8.3 filename. These conflicts cannot be resolved in the usual manner by appending a conflict label to the filename. Instead, the tilde in these short 8.3 filenames should be automatically replaced with an underscore (e.g. "SECURE_1.TXT") when first detected in filenames on any platform, so that these short 8.3 filenames can continue to be synced.

This should cover:

exFAT (http://en.wikipedia.org/wiki/ExFAT)
VFAT (http://en.wikipedia.org/wiki/File_Allocation_Table#VFAT)
NTFS (http://en.wikipedia.org/wiki/Filename#Comparison_of_filename_limitations)
HFS+ (http://en.wikipedia.org/wiki/Filename#Comparison_of_filename_limitations)

For non-case-preserving filesystems (FAT12, FAT16, FAT32), we may also need to replace "+,.;=[]!@" but I have not tested this yet.

Hope this helps.

@jkufner
Copy link

jkufner commented May 7, 2015

Well, none of these characters are problem on decent filesystems used on
Linux. There can be problem on Android, since it uses FAT on SD cards.

Anyway, simply replacing these with '_' is bad idea, because it causes
data loss. There should be bidirectional mapping which allows keep
original filename on one computer and modified on another.

Joran Dirk Greef wrote, on 7.5.2015 10:42:

I think it would be good if Syncthing could tackle the problem. It's the
kind of problem that needs to be tackled when building a sync solution
for everyone that is easy to use. Most people would not mind if ":" is
converted to "_" and those that do could change the default setting,
whereas not syncing a file at all has bigger impact.

Dropbox replace trailing spaces in filenames on any platform, when first
detected. If you create a file "test " with trailing space on Mac, then
as soon as Dropbox detects it (even if you aren't syncing with any other
platforms), it will rename the file locally to "test" without the
trailing space.

I am working on a new kind of sync app for Ronomon, and recently worked
on replacing reserved characters with underscores.

These are the characters that would need to be replaced to support
almost every platform:

  • " Double Quote
  • * Asterisk
  • : Colon
  • < Less Than
  • Greater Than

  • ? Question Mark
  • | Pipe
  • NUL Character (Byte 0)
  • Control Characters (Bytes 1-31, Byte 127)
  • Leading Hyphens (cause problems with many command line tools on
    Linux and Mac)
  • Trailing Dots
  • Trailing Spaces
  • Parent directory alias (..)
  • Current directory alias (.)
  • Home directory alias (~)

These characters should be replaced with as many underscores.

And then these device names also need to be modified slightly because
AUX and AUX.txt are invalid on Windows but AUX_.txt is fine:

  • $IDLE$
  • AUX
  • COM1
  • COM2
  • COM3
  • COM4
  • COM5
  • COM6
  • COM7
  • COM8
  • COM9
  • CONFIG$
  • CON
  • CLOCK$
  • KEYBD$
  • LPT1
  • LPT2
  • LPT3
  • LPT4
  • LPT5
  • LPT6
  • LPT7
  • LPT8
  • LPT9
  • LST
  • NUL
  • PRN
  • SCREEN$
  • $AttrDef
  • $BadClus
  • $Bitmap
  • $Boot
  • $LogFile
  • $MFT
  • $MFTMirr
  • pagefile.sys
  • $Secure
  • $UpCase
  • $Volume
  • $Extend

These reserved device names should be automatically appended with an
underscore (e.g. "AUX_" or "aux_.txt").

Please let me know if something is left out here.

If it would be helpful here, then these are some key ideas which might
make the problem manageable and not so hard:

Rename the files on any platform as soon as they are spotted.
Renaming them only when they reach a platform where they are invalid
only delays the rename and may surprise the user later.
The file should be renamed across the cluster. i.e. There should be
no special mapping to preserve invalid characters on platforms which
allow them.
When renaming the file locally when the file is first detected by
the scanner, first check if another file already exists with the
proposed rename. If it does, then add a "(Reserved Character
Conflict 1)" label to the end of the filename but before the
extension, and then try again or increment the conflict count until
the destination is unique.
Take care with hidden files (e.g. ".hidden:file") to not add the
conflict label before the period. In this case there is no extension
and the conflict label needs to be added on the right hand side of
the period not the left hand side (Dropbox gets this wrong).
Rare case: Make sure that after replacing reserved characters, the
filename has not been inadvertently converted into an Apple Double
file (which starts with "./"), if it was not previously an Apple
Double file before the replacement. If it is now an Apple Double
file, then convert the "./" prefix to ".-", i.e. use a dash instead
of an underscore.
Very rare case: Make sure that after replacing reserved characters,
the filename has not been inadvertently converted into a ".DS_Store"
file. If it was, then convert the "_" to "-", i.e. use a dash
instead of an underscore.
Very rare case: On Windows, certain short 8.3 filenames with no
corresponding long filename and also containing a tilde, such as
"SECURE~1.TXT", can cause rare conflicts with other files such as
"SecureSocketsLayer.txt" and "SecureFTPServer.txt", depending on the
order in which they are synced from different machines, if they are
resolved by Windows to the same short 8.3 filename. These conflicts
cannot be resolved in the usual manner by appending a conflict label
to the filename. Instead, the tilde in these short 8.3 filenames
should be automatically replaced with an underscore (e.g.
"SECURE_1.TXT") when first detected in filenames on any platform, so
that these short 8.3 filenames can continue to be synced.

This should cover:

exFAT (http://en.wikipedia.org/wiki/ExFAT)
VFAT (http://en.wikipedia.org/wiki/File_Allocation_Table#VFAT)
NTFS
(http://en.wikipedia.org/wiki/Filename#Comparison_of_filename_limitations)
HFS+
(http://en.wikipedia.org/wiki/Filename#Comparison_of_filename_limitations)

For non-case-preserving filesystems (FAT12, FAT16, FAT32), we may also
need to replace "+,.;=[]!@" but I have not tested this yet.

Hope this helps.


Reply to this email directly or view it on GitHub
#1734 (comment).

@calmh
Copy link
Member

calmh commented May 7, 2015

I could possibly see a rename-on-sight as described above as an optional must-be-turned-on-manually feature. We currently do that silently for incorrect unicode normalization, and I could possibly see us doing it by default for trailing space (seldom intended I think, and apparently causes issues in cross compatibility), but the rest is not something we should do by default for sure.

@calmh calmh added enhancement New features or improvements of some kind, as opposed to a problem (bug) help-wanted labels May 7, 2015
@MarkusTeufelberger
Copy link

Dropbox has a webservice to check all your files for potential conflicts (https://www.dropbox.com/help/145 and https://www.dropbox.com/bad_files_check) and doesn't rename them by default.

Maybe an empty "illegal filenames present, run the bad_filename_checker" file or something like that could be inserted instead of a file that cannot be created on a certain platform, so the file can be renamed properly. Alternatively name the file temporarily to its sha256 hash or whatever syncthing uses as UUID internally so the content can be synced and only the file name gets lost until it is properly named for your platform and can still be used to seed the contents.

@rdebath
Copy link

rdebath commented May 23, 2015

This is not a simple problem. Normally "network path names" consist of a list of directory names and a a file name. Each of these components can be any string of eight bit bytes except for the NUL byte which is used to terminate each component. The directory separator can be different on every host. But it is very unusual nowadays for the / character not to be a valid separator. (However, care should be taken for aliases such as the \ character on Windows so that a pathname cannot be constructed to go UP in the directory tree.

The components can (in theory) be any string including . .. and the null string. The first two are a problem for most OSes (all?) including Unix (linux). The null string component may be a problem on Windows because \\ is special.

Windows has a very long list of problematic names; it would be very insensitive to export these limitations to other operating systems, especially as it includes case insensitivity.

In addition many of these limitations become security issues, for example translating overlong UTF-8 sequences onto a host that uses UTF-16 can manufacture NUL or \ characters from plain looking byte streams. What's more there's nothing wrong with these "non-UTF-8" file names on a Unix system so they should be transferred unmangled between Linux machines.

It would be nice if a unique "network path name" could be mapped to a unique local path name; this is mostly possible on Unix as it'd be simple enough to add a "someone tried to be evil" prefix to the unwanted file names. But the dumb "sort of case sensitive only not" behaviour of Windows defeats this.

Then, the case insensitivity of Windows is even worse than you think. It isn't the same everywhere. It depends on the localisation that Windows is running under and the version of windows that you're running (eg: Unicode characters that don't exist on an earlier version of windows can't be "case smashed" but can be later ).

Obviously, it's impossible to tell what workarounds may be needed on another peer, so all a particular node can do is look out for itself. The security impacts must be addressed, but it's probably impossible to fix case insensitivity. With luck you'd be able to manufacture a collision, make believe that the file has just been created on the problematic machine and force some sort of rename across the entire "swarm". Making it a unique rename will be the trick.

@biguenique
Copy link

I personally don't find very attractive the idea to apply Windows' puzzling naming constraints on the other OSes. What if a user is syncing files for an application that rely on file names that would be illegal in Windows? And I don't think we can design an encoding that would be bi-directional (allowing to restore the original name from the encoded one) without inserting a fair amount of confusion for the end user.

We may as well let the Windows nodes deal with the Windows problem, and avoid spreading it across other OSes. The Windows nodes should be able to keep track of global file names that are locally illegal. The files with illegal global names could be renamed locally, inserting a distinctive marker (like "!syncname" or whatever) in the name and replacing/removing any illegal characters or sequences (of course it would have to store the mapping of the local and global names somewhere). At least the files would actually be synced, the user would know there is a naming problem, and might have a cue of the original file name. When encountering such file, the Windows node could simply lookup the global name before proceeding with any network task. It could also keep track of file renaming, only changing the remote file name when the user manually removes the marker from the local name -- making it a valid file name for any OS anyway.

As for storage, instead of storing the local-global name mappings in the Syncthing database, why not simply store them in a local file? We already deal with ".stignore" and ".stfolder" at the root of a shared folder (we could even use the latter?). The mapping could as well be stored in a sibling system file, so the global name follows when the user moves folders around. Windows users are kinda used to see their folders cluttered with hidden system files such as "Desktop.ini" and "thumbs.db", so I don't see this approach as a big turn-off for them. And it would not affect in any way how the other nodes work. The Windows problem would remain a Windows problem.

@jorangreef
Copy link

jorangreef commented Jul 8, 2015

Some of these are actually not just Windows-specific, i.e. leading dash in filenames on Linux and Mac which are technically allowed but a security vulnerability.

Replacing characters only on Windows using a mapping would be great, but there is no canonical mapping for this kind of thing (e.g. such as the canonical mapping that Unicode has in NFC or NFD) that users could use to know what Syncthing is doing, so it would break any cross-platform file tree comparison that any applications try and do and lead to data loss (e.g. a program running on Windows tries to access the same file on a Linux server and finds it missing).

@endolith
Copy link

The files should be transferred in some way or other. Put them in a "holding area" where they can be renamed or a special folder for sync conflicts, etc. but they need to be transferred.

@bugith
Copy link

bugith commented Feb 5, 2018

IIRC, leading space(s) is forbidden in windows too (ntfs yes, others *fat* FS I don't know).
If pagefile.sys is really forbbiden (only at X: root or anywhere), maybe their hiberfil.sys is too (not sure of the exact name)

@rdebath
Copy link

rdebath commented Feb 10, 2018

Leading spaces are NOT forbidden in Windows.
From cmd.com you can create them:

md " HI"
echo > " Test.txt"

Windows explorer will, however, be rather confused by them and does try (unsuccessfully) to forbid them.

Interestingly, cmd.com does trim trailing spaces and dots, something like explorer, and explorer can do leading dots IFF you have a trailing dot (which gets deleted ...).

md " WWW . . . . . . . . ."

This test was done on NTFS. The "vfat" addition to the old 8+3 DOS filesystems can probably do the same, but I haven't tested.

@bugith
Copy link

bugith commented Feb 10, 2018

You're right. Forbidden was in my mind instead or broken because of this mismatching behaviour in explorer & cmd. I remember a weird HP MFC application called something like PrintToWeb that ran as a service creating 2 folders in Windows users' profiles that broke NTBackup or ArcServe backups because of these leading or trailing spaces. I couldn't delete any of them as the service recreated the other as soon and deleting both in a single shoot failed in error. I was newb and it took me a while to understand I had to disable the service. It was in NT4 and maybe I also had to use (nt)fsutil to delete the one that started/ended with a space. Damn'd Chimera when these 2 MS/HP monsters fertilise each other to users death!

@calmh calmh removed this from the Unplanned (Contributions Welcome) milestone Feb 11, 2018
@Mannshoch
Copy link

Maybe as a Solution. Setup a translation directory for each unsuported charackter and replace it with an UTF8 symbol like the Roman Numbers:
Roman Numeral One | Ⅰ | Ⅰ | Ⅰ
Roman Numeral Two | Ⅱ | Ⅱ | Ⅱ
Roman Numeral Three | Ⅲ | Ⅲ | Ⅲ
Roman Numeral Four | Ⅳ | Ⅳ | Ⅳ
Roman Numeral Five | Ⅴ | Ⅴ | Ⅴ
Roman Numeral Six | Ⅵ | Ⅵ | Ⅵ
Roman Numeral Seven | Ⅶ | Ⅶ | Ⅶ
Roman Numeral Eight | Ⅷ | Ⅷ | Ⅷ
Roman Numeral Nine | Ⅸ | Ⅸ | Ⅸ
Roman Numeral Ten | Ⅹ | Ⅹ | Ⅹ
Roman Numeral Eleven | Ⅺ | Ⅺ | Ⅺ
Roman Numeral Twelve | Ⅻ | Ⅻ | Ⅻ

@Ichimonji10
Copy link

Ichimonji10 commented May 25, 2018

What happens when a directory with the translated name already exists? Does the translation not happen, because the translation directories can't be created? Does a translation directory get created, thus clobbering the user's files? What happens if the translation directory gets lost?

Let's make sure to not exchange one problem (can't sync certain weirdly named files) for a bunch of new code and a much more subtle set of behaviours and problems.

@Mannshoch
Copy link

How many times have you ever seen a Filename with a UTF8 symbol in it. How many Time you had ever seen a File containing the Symbol for "Ⅷ" and how many Time do you think you will find a filename with a : and an other with an Ⅷ as the only different.
What I suggest is a play with possibilities If you use allowed UTF8 symbols.

@AudriusButkevicius
Copy link
Member

I think its a terrible idea.

@Ferroin
Copy link

Ferroin commented May 25, 2018

@Mannshoch There are three reasons you never see filenames with Unicode symbols in them (as opposed to Unicode characters that are based on actual linguistic graphemes):

  1. It's not portable, at all, period, end of story. You can't count on the platform properly handling Unicode, and you can't count on the font itself properly handling the symbols (in fact, you can't even realistically count on the font handling anything outside of the graphemes for the language the system is configured for and the basic ASCII set (and even the second part is a stretch at times)).
  2. You can't type them directly on any keyboard. As surprising as this may be, a lot people do still use the keyboard for selecting files in quite a few cases, and not being able to type characters from a filename is a bad thing. In theory, people could use input methods to handle this (for example, I can do it on my Linux systems, provided I know the 32-bit codepoint), but it's unreasonable to expect people to use them at all.
  3. They're visually ambiguous in many cases. Taking your exact example, the only visible differences with the default font used by GitHub between â…§ and VIII is the spacing and the presence of serif's on the first one. With Arial (the default font on Windows), as well as most of the default fonts on Linux, there is no visible difference. This is very bad for UX, on multiple levels, it's why people replacing characters in domain names for phishing purposes works so well. This is technically a problem with the font itself, not the application, but that doesn't mean it should not be considered.

Beyond all three of those issues (the first and third are direct problems for Syncthing, the second is an indirect issue), you also have to consider that this absolutely needs to be reversible. In other words, it has to be a 1:1 mapping, which isn't possible without running the risk of a name collision and thus potentially losing data. If we are going to handle losing data, it's better to just use þ, which is the standard, accepted, replacement character, and track the proper names in the index. That still doesn't properly solve any of the above listed issues though.

@endolith
Copy link

I don't understand why you have no problem renaming files with "sync conflict" appended to the filename, but won't consider renaming files when there's a system filename incompatibility. You can escape the bad characters, put them in a special folder, etc. but don't just leave them unsynced and inaccessible

@endolith
Copy link

endolith commented May 25, 2018

You can use other tools to get back to the source device and rename the files. SSH or your feet work fine.

No, you cannot assume that either of these are possible. There are many use cases for syncthing.

Option to rename and sync anyway may be reasonable, but disabled by default to avoid damage it may cause.

That would be great!

@Ichimonji10
Copy link

Ichimonji10 commented May 25, 2018

I don't understand why [...]

Understanding the problems with the currently proposed solutions is necessary for making a good faith argument.

You can escape the bad characters

Can this actually be done? How would the problem listed in #1734 (comment) be solved?

put them in a special folder

Moving a file with weird characters in its name doesn't solve the problem of that file having weird characters in its name. Moving the file now means that some number of hosts need to track file name mappings in a database, and it opens up additional possibilities for naming conflicts, both of which are cans of bugs. There was one problem. Now there are three or more.

@Ferroin
Copy link

Ferroin commented May 25, 2018

@endolith Your argument goes both ways. It's just as easy to argue that you shouldn't be creating files with names that have bad characters in them. Of the above listed problematic characters, the only ones that make any sense to realistically have in a file name that isn't liable to confuse people are " (which is trivial to replace with ' when you create the file), slashes / and \ (which I will admit are non-trivial to replace, though most sane people are in the habit of not using them in filenames), and possibly ?. Beyond what's listed above, almost everything on top row of a keyboard is problematic on UNIX systems (except - and _), and yet people have done just fine over the years handling this fact. As far as names, find me a a realistic case where any of them other than AUX and CON make sense as file names, and I might listen (both AUX and CON as special names need to die on Windows though, they have been the source of a number of bugs over the years involving improper handling of embedded paths).

@rdebath
Copy link

rdebath commented May 25, 2018

First, it's not just CON, NUL, AUX, PRN etc. It's also Nul.txt and Con.txt, also nul.exe is a great one.

Basically, the only reasonably reliable solution is to create a virtual filesystem layer on the client. Filenames that are white-listed are allowed to map directly to real filenames everything else gets mapped to an '_' replacement or random name depending on if there is a collision in some sort of name remapping database. The mangled names never leave the machine.

You can then implement or configure various levels of paranoia, all the way from "just let the files collide" to "only allow lowercase, numbers and one dot for an extension for no more that thirty characters".

But now I'm probably repeating myself ... again.

@soredake
Copy link

Still a problem on android

@Mannshoch
Copy link

Mannshoch commented Jul 10, 2018

@Ferroin

It's not portable, at all, period, end of story. You can't count on the platform properly handling Unicode, and you can't count on the font itself properly handling the symbols (in fact, you can't even realistically count on the font handling anything outside of the graphemes for the language the system is configured for and the basic ASCII set (and even the second part is a stretch at times)).

You can count on handling unicode. I have no problem with it on Windows 7 and 8.1 and also no problem with them on Linux. maybe you could add a Unicode check and print an error if nextcloud is started on a non Unicode system.

You can't type them directly on any keyboard. As surprising as this may be, a lot people do still use the keyboard for selecting files in quite a few cases, and not being able to type characters from a filename is a bad thing. In theory, people could use input methods to handle this (for example, I can do it on my Linux systems, provided I know the 32-bit codepoint), but it's unreasonable to expect people to use them at all.

I think that could be ignored.
Speculative, maybe you could write the old title to the file-info. In Windows I'm able to use the file info to add more date to a File which don't have to be stored inside the file and I'm able to find the file with this info. In Linux there is KDE that is able to handle file info the same way. In terminal, there is maybe more a problem with that but I do not see that this is a greater problem then a file that do not get synced.

They're visually ambiguous in many cases. Taking your exact example, the only visible differences with the default font used by GitHub between Ⅷ and VIII is the spacing and the presence of serif's on the first one. With Arial (the default font on Windows), as well as most of the default fonts on Linux, there is no visible difference. This is very bad for UX, on multiple levels, it's why people replacing characters in domain names for phishing purposes works so well. This is technically a problem with the font itself, not the application, but that doesn't mean it should not be considered.

OK. My chose of the Unicode symbol was not that good. See below for a new Idea

Beyond all three of those issues (the first and third are direct problems for Syncthing, the second is an indirect issue), you also have to consider that this absolutely needs to be reversible. In other words, it has to be a 1:1 mapping, which isn't possible without running the risk of a name collision and thus potentially losing data. If we are going to handle losing data, it's better to just use �, which is the standard, accepted, replacement character, and track the proper names in the index. That still doesn't properly solve any of the above listed issues though.

If you like Clocks use them, with them you could also create an Index which allow it to mad it reversible:
http://www.fileformat.info/info/unicode/char/search.htm?q=CLOCK+FACE&preview=entity
there are many more Unicode possibility.
If you use a different set of Question Marc you avoid some problems with same name after rename, you could also avoid argument two you gave me because Clock-icons are usually not part of any Languages and could not easy be misunderstood.

@Ichimonji10
Copy link

If you like Clocks use them, with them you could also create an Index which allow it to mad it reversible:

Someone go tell the Apple messaging team that they've been one-upped. That bug merely crashed phones. This bug causes data loss.

@Ferroin
Copy link

Ferroin commented Jul 10, 2018

@Mannshoch

You can count on handling unicode. I have no problem with it on Windows 7 and 8.1 and also no problem with them on Linux. maybe you could add a Unicode check and print an error if nextcloud is started on a non Unicode system.

No, you can't. Not 100% correctly at least. Note that I'm not saying 'has to handle Unicode characters in filenames', I'm saying 'has to follow consistent predictable rules regarding handling of Unicode characters'. Case in-point, mapping of Unicode code points to byte-wise length of filenames is drastically different on Linux, Windows, and macOS, and is functionally unpredictable on Linux and Windows (because the filesystem may not support the chosen characters, or may use an encoding which represents them in a way that takes up more or less space than the system encoding would).

I think that could be ignored.
Speculative, maybe you could write the old title to the file-info. In Windows I'm able to use the file info to add more date to a File which don't have to be stored inside the file and I'm able to find the file with this info. In Linux there is KDE that is able to handle file info the same way. In terminal, there is maybe more a problem with that but I do not see that this is a greater problem then a file that do not get synced.

I didn't intend for this point to be an argument against Syncthing handling any of this, just to point out one of the reasons why Unicode symbols (and do note that I'm differentiating symbols from actual graphemes, so I'm talking about stuff like the range between U+1F300 and U+1F5FF, but not things in U+20000 to U+2FFFF). Syncthing can functionally ignore this, but it's worth remembering because it impacts the likelihood of such files existing.

If you like Clocks use them, with them you could also create an Index which allow it to mad it reversible:
http://www.fileformat.info/info/unicode/char/search.htm?q=CLOCK+FACE&preview=entity
there are many more Unicode possibility.
If you use a different set of Question Marc you avoid some problems with same name after rename, you could also avoid argument two you gave me because Clock-icons are usually not part of any Languages and could not easy be misunderstood.

Again, that has support issues. You're just changing what symbol is being used. If the font doesn't support this, you will at best see the numeric codepoint in hex inside a box (if it's a sanely designed font that does this, which is unfortunately rare), and at worst will get a ▯ (U+25AF, White Vertical Rectangle, used as the standard character on Windows for code points that don't have a character defined in the current font).

Note also that � is not a question mark, it just happens that most fonts have standardized on rendering it as a white question mark in a black diamond. It's codepoint U+FFFD, the officially defined replacement character, which is supposed to be used in cases like this.

@AudriusButkevicius
Copy link
Member

Sorry, but this has turned into a non-sense discussion. Nobody is going to rely on unicode to fix this problem. We can use an url like escape sequence or something along those lines that can be converted in-line. Sure, we can blow past file length limits, but who cares at that point.

@Ferroin
Copy link

Ferroin commented Jul 10, 2018

File length limits shouldn't matter at all (I seriously doubt that people are using files with names close enough to the typical 255 character limit for this to matter). Path length limits on Windows may matter though (240 characters sounds like a lot, but people hit that regularly enough with older apps that you see pretty regular questions about it).

@chrisfcarroll
Copy link

Should this thread be closed, with a link to a new issue called "Write a Detailed Specification for How to Reliably Translate and Sync Filenames that are Illegal on Some Platforms"?

We can at least sketch out what that spec has to contain before the feature request can be taken seriously. Something like:

Section 1. Tables of O/S & filesystem Rules
Describing legal & illegal filenames per O/S(-version) and per filesystem(-version).

Section 2. An Encoding/Decoding Algorithm pair
To be used when writing to an filesystem which will reject the original name

Section 3. Description of one or more Rulesets for when Encoding will be applied
for example
Ruleset 1: Never
Ruleset 2: Whenever "Needed"

Section 4. Statement of Rules for when Decoding will be applied
for example
Ruleset 1: Never
Ruleset 2: Whenever the Database says an Encoding has been applied and we are syncing to an OS/fs that doesn't require the encoding
Ruleset 3: Whenever a (to be defined) Heuristic says an Encoding has been applied and we are syncing to an OS/fs that doesn't require the encoding

Section 5. A proof that integrity can be maintained, and under what conditions

Section 5. Proposal for the UI for users to choose their options

@endolith
Copy link

@Ferroin

@endolith Your argument goes both ways.

No it doesn't. It's totally unreasonable to expect people to go through thousands of filenames and sanitize them before syncing them. It's irrelevant which characters are in filenames, the file should be transferred regardless. (For example, I've used Linux software that makes filenames with timestamps like 12:34:56, while colons are invalid in Windows.)

If you put files into a sync folder, expect them to be at the other end when you arrive, and they aren't, and there's nothing you can do from that end to make them sync, that's a major bug. It needs to put them in some kind of BAD_FILENAMES folder, mangle their names in a non-colliding way, or whatever, but it still needs to transfer the files somehow, so you can rename them from the problematic end and put them back in their normal place, or work with them locally and have them remain in the correct name at the original end.

@AudriusButkevicius
Copy link
Member

I am locking this, this is just bikeshedding at this point.

@syncthing syncthing locked as too heated and limited conversation to collaborators Feb 17, 2019
@imsodin imsodin changed the title Rename special characters to '_' Rename special/unsupported characters in filenames Mar 22, 2024
@rasa
Copy link
Member

rasa commented May 14, 2024

I have started drafting a proposed solution at #9539. Feedback is appreciated.

rasa added a commit to rasa/syncthing that referenced this issue Jun 30, 2024
rasa added a commit to rasa/syncthing that referenced this issue Jul 2, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New features or improvements of some kind, as opposed to a problem (bug)
Projects
None yet
Development

No branches or pull requests