-
-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rename special/unsupported characters in filenames #1734
Comments
This needs to happen on the source then, or we need to keep the translation in the database somewhere, and hilariousness ensues when the database is reset or lost. |
Maybe there could be a warning on the source, inviting to rename the file to avoid problems. |
I would think that with the substitution of characters it wouldn't really be required on each machine's database... would it not be possible to just check files incoming/outgoing against the substituted version, if they exist already on the host or don't, do what's appropriate? |
No. Consider a case where a *nix machine has two files in a directory, and their contents are unique: $ echo 'first file' > 'foo:bar'
$ echo 'second file' > 'foo>bar'
$ ls -1
foo>bar
foo:bar
$ cat 'foo:bar'
first file
$ cat 'foo>bar'
second file Simply renaming each file by substituting offending characters would not work. You would end up with two conflicting files, each named Also, consider this scenario:
Oops. One file magically became two files. |
Moreover there are other forbidden names under Windows: for example, nul or com1 and several other, or names longer than 255 characters. And names cannot end with a dot or space. And Mac could be a problem too: I think it requires normalized unicode, while in Linux normalized and non-normalized version could be two different files. (I'm just remembering, cannot verify right now, maybe there are some imprecisions). |
We enforce unicode normalization; there's been some pain around that, for people having files with the "wrong" normalization for their OS. There we actually silently fix it, unless configured otherwise.
😱 I'd forgotten about the first of that, and didn't know about the second. We should probably handle that too (at least with a reject). Dammit, Windows... But @Ichimonji10 above summarizes quite nicely why we probably won't be doing character substitutions anytime soon. |
Just to add all the information in one place: you have to handle case sensitivity too. Moreover, the problem is not with Windows bit also affects NTFS with other OSs. While in Linux shares under Windows have an error if trying to create a wrong filename, direct NTFS mounts do not have errors by default, and files are accessibile under Linux but not under Windows. So, you cannot rely on errors to know if a filename is legal. |
This is maybe even crazier idea, but how about simple escaping? On windows, invalid character in incoming file names can be substituted, |
We could do escaping, but it would have to be to something the user would not enter herself (to avoid confusion), and it would be ugly. I.e. The case sensitivityness is an open bug somewhere else. And I don't think should try to guess hidden rules when the OS and filesystem are fine with them... |
True, but that can be "solved" by escaping escape character, so incoming There is also thing that cygwin does - they are translating invalid characters to character from unicode private space. That may, in plain theory, create conflicts, but chances are really slim and it looks better. |
To be honest, I think the much simpler solution is for people who sync across multiple OS:es to just stick to the lowest common denominator in file names or live with the errors. It's not that onerous, and people using more than one platform should be kind of aware of the issue. For cases where someone has just a Windows box and a NAS running Linux, they'll most likely create all their files from the Windows side and automatically stay within the limitations. (The case sensitivity thing still needs to be handled better though.) |
You're right, but there is at least one case in which I'd like a solution.
I use it to backup my Android photos. If I take a photo with Hangouts app,
the file gets saved with a `:nopm:` suffix.
|
There are two problems with this approach.
|
I don't see point number one related to syncthing; that'd be a problem for the poor user on Windows with someone forcing invalid filenames on him no matter the delivery mechanism? Point number two implies the user is also a Linux user, who I'm assuming are more aware about things like filesystems and name limitations? |
(This is all to say that I think this should be solved in a clean way, not that we shouldn't solve it at all. But something that is really ugly or has bad side effects is probably not worth it IMHO.) |
I can tell from experience that it doesn't works like that :D |
Android is Linux, but I don't think the majority of Android users even know
what a filesystem is.
|
Now when you mention it... I'm ot saying anything bad about their skill level, but this will concern MacOS users as well... |
I don't know what the larger Android population does. But I personally sync photos between my Android phone and PC, which means that I end up with file names like:
Also, that screenshot is terrifying. |
@Ichimonji10 I agree, but if you take a photo with Hangouts you also have something like:
|
I think it would be good if Syncthing could tackle the problem. It's the kind of problem that needs to be tackled when building a sync solution for everyone that is easy to use. Most people would not mind if ":" is converted to "_" and those that do could change the default setting, whereas not syncing a file at all has bigger impact. Dropbox replace trailing spaces in filenames on any platform, when first detected. If you create a file "test " with trailing space on Mac, then as soon as Dropbox detects it (even if you aren't syncing with any other platforms), it will rename the file locally to "test" without the trailing space. I am working on a new kind of sync app for Ronomon, and recently worked on replacing reserved characters with underscores. These are the characters that would need to be replaced to support almost every platform:
These characters should be replaced with as many underscores. And then these device names also need to be modified slightly because AUX and AUX.txt are invalid on Windows but AUX_.txt is fine:
These reserved device names should be automatically appended with an underscore (e.g. "AUX_" or "aux_.txt"). Please let me know if something is left out here. If it would be helpful here, then these are some key ideas which might make the problem manageable and not so hard:
This should cover: exFAT (http://en.wikipedia.org/wiki/ExFAT) For non-case-preserving filesystems (FAT12, FAT16, FAT32), we may also need to replace "+,.;=[]!@" but I have not tested this yet. Hope this helps. |
Well, none of these characters are problem on decent filesystems used on Anyway, simply replacing these with '_' is bad idea, because it causes Joran Dirk Greef wrote, on 7.5.2015 10:42:
|
I could possibly see a rename-on-sight as described above as an optional must-be-turned-on-manually feature. We currently do that silently for incorrect unicode normalization, and I could possibly see us doing it by default for trailing space (seldom intended I think, and apparently causes issues in cross compatibility), but the rest is not something we should do by default for sure. |
Dropbox has a webservice to check all your files for potential conflicts (https://www.dropbox.com/help/145 and https://www.dropbox.com/bad_files_check) and doesn't rename them by default. Maybe an empty "illegal filenames present, run the bad_filename_checker" file or something like that could be inserted instead of a file that cannot be created on a certain platform, so the file can be renamed properly. Alternatively name the file temporarily to its sha256 hash or whatever syncthing uses as UUID internally so the content can be synced and only the file name gets lost until it is properly named for your platform and can still be used to seed the contents. |
This is not a simple problem. Normally "network path names" consist of a list of directory names and a a file name. Each of these components can be any string of eight bit bytes except for the NUL byte which is used to terminate each component. The directory separator can be different on every host. But it is very unusual nowadays for the The components can (in theory) be any string including Windows has a very long list of problematic names; it would be very insensitive to export these limitations to other operating systems, especially as it includes case insensitivity. In addition many of these limitations become security issues, for example translating overlong UTF-8 sequences onto a host that uses UTF-16 can manufacture NUL or It would be nice if a unique "network path name" could be mapped to a unique local path name; this is mostly possible on Unix as it'd be simple enough to add a "someone tried to be evil" prefix to the unwanted file names. But the dumb "sort of case sensitive only not" behaviour of Windows defeats this. Then, the case insensitivity of Windows is even worse than you think. It isn't the same everywhere. It depends on the localisation that Windows is running under and the version of windows that you're running (eg: Unicode characters that don't exist on an earlier version of windows can't be "case smashed" but can be later ). Obviously, it's impossible to tell what workarounds may be needed on another peer, so all a particular node can do is look out for itself. The security impacts must be addressed, but it's probably impossible to fix case insensitivity. With luck you'd be able to manufacture a collision, make believe that the file has just been created on the problematic machine and force some sort of rename across the entire "swarm". Making it a unique rename will be the trick. |
I personally don't find very attractive the idea to apply Windows' puzzling naming constraints on the other OSes. What if a user is syncing files for an application that rely on file names that would be illegal in Windows? And I don't think we can design an encoding that would be bi-directional (allowing to restore the original name from the encoded one) without inserting a fair amount of confusion for the end user. We may as well let the Windows nodes deal with the Windows problem, and avoid spreading it across other OSes. The Windows nodes should be able to keep track of global file names that are locally illegal. The files with illegal global names could be renamed locally, inserting a distinctive marker (like "!syncname" or whatever) in the name and replacing/removing any illegal characters or sequences (of course it would have to store the mapping of the local and global names somewhere). At least the files would actually be synced, the user would know there is a naming problem, and might have a cue of the original file name. When encountering such file, the Windows node could simply lookup the global name before proceeding with any network task. It could also keep track of file renaming, only changing the remote file name when the user manually removes the marker from the local name -- making it a valid file name for any OS anyway. As for storage, instead of storing the local-global name mappings in the Syncthing database, why not simply store them in a local file? We already deal with ".stignore" and ".stfolder" at the root of a shared folder (we could even use the latter?). The mapping could as well be stored in a sibling system file, so the global name follows when the user moves folders around. Windows users are kinda used to see their folders cluttered with hidden system files such as "Desktop.ini" and "thumbs.db", so I don't see this approach as a big turn-off for them. And it would not affect in any way how the other nodes work. The Windows problem would remain a Windows problem. |
Some of these are actually not just Windows-specific, i.e. leading dash in filenames on Linux and Mac which are technically allowed but a security vulnerability. Replacing characters only on Windows using a mapping would be great, but there is no canonical mapping for this kind of thing (e.g. such as the canonical mapping that Unicode has in NFC or NFD) that users could use to know what Syncthing is doing, so it would break any cross-platform file tree comparison that any applications try and do and lead to data loss (e.g. a program running on Windows tries to access the same file on a Linux server and finds it missing). |
The files should be transferred in some way or other. Put them in a "holding area" where they can be renamed or a special folder for sync conflicts, etc. but they need to be transferred. |
IIRC, leading space(s) is forbidden in windows too (ntfs yes, others *fat* FS I don't know). |
Leading spaces are NOT forbidden in Windows.
Windows explorer will, however, be rather confused by them and does try (unsuccessfully) to forbid them. Interestingly,
This test was done on NTFS. The "vfat" addition to the old 8+3 DOS filesystems can probably do the same, but I haven't tested. |
You're right. Forbidden was in my mind instead or broken because of this mismatching behaviour in explorer & cmd. I remember a weird HP MFC application called something like PrintToWeb that ran as a service creating 2 folders in Windows users' profiles that broke NTBackup or ArcServe backups because of these leading or trailing spaces. I couldn't delete any of them as the service recreated the other as soon and deleting both in a single shoot failed in error. I was newb and it took me a while to understand I had to disable the service. It was in NT4 and maybe I also had to use (nt)fsutil to delete the one that started/ended with a space. Damn'd Chimera when these 2 MS/HP monsters fertilise each other to users death! |
Maybe as a Solution. Setup a translation directory for each unsuported charackter and replace it with an UTF8 symbol like the Roman Numbers: |
What happens when a directory with the translated name already exists? Does the translation not happen, because the translation directories can't be created? Does a translation directory get created, thus clobbering the user's files? What happens if the translation directory gets lost? Let's make sure to not exchange one problem (can't sync certain weirdly named files) for a bunch of new code and a much more subtle set of behaviours and problems. |
How many times have you ever seen a Filename with a UTF8 symbol in it. How many Time you had ever seen a File containing the Symbol for "Ⅷ" and how many Time do you think you will find a filename with a : and an other with an Ⅷ as the only different. |
I think its a terrible idea. |
@Mannshoch There are three reasons you never see filenames with Unicode symbols in them (as opposed to Unicode characters that are based on actual linguistic graphemes):
Beyond all three of those issues (the first and third are direct problems for Syncthing, the second is an indirect issue), you also have to consider that this absolutely needs to be reversible. In other words, it has to be a 1:1 mapping, which isn't possible without running the risk of a name collision and thus potentially losing data. If we are going to handle losing data, it's better to just use þ, which is the standard, accepted, replacement character, and track the proper names in the index. That still doesn't properly solve any of the above listed issues though. |
I don't understand why you have no problem renaming files with "sync conflict" appended to the filename, but won't consider renaming files when there's a system filename incompatibility. You can escape the bad characters, put them in a special folder, etc. but don't just leave them unsynced and inaccessible |
No, you cannot assume that either of these are possible. There are many use cases for syncthing.
That would be great! |
Understanding the problems with the currently proposed solutions is necessary for making a good faith argument.
Can this actually be done? How would the problem listed in #1734 (comment) be solved?
Moving a file with weird characters in its name doesn't solve the problem of that file having weird characters in its name. Moving the file now means that some number of hosts need to track file name mappings in a database, and it opens up additional possibilities for naming conflicts, both of which are cans of bugs. There was one problem. Now there are three or more. |
@endolith Your argument goes both ways. It's just as easy to argue that you shouldn't be creating files with names that have bad characters in them. Of the above listed problematic characters, the only ones that make any sense to realistically have in a file name that isn't liable to confuse people are |
First, it's not just Basically, the only reasonably reliable solution is to create a virtual filesystem layer on the client. Filenames that are white-listed are allowed to map directly to real filenames everything else gets mapped to an '_' replacement or random name depending on if there is a collision in some sort of name remapping database. The mangled names never leave the machine. You can then implement or configure various levels of paranoia, all the way from "just let the files collide" to "only allow lowercase, numbers and one dot for an extension for no more that thirty characters". But now I'm probably repeating myself ... again. |
Still a problem on android |
You can count on handling unicode. I have no problem with it on Windows 7 and 8.1 and also no problem with them on Linux. maybe you could add a Unicode check and print an error if nextcloud is started on a non Unicode system.
I think that could be ignored.
OK. My chose of the Unicode symbol was not that good. See below for a new Idea
If you like Clocks use them, with them you could also create an Index which allow it to mad it reversible: |
Someone go tell the Apple messaging team that they've been one-upped. That bug merely crashed phones. This bug causes data loss. |
No, you can't. Not 100% correctly at least. Note that I'm not saying 'has to handle Unicode characters in filenames', I'm saying 'has to follow consistent predictable rules regarding handling of Unicode characters'. Case in-point, mapping of Unicode code points to byte-wise length of filenames is drastically different on Linux, Windows, and macOS, and is functionally unpredictable on Linux and Windows (because the filesystem may not support the chosen characters, or may use an encoding which represents them in a way that takes up more or less space than the system encoding would).
I didn't intend for this point to be an argument against Syncthing handling any of this, just to point out one of the reasons why Unicode symbols (and do note that I'm differentiating symbols from actual graphemes, so I'm talking about stuff like the range between U+1F300 and U+1F5FF, but not things in U+20000 to U+2FFFF). Syncthing can functionally ignore this, but it's worth remembering because it impacts the likelihood of such files existing.
Again, that has support issues. You're just changing what symbol is being used. If the font doesn't support this, you will at best see the numeric codepoint in hex inside a box (if it's a sanely designed font that does this, which is unfortunately rare), and at worst will get a ▯ (U+25AF, White Vertical Rectangle, used as the standard character on Windows for code points that don't have a character defined in the current font). Note also that � is not a question mark, it just happens that most fonts have standardized on rendering it as a white question mark in a black diamond. It's codepoint U+FFFD, the officially defined replacement character, which is supposed to be used in cases like this. |
Sorry, but this has turned into a non-sense discussion. Nobody is going to rely on unicode to fix this problem. We can use an url like escape sequence or something along those lines that can be converted in-line. Sure, we can blow past file length limits, but who cares at that point. |
File length limits shouldn't matter at all (I seriously doubt that people are using files with names close enough to the typical 255 character limit for this to matter). Path length limits on Windows may matter though (240 characters sounds like a lot, but people hit that regularly enough with older apps that you see pretty regular questions about it). |
Should this thread be closed, with a link to a new issue called "Write a Detailed Specification for How to Reliably Translate and Sync Filenames that are Illegal on Some Platforms"? We can at least sketch out what that spec has to contain before the feature request can be taken seriously. Something like: Section 1. Tables of O/S & filesystem Rules Section 2. An Encoding/Decoding Algorithm pair Section 3. Description of one or more Rulesets for when Encoding will be applied Section 4. Statement of Rules for when Decoding will be applied Section 5. A proof that integrity can be maintained, and under what conditions Section 5. Proposal for the UI for users to choose their options |
No it doesn't. It's totally unreasonable to expect people to go through thousands of filenames and sanitize them before syncing them. It's irrelevant which characters are in filenames, the file should be transferred regardless. (For example, I've used Linux software that makes filenames with timestamps like If you put files into a sync folder, expect them to be at the other end when you arrive, and they aren't, and there's nothing you can do from that end to make them sync, that's a major bug. It needs to put them in some kind of BAD_FILENAMES folder, mangle their names in a non-colliding way, or whatever, but it still needs to transfer the files somehow, so you can rename them from the problematic end and put them back in their normal place, or work with them locally and have them remain in the correct name at the original end. |
I am locking this, this is just bikeshedding at this point. |
I have started drafting a proposed solution at #9539. Feedback is appreciated. |
See syncthing#9539 for more details.
See syncthing#9539 for more details.
As mentioned in syncthing/syncthing-android#192 , some filenames are not accepted by windows hosts because they contain 'special characters' like colons or bars.
@schuft69 suggested to add a "Rename special characters to '_'" option to resolve this issue.
The text was updated successfully, but these errors were encountered: