New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
new feature: VFS auto normalization #7072
Comments
I think this should be achievable relatively easily. All file / dir objects are ultimately made from the directory cache. So we only need to do the normalized comparison in the directory cache searching routine. We actually do this already for case sensitive / insensitive file systems. So first we look for an exact match which is a quick hash look up, and then we look for a case insensitive match which involves comparing every item. Lines 710 to 732 in 2c50f26
We could do something very similar for the normalization options too. I think we'd want to make a normalization function which does the (NFC, NFD) and/or (lower case) and just run through the loop once. We do exactly this in the sync routines so factoring this out would be sensible. Lines 58 to 71 in 2c50f26
We currently just use the first one in the case vs CASE comparison. That is the easy thing to do. To detect a collision would require scanning the whole directory which would take on average twice as long. Worth the extra CPU for an ERROR message?
Assuming we are only using the normalize option to find matching file names and not actually returning them to the OS then we can get away with using NFC only and this would match this flag which only does NFC normalization.
Once thing I am unclear on though - do you think we need to apply that normalization to the file names sent to the OS - is that your intention? So we receive an NFC file name from Google Drive (say) and send the macOS kernel an NFD file name according to the flag? That will be a bit more work (but not much). I think this might be what the FUSE-T option |
Yes this is the most important thing actually and whole purpose of this - it is about sending to OS normalized names. Today we have a problem that remotes can contain both NFC and NFD names - it works on Linux but breaks macOS.
This option converts names received from OS (macOS uses NFD) and makes sure that NFS mount receives NFC - so all NFS content is NFC. It also makes NFC content to work with macOS APIs. On macOS it happens for example when user uses With
REMOTE -----------------------> MOUNT MOUNT -----------------------> REMOTE
REMOTE -----------------------> MOUNT MOUNT -----------------------> REMOTE It is difficult to say which one will behave better on macOS - I think that both would work. But with NFC/NFD and macOS I have learnt that only testing can tell. |
The key purpose of this enhancement is to fix Today it simply does not work - does not matter with macFUSE or FUSE-T. And is unfixable by |
I think what I'll do is try the smallest patch that could work first. That would fix the "... can't be found" errors I think but it wouldn't normalize the returns to the OS. This might mean we need the |
Cool. I am happy at all stages to test - I do not need binary build I can happily build myself from I hoped that fuse-t can handle it fully but does not seem to be possible. |
Note: With Lines 120 to 126 in fcb912a
It is crude way to actually try to normalize to NFD- but it stopped working fully with advent of APFS FS few years ago and changes in normalization in all macOS - also I suspect not 100% correct way macFUSE handles it. I actually suspect it never worked 100% before neither - but nowadays it is more apparent. Still should be there for old Apple computers users - this is why |
Sorry to keep pestering... but can it be implemented one day? I am not skilled enough in go to give it a go. I do try to earn my karma points on the forum instead:) I do believe that it is missing piece of macOS I managed to convince author of |
You have lots of karma indeed :-) I put this on the 1.64 milestone which isn't a guarantee that it will make that release, but it will be on my radar at least! |
Just dropping in to say that I would be extremely thankful for having this fixed at some point. It would save a lot of trouble for me and many users in Slovakia, Czech republic and other countries, where basically all words are accented, which makes using |
I took a shot at fixing this: nielash@af6c376 It is the "smallest patch that could work" approach as ncw suggested. So, it does not add a new
I attempted to split the difference on this by adding a new flag for this (disabled by default), On a slightly-related note, it seems to me that there is something odd about how |
This is a big one for us macOS users:) As soon as I have a moment I will start testing it. |
OK I could not wait - I am very excited about this chance to fix this long standing issue:) - so I quickly tried with my test data (NFC/NFD mix usage) I used in the past. And hurrah! It works. Both in Finder and terminal using FUSE-T mount. Have not looked at "duplicates" yet. I am a bit swamped at the moment so it is only quick try - I will take it for a proper ride later. How exactly this patch work? Sorry I am not a programmer and struggle to understand exactly how it is achieved. I see that it ditches |
That is encouraging to hear! 😃
Basically it takes the existing So the result is that if rclone gives mac an NFC file but mac converts it to NFD, rclone will not get confused and will be able to figure out which file mac is talking about. The most important part is here. And this calls Lines 381 to 390 in 91b54aa
This uses the |
Thank you for this explanation. When I digest it I will try to break it with my testing:) And of course I hope I will fail. |
Maybe you could land both normalisation and nfsmount fixes at the same time? At last we could have FUSE free mount life on macOS:) |
That's the dream! 😆 Aside from #7503 (comment) which I posted a fix for, are there other |
I have been using your fix few times and it serves its purpose - writing to NFS mount is possible. One issue I noticed was |
Ok. That sounds fairly easy, I'll try to take a look. |
Using below test set,
I can confirm that it works exactly like described including And first time ever all files and directories in this set I am impressed by simplicity of this solution - it gives me confidence that it actually works as intended. |
I think it is perfectly acceptable approach. Definitely it should be optional as normalisation cases are IMO very rare - it actually takes some effort to even create them:) But they are possible so for people who might have this issue this flag is a solution. |
There is one more problem with |
Thanks, I'll take a look. |
One other thought just occurred to me -- I wonder if we should just return here, at least when not using |
I think yes - it would be even safer. Given that without |
And one thought in my head too... Could we add this logic to rclone dedupe? so I could validate my remote not only for exact duplicates but also for potential ones based on case ( Obviously this is not something critical but rather nice to have functionality. |
I think it would actually be a little less safe... as the purpose of looking for a second match is so that it can return an error if it finds one. But the question is... is that error actually useful, and if so, is it worth the CPU tradeoff?
That's an interesting idea! I bet that would be possible. One other thought I had along these lines is whether something like an |
Error is only useful when using |
BIG yes - another nice to have feature:) I use
|
I tend to think that the type of user that would care about this error would be using |
Any rough idea what is performance impact? Because maybe we worry about non existing problem? Or negligible in real life. It is not that IMO in case of |
I suppose we could run some tests and find out. It would depend a lot on how many files you have -- with a small number you probably wouldn't notice it, but with a large number it could become noticeable. I worked on a unicode normalization issue for One thing that helps in this case though is that we only find ourselves in this loop if we already failed to find an exact match. For most users, this should be pretty rare -- it basically requires the sort of scenario like macOS where the OS is actively auto-converting filenames as opposed to merely enforcing case-insensitive uniqueness. And even then, we're just talking about the small subset of files that have special characters and came from some non-mac source. |
Come to think of it, this is another good use case for |
Yes this is what I do with my rclone remotes data - at least for now before I store it in the cloud I convert all to NFD - so I can live with crippled mount as it is today. Of course there are multiplatform scenarios etc. But And most people are not stupid - so if everything is properly documented people will do what is needed to make sure their macOS mount works fast. "My macOS mount is slow"... simple - run |
@nielash sorry for pestering:) do you think your improvements (normalisation and nfsmount writes on macOS) can be PR-ed?:) These are fixes I was waiting for years so now trying to make sure that they are not lost:) I have been using them the last few days and this is day and night compared to what it was before in terms of macOS I bet some bits can be still ironed but the best way is to get exposure to myriad of use cases we can't even imagine - simply make it live. |
Very glad to hear that 🙂 Yes, I can submit PRs. I was planning to also look at some of the other
🤣 |
IMO normalisation is fully working and is ready to be included in rclone release. I even took time to understand your code) it looks very clean and logical. nfsmount - I do appreciate good engineering so have to say that indeed maybe better to solve known issues and make it real feature not just some half baked thing. Most importantly normalisation works perfectly with macFUSE/FUSE-T so it immediately solves old issue. nfsmount is very nice addition but not critical. |
Before this change, the VFS layer did not properly handle unicode normalization, which caused problems particularly for users of macOS. While attempts were made to handle it with various `-o modules=iconv` combinations, this was an imperfect solution, as no one combination allowed both NFC and NFD content to simultaneously be both visible and editable via Finder. After this change, the VFS supports `--no-unicode-normalization` (default `false`) via the existing `--vfs-case-insensitive` logic, which is extended to apply to both case insensitivity and unicode normalization form. This change also adds an additional flag, `--vfs-block-norm-dupes`, to address a probably rare but potentially possible scenario where a directory contains multiple duplicate filenames after applying case and unicode normalization settings. In such a scenario, this flag (disabled by default) hides the duplicates. This comes with a performance tradeoff, as rclone will have to scan the entire directory for duplicates when listing a directory. For this reason, it is recommended to leave this disabled if not needed. However, macOS users may wish to consider using it, as otherwise, if a remote directory contains both NFC and NFD versions of the same filename, an odd situation will occur: both versions of the file will be visible in the mount, and both will appear to be editable, however, editing either version will actually result in only the NFD version getting edited under the hood. `--vfs-block- norm-dupes` prevents this confusion by detecting this scenario, hiding the duplicates, and logging an error, similar to how this is handled in `rclone sync`.
Before this change, the VFS layer did not properly handle unicode normalization, which caused problems particularly for users of macOS. While attempts were made to handle it with various `-o modules=iconv` combinations, this was an imperfect solution, as no one combination allowed both NFC and NFD content to simultaneously be both visible and editable via Finder. After this change, the VFS supports `--no-unicode-normalization` (default `false`) via the existing `--vfs-case-insensitive` logic, which is extended to apply to both case insensitivity and unicode normalization form. This change also adds an additional flag, `--vfs-block-norm-dupes`, to address a probably rare but potentially possible scenario where a directory contains multiple duplicate filenames after applying case and unicode normalization settings. In such a scenario, this flag (disabled by default) hides the duplicates. This comes with a performance tradeoff, as rclone will have to scan the entire directory for duplicates when listing a directory. For this reason, it is recommended to leave this disabled if not needed. However, macOS users may wish to consider using it, as otherwise, if a remote directory contains both NFC and NFD versions of the same filename, an odd situation will occur: both versions of the file will be visible in the mount, and both will appear to be editable, however, editing either version will actually result in only the NFD version getting edited under the hood. `--vfs-block- norm-dupes` prevents this confusion by detecting this scenario, hiding the duplicates, and logging an error, similar to how this is handled in `rclone sync`.
@kapitainsky I made a proof-of-concept for a Examples of what it does: Link |
I have managed to compile it and run few simple tests. All worked! I think that it is very cool small utility. When needed it is purely irreplaceable. Things like this make rclone truly pro utility. Thank you very much for doing it. |
Before this change, the VFS layer did not properly handle unicode normalization, which caused problems particularly for users of macOS. While attempts were made to handle it with various `-o modules=iconv` combinations, this was an imperfect solution, as no one combination allowed both NFC and NFD content to simultaneously be both visible and editable via Finder. After this change, the VFS supports `--no-unicode-normalization` (default `false`) via the existing `--vfs-case-insensitive` logic, which is extended to apply to both case insensitivity and unicode normalization form. This change also adds an additional flag, `--vfs-block-norm-dupes`, to address a probably rare but potentially possible scenario where a directory contains multiple duplicate filenames after applying case and unicode normalization settings. In such a scenario, this flag (disabled by default) hides the duplicates. This comes with a performance tradeoff, as rclone will have to scan the entire directory for duplicates when listing a directory. For this reason, it is recommended to leave this disabled if not needed. However, macOS users may wish to consider using it, as otherwise, if a remote directory contains both NFC and NFD versions of the same filename, an odd situation will occur: both versions of the file will be visible in the mount, and both will appear to be editable, however, editing either version will actually result in only the NFD version getting edited under the hood. `--vfs-block-norm-dupes` prevents this confusion by detecting this scenario, hiding the duplicates, and logging an error, similar to how this is handled in `rclone sync`.
Before this change, the VFS layer did not properly handle unicode normalization, which caused problems particularly for users of macOS. While attempts were made to handle it with various `-o modules=iconv` combinations, this was an imperfect solution, as no one combination allowed both NFC and NFD content to simultaneously be both visible and editable via Finder. After this change, the VFS supports `--no-unicode-normalization` (default `false`) via the existing `--vfs-case-insensitive` logic, which is extended to apply to both case insensitivity and unicode normalization form. This change also adds an additional flag, `--vfs-block-norm-dupes`, to address a probably rare but potentially possible scenario where a directory contains multiple duplicate filenames after applying case and unicode normalization settings. In such a scenario, this flag (disabled by default) hides the duplicates. This comes with a performance tradeoff, as rclone will have to scan the entire directory for duplicates when listing a directory. For this reason, it is recommended to leave this disabled if not needed. However, macOS users may wish to consider using it, as otherwise, if a remote directory contains both NFC and NFD versions of the same filename, an odd situation will occur: both versions of the file will be visible in the mount, and both will appear to be editable, however, editing either version will actually result in only the NFD version getting edited under the hood. `--vfs-block-norm-dupes` prevents this confusion by detecting this scenario, hiding the duplicates, and logging an error, similar to how this is handled in `rclone sync`.
Before this change, the VFS layer did not properly handle unicode normalization, which caused problems particularly for users of macOS. While attempts were made to handle it with various `-o modules=iconv` combinations, this was an imperfect solution, as no one combination allowed both NFC and NFD content to simultaneously be both visible and editable via Finder. After this change, the VFS supports `--no-unicode-normalization` (default `false`) via the existing `--vfs-case-insensitive` logic, which is extended to apply to both case insensitivity and unicode normalization form. This change also adds an additional flag, `--vfs-block-norm-dupes`, to address a probably rare but potentially possible scenario where a directory contains multiple duplicate filenames after applying case and unicode normalization settings. In such a scenario, this flag (disabled by default) hides the duplicates. This comes with a performance tradeoff, as rclone will have to scan the entire directory for duplicates when listing a directory. For this reason, it is recommended to leave this disabled if not needed. However, macOS users may wish to consider using it, as otherwise, if a remote directory contains both NFC and NFD versions of the same filename, an odd situation will occur: both versions of the file will be visible in the mount, and both will appear to be editable, however, editing either version will actually result in only the NFD version getting edited under the hood. `--vfs-block-norm-dupes` prevents this confusion by detecting this scenario, hiding the duplicates, and logging an error, similar to how this is handled in `rclone sync`.
Before this change, the VFS layer did not properly handle unicode normalization, which caused problems particularly for users of macOS. While attempts were made to handle it with various `-o modules=iconv` combinations, this was an imperfect solution, as no one combination allowed both NFC and NFD content to simultaneously be both visible and editable via Finder. After this change, the VFS supports `--no-unicode-normalization` (default `false`) via the existing `--vfs-case-insensitive` logic, which is extended to apply to both case insensitivity and unicode normalization form. This change also adds an additional flag, `--vfs-block-norm-dupes`, to address a probably rare but potentially possible scenario where a directory contains multiple duplicate filenames after applying case and unicode normalization settings. In such a scenario, this flag (disabled by default) hides the duplicates. This comes with a performance tradeoff, as rclone will have to scan the entire directory for duplicates when listing a directory. For this reason, it is recommended to leave this disabled if not needed. However, macOS users may wish to consider using it, as otherwise, if a remote directory contains both NFC and NFD versions of the same filename, an odd situation will occur: both versions of the file will be visible in the mount, and both will appear to be editable, however, editing either version will actually result in only the NFD version getting edited under the hood. `--vfs-block-norm-dupes` prevents this confusion by detecting this scenario, hiding the duplicates, and logging an error, similar to how this is handled in `rclone sync`.
Immediate goal of this enhancement is to fix macOS
rclone mount
which at the moment does not work properly regardless of mounting method used - macOSFUSE or FUSE-T. Optionally normalized VFS can bring other benefits as well e.g. some less common or future filesystems rclone users can work with and normalization constraints imposed.The background issue and possible solutions have been discussed on the forum:
https://forum.rclone.org/t/possible-special-character-encoding-issue-on-macos/39048/30
At the moment many cloud storage destinations (and rclone virtual crypt remote) allow to store both NFC and NFD encoded files/directories names. Also rclone allows to upload/download both forms.
This in general works trouble free on Linux and Windows but breaks rclone mount functionality in macOS - depending on mount options either only NFC or only NFD content is visible/accessible - this is unfortunate consequence of Apple design decision to be different than rest of the world and use exclusively NFD in their APIs. We won't change it.
Experimentation and testing, including attempts to improve situation with FUSE-T (it has now - since v1.0.20 -
-o nfc
option which should help with NFC path) lead me to believe that last missing part of the puzzle is rclone VFS lack of normalization.I would like to propose to add optional normalization to VFS.
New option:
--vfs-normalize OFF/NFC/NFD - with OFF being default
When set to NFC or NFD it would apply characters normalization to served content.
Edge case to watch is situation where two names differ only by normalization and coexist in the same folder e.g.:
one file is NFC and other is NFD. Only solution I see for such case is use only one name and log ERROR.
The text was updated successfully, but these errors were encountered: