Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new feature: VFS auto normalization #7072

Closed
kapitainsky opened this issue Jun 21, 2023 · 38 comments · Fixed by #7620
Closed

new feature: VFS auto normalization #7072

kapitainsky opened this issue Jun 21, 2023 · 38 comments · Fixed by #7620

Comments

@kapitainsky
Copy link
Contributor

kapitainsky commented Jun 21, 2023

Immediate goal of this enhancement is to fix macOS rclone mount which at the moment does not work properly regardless of mounting method used - macOSFUSE or FUSE-T. Optionally normalized VFS can bring other benefits as well e.g. some less common or future filesystems rclone users can work with and normalization constraints imposed.

The background issue and possible solutions have been discussed on the forum:

https://forum.rclone.org/t/possible-special-character-encoding-issue-on-macos/39048/30

At the moment many cloud storage destinations (and rclone virtual crypt remote) allow to store both NFC and NFD encoded files/directories names. Also rclone allows to upload/download both forms.

This in general works trouble free on Linux and Windows but breaks rclone mount functionality in macOS - depending on mount options either only NFC or only NFD content is visible/accessible - this is unfortunate consequence of Apple design decision to be different than rest of the world and use exclusively NFD in their APIs. We won't change it.

Experimentation and testing, including attempts to improve situation with FUSE-T (it has now - since v1.0.20 - -o nfc option which should help with NFC path) lead me to believe that last missing part of the puzzle is rclone VFS lack of normalization.

I would like to propose to add optional normalization to VFS.

New option:

--vfs-normalize OFF/NFC/NFD - with OFF being default

When set to NFC or NFD it would apply characters normalization to served content.

Edge case to watch is situation where two names differ only by normalization and coexist in the same folder e.g.:

$ rclone lsf crypt:TEST

CONFLICTéééFILE.txt
CONFLICTéééFILE.txt

one file is NFC and other is NFD. Only solution I see for such case is use only one name and log ERROR.

@ncw ncw added this to the v1.64 milestone Jun 23, 2023
@ncw
Copy link
Member

ncw commented Jun 23, 2023

I think this should be achievable relatively easily.

All file / dir objects are ultimately made from the directory cache. So we only need to do the normalized comparison in the directory cache searching routine.

We actually do this already for case sensitive / insensitive file systems.

So first we look for an exact match which is a quick hash look up, and then we look for a case insensitive match which involves comparing every item.

rclone/vfs/dir.go

Lines 710 to 732 in 2c50f26

func (d *Dir) stat(leaf string) (Node, error) {
d.mu.Lock()
defer d.mu.Unlock()
err := d._readDir()
if err != nil {
return nil, err
}
item, ok := d.items[leaf]
if !ok && d.vfs.Opt.CaseInsensitive {
leafLower := strings.ToLower(leaf)
for name, node := range d.items {
if strings.ToLower(name) == leafLower {
if ok {
// duplicate case insensitive match is an error
return nil, fmt.Errorf("duplicate filename %q detected with --vfs-case-insensitive set", leaf)
}
// found a case insensitive match
ok = true
item = node
}
}
}

We could do something very similar for the normalization options too. I think we'd want to make a normalization function which does the (NFC, NFD) and/or (lower case) and just run through the loop once. We do exactly this in the sync routines so factoring this out would be sensible.

rclone/fs/march/march.go

Lines 58 to 71 in 2c50f26

// Now create the matching transform
// ..normalise the UTF8 first
if !m.NoUnicodeNormalization {
m.transforms = append(m.transforms, norm.NFC.String)
}
// ..if destination is caseInsensitive then make it lower case
// case Insensitive | src | dst | lower case compare |
// | No | No | No |
// | Yes | No | No |
// | No | Yes | Yes |
// | Yes | Yes | Yes |
if m.Fdst.Features().CaseInsensitive || ci.IgnoreCaseSync {
m.transforms = append(m.transforms, strings.ToLower)
}

Edge case to watch is situation where two names differ only by normalization and coexist in the same folder

We currently just use the first one in the case vs CASE comparison. That is the easy thing to do. To detect a collision would require scanning the whole directory which would take on average twice as long.

Worth the extra CPU for an ERROR message?

New option:

--vfs-normalize OFF/NFC/NFD - with OFF being default

Assuming we are only using the normalize option to find matching file names and not actually returning them to the OS then we can get away with using NFC only and this would match this flag which only does NFC normalization.

  --no-unicode-normalization   Don't normalize unicode characters in filenames

Once thing I am unclear on though - do you think we need to apply that normalization to the file names sent to the OS - is that your intention? So we receive an NFC file name from Google Drive (say) and send the macOS kernel an NFD file name according to the flag? That will be a bit more work (but not much). I think this might be what the FUSE-T option -o nfc does though so maybe isn't necessary?

@kapitainsky
Copy link
Contributor Author

kapitainsky commented Jun 23, 2023

Once thing I am unclear on though - do you think we need to apply that normalization to the file names sent to the OS - is that your intention?

Yes this is the most important thing actually and whole purpose of this - it is about sending to OS normalized names. Today we have a problem that remotes can contain both NFC and NFD names - it works on Linux but breaks macOS.

I think this might be what the FUSE-T option -o nfc does though so maybe isn't necessary?

This option converts names received from OS (macOS uses NFD) and makes sure that NFS mount receives NFC - so all NFS content is NFC. It also makes NFC content to work with macOS APIs.
It breaks when some content in remote is already NFD - files are not visible/directories are not browsable.

On macOS it happens for example when user uses rclone copy/sync - NFD names are saved in remote.

With --vfs-normalize to make macOS mount work we could either:

  1. --vfs-normalize NFD - this is what macOS likes the most I think

REMOTE -----------------------> MOUNT
NFC -> RCLONE -> NFD -> FUSE -> NFD <---- rclone normalized NFC->NFD
NFD -> RCLONE -> NFD -> FUSE -> NFD

MOUNT -----------------------> REMOTE
NFC -> FUSE -> NFC -> RCLONE -> NFC
NFD -> FUSE -> NFD -> RCLONE -> NFD

  1. --vfs-normalize NFC and use -o nfc

REMOTE -----------------------> MOUNT
NFC -> RCLONE -> NFC -> FUSE -> NFD
NFD -> RCLONE -> NFC -> FUSE -> NFD <---- rclone normalized NFD->NFC

MOUNT -----------------------> REMOTE
NFC -> FUSE -> NFC -> RCLONE -> NFC
NFD -> FUSE -> NFC-> RCLONE -> NFC

It is difficult to say which one will behave better on macOS - I think that both would work. But with NFC/NFD and macOS I have learnt that only testing can tell.

@kapitainsky
Copy link
Contributor Author

kapitainsky commented Jun 23, 2023

The key purpose of this enhancement is to fix rclone mount on macOS.

Today it simply does not work - does not matter with macFUSE or FUSE-T. And is unfixable by -o modules=iconv,from_code=UTF-8,to_code=UTF-8 or -o modules=iconv,from_code=UTF-8,to_code=UTF-8-MAC workaround. It only helps in some cases and creates new issues - like for example that files can not be open in Finder (macOS file browser)

image

@ncw
Copy link
Member

ncw commented Jun 23, 2023

I think what I'll do is try the smallest patch that could work first. That would fix the "... can't be found" errors I think but it wouldn't normalize the returns to the OS. This might mean we need the -o nfc flag. If that doesn't work I can do the further parts..

@kapitainsky
Copy link
Contributor Author

kapitainsky commented Jun 23, 2023

Cool. I am happy at all stages to test - I do not need binary build I can happily build myself from git clone. I have my test bed - crypt with carefully crafted mix of NFC/NFD files and folders:) so far it breaks everything.

I hoped that fuse-t can handle it fully but does not seem to be possible.

@kapitainsky
Copy link
Contributor Author

kapitainsky commented Jun 23, 2023

Note:

With --vfs-normalize set to NFC or NFD iconv should not be applied:

rclone/cmd/cmount/mount.go

Lines 120 to 126 in fcb912a

if runtime.GOOS == "darwin" {
if !findOption("modules=iconv", options) {
iconv := "modules=iconv,from_code=UTF-8,to_code=UTF-8-MAC"
options = append(options, "-o", iconv)
fs.Debugf(nil, "Adding \"-o %s\" for macOS", iconv)
}
}

It is crude way to actually try to normalize to NFD- but it stopped working fully with advent of APFS FS few years ago and changes in normalization in all macOS - also I suspect not 100% correct way macFUSE handles it. I actually suspect it never worked 100% before neither - but nowadays it is more apparent.

Still should be there for old Apple computers users - this is why --vfs-normalize should be optional

@kapitainsky
Copy link
Contributor Author

kapitainsky commented Jul 27, 2023

Sorry to keep pestering... but can it be implemented one day? I am not skilled enough in go to give it a go. I do try to earn my karma points on the forum instead:) I do believe that it is missing piece of macOS rclone mount to work predictably and consistently - and now it is disaster people using it do not realize.

I managed to convince author of fuse-t to make changes helping to alleviate problems moving forward. But without option to normalize mount to NFD on rclone side it is only half measure.

@ncw
Copy link
Member

ncw commented Jul 28, 2023

Sorry to keep pestering... but can it be implemented one day? I am not skilled enough in go to give it a go. I do try to earn my karma points on the forum instead:)

You have lots of karma indeed :-)

I put this on the 1.64 milestone which isn't a guarantee that it will make that release, but it will be on my radar at least!

@FurloSK
Copy link

FurloSK commented Jan 30, 2024

Just dropping in to say that I would be extremely thankful for having this fixed at some point. It would save a lot of trouble for me and many users in Slovakia, Czech republic and other countries, where basically all words are accented, which makes using rclone very problematic.

@nielash
Copy link
Collaborator

nielash commented Feb 5, 2024

I took a shot at fixing this: nielash@af6c376

It is the "smallest patch that could work" approach as ncw suggested. So, it does not add a new --vfs-normalize flag -- rather, it uses the existing --no-unicode-normalization (and note the double-negative -- false means normalize.) This seemed to be enough to fix it -- @kapitainsky I wonder if you agree?

Edge case to watch is situation where two names differ only by normalization and coexist in the same folder

We currently just use the first one in the case vs CASE comparison. That is the easy thing to do. To detect a collision would require scanning the whole directory which would take on average twice as long.

Worth the extra CPU for an ERROR message?

I attempted to split the difference on this by adding a new flag for this (disabled by default), --vfs-block-norm-dupes (better name suggestions welcome). ncw is right that there will be a performance hit, so I think most non-mac users will want to leave it off. However some mac users may want it, as otherwise the "edge case" @kapitainsky mentioned will have an odd effect: both versions of the file will be visible in the mount, and both will appear to be editable, however, editing either version will actually result in only the NFD version getting edited under the hood. So --vfs-block-norm-dupes prevents this confusion by detecting this scenario and hiding the duplicates + logging an error, similar to what march would do. Thoughts on this approach?

On a slightly-related note, it seems to me that there is something odd about how lib/encoder is behaving on mac. For example, if I mount a crypt remote with invalid UTF-8 paths and then run rclone lsl on the mount, I will get errors. However if I rclone sync that same remote to local and then rclone lsl the copied files, it works just fine. But then if I rclone check the same paths I synced, suddenly there are errors due to name encoding differences. Is it that the conversion only goes one way? Probably out of scope for this ticket, but thought I'd mention it.

@kapitainsky
Copy link
Contributor Author

I took a shot at fixing this: nielash@af6c376

This is a big one for us macOS users:)

As soon as I have a moment I will start testing it.

@kapitainsky
Copy link
Contributor Author

kapitainsky commented Feb 5, 2024

OK I could not wait - I am very excited about this chance to fix this long standing issue:) - so I quickly tried with my test data (NFC/NFD mix usage) I used in the past. And hurrah! It works. Both in Finder and terminal using FUSE-T mount.

Have not looked at "duplicates" yet. I am a bit swamped at the moment so it is only quick try - I will take it for a proper ride later.

How exactly this patch work? Sorry I am not a programmer and struggle to understand exactly how it is achieved. I see that it ditches iconv conversions - this is nice - was doing it myself for long as it was maybe something working for HFS+ but definitely for APFS was only trouble. But on its own it was not enough.

@nielash
Copy link
Collaborator

nielash commented Feb 5, 2024

And hurrah! It works. Both in Finder and terminal using FUSE-T mount.

That is encouraging to hear! 😃

How exactly this patch work? Sorry I am not a programmer and struggle to understand exactly how it is achieved.

Basically it takes the existing --vfs-case-insensitive logic and extends it to apply to both case insensitivity and unicode normalization form. So, if mac references a file with its NFD name, rclone first looks to see if it has a file with that exact NFD name. If it does, great -- we're done. If it does not, instead of telling mac "sorry, I don't have that", it will look to see if it has a file that normalizes to the same NFC name as the one that mac is asking for. If it finds exactly one such file, it will consider this a match. If it finds more than one, it will error (this is possible for case sensitivity but should be pretty rare for NFC/NFD. For example, if mac asks for "hello" but rclone has both "HELLO" and "HeLlo".)

So the result is that if rclone gives mac an NFC file but mac converts it to NFD, rclone will not get confused and will be able to figure out which file mac is talking about.

The most important part is here. And this calls operations.ApplyTransforms to do the actual conversion:

func ApplyTransforms(ctx context.Context, s string) string {
ci := fs.GetConfig(ctx)
if !ci.NoUnicodeNormalization {
s = norm.NFC.String(s)
}
if ci.IgnoreCaseSync {
s = strings.ToLower(s)
}
return s
}

This uses the norm package from the Go standard library instead of Mac's iconv implementation, which is known to be buggy.

@kapitainsky
Copy link
Contributor Author

Thank you for this explanation. When I digest it I will try to break it with my testing:) And of course I hope I will fail.

@kapitainsky
Copy link
Contributor Author

Maybe you could land both normalisation and nfsmount fixes at the same time? At last we could have FUSE free mount life on macOS:)

@nielash
Copy link
Collaborator

nielash commented Feb 5, 2024

At last we could have FUSE free mount life on macOS:)

That's the dream! 😆

Aside from #7503 (comment) which I posted a fix for, are there other nfsmount issues?

@kapitainsky
Copy link
Contributor Author

kapitainsky commented Feb 5, 2024

Aside from #7503 (comment) which I posted a fix for, are there other nfsmount issues?

I have been using your fix few times and it serves its purpose - writing to NFS mount is possible.

One issue I noticed was nfsmount mount name. Firstly it can not be controlled with --volname and secondly it was rather random as I remember (definitely not using remote name). Not show stopper for my use so I just ignored it:)

@nielash
Copy link
Collaborator

nielash commented Feb 5, 2024

Ok. That sounds fairly easy, I'll try to take a look.

@kapitainsky
Copy link
Contributor Author

kapitainsky commented Feb 5, 2024

It is the "smallest patch that could work" approach as ncw suggested. So, it does not add a new --vfs-normalize flag -- rather, it uses the existing --no-unicode-normalization (and note the double-negative -- false means normalize.) This seemed to be enough to fix it -- @kapitainsky I wonder if you agree?

Using below test set,

$ rclone tree crypt:NFC_NFD_testing
/
├── NFCéééFILE.txt
├── NFCééééDIR
│   ├── NFCéééFILE.txt
│   └── NFDééééFILE.txt
├── NFDééééDIR
│   ├── NFCéééFILE.txt
│   └── NFDééééFILE.txt
├── NFDééééFILE.txt
└── TEST_DUPES
    ├── CONFLICTéééDIR
    │   ├── CONFLICTéééFILE.txt
    │   └── CONFLICTéééFILE.txt
    ├── CONFLICTéééFILE.txt
    ├── CONFLICTéééDIR
    │   ├── CONFLICTéééFILE.txt
    │   └── CONFLICTéééFILE.txt
    └── CONFLICTéééFILE.txt

5 directories, 12 files


$ rclone lsf -R crypt:NFC_NFD_testing | xargs -I'{}' bash -c 'hd <<< "${1}"' -- '{}' 
00000000  4e 46 43 c3 a9 c3 a9 c3  a9 46 49 4c 45 2e 74 78  |NFC......FILE.tx|
00000010  74 0a                                             |t.|
00000012
00000000  4e 46 43 c3 a9 c3 a9 c3  a9 c3 a9 44 49 52 2f 0a  |NFC........DIR/.|
00000010
00000000  4e 46 44 65 cc 81 65 cc  81 65 cc 81 65 cc 81 44  |NFDe..e..e..e..D|
00000010  49 52 2f 0a                                       |IR/.|
00000014
00000000  4e 46 44 65 cc 81 65 cc  81 65 cc 81 65 cc 81 46  |NFDe..e..e..e..F|
00000010  49 4c 45 2e 74 78 74 0a                           |ILE.txt.|
00000018
00000000  54 45 53 54 5f 44 55 50  45 53 2f 0a              |TEST_DUPES/.|
0000000c
00000000  54 45 53 54 5f 44 55 50  45 53 2f 43 4f 4e 46 4c  |TEST_DUPES/CONFL|
00000010  49 43 54 65 cc 81 65 cc  81 65 cc 81 44 49 52 2f  |ICTe..e..e..DIR/|
00000020  0a                                                |.|
00000021
00000000  54 45 53 54 5f 44 55 50  45 53 2f 43 4f 4e 46 4c  |TEST_DUPES/CONFL|
00000010  49 43 54 65 cc 81 65 cc  81 65 cc 81 46 49 4c 45  |ICTe..e..e..FILE|
00000020  2e 74 78 74 0a                                    |.txt.|
00000025
00000000  54 45 53 54 5f 44 55 50  45 53 2f 43 4f 4e 46 4c  |TEST_DUPES/CONFL|
00000010  49 43 54 c3 a9 c3 a9 c3  a9 44 49 52 2f 0a        |ICT......DIR/.|
0000001e
00000000  54 45 53 54 5f 44 55 50  45 53 2f 43 4f 4e 46 4c  |TEST_DUPES/CONFL|
00000010  49 43 54 c3 a9 c3 a9 c3  a9 46 49 4c 45 2e 74 78  |ICT......FILE.tx|
00000020  74 0a                                             |t.|
00000022
00000000  4e 46 44 65 cc 81 65 cc  81 65 cc 81 65 cc 81 44  |NFDe..e..e..e..D|
00000010  49 52 2f 4e 46 43 c3 a9  c3 a9 c3 a9 46 49 4c 45  |IR/NFC......FILE|
00000020  2e 74 78 74 0a                                    |.txt.|
00000025
00000000  4e 46 44 65 cc 81 65 cc  81 65 cc 81 65 cc 81 44  |NFDe..e..e..e..D|
00000010  49 52 2f 4e 46 44 65 cc  81 65 cc 81 65 cc 81 65  |IR/NFDe..e..e..e|
00000020  cc 81 46 49 4c 45 2e 74  78 74 0a                 |..FILE.txt.|
0000002b
00000000  4e 46 43 c3 a9 c3 a9 c3  a9 c3 a9 44 49 52 2f 4e  |NFC........DIR/N|
00000010  46 43 c3 a9 c3 a9 c3 a9  46 49 4c 45 2e 74 78 74  |FC......FILE.txt|
00000020  0a                                                |.|
00000021
00000000  4e 46 43 c3 a9 c3 a9 c3  a9 c3 a9 44 49 52 2f 4e  |NFC........DIR/N|
00000010  46 44 65 cc 81 65 cc 81  65 cc 81 65 cc 81 46 49  |FDe..e..e..e..FI|
00000020  4c 45 2e 74 78 74 0a                              |LE.txt.|
00000027
00000000  54 45 53 54 5f 44 55 50  45 53 2f 43 4f 4e 46 4c  |TEST_DUPES/CONFL|
00000010  49 43 54 c3 a9 c3 a9 c3  a9 44 49 52 2f 43 4f 4e  |ICT......DIR/CON|
00000020  46 4c 49 43 54 65 cc 81  65 cc 81 65 cc 81 46 49  |FLICTe..e..e..FI|
00000030  4c 45 2e 74 78 74 0a                              |LE.txt.|
00000037
00000000  54 45 53 54 5f 44 55 50  45 53 2f 43 4f 4e 46 4c  |TEST_DUPES/CONFL|
00000010  49 43 54 c3 a9 c3 a9 c3  a9 44 49 52 2f 43 4f 4e  |ICT......DIR/CON|
00000020  46 4c 49 43 54 c3 a9 c3  a9 c3 a9 46 49 4c 45 2e  |FLICT......FILE.|
00000030  74 78 74 0a                                       |txt.|
00000034
00000000  54 45 53 54 5f 44 55 50  45 53 2f 43 4f 4e 46 4c  |TEST_DUPES/CONFL|
00000010  49 43 54 65 cc 81 65 cc  81 65 cc 81 44 49 52 2f  |ICTe..e..e..DIR/|
00000020  43 4f 4e 46 4c 49 43 54  65 cc 81 65 cc 81 65 cc  |CONFLICTe..e..e.|
00000030  81 46 49 4c 45 2e 74 78  74 0a                    |.FILE.txt.|
0000003a
00000000  54 45 53 54 5f 44 55 50  45 53 2f 43 4f 4e 46 4c  |TEST_DUPES/CONFL|
00000010  49 43 54 65 cc 81 65 cc  81 65 cc 81 44 49 52 2f  |ICTe..e..e..DIR/|
00000020  43 4f 4e 46 4c 49 43 54  c3 a9 c3 a9 c3 a9 46 49  |CONFLICT......FI|
00000030  4c 45 2e 74 78 74 0a                              |LE.txt.|
00000037



I can confirm that it works exactly like described including --vfs-block-norm-dupes flag behaviour.

And first time ever all files and directories in this set rclone mount are visible and more important accessible by Finder

image

I am impressed by simplicity of this solution - it gives me confidence that it actually works as intended.
Myself I was thinking about some unnecessarily complex --vfs-normalize OFF/NFC/NFD approach.

@kapitainsky
Copy link
Contributor Author

I attempted to split the difference on this by adding a new flag for this (disabled by default), --vfs-block-norm-dupes (better name suggestions welcome). ncw is right that there will be a performance hit, so I think most non-mac users will want to leave it off. However some mac users may want it, as otherwise the "edge case" @kapitainsky mentioned will have an odd effect: both versions of the file will be visible in the mount, and both will appear to be editable, however, editing either version will actually result in only the NFD version getting edited under the hood. So --vfs-block-norm-dupes prevents this confusion by detecting this scenario and hiding the duplicates + logging an error, similar to what march would do. Thoughts on this approach?

I think it is perfectly acceptable approach. Definitely it should be optional as normalisation cases are IMO very rare - it actually takes some effort to even create them:) But they are possible so for people who might have this issue this flag is a solution.

@kapitainsky
Copy link
Contributor Author

kapitainsky commented Feb 5, 2024

Aside from #7503 (comment) which I posted a fix for, are there other nfsmount issues?

There is one more problem with rclone nfsmount. When you unmount (right click->unmount) - share folder is gone - but rclone nfsmount process is still running (confirmed with ps -ef | grep rclone) and has to be CTRL-C in terminal.

@nielash
Copy link
Collaborator

nielash commented Feb 5, 2024

Thanks, I'll take a look.

@nielash
Copy link
Collaborator

nielash commented Feb 5, 2024

One other thought just occurred to me -- I wonder if we should just return here, at least when not using --vfs-block-norm-dupes. In other words, as soon as we find the first match, use it and move on instead of continuing to look for a possible second match. If all we care about is the first match, it would be a lot faster.

@kapitainsky
Copy link
Contributor Author

One other thought just occurred to me -- I wonder if we should just return here, at least when not using --vfs-block-norm-dupes. In other words, as soon as we find the first match, use it and move on instead of continuing to look for a possible second match. If all we care about is the first match, it would be a lot faster.

I think yes - it would be even safer. Given that without --vfs-block-norm-dupes I can be editing wrong file on my mount... This way I think we would have dupe missing - which again I think is better.

@kapitainsky
Copy link
Contributor Author

kapitainsky commented Feb 5, 2024

And one thought in my head too... Could we add this logic to rclone dedupe? so I could validate my remote not only for exact duplicates but also for potential ones based on case (--by-case)and normalisation (--by-normalisation)?

Obviously this is not something critical but rather nice to have functionality.

@nielash
Copy link
Collaborator

nielash commented Feb 5, 2024

I think yes - it would be even safer.

I think it would actually be a little less safe... as the purpose of looking for a second match is so that it can return an error if it finds one. But the question is... is that error actually useful, and if so, is it worth the CPU tradeoff?

And one thought in my head too... Could we add this logic to rclone dedupe? so I could validate my remote not only for exact duplicates but also for potential ones based on case (--by-case)and normalisation (--by-normalisation)?

That's an interesting idea! I bet that would be possible. One other thought I had along these lines is whether something like an rclone iconv command might be a useful thing. For example, to convert any NFC filenames to NFD, or vice versa.

@kapitainsky
Copy link
Contributor Author

I think it would actually be a little less safe... as the purpose of looking for a second match is so that it can return an error if it finds one. But the question is... is that error actually useful, and if so, is it worth the CPU tradeoff?

Error is only useful when using --vfs-block-norm-dupes? Isn't it?

@kapitainsky
Copy link
Contributor Author

kapitainsky commented Feb 5, 2024

That's an interesting idea! I bet that would be possible. One other thought I had along these lines is whether something like an rclone iconv command might be a useful thing. For example, to convert any NFC filenames to NFD, or vice versa.

BIG yes - another nice to have feature:) I use brew install convmv for this:

convmv -r -f utf8 -t utf8 --nfc --notest .
convmv -r -f utf8 -t utf8 --nfd --notest .

@nielash
Copy link
Collaborator

nielash commented Feb 5, 2024

Error is only useful when using --vfs-block-norm-dupes? Isn't it?

I tend to think that the type of user that would care about this error would be using --vfs-block-norm-dupes anyway, and the vast majority of users will not care about it and would rather have the performance boost. It is a very unlikely error to begin with -- you'd have to have multiple equal-folding filenames in the same directory to encounter it. And even if you did, "pick the first one and move on" might still be preferable to an error, for many users.

@kapitainsky
Copy link
Contributor Author

kapitainsky commented Feb 5, 2024

Any rough idea what is performance impact? Because maybe we worry about non existing problem? Or negligible in real life.

It is not that rclone mount is used for mission critical applications... It is mostly used to lazy browse video collections or some docs.

IMO in case of rclone mount performance can be sacrificed for convenience.

@nielash
Copy link
Collaborator

nielash commented Feb 5, 2024

I suppose we could run some tests and find out. It would depend a lot on how many files you have -- with a small number you probably wouldn't notice it, but with a large number it could become noticeable. I worked on a unicode normalization issue for bisync recently and was surprised at how much of a bottleneck all the conversions turned out to be -- I ended up reworking the code because of it.

One thing that helps in this case though is that we only find ourselves in this loop if we already failed to find an exact match. For most users, this should be pretty rare -- it basically requires the sort of scenario like macOS where the OS is actively auto-converting filenames as opposed to merely enforcing case-insensitive uniqueness. And even then, we're just talking about the small subset of files that have special characters and came from some non-mac source.

@nielash
Copy link
Collaborator

nielash commented Feb 5, 2024

Come to think of it, this is another good use case for rclone convmv... the more files that are already in the correct format, the fewer checks you have to do each time.

@kapitainsky
Copy link
Contributor Author

kapitainsky commented Feb 5, 2024

Yes this is what I do with my rclone remotes data - at least for now before I store it in the cloud I convert all to NFD - so I can live with crippled mount as it is today.

Of course there are multiplatform scenarios etc.

But rclone convmv would make it easy - as other systems do not face NFC/NFD issues.

And most people are not stupid - so if everything is properly documented people will do what is needed to make sure their macOS mount works fast.

"My macOS mount is slow"... simple - run rclone convmv and all will be like new:)

@kapitainsky
Copy link
Contributor Author

kapitainsky commented Feb 7, 2024

@nielash sorry for pestering:) do you think your improvements (normalisation and nfsmount writes on macOS) can be PR-ed?:) These are fixes I was waiting for years so now trying to make sure that they are not lost:)

I have been using them the last few days and this is day and night compared to what it was before in terms of macOS rclone mount compatibility.

I bet some bits can be still ironed but the best way is to get exposure to myriad of use cases we can't even imagine - simply make it live. rclone nfsmount made to release in disastrous stage - effectively useless so compared to it your changes are gold.

@nielash
Copy link
Collaborator

nielash commented Feb 7, 2024

I have been using them the last few days and this is day and night compared to what it was before in terms of macOS rclone mount compatibility.

Very glad to hear that 🙂

Yes, I can submit PRs. I was planning to also look at some of the other nfsmount issues you mentioned (--volname and unmounting), but as they are less critical maybe it's best to submit the most important changes first and fix the rest later. What do you think?

rclone nfsmount made to release in disastrous stage - effectively useless so compared to it your changes are gold.

🤣

@kapitainsky
Copy link
Contributor Author

kapitainsky commented Feb 7, 2024

Yes, I can submit PRs. I was planning to also look at some of the other nfsmount issues you mentioned (--volname and unmounting), but as they are less critical maybe it's best to submit the most important changes first and fix the rest later. What do you think?

IMO normalisation is fully working and is ready to be included in rclone release. I even took time to understand your code) it looks very clean and logical.

nfsmount - I do appreciate good engineering so have to say that indeed maybe better to solve known issues and make it real feature not just some half baked thing.

Most importantly normalisation works perfectly with macFUSE/FUSE-T so it immediately solves old issue. nfsmount is very nice addition but not critical.

nielash added a commit to nielash/rclone that referenced this issue Feb 8, 2024
Before this change, the VFS layer did not properly handle unicode normalization,
which caused problems particularly for users of macOS. While attempts were made
to handle it with various `-o modules=iconv` combinations, this was an imperfect
solution, as no one combination allowed both NFC and NFD content to
simultaneously be both visible and editable via Finder.

After this change, the VFS supports `--no-unicode-normalization` (default `false`)
via the existing `--vfs-case-insensitive` logic, which is extended to apply to both
case insensitivity and unicode normalization form.

This change also adds an additional flag, `--vfs-block-norm-dupes`, to address a
probably rare but potentially possible scenario where a directory contains
multiple duplicate filenames after applying case and unicode normalization
settings. In such a scenario, this flag (disabled by default) hides the
duplicates. This comes with a performance tradeoff, as rclone will have to scan
the entire directory for duplicates when listing a directory. For this reason,
it is recommended to leave this disabled if not needed. However, macOS users may
wish to consider using it, as otherwise, if a remote directory contains both NFC
and NFD versions of the same filename, an odd situation will occur: both
versions of the file will be visible in the mount, and both will appear to be
editable, however, editing either version will actually result in only the NFD
version getting edited under the hood. `--vfs-block- norm-dupes` prevents this
confusion by detecting this scenario, hiding the duplicates, and logging an
error, similar to how this is handled in `rclone sync`.
nielash added a commit to nielash/rclone that referenced this issue Feb 8, 2024
Before this change, the VFS layer did not properly handle unicode normalization,
which caused problems particularly for users of macOS. While attempts were made
to handle it with various `-o modules=iconv` combinations, this was an imperfect
solution, as no one combination allowed both NFC and NFD content to
simultaneously be both visible and editable via Finder.

After this change, the VFS supports `--no-unicode-normalization` (default `false`)
via the existing `--vfs-case-insensitive` logic, which is extended to apply to both
case insensitivity and unicode normalization form.

This change also adds an additional flag, `--vfs-block-norm-dupes`, to address a
probably rare but potentially possible scenario where a directory contains
multiple duplicate filenames after applying case and unicode normalization
settings. In such a scenario, this flag (disabled by default) hides the
duplicates. This comes with a performance tradeoff, as rclone will have to scan
the entire directory for duplicates when listing a directory. For this reason,
it is recommended to leave this disabled if not needed. However, macOS users may
wish to consider using it, as otherwise, if a remote directory contains both NFC
and NFD versions of the same filename, an odd situation will occur: both
versions of the file will be visible in the mount, and both will appear to be
editable, however, editing either version will actually result in only the NFD
version getting edited under the hood. `--vfs-block- norm-dupes` prevents this
confusion by detecting this scenario, hiding the duplicates, and logging an
error, similar to how this is handled in `rclone sync`.
@nielash nielash linked a pull request Feb 8, 2024 that will close this issue
5 tasks
@nielash
Copy link
Collaborator

nielash commented Feb 13, 2024

@kapitainsky I made a proof-of-concept for a rclone convmv command. Very much still a WIP, but if you are interested: nielash@1d6be93

Examples of what it does: Link

@kapitainsky
Copy link
Contributor Author

@kapitainsky I made a proof-of-concept for a rclone convmv command. Very much still a WIP, but if you are interested: nielash@1d6be93

Examples of what it does: Link

I have managed to compile it and run few simple tests. All worked! I think that it is very cool small utility. When needed it is purely irreplaceable. Things like this make rclone truly pro utility. Thank you very much for doing it.

nielash added a commit to nielash/rclone that referenced this issue Mar 4, 2024
Before this change, the VFS layer did not properly handle unicode normalization,
which caused problems particularly for users of macOS. While attempts were made
to handle it with various `-o modules=iconv` combinations, this was an imperfect
solution, as no one combination allowed both NFC and NFD content to
simultaneously be both visible and editable via Finder.

After this change, the VFS supports `--no-unicode-normalization` (default `false`)
via the existing `--vfs-case-insensitive` logic, which is extended to apply to both
case insensitivity and unicode normalization form.

This change also adds an additional flag, `--vfs-block-norm-dupes`, to address a
probably rare but potentially possible scenario where a directory contains
multiple duplicate filenames after applying case and unicode normalization
settings. In such a scenario, this flag (disabled by default) hides the
duplicates. This comes with a performance tradeoff, as rclone will have to scan
the entire directory for duplicates when listing a directory. For this reason,
it is recommended to leave this disabled if not needed. However, macOS users may
wish to consider using it, as otherwise, if a remote directory contains both NFC
and NFD versions of the same filename, an odd situation will occur: both
versions of the file will be visible in the mount, and both will appear to be
editable, however, editing either version will actually result in only the NFD
version getting edited under the hood. `--vfs-block-norm-dupes` prevents this
confusion by detecting this scenario, hiding the duplicates, and logging an
error, similar to how this is handled in `rclone sync`.
nielash added a commit to nielash/rclone that referenced this issue Mar 5, 2024
Before this change, the VFS layer did not properly handle unicode normalization,
which caused problems particularly for users of macOS. While attempts were made
to handle it with various `-o modules=iconv` combinations, this was an imperfect
solution, as no one combination allowed both NFC and NFD content to
simultaneously be both visible and editable via Finder.

After this change, the VFS supports `--no-unicode-normalization` (default `false`)
via the existing `--vfs-case-insensitive` logic, which is extended to apply to both
case insensitivity and unicode normalization form.

This change also adds an additional flag, `--vfs-block-norm-dupes`, to address a
probably rare but potentially possible scenario where a directory contains
multiple duplicate filenames after applying case and unicode normalization
settings. In such a scenario, this flag (disabled by default) hides the
duplicates. This comes with a performance tradeoff, as rclone will have to scan
the entire directory for duplicates when listing a directory. For this reason,
it is recommended to leave this disabled if not needed. However, macOS users may
wish to consider using it, as otherwise, if a remote directory contains both NFC
and NFD versions of the same filename, an odd situation will occur: both
versions of the file will be visible in the mount, and both will appear to be
editable, however, editing either version will actually result in only the NFD
version getting edited under the hood. `--vfs-block-norm-dupes` prevents this
confusion by detecting this scenario, hiding the duplicates, and logging an
error, similar to how this is handled in `rclone sync`.
nielash added a commit to nielash/rclone that referenced this issue Mar 6, 2024
Before this change, the VFS layer did not properly handle unicode normalization,
which caused problems particularly for users of macOS. While attempts were made
to handle it with various `-o modules=iconv` combinations, this was an imperfect
solution, as no one combination allowed both NFC and NFD content to
simultaneously be both visible and editable via Finder.

After this change, the VFS supports `--no-unicode-normalization` (default `false`)
via the existing `--vfs-case-insensitive` logic, which is extended to apply to both
case insensitivity and unicode normalization form.

This change also adds an additional flag, `--vfs-block-norm-dupes`, to address a
probably rare but potentially possible scenario where a directory contains
multiple duplicate filenames after applying case and unicode normalization
settings. In such a scenario, this flag (disabled by default) hides the
duplicates. This comes with a performance tradeoff, as rclone will have to scan
the entire directory for duplicates when listing a directory. For this reason,
it is recommended to leave this disabled if not needed. However, macOS users may
wish to consider using it, as otherwise, if a remote directory contains both NFC
and NFD versions of the same filename, an odd situation will occur: both
versions of the file will be visible in the mount, and both will appear to be
editable, however, editing either version will actually result in only the NFD
version getting edited under the hood. `--vfs-block-norm-dupes` prevents this
confusion by detecting this scenario, hiding the duplicates, and logging an
error, similar to how this is handled in `rclone sync`.
@ncw ncw closed this as completed in #7620 Mar 6, 2024
ncw pushed a commit that referenced this issue Mar 6, 2024
Before this change, the VFS layer did not properly handle unicode normalization,
which caused problems particularly for users of macOS. While attempts were made
to handle it with various `-o modules=iconv` combinations, this was an imperfect
solution, as no one combination allowed both NFC and NFD content to
simultaneously be both visible and editable via Finder.

After this change, the VFS supports `--no-unicode-normalization` (default `false`)
via the existing `--vfs-case-insensitive` logic, which is extended to apply to both
case insensitivity and unicode normalization form.

This change also adds an additional flag, `--vfs-block-norm-dupes`, to address a
probably rare but potentially possible scenario where a directory contains
multiple duplicate filenames after applying case and unicode normalization
settings. In such a scenario, this flag (disabled by default) hides the
duplicates. This comes with a performance tradeoff, as rclone will have to scan
the entire directory for duplicates when listing a directory. For this reason,
it is recommended to leave this disabled if not needed. However, macOS users may
wish to consider using it, as otherwise, if a remote directory contains both NFC
and NFD versions of the same filename, an odd situation will occur: both
versions of the file will be visible in the mount, and both will appear to be
editable, however, editing either version will actually result in only the NFD
version getting edited under the hood. `--vfs-block-norm-dupes` prevents this
confusion by detecting this scenario, hiding the duplicates, and logging an
error, similar to how this is handled in `rclone sync`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants