Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Umlauts and Regular Expressions #40

Closed
kpprt opened this issue Jul 20, 2019 · 7 comments
Closed

Umlauts and Regular Expressions #40

kpprt opened this issue Jul 20, 2019 · 7 comments

Comments

@kpprt
Copy link

kpprt commented Jul 20, 2019

I wish I wouldn't have to organize files with umlauts or any other special characters in it, but unfortunately my bank is sending me PDF files with umlauts in it.

Example file name:
Erträgnisaufstellung_2018.pdf

I tried the following regular expressions with no success:

  • Regex: '^Erträgnisaufstellung.pdf$'
  • Regex: '^Ertr.gnisaufstellung.pdf$'
  • Regex: '^Ertr(.)gnisaufstellung.pdf$'

The only workaround I found (not very accurate) is one of these:

  • Regex: '^Ertr(.{2})gnisaufstellung.pdf$'
  • Regex: '^Ertr(.*)gnisaufstellung.pdf$'
  • Regex: '^Ertr(.+)gnisaufstellung.pdf$'

OS: Mac
organize version: 1.5

Maybe it has sth. to do with unicode encoding, but I'm not sure.

@tfeldmann
Copy link
Owner

Hi, thank you for the detailed report. Unfortunately I'm not able to reproduce this - it works for me on both windows 10 and macOS :/

  • Can you check whether your config.yaml is UTF-8 encoded? Or even better send the original file via email (address is in my profile)
  • Can you try retyping the filename by hand and check again? UTF-8 has some identical looking, confusable characters, maybe your ä is really an 𝚊̈ (https://unicode.org/cldr/utility/confusables.jsp?a=ä&r=None)

@kpprt
Copy link
Author

kpprt commented Jul 22, 2019

I think the encoding of the config file is correct and I tried both versions of the confusables again, but with no luck. I will send you an email with an example file and config.

@tfeldmann
Copy link
Owner

Thank you for the minimal reproducer! I found the issue - the strings are both unicode but are composed differently:

>>> s1 = 'Erträgnisaufstellung'  # copied from config.yaml
>>> s2 = 'Erträgnisaufstellung'  # copied from filename
>>> s1.encode('utf-8')
b'Ertr\xc3\xa4gnisaufstellung'
>>> s2.encode('utf-8')
b'Ertra\xcc\x88gnisaufstellung'

To be honest, this sent me down a rabbit hole. It seems like the HFS+ filesystem saves filenames in UTF-8 in NFD (decomposed) form and in your config you wrote the NFC (precomposed) form.
The form on NFS filesystems is not specified and filenames on samba shares and in linux seem to be in NFC. APFS does not enforce a normalization as far as I know.

I guess using the NFKD form internally for all comparisons would behave like expected for most use cases. This means rolling my own unicode-normalized glob implementation and normalizing the config before parsing. This is now on my todo-list but might take a while because it needs to be tested on different platforms. I'm really wondering why python doesn't simplify these things for you :/

@tfeldmann
Copy link
Owner

In the meantime you can copy your filename into your config file and everything should work 👌

@kpprt
Copy link
Author

kpprt commented Aug 12, 2019

Hi Thomas, sorry for the ultra late reply!

This is indeed a rabbit hole! Never heard of the confusables before, but it is good to know. I remember that I tried to copy the filename, but that did not work either.

As I tested it now it works though. I guess this has sth. to do with these confusables. The original 'ä' in the PDF from my bank was probably a different one than the 'ä' I type on my Mac and after fiddling around with renaming and adjusting the config a couple of times, the original 'ä' probably disappeared.

Thanks for clearing that up and thanks for providing organize!

@tfeldmann
Copy link
Owner

Hey thank you so much for the kind words and for the donation! It's very much appreciated!

@tfeldmann
Copy link
Owner

This is now fixed and will be available in the next version 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants