Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

UTF8 Charset nfd/nfc Encoding #2479

Closed
wants to merge 1 commit into from

5 participants

@dezi

Background:

I am using an Airpot Extreme Shared disk. It will not scrape any TV-Shows having UTF8 characters in show name. This is because Apple-Disk store path/file names in UTF-8 decomposed manner which is not compatible the installed scrapers.

This will affect many unhappy OSX/Mac and Airport Extreme users.

Read here for additional info:

http://stackoverflow.com/questions/12147410/different-utf-8-signature-for-same-diacritics-umlauts-2-binary-ways-to-write

and here:

http://loopkid.net/articles/2011/03/19/groking-hfs-character-encoding

I have implemented a relative dirty work around for this problem. It will convert decomposed UTF-8 file names to composed UTF-8 file names for various scrapers.

This is a discussion commit only. I welcome any suggestions to enhance the solution.

Cheers
Dezi

@dezi

Oops, i left debugging stuff in patch, i will rebase and changes this.

@jmarshallnz
Owner

Perhaps it would be more useful to do the conversion in the file classes? You'll notice we already handle it in the utf8 to wide conversion functions for display in the UI. By handling it at the filesystem level, where we know the encoding, we don't have to handle it in places where we don't know the encoding.

What happens with NFS/SMB shares from AirportExpress to Linux machines? Does it transfer as 'normal' UTF8 at that point?

@Memphiz
Owner

just fyi - nfs doesn't specify encoding - it relys on what the underlaying filesystem does ...

read comments on http://trac.xbmc.org/ticket/13644

@ghost

i believe you can use iconv to do the conversion. there should be an utf8-mac charset to use as the source. e.g. http://serverfault.com/questions/397420/converting-utf-8-nfd-filenames-to-utf-8-nfc-in-either-rsync-or-afpdw

@jmarshallnz
Owner

@Memphiz: what happens if Apple Express/OSX is serving over NFS to a Linux box? Do you get "corrupt" UTF8 filenames, or is Apple Express/OSX smart enough to serve stuff over NFS/SMB using UTF8 that others will understand?

@dezi

I looked through iconv and could not find the feature UTF8-MAC anymore. So this is not an option. If You access drives with HFS+ directly or via AFS-Protocoll, Apple takes care that Your filenames, however You specify them, are decomposed. No problem here. However, if You copy files from HFS+ to ext4, their names will be decomposed. ext4 basically does not care. You can have in the same directory two files, which look as if they had the same file name, one precomposed and one decomposed. So the decomposed file cannot be accessed with precomposed file names. It is a big mess, introduced by Apple. I would under no circumstances manipulate strings, which are used to acces files. Path-strings should only be converted for queries somewhere else (Scrapers) or for display, since XBMC renders those decomposed string wrong in displays.

@jmarshallnz
Owner

We use UTF-8-MAC on darwin for utf8 to foo conversion, so it's certainly there (see utils/CharsetConverter.cpp) Thus, things should be displayed correctly in the UI already, right? You could thus easily patch in an iconv conversion method. Indeed, you could use CCharsetConverter::utf8To() directly if you changed it to check for UTF8_SOURCE rather than "UTF-8"

@dezi

I missed the UTF-8-MAC implemenation in CharsetConverter.cpp. So we do not need my routines. By the way, i looked half a day in internet for routines which do the job and did not find suitable...

I have just checked on my Mac. Yes, the decomposed file names are displayed correctly. Yet the scraper does not work on those of course. I use the very same Airport Extreme Disk on my RaspberryPi, here the file names show up incorrect. I would say, converting for display from NFD to NFC is not an OSX issue, but an issue on hosts, where OSX drives are mounted. So we should always decompose for display and searching.

As i mentioned, i copied files with those filenames down from my Airport drive via AFS to ext4, and the filenames on ext4 stay decomposed. So I infected also my ext4 drive. Basically You have to exspect them everywhere. All my music comes originally from iTunes / OSX. Lots of accented characters in there. Moved all of that down to my server on ext4.

Cheers
Dezi

@dezi

@jmarshallnz: I have looked into both, CharsetConverter.cpp and libiconv source. UTF-8-MAC in CharsetConverter is only a define used on Darwin build targets. It is not available anywhere else. Basically I cannot find a routine, which will convert UTF-8-MAC to UTF-8. By the way, decomposed characters are also valid characters in UTF-8. So if You convert Mac-Style-UTF to ISO-Latin, it will work everywhere as such, no special facilities required.

Converting strings from UTF-8 to ISO-Latin, then converting it back to UTF-8 would do the job, but is certainly no option, because ISO-Latin is only good for European languages and will fail on other content.

Basically we have two problems here:

  1. People using Mac Shared Disks (Airport Extreme NAS is a common example) or files from Mac Shared Disks on local filesystems will expierence unpredictable and hard to explain problems with scrapers (Content not found).

  2. People using Mac Content in XBMC Gui other than on Mac will have character display problems in file lists and other places.

@jmarshallnz
Owner

utf8To() will work fine as long as you remove the check (or better, replace the check) that the "to" charset isn't UTF-8. Instead, compare it to UTF8_SOURCE. That will convert from UTF-8-MAC to UTF-8 then on darwin builds, which is a start.

@jmarshallnz
Owner

Also, you could probably get away with utf8ToW -> wToUTF8 for the conversion.

@dezi

I have tried on Linux/Debian the first variant w/o success as far i remember. I will doublecheck this. Also, converting to wide may work, but, since decomposed characters is well formed UTF-8, i am afraid, the conversion routine sees no need in converting from decomposed to composed while converting to wide unicode and back. I will also check this and give feedback. Thanks for Your support on this.

@dezi

More interesting facts on this issue:

This is a protocol from a ssh session on my RPI. "Test" is a directory, which resides on a EXT4 partition.
The content was copied over from Aiport Extreme NAS to this place. Its a directory called "Unsere Mütter, unsere Väter"

I tried to cd into into it.

The first cd, I typed the path in. Did not work.
The second cd, I copy and paste the path from ll listing. Did not work.
The third cd, i used file name completion, worked.


pi@xberrydev ~ $ ll Test/
total 12
drwxr-xr-x 3 pi pi 4096 Mar 23 09:38 .
drwxr-xr-x 13 pi pi 4096 Mar 23 08:44 ..
drwx------ 2 pi pi 4096 Mar 23 08:45 Unsere Mütter, unsere Väter
pi@xberrydev ~ $ cd "Test/Unsere Mütter, unsere Väter"
-bash: cd: Test/Unsere Mütter, unsere Väter: No such file or directory
pi@xberrydev ~ $ cd "Test/Unsere Mütter, unsere Väter"
-bash: cd: Test/Unsere Mütter, unsere Väter: No such file or directory
pi@xberrydev ~ $ cd "Test/Unsere Mütter, unsere Väter"
pi@xberrydev ~/Test/Unsere Mütter, unsere Väter $


Conclusion: Linux / EXT4 is picky about how you specify accented characters. You need to match the NFD/NFC version exactly, or otherwise it will not work. AFP/HFP+ ist not picky about this, You can enter paths in both, NFC and NFD and it will work.

@dezi

@jmarshallnz: I have double checked both suggestions You made. Both conversions leave NFD / NFC domains as they are. I am not able to compile the OSX version, so i cannot verify the conversion there. Anyhow, on OSX the path names are displayed correctly, Scraper searching fails. You can check from the log, what really was searched, because strings are dumped in url-encoded manner. I believe, on Darwin, the native display is just able to display NFD correctly and no conversions are ever made. Maybe You can verify this. Thanks a lot, Dezi

@dezi

The NFD_NFC_Tupel array is pretty large, we could move it to some static area later on. Just presented as a working draft here. If someone is interestest in how the table was generated, just give me a note.

@jmarshallnz jmarshallnz was assigned
@MartijnKaijser

further comments on this PR?

@jmarshallnz
Owner

The only way it's really going to be solved is using ICU or similar, but that's a large change.

As an interim stop-gap, if the current solution was cleaned up for efficiency (the initial check might be optimised a little by jumping by utf8 characters perhaps, and the LUT could be done in O(1), as there's only a range of 40642 possibilities for the codewords).

For display and passing to 3rd parties (scrapers etc.) only.

I presume that if you have a file stored as NFD (on some HFS+ disk for example) and request it using NFC (i.e. the filename is stored in the database for example in NFC) then things go wrong, or is the disk/filesystem/OS smart enough to figure things out?

@dezi

My experience is: Apple screwed this up. I cannot copy files containing Umlauts from an Samba volume in NFC to my iMac. Period. Too bad.

The linux file systems ext4 is able to store files in either way, but need filesname "literally" when accessing those files. Means: You have files with Umlauts copied via Mac/OSX and Samba down to ext4 file-system. A dir listing on the Linux box shows "correct" file names. When You try to access the file with a "typed in" filename name, the file is not present. When You use file name completion in bash, it works. Base line: Thank You Apple Developers. That all i have to say.

This patch was a workaround because this thing drow me crazy. I generated the table programatically by testing out all possible diacritical characters in NFD and converting them to NFC. All characters which succeeded a conversion are part of the list.

@MartijnKaijser

@Karlson2k
this is your area atm. comments?

@Karlson2k
Collaborator

I know that problem, it's not only mac/linux. If you create some .zip/.rar on mac and unpack it on win32 than you'll have decomposed chars on NTFS (or FAT32/exFAT).
Let's divide problem to smaller pieces and solve them one-by-one.
1. Correctly display any kind of chars (composed/decomposed) in GUI on any platform
2. Get scrapers to work properly with decomposed chars
3. Access files with decomposed chars correctly on all platforms

1 and 2 can be solved together, for 3 we need to carefully inspect our code and remove unwanted charset conversion to store chars in original form.
I will dig deeply into 1 and 2 after merge window.

It would be nice if someone share .zip/.rar with problem chars from mac.

@MartijnKaijser

@Karlson2k what to do with this PR?

@Karlson2k
Collaborator

@MartijnKaijser This PR do not cover all possible decomposed chars, convert decomposed->composed is not optimized (as @jmarshallnz suggested) and conversion must in not in scraper as UI did not support decomposed chars too.
PR was not updated for a long time, close it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Mar 24, 2013
  1. @dezi

    UTF8 Charset nfd/nfc Encoding

    dezi authored
This page is out of date. Refresh to see the latest.
View
1  xbmc/addons/Scraper.cpp
@@ -501,6 +501,7 @@ std::vector<CScraperUrl> CScraper::FindMovie(XFILE::CCurlFile &fcurl, const CStd
sTitle.ToLower();
vector<CStdString> vcsIn(1);
+ g_charsetConverter.utf8ToUtf8NFC(sTitle);
g_charsetConverter.utf8To(SearchStringEncoding(), sTitle, vcsIn[0]);
CURL::Encode(vcsIn[0]);
if (!sYear.IsEmpty())
View
111 xbmc/utils/CharsetConverter.cpp
@@ -477,6 +477,117 @@ void CCharsetConverter::stringCharsetToUtf8(const CStdStringA& strSourceCharset,
iconv_close(iconvString);
}
+void CCharsetConverter::utf8ToUtf8NFC(CStdStringA& strSourceDest)
+{
+ // To keep stuff simple, we only check all decomposable
+ // simple letter character candidates.
+ // Sample: 0x41 0xcc 0x88 => 0xc3 0x84 (&adieresis; or &auml;)
+ int NFD_NFC_tupels[] =
+ {
+ 0x41cc80, 0xc380, 0x45cc80, 0xc388, 0x49cc80, 0xc38c, 0x4ecc80, 0xc7b8,
+ 0x4fcc80, 0xc392, 0x55cc80, 0xc399, 0x61cc80, 0xc3a0, 0x65cc80, 0xc3a8,
+ 0x69cc80, 0xc3ac, 0x6ecc80, 0xc7b9, 0x6fcc80, 0xc3b2, 0x75cc80, 0xc3b9,
+ 0x41cc81, 0xc381, 0x43cc81, 0xc486, 0x45cc81, 0xc389, 0x47cc81, 0xc7b4,
+ 0x49cc81, 0xc38d, 0x4ccc81, 0xc4b9, 0x4ecc81, 0xc583, 0x4fcc81, 0xc393,
+ 0x52cc81, 0xc594, 0x53cc81, 0xc59a, 0x55cc81, 0xc39a, 0x59cc81, 0xc39d,
+ 0x5acc81, 0xc5b9, 0x61cc81, 0xc3a1, 0x63cc81, 0xc487, 0x65cc81, 0xc3a9,
+ 0x67cc81, 0xc7b5, 0x69cc81, 0xc3ad, 0x6ccc81, 0xc4ba, 0x6ecc81, 0xc584,
+ 0x6fcc81, 0xc3b3, 0x72cc81, 0xc595, 0x73cc81, 0xc59b, 0x75cc81, 0xc3ba,
+ 0x79cc81, 0xc3bd, 0x7acc81, 0xc5ba, 0x41cc82, 0xc382, 0x43cc82, 0xc488,
+ 0x45cc82, 0xc38a, 0x47cc82, 0xc49c, 0x48cc82, 0xc4a4, 0x49cc82, 0xc38e,
+ 0x4acc82, 0xc4b4, 0x4fcc82, 0xc394, 0x53cc82, 0xc59c, 0x55cc82, 0xc39b,
+ 0x57cc82, 0xc5b4, 0x59cc82, 0xc5b6, 0x61cc82, 0xc3a2, 0x63cc82, 0xc489,
+ 0x65cc82, 0xc3aa, 0x67cc82, 0xc49d, 0x68cc82, 0xc4a5, 0x69cc82, 0xc3ae,
+ 0x6acc82, 0xc4b5, 0x6fcc82, 0xc3b4, 0x73cc82, 0xc59d, 0x75cc82, 0xc3bb,
+ 0x77cc82, 0xc5b5, 0x79cc82, 0xc5b7, 0x41cc83, 0xc383, 0x49cc83, 0xc4a8,
+ 0x4ecc83, 0xc391, 0x4fcc83, 0xc395, 0x55cc83, 0xc5a8, 0x61cc83, 0xc3a3,
+ 0x69cc83, 0xc4a9, 0x6ecc83, 0xc3b1, 0x6fcc83, 0xc3b5, 0x75cc83, 0xc5a9,
+ 0x41cc84, 0xc480, 0x45cc84, 0xc492, 0x49cc84, 0xc4aa, 0x4fcc84, 0xc58c,
+ 0x55cc84, 0xc5aa, 0x59cc84, 0xc8b2, 0x61cc84, 0xc481, 0x65cc84, 0xc493,
+ 0x69cc84, 0xc4ab, 0x6fcc84, 0xc58d, 0x75cc84, 0xc5ab, 0x79cc84, 0xc8b3,
+ 0x41cc86, 0xc482, 0x45cc86, 0xc494, 0x47cc86, 0xc49e, 0x49cc86, 0xc4ac,
+ 0x4fcc86, 0xc58e, 0x55cc86, 0xc5ac, 0x61cc86, 0xc483, 0x65cc86, 0xc495,
+ 0x67cc86, 0xc49f, 0x69cc86, 0xc4ad, 0x6fcc86, 0xc58f, 0x75cc86, 0xc5ad,
+ 0x41cc87, 0xc8a6, 0x43cc87, 0xc48a, 0x45cc87, 0xc496, 0x47cc87, 0xc4a0,
+ 0x49cc87, 0xc4b0, 0x4fcc87, 0xc8ae, 0x5acc87, 0xc5bb, 0x61cc87, 0xc8a7,
+ 0x63cc87, 0xc48b, 0x65cc87, 0xc497, 0x67cc87, 0xc4a1, 0x6fcc87, 0xc8af,
+ 0x7acc87, 0xc5bc, 0x41cc88, 0xc384, 0x45cc88, 0xc38b, 0x49cc88, 0xc38f,
+ 0x4fcc88, 0xc396, 0x55cc88, 0xc39c, 0x59cc88, 0xc5b8, 0x61cc88, 0xc3a4,
+ 0x65cc88, 0xc3ab, 0x69cc88, 0xc3af, 0x6fcc88, 0xc3b6, 0x75cc88, 0xc3bc,
+ 0x79cc88, 0xc3bf, 0x41cc8a, 0xc385, 0x55cc8a, 0xc5ae, 0x61cc8a, 0xc3a5,
+ 0x75cc8a, 0xc5af, 0x4fcc8b, 0xc590, 0x55cc8b, 0xc5b0, 0x6fcc8b, 0xc591,
+ 0x75cc8b, 0xc5b1, 0x41cc8c, 0xc78d, 0x43cc8c, 0xc48c, 0x44cc8c, 0xc48e,
+ 0x45cc8c, 0xc49a, 0x47cc8c, 0xc7a6, 0x48cc8c, 0xc89e, 0x49cc8c, 0xc78f,
+ 0x4bcc8c, 0xc7a8, 0x4ccc8c, 0xc4bd, 0x4ecc8c, 0xc587, 0x4fcc8c, 0xc791,
+ 0x52cc8c, 0xc598, 0x53cc8c, 0xc5a0, 0x54cc8c, 0xc5a4, 0x55cc8c, 0xc793,
+ 0x5acc8c, 0xc5bd, 0x61cc8c, 0xc78e, 0x63cc8c, 0xc48d, 0x64cc8c, 0xc48f,
+ 0x65cc8c, 0xc49b, 0x67cc8c, 0xc7a7, 0x68cc8c, 0xc89f, 0x69cc8c, 0xc790,
+ 0x6acc8c, 0xc7b0, 0x6bcc8c, 0xc7a9, 0x6ccc8c, 0xc4be, 0x6ecc8c, 0xc588,
+ 0x6fcc8c, 0xc792, 0x72cc8c, 0xc599, 0x73cc8c, 0xc5a1, 0x74cc8c, 0xc5a5,
+ 0x75cc8c, 0xc794, 0x7acc8c, 0xc5be, 0x41cc8f, 0xc880, 0x45cc8f, 0xc884,
+ 0x49cc8f, 0xc888, 0x4fcc8f, 0xc88c, 0x52cc8f, 0xc890, 0x55cc8f, 0xc894,
+ 0x61cc8f, 0xc881, 0x65cc8f, 0xc885, 0x69cc8f, 0xc889, 0x6fcc8f, 0xc88d,
+ 0x72cc8f, 0xc891, 0x75cc8f, 0xc895, 0x41cc91, 0xc882, 0x45cc91, 0xc886,
+ 0x49cc91, 0xc88a, 0x4fcc91, 0xc88e, 0x52cc91, 0xc892, 0x55cc91, 0xc896,
+ 0x61cc91, 0xc883, 0x65cc91, 0xc887, 0x69cc91, 0xc88b, 0x6fcc91, 0xc88f,
+ 0x72cc91, 0xc893, 0x75cc91, 0xc897, 0x4fcc9b, 0xc6a0, 0x55cc9b, 0xc6af,
+ 0x6fcc9b, 0xc6a1, 0x75cc9b, 0xc6b0, 0x53cca6, 0xc898, 0x54cca6, 0xc89a,
+ 0x73cca6, 0xc899, 0x74cca6, 0xc89b, 0x43cca7, 0xc387, 0x45cca7, 0xc8a8,
+ 0x47cca7, 0xc4a2, 0x4bcca7, 0xc4b6, 0x4ccca7, 0xc4bb, 0x4ecca7, 0xc585,
+ 0x52cca7, 0xc596, 0x53cca7, 0xc59e, 0x54cca7, 0xc5a2, 0x63cca7, 0xc3a7,
+ 0x65cca7, 0xc8a9, 0x67cca7, 0xc4a3, 0x6bcca7, 0xc4b7, 0x6ccca7, 0xc4bc,
+ 0x6ecca7, 0xc586, 0x72cca7, 0xc597, 0x73cca7, 0xc59f, 0x74cca7, 0xc5a3,
+ 0x41cca8, 0xc484, 0x45cca8, 0xc498, 0x49cca8, 0xc4ae, 0x4fcca8, 0xc7aa,
+ 0x55cca8, 0xc5b2, 0x61cca8, 0xc485, 0x65cca8, 0xc499, 0x69cca8, 0xc4af,
+ 0x6fcca8, 0xc7ab, 0x75cca8, 0xc5b3, 0x41cd80, 0xc380, 0x45cd80, 0xc388,
+ 0x49cd80, 0xc38c, 0x4ecd80, 0xc7b8, 0x4fcd80, 0xc392, 0x55cd80, 0xc399,
+ 0x61cd80, 0xc3a0, 0x65cd80, 0xc3a8, 0x69cd80, 0xc3ac, 0x6ecd80, 0xc7b9,
+ 0x6fcd80, 0xc3b2, 0x75cd80, 0xc3b9, 0x41cd81, 0xc381, 0x43cd81, 0xc486,
+ 0x45cd81, 0xc389, 0x47cd81, 0xc7b4, 0x49cd81, 0xc38d, 0x4ccd81, 0xc4b9,
+ 0x4ecd81, 0xc583, 0x4fcd81, 0xc393, 0x52cd81, 0xc594, 0x53cd81, 0xc59a,
+ 0x55cd81, 0xc39a, 0x59cd81, 0xc39d, 0x5acd81, 0xc5b9, 0x61cd81, 0xc3a1,
+ 0x63cd81, 0xc487, 0x65cd81, 0xc3a9, 0x67cd81, 0xc7b5, 0x69cd81, 0xc3ad,
+ 0x6ccd81, 0xc4ba, 0x6ecd81, 0xc584, 0x6fcd81, 0xc3b3, 0x72cd81, 0xc595,
+ 0x73cd81, 0xc59b, 0x75cd81, 0xc3ba, 0x79cd81, 0xc3bd, 0x7acd81, 0xc5ba,
+ 0x55cd84, 0xc797, 0x75cd84, 0xc798, 0
+ };
+
+ CStdStringA strDest;
+
+ strDest.reserve(strSourceDest.length());
+
+ int i = 0;
+ for (; i < (int)strSourceDest.size() - 2; ++i)
+ {
+ int kar = (unsigned char)strSourceDest[i];
+ int kar1 = (unsigned char)strSourceDest[i+1];
+ int kar2 = (unsigned char)strSourceDest[i+2];
+
+ if (((kar1 == 0xcc) || (kar1 == 0xcd)) && (kar2 >= 0x80))
+ {
+ int nfd = (kar << 16) | (kar1 << 8) | kar2;
+ int skip = false;
+
+ for (int j = 0; NFD_NFC_tupels[j]; j+=2)
+ {
+ if (NFD_NFC_tupels[j] == nfd)
+ {
+ strDest += NFD_NFC_tupels[j+1] >> 8;
+ strDest += NFD_NFC_tupels[j+1] & 0xff;
+ skip = true;
+ i+=2;
+ break;
+ }
+ }
+ if(skip) continue;
+ }
+ strDest += kar;
+ }
+ for (; i < (int)strSourceDest.size(); ++i)
+ strDest += strSourceDest[i];
+ strSourceDest = strDest;
+}
+
void CCharsetConverter::utf8To(const CStdStringA& strDestCharset, const CStdStringA& strSource, CStdStringA& strDest)
{
if (strDestCharset == "UTF-8")
View
1  xbmc/utils/CharsetConverter.h
@@ -46,6 +46,7 @@ class CCharsetConverter
void utf8ToStringCharset(CStdStringA& strSourceDest);
void utf8ToSystem(CStdStringA& strSourceDest);
+ void utf8ToUtf8NFC(CStdStringA& strSourceDest);
void utf8To(const CStdStringA& strDestCharset, const CStdStringA& strSource, CStdStringA& strDest);
void utf8To(const CStdStringA& strDestCharset, const CStdStringA& strSource, CStdString16& strDest);
Something went wrong with that request. Please try again.