Full Unicode support on Windows #16

mgeier · 2015-04-13T17:52:32Z

Thanks for the idea with sys.getfilesystemencoding(), I stole it from you in https://github.com/bastibe/PySoundFile/pull/119!

However, this doesn't seem to be enough to support all Unicode filenames on Windows, there are Unicode code points that cannot be encoded using the mbcs encoding, see https://github.com/bastibe/PySoundFile/pull/129.

I think this issue would be a good benchmark: libsndfile/libsndfile#74

The text was updated successfully, but these errors were encountered:

vokimon · 2015-12-03T18:06:45Z

Please, could you elaborate more about the problem? I have no windows machine to test in and i have no clue about either what your expected behaviour and the actual behaviour are.

Just to clarify, the policy here is:

you provide bytes, then the module will use them as they were already encoded using file system coding.
- bad encoded bytes could end up with I/O errors or garbage names
If you provide unicode, the module will try to encode it to the file system encoding.
- if the system default is unable to encode some unicode point, weird things happen, maybe here some transliteration policy may help, maybe you refer to this. But it is not clear from the report.

mgeier · 2015-12-04T09:10:03Z

Both the problem and the solution are described in detail in https://github.com/bastibe/PySoundFile/pull/129.

If you provide bytes, everything is OK (It's the user's problem to get the encoding right).
If you provide unicode, getfilesystemencoding() is not enough on Windows. You'll have to use sf_wchar_open() to support all Unicode characters.

vokimon · 2015-12-07T11:52:35Z

I take @batisbe's rant: Why Windows!! :-)

There is a problem here, you are right. But the test code you point to misses to highlight it because it is just using decode and encode wrongly and fails because of that.

So for further discussion it would be very helpful to have:

Code, expectation and result
Which python version are you using, in order to know the actual type of the literals bytes or unicode, and to know which getfilesystemencoding behavior are we getting as it changed in 3.2 according documentation.
Which windows version you getting the result on, as they have different charset handling
Which is the default codepage (run chcp) in the test system.

And remember:

unicode -> encode -> bytes
bytes -> decode -> unicode
Also unicode != utf-8 or utf-16, something in utf-8 or utf-16 would be a bytes. unicode is abstract representation, utf-8 and utf-16 are encodings (into bytes) able to represent the abstract representation.

That said, let's take a look to the actual problem:
python-wavefile is using sf_open like it accepted bytes encoded using the filesystem encoding for the filename. But that's not true. In Windows, every 'A' suffixed API function takes the bytes like they where encoding the local codepage. Not to the filesystem encoding which is ucs2 or utf-16 in modern windows (NT/2000 and beyond) but the code page which is locale dependant and more restrictive than the filesystem itself. In modern windows 'A' suffixed API's transliterate the codepage encoded string to UTF-16 and calls the 'W' suffixed version of the API.

Why the test case you describe should work? If the locale codepage is compatible with the unicode chars you are using on the filename, getfilesystemencoding() will return mbcs which is not an specific encoding at all but a way of saying 'whichever windows codepage is related to the current locale'. If the locale is set to Korean, then we are encoding unicode into korean and then windows API is transcoding korean into utf-16.

When would we get the actual error? When we are trying to use a filename which is not representable with the locale codepage. Just because the system itself is able to do so but we are using a middle man, the codepage, that makes it fail. And yes in order to be able to do that we should use the wchar version.

A test case would be a name taking chars from different codepages: cyrilic, korean... so for sure it will fail whichever your codepage.

Let me write a test using just korean that should not fail in your computer and a test that should fail in every computer. If we got that, lets do the fix.

vokimon · 2015-12-07T13:42:16Z

I just commited two unit tests that should break in windows, i would like to have a red before proceeding with the fix, since i can't test it myself:

test_write_unicodeFilename_koreanCodepage: It uses Korean code page (the one we were getting problems, i guess it should work in a Korean computer)
test_write_unicodeFilename_multipleCodePage: It uses a mix of Cyrilic and Korean, my guest is that it should fail whichever the code page is in Windows.

Could you run them in a Korean windows and paste the output so that we can confirm we have a failing case? Would be nice having it run in both Py2 and Py3. And give, please the information I asked above (windows version, locale codepage)

mgeier · 2015-12-07T14:16:57Z

I'm normally testing this stuff on virtualbox using an image from http://modern.ie. The Korean computer I was talking about was from a co-worker of mine, I currently don't have access to it.

I tried the file names from your new tests, and they all work in PySoundFile (with the above-mentioned virtualbox image). The characters couldn't be displayed in the command prompt and in the IPython terminal either. But it worked when pasting the names into a .py file and saving it as UTF 8 (with Notepad). I don't have the time to test this on python-wavefile, sorry.

But I'm quite confident that if you use my solution it will just work.

I'm not sure if your tests are sensible, since you are using your library for both writing and reading. If there is an error in the name conversion, but the same error happens in both writing and reading, you'll not be able to detect it. This makes this stuff really unpleasant to test ...

BTW, the file name in your tests (查找問題.wav) is definitely not Korean, I assume it's Chinese.

vokimon · 2015-12-07T15:07:08Z

It's important how you encode the file, but it is also important how do you declare the data types and how do you use encoding and decoding afterwards. Please could you run the tests, as is, in a Windows machine so that we could have a failing test? I am preparing the commit with your solution but i would like to have a failing test first.

If symbols are Chinese then both cases should fail on a non Chinese windows computer. And then it is normal it failed in a Korean locale, because it was transcoded from unicode to unicode through the Korean encoding.

For the tests i am relying on Python standard library os.listdir function. https://docs.python.org/2/library/os.html#os.listdir https://docs.python.org/3/library/os.html#os.listdir Using an unicode parameter, the listing is retrieved as unicode as well. If some bad transcoding happens we will see it.

Anyway, that's why i want a failing test, so that we can show up the failing case. Could you run it?

I just updated the test to have one with just korean and the other with both Chinese and Russian.

mgeier · 2015-12-07T15:29:26Z

As I said, I don't have time for this, sorry.

I missed the os.listdir thing before, that looks good!

vokimon · 2016-03-07T15:44:56Z

As i cannot test the code my self i publish the fix proposal in the unicode_for_windows branch. I will appreciate anyone interested that could help me testing the fix (and the tests making them fail without the fix).
Commit 161e15a

vokimon · 2023-10-28T16:54:03Z

The new way libsndfile handles filenames in windows using utf8 should solve this and other issues. Or will pop others. Any way this concrete issue is no more relevant. File new ones with the new libsndfile behaviour if you hit them.

https://github.com/libsndfile/libsndfile/blob/master/CHANGELOG.md#changed

vokimon added the bug label Dec 7, 2015

vokimon added this to the python-wavefile-1.5 milestone Dec 7, 2015

vokimon self-assigned this Dec 7, 2015

vokimon added the feedback requested label Dec 7, 2015

vokimon modified the milestones: python-wavefile-1.6, python-wavefile-1.5 Mar 10, 2016

vokimon closed this as completed Oct 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Full Unicode support on Windows #16

Full Unicode support on Windows #16

mgeier commented Apr 13, 2015

vokimon commented Dec 3, 2015

mgeier commented Dec 4, 2015

vokimon commented Dec 7, 2015

vokimon commented Dec 7, 2015

mgeier commented Dec 7, 2015

vokimon commented Dec 7, 2015

mgeier commented Dec 7, 2015

vokimon commented Mar 7, 2016

vokimon commented Oct 28, 2023

Full Unicode support on Windows #16

Full Unicode support on Windows #16

Comments

mgeier commented Apr 13, 2015

vokimon commented Dec 3, 2015

mgeier commented Dec 4, 2015

vokimon commented Dec 7, 2015

vokimon commented Dec 7, 2015

mgeier commented Dec 7, 2015

vokimon commented Dec 7, 2015

mgeier commented Dec 7, 2015

vokimon commented Mar 7, 2016

vokimon commented Oct 28, 2023