Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full Unicode support on Windows #16

Closed
mgeier opened this issue Apr 13, 2015 · 9 comments
Closed

Full Unicode support on Windows #16

mgeier opened this issue Apr 13, 2015 · 9 comments

Comments

@mgeier
Copy link

mgeier commented Apr 13, 2015

Thanks for the idea with sys.getfilesystemencoding(), I stole it from you in https://github.com/bastibe/PySoundFile/pull/119!

However, this doesn't seem to be enough to support all Unicode filenames on Windows, there are Unicode code points that cannot be encoded using the mbcs encoding, see https://github.com/bastibe/PySoundFile/pull/129.

I think this issue would be a good benchmark: libsndfile/libsndfile#74

@vokimon
Copy link
Owner

vokimon commented Dec 3, 2015

Please, could you elaborate more about the problem? I have no windows machine to test in and i have no clue about either what your expected behaviour and the actual behaviour are.

Just to clarify, the policy here is:

  • you provide bytes, then the module will use them as they were already encoded using file system coding.
    • bad encoded bytes could end up with I/O errors or garbage names
  • If you provide unicode, the module will try to encode it to the file system encoding.
    • if the system default is unable to encode some unicode point, weird things happen, maybe here some transliteration policy may help, maybe you refer to this. But it is not clear from the report.

@mgeier
Copy link
Author

mgeier commented Dec 4, 2015

Both the problem and the solution are described in detail in https://github.com/bastibe/PySoundFile/pull/129.

If you provide bytes, everything is OK (It's the user's problem to get the encoding right).
If you provide unicode, getfilesystemencoding() is not enough on Windows. You'll have to use sf_wchar_open() to support all Unicode characters.

@vokimon
Copy link
Owner

vokimon commented Dec 7, 2015

I take @batisbe's rant: Why Windows!! :-)

There is a problem here, you are right. But the test code you point to misses to highlight it because it is just using decode and encode wrongly and fails because of that.

So for further discussion it would be very helpful to have:

  • Code, expectation and result
  • Which python version are you using, in order to know the actual type of the literals bytes or unicode, and to know which getfilesystemencoding behavior are we getting as it changed in 3.2 according documentation.
  • Which windows version you getting the result on, as they have different charset handling
  • Which is the default codepage (run chcp) in the test system.

And remember:

  • unicode -> encode -> bytes
  • bytes -> decode -> unicode
  • Also unicode != utf-8 or utf-16, something in utf-8 or utf-16 would be a bytes. unicode is abstract representation, utf-8 and utf-16 are encodings (into bytes) able to represent the abstract representation.

That said, let's take a look to the actual problem:
python-wavefile is using sf_open like it accepted bytes encoded using the filesystem encoding for the filename. But that's not true. In Windows, every 'A' suffixed API function takes the bytes like they where encoding the local codepage. Not to the filesystem encoding which is ucs2 or utf-16 in modern windows (NT/2000 and beyond) but the code page which is locale dependant and more restrictive than the filesystem itself. In modern windows 'A' suffixed API's transliterate the codepage encoded string to UTF-16 and calls the 'W' suffixed version of the API.

Why the test case you describe should work? If the locale codepage is compatible with the unicode chars you are using on the filename, getfilesystemencoding() will return mbcs which is not an specific encoding at all but a way of saying 'whichever windows codepage is related to the current locale'. If the locale is set to Korean, then we are encoding unicode into korean and then windows API is transcoding korean into utf-16.

When would we get the actual error? When we are trying to use a filename which is not representable with the locale codepage. Just because the system itself is able to do so but we are using a middle man, the codepage, that makes it fail. And yes in order to be able to do that we should use the wchar version.

A test case would be a name taking chars from different codepages: cyrilic, korean... so for sure it will fail whichever your codepage.

Let me write a test using just korean that should not fail in your computer and a test that should fail in every computer. If we got that, lets do the fix.

@vokimon vokimon added the bug label Dec 7, 2015
@vokimon
Copy link
Owner

vokimon commented Dec 7, 2015

I just commited two unit tests that should break in windows, i would like to have a red before proceeding with the fix, since i can't test it myself:

  • test_write_unicodeFilename_koreanCodepage: It uses Korean code page (the one we were getting problems, i guess it should work in a Korean computer)
  • test_write_unicodeFilename_multipleCodePage: It uses a mix of Cyrilic and Korean, my guest is that it should fail whichever the code page is in Windows.

Could you run them in a Korean windows and paste the output so that we can confirm we have a failing case? Would be nice having it run in both Py2 and Py3. And give, please the information I asked above (windows version, locale codepage)

@vokimon vokimon added this to the python-wavefile-1.5 milestone Dec 7, 2015
@vokimon vokimon self-assigned this Dec 7, 2015
@mgeier
Copy link
Author

mgeier commented Dec 7, 2015

I'm normally testing this stuff on virtualbox using an image from http://modern.ie. The Korean computer I was talking about was from a co-worker of mine, I currently don't have access to it.

I tried the file names from your new tests, and they all work in PySoundFile (with the above-mentioned virtualbox image). The characters couldn't be displayed in the command prompt and in the IPython terminal either. But it worked when pasting the names into a .py file and saving it as UTF 8 (with Notepad). I don't have the time to test this on python-wavefile, sorry.

But I'm quite confident that if you use my solution it will just work.

I'm not sure if your tests are sensible, since you are using your library for both writing and reading. If there is an error in the name conversion, but the same error happens in both writing and reading, you'll not be able to detect it. This makes this stuff really unpleasant to test ...

BTW, the file name in your tests (查找問題.wav) is definitely not Korean, I assume it's Chinese.

@vokimon
Copy link
Owner

vokimon commented Dec 7, 2015

It's important how you encode the file, but it is also important how do you declare the data types and how do you use encoding and decoding afterwards. Please could you run the tests, as is, in a Windows machine so that we could have a failing test? I am preparing the commit with your solution but i would like to have a failing test first.

If symbols are Chinese then both cases should fail on a non Chinese windows computer. And then it is normal it failed in a Korean locale, because it was transcoded from unicode to unicode through the Korean encoding.

For the tests i am relying on Python standard library os.listdir function. https://docs.python.org/2/library/os.html#os.listdir https://docs.python.org/3/library/os.html#os.listdir Using an unicode parameter, the listing is retrieved as unicode as well. If some bad transcoding happens we will see it.

Anyway, that's why i want a failing test, so that we can show up the failing case. Could you run it?

I just updated the test to have one with just korean and the other with both Chinese and Russian.

@mgeier
Copy link
Author

mgeier commented Dec 7, 2015

As I said, I don't have time for this, sorry.

I missed the os.listdir thing before, that looks good!

@vokimon
Copy link
Owner

vokimon commented Mar 7, 2016

As i cannot test the code my self i publish the fix proposal in the unicode_for_windows branch. I will appreciate anyone interested that could help me testing the fix (and the tests making them fail without the fix).
Commit 161e15a

@vokimon
Copy link
Owner

vokimon commented Oct 28, 2023

The new way libsndfile handles filenames in windows using utf8 should solve this and other issues. Or will pop others. Any way this concrete issue is no more relevant. File new ones with the new libsndfile behaviour if you hit them.

https://github.com/libsndfile/libsndfile/blob/master/CHANGELOG.md#changed

@vokimon vokimon closed this as completed Oct 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants