Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encode problems on parse() function #22

Closed
DaWy opened this issue Mar 11, 2016 · 32 comments
Closed

Encode problems on parse() function #22

DaWy opened this issue Mar 11, 2016 · 32 comments

Comments

@DaWy
Copy link

DaWy commented Mar 11, 2016

Hi,

Having problems with files with quotes and specials chars like:

/home/dmartin/mounted/Multimedia/01-Incomming/2015-11-17_Instal·lació DUES TEXTIL/20151110_180916.mp4

Throws up:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 63: ordinal not in range(128)

Full Traceback:

Traceback (most recent call last):
  File "OrdenaIncomming.py", line 240, in <module>
    main()
  File "OrdenaIncomming.py", line 158, in main
    mi = MediaInfo.parse(os.path.join(root, arxiu))
  File "/usr/local/lib/python2.7/dist-packages/pymediainfo/__init__.py", line 92, in parse
    lib.MediaInfoA_Open(handle, filename.encode("utf8"), 0)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 63: ordinal not in range(128)
@sbraz
Copy link
Owner

sbraz commented Mar 12, 2016

Theoretically, using unicode types would work with Python 2 (u"/home/dmartin/mounted/Multimedia/01-Incomming/2015-11-17_Instal·lació DUES TEXTIL/20151110_180916.mp4"), however it seems to cause the mediainfo library to return in an empty result. I'll look into it further to try and isolate the issue.
In the meantime, you could just use Python 3 :)

@miigotu
Copy link
Contributor

miigotu commented Apr 13, 2016

@DaWy I would check two things. Either that string is already bytes, or the file system encoding is not utf-8.

if isinstance(filename, unicode):
    filename = filename.encode(sys.getfilesystemencoding())
else:
    raise Exception("You shoul be using unicode throughout your module until it leaves for filesystem or network, etc")
lib.MediaInfoA_Open(handle, filename, 0)

If it is already bytes and you tell python to encode it to utf-8, it has to first decode it to unicode before encoding to utf-8, and ascii is the default encoding in python2.

Im this you already have a byte string there and that is the cause, look further back at where filename comes from and make sure it is converted to unicode on entry to your program.

If this string is coming via cli argument, then it is already in the proper byte encoding.

@sbraz
Copy link
Owner

sbraz commented Apr 13, 2016

@miigotu you probably meant to ping me, not DaWy? The main issue here is that I was not able to make the ctypes code work with Python 2 and accentuated files although it worked with Python 3.

@miigotu
Copy link
Contributor

miigotu commented Apr 13, 2016

I would suggest using six in that case

I'm guessing the library still has some issues with py2 since parse causes a segfault for some of my users still? =P

@sbraz
Copy link
Owner

sbraz commented Apr 13, 2016

IMO developers should be aware of str/unicode. I dropped the dependency on six a while back as there is very little specific code.
As for segfaults, they may occur when using Werkzeug's debugger but that's all I've heard of, please open another bug if you think it is relevant :)

@miigotu
Copy link
Contributor

miigotu commented Apr 13, 2016

I would, but can't reproduce it myself. It works ok when the user does mediainfo on the file on the command line, but crashes the entire app with pymediainfo.
SiCKRAGE/SiCKRAGE#1445

When I have more info I will let you know.

@apwelsh
Copy link

apwelsh commented Apr 28, 2016

On my install, I solved this issue by changing the source from:

lib.MediaInfoA_Open(handle, filename.encode("utf8"))

to:
lib.MediaInfoA_Open(handle, filename.encode("ascii"))

This is caused when the character is in the upper range (ASCII CODE) and being encoded to UTF-8 translates those into a two byte value. It also works if I drop the encode call completely.

@sbraz
Copy link
Owner

sbraz commented Apr 28, 2016

I don't really understand how this works for you: unicode strings can't be encoded as ASCII:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 1: ordinal not in range(128)
I'll wait for feedback from the upstream bug.

@apwelsh
Copy link

apwelsh commented Apr 28, 2016

Ascii stings, stores in a wide char are not automatically unicode, at best, it marches up to utf16 version of unicode. The filesystem directory is returning ascii, (or perhaps utf16) but not utf 8. In utf 8 you need two bytes for any characters above 127. When certain ascii characyers are sent to the encode function (128-255) you will get an encode error if the source is not ascii, or utf16.

That being said, i just came across a file that dis not work, and removing encode completely from the argument solved all the encoding errors in my several thousand file scan. So i guess on some operating systems, just remove the encode feature.

Armand
--Sent from iPhone

On Apr 28, 2016, at 12:29 AM, Louis Sautier notifications@github.com wrote:

I don't really understand how this works for you: unicode strings can't be encoded as ASCII:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 1: ordinal not in range(128)

I'll wait for feedback from the upstream bug.


You are receiving this because you commented.
Reply to this email directly or view it on GitHub

@DaWy
Copy link
Author

DaWy commented Apr 28, 2016

I ended doing the same as @apwelsh, changing encoding to UTF-8. Sorry for the late feedback.

@sbraz
Copy link
Owner

sbraz commented Apr 28, 2016

I am confused: you changed the filename.encode line to lib.MediaInfoA_Open(handle, filename.encode("ascii"))? This is the same as lib.MediaInfoA_Open(handle, filename) for Python 2 and if the file name contains non-ASCII chars, it will fail.

@miigotu
Copy link
Contributor

miigotu commented Apr 28, 2016

Or use lib.MediaInfoW_Open and not encode at all...

@sbraz
Copy link
Owner

sbraz commented Apr 28, 2016

You mean MediaInfo_Open?

@sbraz
Copy link
Owner

sbraz commented Apr 28, 2016

@miigotu @DaWy @apwelsh if any of you managed to make the mediainfo library open a file containing accents (like in this test https://github.com/sbraz/pymediainfo/blob/master/tests/test.py#L79) with Python 2, I'm all ears.

@apwelsh
Copy link

apwelsh commented Apr 28, 2016

I will test with an accented filename and report back. I am on Mac though so my results may differ

Armand
--Sent from iPhone

On Apr 28, 2016, at 6:26 AM, Louis Sautier notifications@github.com wrote:

@miigotu @DaWy @apwelsh if any of you managed to make the mediainfo library open a file containing accents (like in this test https://github.com/sbraz/pymediainfo/blob/master/tests/test.py#L79) with Python 2, I'm all ears.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub

@apwelsh
Copy link

apwelsh commented Apr 29, 2016

So, I renamed a file to include à in the filename, and it worked fine after removing the .encode() statement from the open command.

Perhaps because the à is already in utf-8 format by format. Of course, this behavior is according to an OS X system in the US locale.

To ensure I was truly in Unicode, I renamed the same file to include 😇 as one of the filename characters, and it processed it correctly too.

I am stumped since the documentation on all these libraries is very poor. and I am too rusty on my C++ to spend the time working through the logic when it is working now.

On Apr 28, 2016, at 6:26 AM, Louis Sautier notifications@github.com wrote:

@miigotu https://github.com/miigotu @DaWy https://github.com/DaWy @apwelsh https://github.com/apwelsh if any of you managed to make the mediainfo library open a file containing accents (like in this test https://github.com/sbraz/pymediainfo/blob/master/tests/test.py#L79 https://github.com/sbraz/pymediainfo/blob/master/tests/test.py#L79) with Python 2, I'm all ears.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub #22 (comment)

@miigotu
Copy link
Contributor

miigotu commented Apr 29, 2016

MediaInfoA_Open requires acsii.

@apwelsh
Copy link

apwelsh commented Apr 29, 2016

That much is understood. But utf-8 is ascii compatible, in that it will generate 8 bit characters w/o any nulls, as opposed to utf-16 which generates a lot of nulls, and thus breaks ascii systems. Without examining the source and the full data flow, we cannot know for sure why on a mac, unicode works fine when passed to the ansi version of open. My suspicion is that mediainfo library relies on the OS for openning the file, and the OS libraries are allowing utf-8 encoded filenames to be resolved correctly. This is why i was clear to point out that I am on a U.S. Locale Mac. Because, this configuration likely results in utf-8 which works fine as-is, but of course will break if you try to encode utf-8 to ascii but will work fine if you leave it alone.

Armand
--Sent from iPhone

On Apr 28, 2016, at 10:50 PM, miigotu notifications@github.com wrote:

MediaInfoA_Open requires acsii.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub

@miigotu
Copy link
Contributor

miigotu commented Apr 29, 2016

Utf-8 is not ASCII compatible.

@miigotu
Copy link
Contributor

miigotu commented Apr 29, 2016

If you simply assure it is left as Unicode and use MediaInfo_Open() it will work on all systems as long as you cast it with types

@apwelsh
Copy link

apwelsh commented Apr 29, 2016

https://tools.ietf.org/html/rfc3629

A null terminated Char array will correctly store a UTF-8 encoded string, it cannot correctly store any other unicode string, because of the dependence on Null characters in the byte array. In this way, a function that receives char* can handle utf-8 encode unicode without prematurely terminating the text parsing methods. This is what UTF-8 was invented for. Because UTF-16 would result in a early termination when parsing a null terminated string if the data was not correctly encoded.
Surely, if the data is processed in mediainfo directly, it would be very likely to cause problems. In the case of a Mac, using utf-8, this is not the case, which is evidence that the open function merely passes the char* or wchar* off to the OS fopen() method accordingly, and is handled correctly.

This is not a statement about best practice, nor what the code ought to do, it is a statement of fact, based on observations on a system, using unicode... Including emoticons, on a filename and verifying the passing results.

Surely, the correct solution is to use the right method for each platform, the same way MediaInfo_DLL.py does. There is no, one solution works on all platforms.

But as i have proven already, UTF-8 is compatible with ASCII (within reason) but not the other way around.

Armand
--Sent from iPhone

On Apr 29, 2016, at 10:08 AM, miigotu notifications@github.com wrote:

Utf-8 is not ASCII compatible.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub

@miigotu
Copy link
Contributor

miigotu commented Apr 29, 2016

null termination has nothing to do with the fact that the mediainfo method MediaInfoA cannot handle mbcs, it expects 1 byte per character.

@apwelsh
Copy link

apwelsh commented Apr 29, 2016

That depends on what MediaInfoA_Open does with the byte array.

And ignore me as you will, on a Mac, it works just fine. And again, this is not about what should be done for cross platform use, it is a statement about observed effects on this improper use.

Yes, the code should use MediaInfo_Open for windows, dos, os2, ce. For darwin (mac) and linux, they should use MediaInfoA_Open
-- as per the actual MediaInfoDLL source.

This has to do with the source, which is C++, and receives the argument String for Open method.
As such they both are unicode based. The difference in which method to use is based on which OS you are in, as the windows OS represent the data differently than others.
Given your adamant insistence on the matter of Unicode, It is a good bet you are on a Windows system which implements Unicode only in wchar, even though char is unicode compatible. And thus the reason for the two methods in the API.

I have taken the time to look up the source code, and refresh my memory of C++ (which i hate with a passion) to put this matter to rest. When running on any DOS, OS2 or Windows variant, you must use wchar as argument to enable unicode, as these systems ignore unicode on char. When using every other operating system, use char, because this is a String which is of type char, and unicode is always in play.

Armand
--Sent from iPhone

On Apr 29, 2016, at 1:11 PM, miigotu notifications@github.com wrote:

null termination has nothing to do with the fact that the mediainfo method MediaInfoA cannot handle mbcs


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub

@sbraz
Copy link
Owner

sbraz commented Apr 29, 2016

@apwelsh MediaInfoDLL still does not work on Python 2 with accentuated file names on Linux, see the upstream issue which I opened. It does not matter if I use the wchar or the char variant, it fails.
I have cleaned up the code and decided to use the wchar_t functions everywhere. I don't understand why MediaInfoDLL does not use them for Linux as they are functional for me both on old and recent Linux versions, as well as Windows 7. Can you try it on Mac OS?

@miigotu
Copy link
Contributor

miigotu commented Apr 29, 2016

@apwelsh I think you dont realize how much I know what i'm talking about. And no, I use linux, and I program in languages numbering in the upper teens.

@sbraz thats my advice what you just said, it can work on linux with accents if you pass everything just right (unless there is a MAJOR bug upstream).

import ctypes
lib.MediaInfo_Open(handle, ctypes.c_wchar_p(unicode(filename))

lib.MediaInfoA_x is not unicode aware and unicode works fine directly in windows as well as linux
https://mediaarea.net/en-us/MediaInfo/Support/SDK/Quick_Start#Unicode

I wouldnt have an app with 70,000 users translated in like 20 languages if I didnt understand character encoding.

@apwelsh
Copy link

apwelsh commented Apr 29, 2016

When using char variant on open, did you pass the filename without using the encode method against the filename? To fix the issue on my i removed the encode statement. This mirrors the MediaInfoDLL.py behavior.

If you read my last comment, the wchar argumented calls are for Microsoft operating systems. On all other systems, it should be char. This too is documented in the MediaInfoDLL.py sample code.

I do have a Linux VM and can test but it will take a day to get my system prepped with all the access it needs for this since only use it for very specialized AWS development purposes (Java, JS, but no python)

Armand
--Sent from iPhone

On Apr 29, 2016, at 3:13 PM, Louis Sautier notifications@github.com wrote:

@apwelsh MediaInfoDLL still does not work on Python 2 with accentuated file names on Linux, see the upstream issue which I opened. It does not matter if I use the wchar or the char variant, it fails.
I have cleaned up the code and decided to use the wchar_t functions everywhere. I don't understand why MediaInfoDLL does not use them for Linux as they are functional for me both on old and recent Linux versions, as well as Windows 7. Can you try it on Mac OS?


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub

@miigotu
Copy link
Contributor

miigotu commented Apr 29, 2016

All modern operating systems use multibyte encoding. Windows was the LAST one to adopt it.

@apwelsh
Copy link

apwelsh commented Apr 30, 2016

Yes. It does. But on wchar, for backward compatibility with the old style codepage from DOS, Windows treats char as codepage mapped and wchar as native unicode. All other operating systems support unicode with char.
The char vs wchar is not an ascii vs unicode issue.. Its a data alignment issue. Only on MS systems is it tied to unicode. The wchar vs char data alignment issue is important to understand, because the OS libraries dictate how data is stored and passed around.

If you take care to define the call (like you added to the functions recently) then python seems to convert the filename to work with wchar. Without it, a seg fault still occurs.

But still remove the .encode("utf8") conversion regardless of the function you use so that mac dont throw a sev fault, and so that the UnicodeDecodeError does not occur. The encode() call is what is causing this error. If you do both, it will work on the mac.

Armand
--Sent from iPhone

On Apr 29, 2016, at 4:57 PM, miigotu notifications@github.com wrote:

All modern operating systems use multibyte encoding. Windows was the LAST one to adopt it.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub

@sbraz
Copy link
Owner

sbraz commented Apr 30, 2016

OK, so if the current code passes tests on Mac OS, Windows and Linux, I think it's time for me to make a new release. This issue will still remain open though as I still run into it on Linux with Python 2 (it works on Windows).

@apwelsh
Copy link

apwelsh commented Apr 30, 2016

If the final version uses:

    lib.MediaInfo_Open.argtypes = [c_void_p, c_wchar_p]
    lib.MediaInfo_Open.restype = c_size_t

and:
lib.MediaInfo_Open(handle, filename)

Then, yes. It will work. I think the definition was the missing piece in our first round of tests to make it work on Macs.

On Apr 29, 2016, at 6:13 PM, Louis Sautier notifications@github.com wrote:

OK, so if the current code passes tests on Mac OS, Windows and Linux, I think it's time for me to make a new release. This issue will still remain open though as I still run into it on Linux with Python 2 (it works on Windows).


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub #22 (comment)

@sbraz
Copy link
Owner

sbraz commented Apr 30, 2016

It does, indeed :) https://github.com/sbraz/pymediainfo/blob/master/pymediainfo/__init__.py#L86
I just don't understand why accentuated file names fail on Linux and Python 2 but I'll leave it to upstream as this looks like either a bug in ctypes or in libmediainfo.

@apwelsh
Copy link

apwelsh commented Apr 30, 2016

I’m still setting up my linux VM to run some tests. I’ll see what I can find.

On Apr 29, 2016, at 6:19 PM, Louis Sautier notifications@github.com wrote:

It does, indeed :) https://github.com/sbraz/pymediainfo/blob/master/pymediainfo/__init__.py#L86 https://github.com/sbraz/pymediainfo/blob/master/pymediainfo/__init__.py#L86
I just don't understand why accentuated file names fail on Linux and Python 2 but I'll leave it to upstream as this looks like either a bug in ctypes or in libmediainfo.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub #22 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants