-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Encode problems on parse() function #22
Comments
Theoretically, using unicode types would work with Python 2 ( |
@DaWy I would check two things. Either that string is already bytes, or the file system encoding is not utf-8. if isinstance(filename, unicode):
filename = filename.encode(sys.getfilesystemencoding())
else:
raise Exception("You shoul be using unicode throughout your module until it leaves for filesystem or network, etc")
lib.MediaInfoA_Open(handle, filename, 0) If it is already bytes and you tell python to encode it to utf-8, it has to first decode it to unicode before encoding to utf-8, and ascii is the default encoding in python2. Im this you already have a byte string there and that is the cause, look further back at where filename comes from and make sure it is converted to unicode on entry to your program. If this string is coming via cli argument, then it is already in the proper byte encoding. |
@miigotu you probably meant to ping me, not DaWy? The main issue here is that I was not able to make the ctypes code work with Python 2 and accentuated files although it worked with Python 3. |
I would suggest using six in that case I'm guessing the library still has some issues with py2 since parse causes a segfault for some of my users still? =P |
IMO developers should be aware of str/unicode. I dropped the dependency on six a while back as there is very little specific code. |
I would, but can't reproduce it myself. It works ok when the user does mediainfo on the file on the command line, but crashes the entire app with pymediainfo. When I have more info I will let you know. |
On my install, I solved this issue by changing the source from:
to: This is caused when the character is in the upper range (ASCII CODE) and being encoded to UTF-8 translates those into a two byte value. It also works if I drop the encode call completely. |
I don't really understand how this works for you: unicode strings can't be encoded as ASCII: |
Ascii stings, stores in a wide char are not automatically unicode, at best, it marches up to utf16 version of unicode. The filesystem directory is returning ascii, (or perhaps utf16) but not utf 8. In utf 8 you need two bytes for any characters above 127. When certain ascii characyers are sent to the encode function (128-255) you will get an encode error if the source is not ascii, or utf16. That being said, i just came across a file that dis not work, and removing encode completely from the argument solved all the encoding errors in my several thousand file scan. So i guess on some operating systems, just remove the encode feature. Armand
|
I ended doing the same as @apwelsh, changing encoding to UTF-8. Sorry for the late feedback. |
I am confused: you changed the filename.encode line to |
Or use lib.MediaInfoW_Open and not encode at all... |
You mean |
@miigotu @DaWy @apwelsh if any of you managed to make the mediainfo library open a file containing accents (like in this test https://github.com/sbraz/pymediainfo/blob/master/tests/test.py#L79) with Python 2, I'm all ears. |
I will test with an accented filename and report back. I am on Mac though so my results may differ Armand
|
So, I renamed a file to include à in the filename, and it worked fine after removing the .encode() statement from the open command. Perhaps because the à is already in utf-8 format by format. Of course, this behavior is according to an OS X system in the US locale. To ensure I was truly in Unicode, I renamed the same file to include 😇 as one of the filename characters, and it processed it correctly too. I am stumped since the documentation on all these libraries is very poor. and I am too rusty on my C++ to spend the time working through the logic when it is working now.
|
MediaInfoA_Open requires acsii. |
That much is understood. But utf-8 is ascii compatible, in that it will generate 8 bit characters w/o any nulls, as opposed to utf-16 which generates a lot of nulls, and thus breaks ascii systems. Without examining the source and the full data flow, we cannot know for sure why on a mac, unicode works fine when passed to the ansi version of open. My suspicion is that mediainfo library relies on the OS for openning the file, and the OS libraries are allowing utf-8 encoded filenames to be resolved correctly. This is why i was clear to point out that I am on a U.S. Locale Mac. Because, this configuration likely results in utf-8 which works fine as-is, but of course will break if you try to encode utf-8 to ascii but will work fine if you leave it alone. Armand
|
Utf-8 is not ASCII compatible. |
If you simply assure it is left as Unicode and use MediaInfo_Open() it will work on all systems as long as you cast it with types |
https://tools.ietf.org/html/rfc3629 A null terminated Char array will correctly store a UTF-8 encoded string, it cannot correctly store any other unicode string, because of the dependence on Null characters in the byte array. In this way, a function that receives char* can handle utf-8 encode unicode without prematurely terminating the text parsing methods. This is what UTF-8 was invented for. Because UTF-16 would result in a early termination when parsing a null terminated string if the data was not correctly encoded. This is not a statement about best practice, nor what the code ought to do, it is a statement of fact, based on observations on a system, using unicode... Including emoticons, on a filename and verifying the passing results. Surely, the correct solution is to use the right method for each platform, the same way MediaInfo_DLL.py does. There is no, one solution works on all platforms. But as i have proven already, UTF-8 is compatible with ASCII (within reason) but not the other way around. Armand
|
null termination has nothing to do with the fact that the mediainfo method MediaInfoA cannot handle mbcs, it expects 1 byte per character. |
That depends on what MediaInfoA_Open does with the byte array. And ignore me as you will, on a Mac, it works just fine. And again, this is not about what should be done for cross platform use, it is a statement about observed effects on this improper use. Yes, the code should use MediaInfo_Open for windows, dos, os2, ce. For darwin (mac) and linux, they should use MediaInfoA_Open This has to do with the source, which is C++, and receives the argument String for Open method. I have taken the time to look up the source code, and refresh my memory of C++ (which i hate with a passion) to put this matter to rest. When running on any DOS, OS2 or Windows variant, you must use wchar as argument to enable unicode, as these systems ignore unicode on char. When using every other operating system, use char, because this is a String which is of type char, and unicode is always in play. Armand
|
@apwelsh MediaInfoDLL still does not work on Python 2 with accentuated file names on Linux, see the upstream issue which I opened. It does not matter if I use the wchar or the char variant, it fails. |
@apwelsh I think you dont realize how much I know what i'm talking about. And no, I use linux, and I program in languages numbering in the upper teens. @sbraz thats my advice what you just said, it can work on linux with accents if you pass everything just right (unless there is a MAJOR bug upstream).
I wouldnt have an app with 70,000 users translated in like 20 languages if I didnt understand character encoding. |
When using char variant on open, did you pass the filename without using the encode method against the filename? To fix the issue on my i removed the encode statement. This mirrors the MediaInfoDLL.py behavior. If you read my last comment, the wchar argumented calls are for Microsoft operating systems. On all other systems, it should be char. This too is documented in the MediaInfoDLL.py sample code. I do have a Linux VM and can test but it will take a day to get my system prepped with all the access it needs for this since only use it for very specialized AWS development purposes (Java, JS, but no python) Armand
|
All modern operating systems use multibyte encoding. Windows was the LAST one to adopt it. |
Yes. It does. But on wchar, for backward compatibility with the old style codepage from DOS, Windows treats char as codepage mapped and wchar as native unicode. All other operating systems support unicode with char. If you take care to define the call (like you added to the functions recently) then python seems to convert the filename to work with wchar. Without it, a seg fault still occurs. But still remove the .encode("utf8") conversion regardless of the function you use so that mac dont throw a sev fault, and so that the UnicodeDecodeError does not occur. The encode() call is what is causing this error. If you do both, it will work on the mac. Armand
|
OK, so if the current code passes tests on Mac OS, Windows and Linux, I think it's time for me to make a new release. This issue will still remain open though as I still run into it on Linux with Python 2 (it works on Windows). |
If the final version uses:
and: Then, yes. It will work. I think the definition was the missing piece in our first round of tests to make it work on Macs.
|
It does, indeed :) https://github.com/sbraz/pymediainfo/blob/master/pymediainfo/__init__.py#L86 |
I’m still setting up my linux VM to run some tests. I’ll see what I can find.
|
Hi,
Having problems with files with quotes and specials chars like:
/home/dmartin/mounted/Multimedia/01-Incomming/2015-11-17_Instal·lació DUES TEXTIL/20151110_180916.mp4
Throws up:
Full Traceback:
The text was updated successfully, but these errors were encountered: