-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem with the CP950 codec #19
Comments
For clarification, the CP950 codec does not support the user defined area, which Microsoft includes in their version. That is what is causing the failure. For codes not in the user defined area, things are fine. The full mapping can be seen in this document, which I am using to add my own form of support to extract-msg: https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit950.txt Edit: |
Thanks for the in-depth overview. Am I correctly understanding that the issue only occurs when RTFDE is attempting to parse CP950 encoded text which uses characters in the Unicode private use area? |
I think it may be more correct to say it only happens when the RTF uses data from the Big5 EUDC area, but yes that's the issue. |
OK, if it's just Big5 EUDC characters and it's a result of an upstream bug then I don't think I will delay the next release for it. But, if folks find this issue in the future and can provide test files and/or pull requests to address the upstream issue I would not be against implementing an appropriate fix. |
Yeah, no worries. I went and added some form of support into the upcoming version of extract-msg myself, and I plan to test against a file that previously failed while identifying itself as Big5 to see if that resolved it. I've done so by adding the codec under the name "windows-950", in the event you want to use that when you get around to it. I'll be doing what I can to ensure support is properly maintained on my end. The only failures I get are on an extremely low number of files, part of a new set of private test data I got. I think I managed to actually devise a small test file based off of them that threw the same error, so I'll find it and upload it for you. I wouldn't consider this particularly urgent at the moment. |
Alright, here is a test file that can raise the error. Word successfully parses this, showing "Hello" followed by a space then a single character, though it's either one it doesn't know how to render or is meant to look like a box. {\rtf1\ansi\ansicpg1252\fromhtml1 \fbidis \deff0{\fonttbl
{\f0\fswiss\fcharset0 Arial;}
{\f1\fmodern Courier New;}
{\f2\fnil\fcharset2 Symbol;}
{\f3\fmodern\fcharset0 Courier New;}
{\f4\fnil\fcharset2 Symbol;}
{\f5\fswiss\fcharset136 New MingLiu;}
{\f6\fswiss\fcharset0 "Century Gothic";}
{\f7\fswiss\fcharset0 "Arial";}
{\f8\fswiss "Calibri Light";}
{\f9\fswiss\fcharset0 "Courier New";}}
{\colortbl\red0\green0\blue0;\red5\green99\blue193;}
\uc1\pard\plain\deftab360 \f5\fs24
Hello \'84\'68} |
Decided to do a test and modified |
Great, if happily review a pull request before the release.
…On Sun, Jul 9, 2023, 5:04 AM Destiny Peterson ***@***.***> wrote:
Decided to do a test and modified RTFDE to use my implementation of CP950
and that fixes all the failing files, so implementing the Microsoft version
of the codec is all that is needed to fix the issue.
—
Reply to this email directly, view it on GitHub
<#19 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJLMB6EATAECUJYRMMLRELXPIGSTANCNFSM6AAAAAAZNZ4MOU>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
I've implemented your fix in dev but tests don't seem to pass when run on my machine. I am using a test file created using the code you provided in this thread. You can find the test here. Based on the way the code is implemented it looks like if the windows-950 codec is not available by default on the operating system in question it will still fail in the way it is for me. So, this might be a dependency issue I can address by updating the way the install is done. But, I don't know what package you have that is pulling in the windows-950 codex. I get the below error when running these tests.
|
Ah, it seems like there was a bit of a miscommunication. My modification will only cause the issue to be fixed if extract-msg is imported when the test is run. It doesn't matter when it is imported, so long as it is imported before the fonts are checked for a file. The encoding is added by extract-msg (specifically the next version of extract-msg on the next-release branch), so that's why. The alternative is to basically just have the encoding defined on both modules. There isn't really an ideal solution, my solution just makes it so no one else will notice the issue while using extract-msg. |
Ahhh, in that case I'll remove and allow extract-msg to handle it
downstream. If enough others save the same issue then I can look into a
full fix at this level.
…On Sat, Jul 22, 2023, 8:10 PM Destiny Peterson ***@***.***> wrote:
Ah, it seems like there was a bit of a miscommunication. My modification
will only cause the issue to be fixed if extract-msg is imported when the
test is run. It doesn't matter when it is imported, so long as it is
imported before the fonts are checked for a file.
The encoding is added by extract-msg (specifically the next version of
extract-msg on the next-release branch), so that's why. The alternative is
to basically just have the encoding defined on both modules. There isn't
really an ideal solution, my solution just makes it so no one else will
notice the issue while using extract-msg.
—
Reply to this email directly, view it on GitHub
<#19 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJLMBY6MNAMDWL2EL7I6ZDXRQQTRANCNFSM6AAAAAAZNZ4MOU>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
I seem to be getting this same error when giving
If I write the binary to a
And open it in a text editor, the first line appears to say that it is
The character itself seems to be
The character does not show even in Outlook reading the native .msg. I get a So, what I think is happening here is that I have an email that was erroneously converted to I would expect If listing the encoding in the first line of RTF like that is standard practice (I have no idea), it would also be super cool to have an option to send I have no idea how hard or feasible changes would be, but at least wanted to document. Thanks for maintaining! |
Thank you. Could you provide the full font table object from the header?
(The one starting with `fonttbl`). That will identify which encoding \f4 is
using and help track down what might be the issue here. If you can provide
the entire header or full message that would be even more helpful.
…On Wed, Nov 15, 2023, 1:17 PM Gregory Wiedeman ***@***.***> wrote:
I seem to be getting this same error when giving .deencapsulate() a
cp1252 encoded RTF file with an invalid character.
body = mail.rtfBody
deencapsultor = RTFDE.DeEncapsulator(body)
deencapsultor.deencapsulate()
> Traceback (most recent call last):
File "/mailbagit/test.py", line 41, in <module>
deencapsultor.deencapsulate()
File "/usr/local/lib/python3.11/site-packages/RTFDE/deencapsulate.py", line 119, in deencapsulate
Decoder.update_children(self.full_tree)
File "/usr/local/lib/python3.11/site-packages/RTFDE/text_extraction.py", line 671, in update_children
obj.children = [i for i in self.iterate_on_children(children)]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/RTFDE/text_extraction.py", line 671, in <listcomp>
obj.children = [i for i in self.iterate_on_children(children)]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/RTFDE/text_extraction.py", line 744, in iterate_on_children
item.children = [i for i in self.iterate_on_children(item.children)]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/RTFDE/text_extraction.py", line 744, in <listcomp>
item.children = [i for i in self.iterate_on_children(item.children)]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/RTFDE/text_extraction.py", line 744, in iterate_on_children
item.children = [i for i in self.iterate_on_children(item.children)]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/RTFDE/text_extraction.py", line 744, in <listcomp>
item.children = [i for i in self.iterate_on_children(item.children)]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/RTFDE/text_extraction.py", line 731, in iterate_on_children
decoded_hex = decode_hex_char(base_bytes, current_codec)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/RTFDE/text_extraction.py", line 623, in decode_hex_char
decoded = item.decode(codec)
^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'cp950' codec can't decode byte 0x83 in position 0: illegal multibyte sequence
chardetect thinks its ascii, but is wrong since its only one character
body = mail.rtfBody
print (chardet.detect(body))
> {'encoding': 'ascii', 'confidence': 1.0, 'language': ''}
If I write the binary to a .rtf,
with open("/data/msg_bugs/issues/test-default.rtf", "wb") as f:
f.write(body)
And open it in a text editor, the first line appears to say that it is
cp1252 which is consistent with the related data.
{\rtf1\ansi\ansicpg1252\fromtext \fbidis \deff0{\fonttbl
The character itself seems to be 83\:
\htmlrtf{\f4\fs20\htmlrtf0 \'83\'c0 Please consider our environment before printing this e-mail.\htmlrtf\f0}\htmlrtf0 \par
The character does not show even in Outlook reading the native .msg. I get
a
So, what I think is happening here is that I have an email that was
erroneously converted to cp1252, but RTFDE is treating it as cp950 for
some reason.
I would expect RTFDE to use the encoding listed in the file, though for
this case I could still see it raising a similar error. It would be cool if
it also had an option to write a for an invalid character instead an
exception.
If listing the encoding in the first line of RTF like that is standard
practice (I have no idea), it would also be super cool to have an option to
send RTFDE a binary rtf and it return both the text and the encoding.
I have no idea how hard or feasible changes would be, but at least wanted
to document. Thanks for maintaining!
—
Reply to this email directly, view it on GitHub
<#19 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJLMBZ5JBFT2BYMX5ETWGDYEUBMZAVCNFSM6AAAAAAZNZ4MOWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMJTGAZDSNZWHA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
This would be consistent with Outlook reading the data there as Microsoft CP950, specifically for a private use character. Being unable to render it does not necessarily mean the data is invalid, just that your Outlook is likely unable to figure out how to interpret the character as it was encoded. I confirmed the consistency by running the two bytes through extract-msg's windows-950 encoding and got the matching unicode character. As such, I would expect the f4 part of the font table to use |
Sure, below is the header (from what I can tell). Also happy to share the .msg and/or exported .rtf directly.
Thanks for maintaining. |
Unfortunately, the way that python handles the big5/cp950 codec does not match the information I have been able to find on it, leading to some documents raising encoding errors. It's a very deep problem and I can see it existing even in Python 3.11. My initial impression was that it has to do with a naming conflict in encodings leading to cp950 not being the same one as the RTF docs are referencing, but big5 has the same issue, so I'm not sure. I spent a while trying to go through the CPython source code to see if I could figure out what was going wrong, but the relevant sections are confusing to me and don't appear to have any helpful documentation.
I did end up finding an old issue bringing this up though, so it's a documented issue python/cpython#72879
This issue currently affects the dev branch, but I suspect it also affects the current release. I got some new tests files and this is causing several of them to fail decoding because of multibyte sequences.
Basically, multibyte sequences encoded in cp950 are inaccurately parsed. I found an entry on Wikipedia which explains how to appropriately parse them, and outlook's handling confirms this method to be the one Microsoft uses. The Wikipedia page is here, however I will also leave a screenshot of the relevant section which explains how to map to the correct Unicode character value:
![image](https://private-user-images.githubusercontent.com/24580325/247250171-2e7cabf4-f408-4691-8556-b5caa96ab71f.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTg5MTIzNzUsIm5iZiI6MTcxODkxMjA3NSwicGF0aCI6Ii8yNDU4MDMyNS8yNDcyNTAxNzEtMmU3Y2FiZjQtZjQwOC00NjkxLTg1NTYtYjVjYWE5NmFiNzFmLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA2MjAlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNjIwVDE5MzQzNVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWM5MmFjMmQ5MjczZTc3YTZjYzNlY2FiODFiN2U3OTk5YTY0ZjM3MDQ4ZmEyYzcyZDRhMjI1ZmE0OTM1YmFiOWQmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.pVr0vucQeDzuM3iVHOrZyrcNKz7hQzeicZjW6YnNrNg)
In RTF, the sequences are written, for whatever reason, not as Unicode characters, but as a series of
\'HH
controls. For example, I saw\'84\'68
in one of my files. Using the method listed on wikipedia, I translated that to U+F0B7, and then managed to confirm that was the same way that Outlook translated it, as the deencapsulated RTF had that unicode character in UTF8.(Unfortunately, I suspect this issue may actually be the root of an issue created before on extract-msg, and proper support for this encoding will require extract-msg to handle it properly as well.)
The text was updated successfully, but these errors were encountered: