Problem with the CP950 codec #19

TheElementalOfDestruction · 2023-06-20T20:44:33Z

Unfortunately, the way that python handles the big5/cp950 codec does not match the information I have been able to find on it, leading to some documents raising encoding errors. It's a very deep problem and I can see it existing even in Python 3.11. My initial impression was that it has to do with a naming conflict in encodings leading to cp950 not being the same one as the RTF docs are referencing, but big5 has the same issue, so I'm not sure. I spent a while trying to go through the CPython source code to see if I could figure out what was going wrong, but the relevant sections are confusing to me and don't appear to have any helpful documentation.

I did end up finding an old issue bringing this up though, so it's a documented issue python/cpython#72879

This issue currently affects the dev branch, but I suspect it also affects the current release. I got some new tests files and this is causing several of them to fail decoding because of multibyte sequences.

Basically, multibyte sequences encoded in cp950 are inaccurately parsed. I found an entry on Wikipedia which explains how to appropriately parse them, and outlook's handling confirms this method to be the one Microsoft uses. The Wikipedia page is here, however I will also leave a screenshot of the relevant section which explains how to map to the correct Unicode character value:

In RTF, the sequences are written, for whatever reason, not as Unicode characters, but as a series of \'HH controls. For example, I saw \'84\'68 in one of my files. Using the method listed on wikipedia, I translated that to U+F0B7, and then managed to confirm that was the same way that Outlook translated it, as the deencapsulated RTF had that unicode character in UTF8.

(Unfortunately, I suspect this issue may actually be the root of an issue created before on extract-msg, and proper support for this encoding will require extract-msg to handle it properly as well.)

The text was updated successfully, but these errors were encountered:

TheElementalOfDestruction · 2023-06-21T03:39:49Z

For clarification, the CP950 codec does not support the user defined area, which Microsoft includes in their version. That is what is causing the failure. For codes not in the user defined area, things are fine. The full mapping can be seen in this document, which I am using to add my own form of support to extract-msg: https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit950.txt

Edit: ~~After inspecting the reference, it appears to be missing much of the lower mapping, starting with a high byte of 0x9D, so I'm having to use the wikipedia page for populating the missing ones.~~ Never mind, I was misunderstanding the file

seamustuohy · 2023-06-21T15:06:15Z

Thanks for the in-depth overview. Am I correctly understanding that the issue only occurs when RTFDE is attempting to parse CP950 encoded text which uses characters in the Unicode private use area?

TheElementalOfDestruction · 2023-06-21T15:13:59Z

I think it may be more correct to say it only happens when the RTF uses data from the Big5 EUDC area, but yes that's the issue.

seamustuohy · 2023-06-21T15:17:54Z

OK, if it's just Big5 EUDC characters and it's a result of an upstream bug then I don't think I will delay the next release for it. But, if folks find this issue in the future and can provide test files and/or pull requests to address the upstream issue I would not be against implementing an appropriate fix.

TheElementalOfDestruction · 2023-06-21T16:20:46Z

Yeah, no worries. I went and added some form of support into the upcoming version of extract-msg myself, and I plan to test against a file that previously failed while identifying itself as Big5 to see if that resolved it. I've done so by adding the codec under the name "windows-950", in the event you want to use that when you get around to it. I'll be doing what I can to ensure support is properly maintained on my end.

The only failures I get are on an extremely low number of files, part of a new set of private test data I got. I think I managed to actually devise a small test file based off of them that threw the same error, so I'll find it and upload it for you. I wouldn't consider this particularly urgent at the moment.

TheElementalOfDestruction · 2023-06-21T20:31:13Z

Alright, here is a test file that can raise the error. Word successfully parses this, showing "Hello" followed by a space then a single character, though it's either one it doesn't know how to render or is meant to look like a box.

{\rtf1\ansi\ansicpg1252\fromhtml1 \fbidis \deff0{\fonttbl
{\f0\fswiss\fcharset0 Arial;}
{\f1\fmodern Courier New;}
{\f2\fnil\fcharset2 Symbol;}
{\f3\fmodern\fcharset0 Courier New;}
{\f4\fnil\fcharset2 Symbol;}
{\f5\fswiss\fcharset136 New MingLiu;}
{\f6\fswiss\fcharset0 "Century Gothic";}
{\f7\fswiss\fcharset0 "Arial";}
{\f8\fswiss "Calibri Light";}
{\f9\fswiss\fcharset0 "Courier New";}}
{\colortbl\red0\green0\blue0;\red5\green99\blue193;}
\uc1\pard\plain\deftab360 \f5\fs24
Hello \'84\'68}

TheElementalOfDestruction · 2023-07-09T02:04:14Z

Decided to do a test and modified RTFDE to use my implementation of CP950 and that fixes all the failing files, so implementing the Microsoft version of the codec is all that is needed to fix the issue.

seamustuohy · 2023-07-14T05:28:24Z

Great, if happily review a pull request before the release.

…

On Sun, Jul 9, 2023, 5:04 AM Destiny Peterson ***@***.***> wrote: Decided to do a test and modified RTFDE to use my implementation of CP950 and that fixes all the failing files, so implementing the Microsoft version of the codec is all that is needed to fix the issue. — Reply to this email directly, view it on GitHub <#19 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJLMB6EATAECUJYRMMLRELXPIGSTANCNFSM6AAAAAAZNZ4MOU> . You are receiving this because you commented.Message ID: ***@***.***>

seamustuohy · 2023-07-22T11:26:12Z

I've implemented your fix in dev but tests don't seem to pass when run on my machine. I am using a test file created using the code you provided in this thread. You can find the test here.

Based on the way the code is implemented it looks like if the windows-950 codec is not available by default on the operating system in question it will still fail in the way it is for me. So, this might be a dependency issue I can address by updating the way the install is done. But, I don't know what package you have that is pulling in the windows-950 codex.

I get the below error when running these tests.

======================================================================
ERROR: test_windows_950_codec (tests.deencapsulate.test_de_encapsulate.TestTextDecoding)
https://github.com/seamustuohy/RTFDE/issues/19
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/user/code/RTFDE/tests/deencapsulate/test_de_encapsulate.py", line 363, in test_windows_950_codec
    rtf_obj.deencapsulate()
  File "/home/user/code/RTFDE/RTFDE/deencapsulate.py", line 119, in deencapsulate
    Decoder.update_children(self.full_tree)
  File "/home/user/code/RTFDE/RTFDE/text_extraction.py", line 682, in update_children
    obj.children = [i for i in self.iterate_on_children(children)]
  File "/home/user/code/RTFDE/RTFDE/text_extraction.py", line 682, in <listcomp>
    obj.children = [i for i in self.iterate_on_children(children)]
  File "/home/user/code/RTFDE/RTFDE/text_extraction.py", line 755, in iterate_on_children
    item.children = [i for i in self.iterate_on_children(item.children)]
  File "/home/user/code/RTFDE/RTFDE/text_extraction.py", line 755, in <listcomp>
    item.children = [i for i in self.iterate_on_children(item.children)]
  File "/home/user/code/RTFDE/RTFDE/text_extraction.py", line 742, in iterate_on_children
    decoded_hex = decode_hex_char(base_bytes, current_codec)
  File "/home/user/code/RTFDE/RTFDE/text_extraction.py", line 634, in decode_hex_char
    decoded = item.decode(codec)
UnicodeDecodeError: 'cp950' codec can't decode byte 0x84 in position 0: illegal multibyte sequence

TheElementalOfDestruction · 2023-07-22T19:10:37Z

Ah, it seems like there was a bit of a miscommunication. My modification will only cause the issue to be fixed if extract-msg is imported when the test is run. It doesn't matter when it is imported, so long as it is imported before the fonts are checked for a file.

The encoding is added by extract-msg (specifically the next version of extract-msg on the next-release branch), so that's why. The alternative is to basically just have the encoding defined on both modules. There isn't really an ideal solution, my solution just makes it so no one else will notice the issue while using extract-msg.

seamustuohy · 2023-07-22T19:14:35Z

Ahhh, in that case I'll remove and allow extract-msg to handle it downstream. If enough others save the same issue then I can look into a full fix at this level.

…

On Sat, Jul 22, 2023, 8:10 PM Destiny Peterson ***@***.***> wrote: Ah, it seems like there was a bit of a miscommunication. My modification will only cause the issue to be fixed if extract-msg is imported when the test is run. It doesn't matter when it is imported, so long as it is imported before the fonts are checked for a file. The encoding is added by extract-msg (specifically the next version of extract-msg on the next-release branch), so that's why. The alternative is to basically just have the encoding defined on both modules. There isn't really an ideal solution, my solution just makes it so no one else will notice the issue while using extract-msg. — Reply to this email directly, view it on GitHub <#19 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJLMBY6MNAMDWL2EL7I6ZDXRQQTRANCNFSM6AAAAAAZNZ4MOU> . You are receiving this because you commented.Message ID: ***@***.***>

gwiedeman · 2023-11-15T18:17:36Z

I seem to be getting this same error when giving .deencapsulate() a cp1252 encoded RTF file with an invalid character.

body = mail.rtfBody
deencapsultor = RTFDE.DeEncapsulator(body)
deencapsultor.deencapsulate()
> Traceback (most recent call last):
  File "/mailbagit/test.py", line 41, in <module>
    deencapsultor.deencapsulate()
  File "/usr/local/lib/python3.11/site-packages/RTFDE/deencapsulate.py", line 119, in deencapsulate
    Decoder.update_children(self.full_tree)
  File "/usr/local/lib/python3.11/site-packages/RTFDE/text_extraction.py", line 671, in update_children
    obj.children = [i for i in self.iterate_on_children(children)]
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/RTFDE/text_extraction.py", line 671, in <listcomp>
    obj.children = [i for i in self.iterate_on_children(children)]
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/RTFDE/text_extraction.py", line 744, in iterate_on_children
    item.children = [i for i in self.iterate_on_children(item.children)]
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/RTFDE/text_extraction.py", line 744, in <listcomp>
    item.children = [i for i in self.iterate_on_children(item.children)]
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/RTFDE/text_extraction.py", line 744, in iterate_on_children
    item.children = [i for i in self.iterate_on_children(item.children)]
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/RTFDE/text_extraction.py", line 744, in <listcomp>
    item.children = [i for i in self.iterate_on_children(item.children)]
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/RTFDE/text_extraction.py", line 731, in iterate_on_children
    decoded_hex = decode_hex_char(base_bytes, current_codec)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/RTFDE/text_extraction.py", line 623, in decode_hex_char
    decoded = item.decode(codec)
              ^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'cp950' codec can't decode byte 0x83 in position 0: illegal multibyte sequence

chardetect thinks its ascii, but is wrong since its only one character

body = mail.rtfBody
print (chardet.detect(body))
> {'encoding': 'ascii', 'confidence': 1.0, 'language': ''}

If I write the binary to a .rtf,

with open("/data/msg_bugs/issues/test-default.rtf", "wb") as f:
	f.write(body)

And open it in a text editor, the first line appears to say that it is cp1252 which is consistent with the related data.

{\rtf1\ansi\ansicpg1252\fromtext \fbidis \deff0{\fonttbl

The character itself seems to be 83\:

\htmlrtf{\f4\fs20\htmlrtf0 \'83\'c0 Please consider our environment before printing this e-mail.\htmlrtf\f0}\htmlrtf0 \par

The character does not show even in Outlook reading the native .msg. I get a 

So, what I think is happening here is that I have an email that was erroneously converted to cp1252, but RTFDE is treating it as cp950 for some reason.

I would expect RTFDE to use the encoding listed in the file, though for this case I could still see it raising a similar error. It would be cool if it also had an option to write a  for an invalid character instead an exception.

If listing the encoding in the first line of RTF like that is standard practice (I have no idea), it would also be super cool to have an option to send RTFDE a binary rtf and it return both the text and the encoding.

I have no idea how hard or feasible changes would be, but at least wanted to document. Thanks for maintaining!

seamustuohy · 2023-11-15T18:25:31Z

Thank you. Could you provide the full font table object from the header? (The one starting with `fonttbl`). That will identify which encoding \f4 is using and help track down what might be the issue here. If you can provide the entire header or full message that would be even more helpful.

…

On Wed, Nov 15, 2023, 1:17 PM Gregory Wiedeman ***@***.***> wrote: I seem to be getting this same error when giving .deencapsulate() a cp1252 encoded RTF file with an invalid character. body = mail.rtfBody deencapsultor = RTFDE.DeEncapsulator(body) deencapsultor.deencapsulate() > Traceback (most recent call last): File "/mailbagit/test.py", line 41, in <module> deencapsultor.deencapsulate() File "/usr/local/lib/python3.11/site-packages/RTFDE/deencapsulate.py", line 119, in deencapsulate Decoder.update_children(self.full_tree) File "/usr/local/lib/python3.11/site-packages/RTFDE/text_extraction.py", line 671, in update_children obj.children = [i for i in self.iterate_on_children(children)] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/RTFDE/text_extraction.py", line 671, in <listcomp> obj.children = [i for i in self.iterate_on_children(children)] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/RTFDE/text_extraction.py", line 744, in iterate_on_children item.children = [i for i in self.iterate_on_children(item.children)] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/RTFDE/text_extraction.py", line 744, in <listcomp> item.children = [i for i in self.iterate_on_children(item.children)] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/RTFDE/text_extraction.py", line 744, in iterate_on_children item.children = [i for i in self.iterate_on_children(item.children)] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/RTFDE/text_extraction.py", line 744, in <listcomp> item.children = [i for i in self.iterate_on_children(item.children)] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/RTFDE/text_extraction.py", line 731, in iterate_on_children decoded_hex = decode_hex_char(base_bytes, current_codec) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/RTFDE/text_extraction.py", line 623, in decode_hex_char decoded = item.decode(codec) ^^^^^^^^^^^^^^^^^^ UnicodeDecodeError: 'cp950' codec can't decode byte 0x83 in position 0: illegal multibyte sequence chardetect thinks its ascii, but is wrong since its only one character body = mail.rtfBody print (chardet.detect(body)) > {'encoding': 'ascii', 'confidence': 1.0, 'language': ''} If I write the binary to a .rtf, with open("/data/msg_bugs/issues/test-default.rtf", "wb") as f: f.write(body) And open it in a text editor, the first line appears to say that it is cp1252 which is consistent with the related data. {\rtf1\ansi\ansicpg1252\fromtext \fbidis \deff0{\fonttbl The character itself seems to be 83\: \htmlrtf{\f4\fs20\htmlrtf0 \'83\'c0 Please consider our environment before printing this e-mail.\htmlrtf\f0}\htmlrtf0 \par The character does not show even in Outlook reading the native .msg. I get a  So, what I think is happening here is that I have an email that was erroneously converted to cp1252, but RTFDE is treating it as cp950 for some reason. I would expect RTFDE to use the encoding listed in the file, though for this case I could still see it raising a similar error. It would be cool if it also had an option to write a  for an invalid character instead an exception. If listing the encoding in the first line of RTF like that is standard practice (I have no idea), it would also be super cool to have an option to send RTFDE a binary rtf and it return both the text and the encoding. I have no idea how hard or feasible changes would be, but at least wanted to document. Thanks for maintaining! — Reply to this email directly, view it on GitHub <#19 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJLMBZ5JBFT2BYMX5ETWGDYEUBMZAVCNFSM6AAAAAAZNZ4MOWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMJTGAZDSNZWHA> . You are receiving this because you commented.Message ID: ***@***.***>

TheElementalOfDestruction · 2023-11-15T20:09:54Z

The character does not show even in Outlook reading the native .msg. I get a 

This would be consistent with Outlook reading the data there as Microsoft CP950, specifically for a private use character. Being unable to render it does not necessarily mean the data is invalid, just that your Outlook is likely unable to figure out how to interpret the character as it was encoded.

I confirmed the consistency by running the two bytes through extract-msg's windows-950 encoding and got the matching unicode character. As such, I would expect the f4 part of the font table to use fcharset136.

gwiedeman · 2023-11-15T20:14:23Z

Sure, below is the header (from what I can tell). Also happy to share the .msg and/or exported .rtf directly.

{\rtf1\ansi\ansicpg1252\fromtext \fbidis \deff0{\fonttbl
{\f0\fswiss\fcharset0 Arial;}
{\f1\fmodern Courier New;}
{\f2\fnil\fcharset2 Symbol;}
{\f3\fmodern\fcharset0 Courier New;}
{\f4\fswiss\fcharset136 MingLiu;}}
{\colortbl\red0\green0\blue0;\red0\green0\blue255;}
\uc1\pard\plain\deftab360 \f0\fs20

Thanks for maintaining.

TheElementalOfDestruction mentioned this issue Jul 14, 2023

Update to try to use windows-950 when extract-msg is imported #20

Merged

TheElementalOfDestruction mentioned this issue Jul 31, 2023

Cannot decode using CP950 (and possibly others) due to the Python implementation differing from the Microsoft implementation TeamMsgExtractor/msg-extractor#373

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem with the CP950 codec #19

Problem with the CP950 codec #19

TheElementalOfDestruction commented Jun 20, 2023

TheElementalOfDestruction commented Jun 21, 2023 •

edited

seamustuohy commented Jun 21, 2023

TheElementalOfDestruction commented Jun 21, 2023

seamustuohy commented Jun 21, 2023

TheElementalOfDestruction commented Jun 21, 2023

TheElementalOfDestruction commented Jun 21, 2023

TheElementalOfDestruction commented Jul 9, 2023

seamustuohy commented Jul 14, 2023 via email

seamustuohy commented Jul 22, 2023

TheElementalOfDestruction commented Jul 22, 2023

seamustuohy commented Jul 22, 2023 via email

gwiedeman commented Nov 15, 2023

seamustuohy commented Nov 15, 2023 via email

TheElementalOfDestruction commented Nov 15, 2023

gwiedeman commented Nov 15, 2023

Problem with the CP950 codec #19

Problem with the CP950 codec #19

Comments

TheElementalOfDestruction commented Jun 20, 2023

TheElementalOfDestruction commented Jun 21, 2023 • edited

seamustuohy commented Jun 21, 2023

TheElementalOfDestruction commented Jun 21, 2023

seamustuohy commented Jun 21, 2023

TheElementalOfDestruction commented Jun 21, 2023

TheElementalOfDestruction commented Jun 21, 2023

TheElementalOfDestruction commented Jul 9, 2023

seamustuohy commented Jul 14, 2023 via email

seamustuohy commented Jul 22, 2023

TheElementalOfDestruction commented Jul 22, 2023

seamustuohy commented Jul 22, 2023 via email

gwiedeman commented Nov 15, 2023

seamustuohy commented Nov 15, 2023 via email

TheElementalOfDestruction commented Nov 15, 2023

gwiedeman commented Nov 15, 2023

TheElementalOfDestruction commented Jun 21, 2023 •

edited