Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with the CP950 codec #19

Open
TheElementalOfDestruction opened this issue Jun 20, 2023 · 15 comments
Open

Problem with the CP950 codec #19

TheElementalOfDestruction opened this issue Jun 20, 2023 · 15 comments

Comments

@TheElementalOfDestruction

Unfortunately, the way that python handles the big5/cp950 codec does not match the information I have been able to find on it, leading to some documents raising encoding errors. It's a very deep problem and I can see it existing even in Python 3.11. My initial impression was that it has to do with a naming conflict in encodings leading to cp950 not being the same one as the RTF docs are referencing, but big5 has the same issue, so I'm not sure. I spent a while trying to go through the CPython source code to see if I could figure out what was going wrong, but the relevant sections are confusing to me and don't appear to have any helpful documentation.

I did end up finding an old issue bringing this up though, so it's a documented issue python/cpython#72879

This issue currently affects the dev branch, but I suspect it also affects the current release. I got some new tests files and this is causing several of them to fail decoding because of multibyte sequences.

Basically, multibyte sequences encoded in cp950 are inaccurately parsed. I found an entry on Wikipedia which explains how to appropriately parse them, and outlook's handling confirms this method to be the one Microsoft uses. The Wikipedia page is here, however I will also leave a screenshot of the relevant section which explains how to map to the correct Unicode character value:
image

In RTF, the sequences are written, for whatever reason, not as Unicode characters, but as a series of \'HH controls. For example, I saw \'84\'68 in one of my files. Using the method listed on wikipedia, I translated that to U+F0B7, and then managed to confirm that was the same way that Outlook translated it, as the deencapsulated RTF had that unicode character in UTF8.

(Unfortunately, I suspect this issue may actually be the root of an issue created before on extract-msg, and proper support for this encoding will require extract-msg to handle it properly as well.)

@TheElementalOfDestruction
Copy link
Author

For clarification, the CP950 codec does not support the user defined area, which Microsoft includes in their version. That is what is causing the failure. For codes not in the user defined area, things are fine. The full mapping can be seen in this document, which I am using to add my own form of support to extract-msg: https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit950.txt

Edit: After inspecting the reference, it appears to be missing much of the lower mapping, starting with a high byte of 0x9D, so I'm having to use the wikipedia page for populating the missing ones. Never mind, I was misunderstanding the file

@seamustuohy
Copy link
Owner

Thanks for the in-depth overview. Am I correctly understanding that the issue only occurs when RTFDE is attempting to parse CP950 encoded text which uses characters in the Unicode private use area?

@TheElementalOfDestruction
Copy link
Author

I think it may be more correct to say it only happens when the RTF uses data from the Big5 EUDC area, but yes that's the issue.

@seamustuohy
Copy link
Owner

OK, if it's just Big5 EUDC characters and it's a result of an upstream bug then I don't think I will delay the next release for it. But, if folks find this issue in the future and can provide test files and/or pull requests to address the upstream issue I would not be against implementing an appropriate fix.

@TheElementalOfDestruction
Copy link
Author

Yeah, no worries. I went and added some form of support into the upcoming version of extract-msg myself, and I plan to test against a file that previously failed while identifying itself as Big5 to see if that resolved it. I've done so by adding the codec under the name "windows-950", in the event you want to use that when you get around to it. I'll be doing what I can to ensure support is properly maintained on my end.

The only failures I get are on an extremely low number of files, part of a new set of private test data I got. I think I managed to actually devise a small test file based off of them that threw the same error, so I'll find it and upload it for you. I wouldn't consider this particularly urgent at the moment.

@TheElementalOfDestruction
Copy link
Author

Alright, here is a test file that can raise the error. Word successfully parses this, showing "Hello" followed by a space then a single character, though it's either one it doesn't know how to render or is meant to look like a box.

{\rtf1\ansi\ansicpg1252\fromhtml1 \fbidis \deff0{\fonttbl
{\f0\fswiss\fcharset0 Arial;}
{\f1\fmodern Courier New;}
{\f2\fnil\fcharset2 Symbol;}
{\f3\fmodern\fcharset0 Courier New;}
{\f4\fnil\fcharset2 Symbol;}
{\f5\fswiss\fcharset136 New MingLiu;}
{\f6\fswiss\fcharset0 "Century Gothic";}
{\f7\fswiss\fcharset0 "Arial";}
{\f8\fswiss "Calibri Light";}
{\f9\fswiss\fcharset0 "Courier New";}}
{\colortbl\red0\green0\blue0;\red5\green99\blue193;}
\uc1\pard\plain\deftab360 \f5\fs24
Hello \'84\'68}

@TheElementalOfDestruction
Copy link
Author

Decided to do a test and modified RTFDE to use my implementation of CP950 and that fixes all the failing files, so implementing the Microsoft version of the codec is all that is needed to fix the issue.

@seamustuohy
Copy link
Owner

seamustuohy commented Jul 14, 2023 via email

@seamustuohy
Copy link
Owner

I've implemented your fix in dev but tests don't seem to pass when run on my machine. I am using a test file created using the code you provided in this thread. You can find the test here.

Based on the way the code is implemented it looks like if the windows-950 codec is not available by default on the operating system in question it will still fail in the way it is for me. So, this might be a dependency issue I can address by updating the way the install is done. But, I don't know what package you have that is pulling in the windows-950 codex.

I get the below error when running these tests.

======================================================================
ERROR: test_windows_950_codec (tests.deencapsulate.test_de_encapsulate.TestTextDecoding)
https://github.com/seamustuohy/RTFDE/issues/19
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/user/code/RTFDE/tests/deencapsulate/test_de_encapsulate.py", line 363, in test_windows_950_codec
    rtf_obj.deencapsulate()
  File "/home/user/code/RTFDE/RTFDE/deencapsulate.py", line 119, in deencapsulate
    Decoder.update_children(self.full_tree)
  File "/home/user/code/RTFDE/RTFDE/text_extraction.py", line 682, in update_children
    obj.children = [i for i in self.iterate_on_children(children)]
  File "/home/user/code/RTFDE/RTFDE/text_extraction.py", line 682, in <listcomp>
    obj.children = [i for i in self.iterate_on_children(children)]
  File "/home/user/code/RTFDE/RTFDE/text_extraction.py", line 755, in iterate_on_children
    item.children = [i for i in self.iterate_on_children(item.children)]
  File "/home/user/code/RTFDE/RTFDE/text_extraction.py", line 755, in <listcomp>
    item.children = [i for i in self.iterate_on_children(item.children)]
  File "/home/user/code/RTFDE/RTFDE/text_extraction.py", line 742, in iterate_on_children
    decoded_hex = decode_hex_char(base_bytes, current_codec)
  File "/home/user/code/RTFDE/RTFDE/text_extraction.py", line 634, in decode_hex_char
    decoded = item.decode(codec)
UnicodeDecodeError: 'cp950' codec can't decode byte 0x84 in position 0: illegal multibyte sequence

@TheElementalOfDestruction
Copy link
Author

Ah, it seems like there was a bit of a miscommunication. My modification will only cause the issue to be fixed if extract-msg is imported when the test is run. It doesn't matter when it is imported, so long as it is imported before the fonts are checked for a file.

The encoding is added by extract-msg (specifically the next version of extract-msg on the next-release branch), so that's why. The alternative is to basically just have the encoding defined on both modules. There isn't really an ideal solution, my solution just makes it so no one else will notice the issue while using extract-msg.

@seamustuohy
Copy link
Owner

seamustuohy commented Jul 22, 2023 via email

@gwiedeman
Copy link

I seem to be getting this same error when giving .deencapsulate() a cp1252 encoded RTF file with an invalid character.

body = mail.rtfBody
deencapsultor = RTFDE.DeEncapsulator(body)
deencapsultor.deencapsulate()
> Traceback (most recent call last):
  File "/mailbagit/test.py", line 41, in <module>
    deencapsultor.deencapsulate()
  File "/usr/local/lib/python3.11/site-packages/RTFDE/deencapsulate.py", line 119, in deencapsulate
    Decoder.update_children(self.full_tree)
  File "/usr/local/lib/python3.11/site-packages/RTFDE/text_extraction.py", line 671, in update_children
    obj.children = [i for i in self.iterate_on_children(children)]
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/RTFDE/text_extraction.py", line 671, in <listcomp>
    obj.children = [i for i in self.iterate_on_children(children)]
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/RTFDE/text_extraction.py", line 744, in iterate_on_children
    item.children = [i for i in self.iterate_on_children(item.children)]
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/RTFDE/text_extraction.py", line 744, in <listcomp>
    item.children = [i for i in self.iterate_on_children(item.children)]
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/RTFDE/text_extraction.py", line 744, in iterate_on_children
    item.children = [i for i in self.iterate_on_children(item.children)]
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/RTFDE/text_extraction.py", line 744, in <listcomp>
    item.children = [i for i in self.iterate_on_children(item.children)]
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/RTFDE/text_extraction.py", line 731, in iterate_on_children
    decoded_hex = decode_hex_char(base_bytes, current_codec)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/RTFDE/text_extraction.py", line 623, in decode_hex_char
    decoded = item.decode(codec)
              ^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'cp950' codec can't decode byte 0x83 in position 0: illegal multibyte sequence

chardetect thinks its ascii, but is wrong since its only one character

body = mail.rtfBody
print (chardet.detect(body))
> {'encoding': 'ascii', 'confidence': 1.0, 'language': ''}

If I write the binary to a .rtf,

with open("/data/msg_bugs/issues/test-default.rtf", "wb") as f:
	f.write(body)

And open it in a text editor, the first line appears to say that it is cp1252 which is consistent with the related data.

{\rtf1\ansi\ansicpg1252\fromtext \fbidis \deff0{\fonttbl

The character itself seems to be 83\:

\htmlrtf{\f4\fs20\htmlrtf0 \'83\'c0 Please consider our environment before printing this e-mail.\htmlrtf\f0}\htmlrtf0 \par

The character does not show even in Outlook reading the native .msg. I get a

So, what I think is happening here is that I have an email that was erroneously converted to cp1252, but RTFDE is treating it as cp950 for some reason.

I would expect RTFDE to use the encoding listed in the file, though for this case I could still see it raising a similar error. It would be cool if it also had an option to write a for an invalid character instead an exception.

If listing the encoding in the first line of RTF like that is standard practice (I have no idea), it would also be super cool to have an option to send RTFDE a binary rtf and it return both the text and the encoding.

I have no idea how hard or feasible changes would be, but at least wanted to document. Thanks for maintaining!

@seamustuohy
Copy link
Owner

seamustuohy commented Nov 15, 2023 via email

@TheElementalOfDestruction
Copy link
Author

The character does not show even in Outlook reading the native .msg. I get a

This would be consistent with Outlook reading the data there as Microsoft CP950, specifically for a private use character. Being unable to render it does not necessarily mean the data is invalid, just that your Outlook is likely unable to figure out how to interpret the character as it was encoded.

I confirmed the consistency by running the two bytes through extract-msg's windows-950 encoding and got the matching unicode character. As such, I would expect the f4 part of the font table to use fcharset136.

@gwiedeman
Copy link

Sure, below is the header (from what I can tell). Also happy to share the .msg and/or exported .rtf directly.

{\rtf1\ansi\ansicpg1252\fromtext \fbidis \deff0{\fonttbl
{\f0\fswiss\fcharset0 Arial;}
{\f1\fmodern Courier New;}
{\f2\fnil\fcharset2 Symbol;}
{\f3\fmodern\fcharset0 Courier New;}
{\f4\fswiss\fcharset136 MingLiu;}}
{\colortbl\red0\green0\blue0;\red0\green0\blue255;}
\uc1\pard\plain\deftab360 \f0\fs20

Thanks for maintaining.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants