Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eml_analyzer --text throws cyrillic characters instead of german umlauts #17

Closed
mrchang0815 opened this issue Jul 25, 2023 · 2 comments · Fixed by #18
Closed

eml_analyzer --text throws cyrillic characters instead of german umlauts #17

mrchang0815 opened this issue Jul 25, 2023 · 2 comments · Fixed by #18
Assignees
Labels
bug Something isn't working

Comments

@mrchang0815
Copy link

Hi,

I face a issue when parsing outgoing mails:

Extract from the eml-File (save-as via Thunderbird):
"...
This is a cryptographically signed message in MIME format.

--------------ms090908070501060903060609
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit

ä

..."

If the plain text contains i. e. a "ä" the output character becomes cyrillic "д" independend from the chosen output format "--text" or "--text --format json".

In comparison an extract from a eml-File that get's parsed correctly:

"...
--eTqZtiOboXMORarM2jeks2PNUJpOw=_O7X
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
...
"

Any idea?

I add a test-file.

Regards,
MrChang.
test.zip

@wahlflo wahlflo self-assigned this Aug 2, 2023
@wahlflo wahlflo linked a pull request Aug 4, 2023 that will close this issue
@wahlflo
Copy link
Owner

wahlflo commented Aug 4, 2023

Hi,
I can confirm the bug on Linux, on Windows it works fine. Looks like the return value of the method Message.get_payload from the python library is based on the OS.

Input Dies ist ein dämlicher Test.
Output on Windows: b'Dies ist ein d\xc3\xa4mlicher Test.'
Output on Linux: b'Dies ist ein d\xe4mlicher Test.'

I added a unit test to test this behaviour on Ubuntu and Windows: #18
To find and fix the root cause one needs to analyze the Message.get_payload method from the python library.

@wahlflo wahlflo added the bug Something isn't working label Aug 4, 2023
@wahlflo
Copy link
Owner

wahlflo commented Aug 4, 2023

So @mrchang0815,

The difference between Windows and Linux was caused since the default encoding when the file is read is different. I set it now explicit to utf-8.

I had a look at email library, it looks like that the value 8bit of the Content-Transfer-Encoding: is not considered.

I will create an issue for the standard library. For the meantime I implemented a fix in the eml-analyzer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants