Skip to content

email module get_content() yields invalid UTF8 when CTE is 8bit #105285

@dougmccasland

Description

@dougmccasland

Python 3.10.6
module email — An email and MIME handling package v3.11.3

Consider this simple message:

Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8-bit
MIME-Version: 1.0
From: Dmcc <foobar1@gmail.com>
To: Dmcc <foobar2@gmail.com>
Subject: test msg of 8bit CTE and UTF8

there is the hötel 

Notice the o-umlaut in the word hotel, this is encoded in utf8. I put this in a file called msg.eml. Then run this:

#!/usr/bin/env python3

import email
from email.policy import default    

f = open("msg.eml", "r")
msg = email.message_from_file(f, policy=default)  
f.close()
print('CTE: ', msg['content-transfer-encoding'])
body = msg.get_content()
print('body:', body)

The output:

CTE:  8-bit
body: there is the h�tel

I expect the output to have valid utf8 since the CTE is 8bit. This problem also hhappens with the older get_payload() and with any of the "_from" methods, such as email.message_from_bytes().

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions