Encoding parameter not used for html5lib #702

rvanlaar · 2023-08-08T12:32:35Z

Describe the Bug

When giving pisa.CreatePDF an encoded bytestring, the result is encoded with cp1252.

Minimal Example to Reproduce

from xhtml2pdf import pisa

# Define your data
source_html = "<html><body><span>•</span></body></html>"
output_filename = "test.pdf"

# Utility function
def convert_html_to_pdf(source_html, output_filename):
    result_file = open(output_filename, "w+b")

    source = source_html.encode("utf-8")
    pisa.CreatePDF(
            source,   
            encoding="utf-8",
            dest=result_file)          
    result_file.close()

# Main program
if __name__ == "__main__":
    convert_html_to_pdf(source_html, output_filename)

Expected Behavior

test.pdf contains only: •

Actual Behavior

test.pdf contains: â€¢

Additional Information

Example why this happens:

assert("•".encode("utf-8").decode("cp1252") == "â€¢")

Seems related to: #468

A possible in XHTML2PDF solution:

diff --git a/xhtml2pdf/parser.py b/xhtml2pdf/parser.py
index 4d2188d..b3ccaf5 100644
--- a/xhtml2pdf/parser.py
+++ b/xhtml2pdf/parser.py
@@ -767,6 +767,7 @@ def pisaParser(src, context, default_css="", xhtml=False, encoding="utf8", xml_o
         src = pisaTempFile(src, capacity=context.capacity)
         # To pass the encoding used to convert the text_type src to binary_type
         # on to html5lib's parser to ensure proper decoding
+    if encoding:
         parser_kwargs['transport_encoding'] = encoding
 
     # # Test for the restrictions of html5lib

A workaround is to not encode the source_html.
This lets xhtml2pdf do the encoding and set the encoding when passed to html5lib.

System Information

OS version: Ubuntu 22.04
Python version: 3.10.12
XHTML2PDF version: 0.2.11

The text was updated successfully, but these errors were encountered:

rvanlaar added the bug Something isn't working label Aug 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encoding parameter not used for html5lib #702

Encoding parameter not used for html5lib #702

rvanlaar commented Aug 8, 2023

Encoding parameter not used for html5lib #702

Encoding parameter not used for html5lib #702

Comments

rvanlaar commented Aug 8, 2023

Describe the Bug

Minimal Example to Reproduce

Expected Behavior

Actual Behavior

Additional Information

System Information