Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding parameter not used for html5lib #702

Open
rvanlaar opened this issue Aug 8, 2023 · 0 comments
Open

Encoding parameter not used for html5lib #702

rvanlaar opened this issue Aug 8, 2023 · 0 comments
Labels
bug Something isn't working

Comments

@rvanlaar
Copy link

rvanlaar commented Aug 8, 2023

Describe the Bug

When giving pisa.CreatePDF an encoded bytestring, the result is encoded with cp1252.

Minimal Example to Reproduce

from xhtml2pdf import pisa

# Define your data
source_html = "<html><body><span>•</span></body></html>"
output_filename = "test.pdf"

# Utility function
def convert_html_to_pdf(source_html, output_filename):
    result_file = open(output_filename, "w+b")

    source = source_html.encode("utf-8")
    pisa.CreatePDF(
            source,   
            encoding="utf-8",
            dest=result_file)          
    result_file.close()

# Main program
if __name__ == "__main__":
    convert_html_to_pdf(source_html, output_filename)

Expected Behavior

test.pdf contains only: •

Actual Behavior

test.pdf contains: •

Additional Information

Example why this happens:

assert("•".encode("utf-8").decode("cp1252") == "•")

Seems related to: #468

A possible in XHTML2PDF solution:

diff --git a/xhtml2pdf/parser.py b/xhtml2pdf/parser.py
index 4d2188d..b3ccaf5 100644
--- a/xhtml2pdf/parser.py
+++ b/xhtml2pdf/parser.py
@@ -767,6 +767,7 @@ def pisaParser(src, context, default_css="", xhtml=False, encoding="utf8", xml_o
         src = pisaTempFile(src, capacity=context.capacity)
         # To pass the encoding used to convert the text_type src to binary_type
         # on to html5lib's parser to ensure proper decoding
+    if encoding:
         parser_kwargs['transport_encoding'] = encoding
 
     # # Test for the restrictions of html5lib

A workaround is to not encode the source_html.
This lets xhtml2pdf do the encoding and set the encoding when passed to html5lib.

System Information

OS version: Ubuntu 22.04
Python version: 3.10.12
XHTML2PDF version: 0.2.11

@rvanlaar rvanlaar added the bug Something isn't working label Aug 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant