generic/_utils.py: function create_string_object not working with the bytearray type #2434

sbourlon · 2024-02-02T06:24:59Z

Hello pypdf team,

While trying to to get the fields of my PDF with the function PdfReader.get_fields(), my code received an exception from the function create_string_object (in pypdf/generic/_utils.py, line 113) because it received a bytearray instead of a str or bytes.

By looking at the traceback, the error occurs when the function def decrypt_object(self, obj: PdfObject) -> PdfObject detects that the object to decrypt is either of type ByteStringObject or TextStringObject, before calling create_string_object.

The documentation about the bytearray type states:

bytearray objects are a mutable counterpart to bytes objects.

As bytearray objects are mutable, they support the mutable sequence operations in addition to the common bytes and bytearray operations described in Bytes and Bytearray Operations.

source: https://docs.python.org/3/library/stdtypes.html#bytearray

So it seems like the function create_string_object could accept bytearray objects and could treat them as bytes instead of raising an exception.

After applying this fix, I was able to read the fields of my PDF.

diff --git a/pypdf/generic/_utils.py b/pypdf/generic/_utils.py
index e6da5cf..edc9153 100644
--- a/pypdf/generic/_utils.py
+++ b/pypdf/generic/_utils.py
@@ -129,7 +129,7 @@ def create_string_object(
     """
     if isinstance(string, str):
         return TextStringObject(string)
-    elif isinstance(string, bytes):
+    elif isinstance(string, bytes | bytearray):
         if isinstance(forced_encoding, (list, dict)):
             out = ""
             for x in string:

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-6.5.0-15-generic-x86_64-with-glibc2.38

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==3.17.4, crypt_provider=('cryptography', '41.0.7'), PIL=none

Code + PDF

This is a minimal, complete example that shows the issue:

pdf = PdfReader(file)
fields = pdf.get_fields()

I can't provide my PDF file because it contains personal information.

Traceback

This is the complete traceback I see:

Traceback (most recent call last):
  File "/home/stefan/src/source/pdf.py", line 235, in get_pdf_fields
    fields = self.reader.get_fields()
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/stefan/.venv/lib/python3.11/site-packages/pypdf/_reader.py", line 577, in get_fields
    field = f.get_object()
            ^^^^^^^^^^^^^^
  File "/home/stefan/.venv/lib/python3.11/site-packages/pypdf/generic/_base.py", line 312, in get_object
    obj = self.pdf.get_object(self)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/stefan/.venv/lib/python3.11/site-packages/pypdf/_reader.py", line 1417, in get_object
    retval = self._encryption.decrypt_object(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/stefan/.venv/lib/python3.11/site-packages/pypdf/_encryption.py", line 850, in decrypt_object
    return cf.decrypt_object(obj)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/stefan/.venv/lib/python3.11/site-packages/pypdf/_encryption.py", line 104, in decrypt_object
    obj[key] = self.decrypt_object(value)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/stefan/.venv/lib/python3.11/site-packages/pypdf/_encryption.py", line 97, in decrypt_object
    obj = create_string_object(data)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/stefan/.venv/lib/python3.11/site-packages/pypdf/generic/_utils.py", line 163, in create_string_object
    raise TypeError(
TypeError: ('create_string_object should have str or unicode arg: %s', <class 'bytearray'>)

The text was updated successfully, but these errors were encountered:

stefan6419846 · 2024-02-02T07:10:09Z

Thanks for the report. While this patch might work, this will break mypy as decrypt should always receive and return bytes. Are you able to pinpoint where the bytearray is being generated initially? Having a quick look at the changelog of cryptography, there does not seem to be a related change.

sbourlon · 2024-02-03T05:21:47Z

Hello Stefan,

(I'm glad to meet another Stefan)

Thank you for your responsiveness.

The error seems to happen with objects of type TextStringObject (from _base.py) because its property original_bytes sometimes returns a bytearray though the function encode_pdfdocencoding(unicode_string: str) -> bytes.

mypy is currently accepting the bytearray to bytes promotion but there is an expirement to remove it.

Sources:

If the function encode_pdfdocencoding(unicode_string: str) -> bytes could only return immutable bytes (bytes type), which seems to be the intent according to its function signature, its return value could be casted to bytes, like so:

diff --git a/pypdf/generic/_base.py b/pypdf/generic/_base.py
index 5a27572..813b1df 100644
--- a/pypdf/generic/_base.py
+++ b/pypdf/generic/_base.py
@@ -650,4 +650,4 @@ def encode_pdfdocencoding(unicode_string: str) -> bytes:
             raise UnicodeEncodeError(
                 "pdfdocencoding", c, -1, -1, "does not exist in translation table"
             )
-    return retval
+    return bytes(retval)

Detailed explanation:

The function PdfReader.get_object, in the "# otherwise, decrypt here..." section, decrypts the object using the function Encryption.decrypt_object(obj: PdfObject, ...)
The function Encryption.decrypt_object(obj: PdfObject):
Checks that the object is encrypted with the function Encryption._is_encryption_object(obj), which simply checks that the object is one of the following types: ByteStringObject, TextStringObject, StreamObject, ArrayObject, DictionaryObject.
Decrypts the object with the function CryptFilter.decrypt_object(obj: PdfObject)
The function CryptFilter.decrypt_object(obj: PdfObject):
Calls the object's property obj.original_bytes, which in case of TextStringObject returns a bytearray
Calls the function _utils.py.create_string_object(string: Union[str, bytes], ...) which accepts only str or bytes

stefan6419846 · 2024-02-03T09:29:35Z

I am still not completely sure why bytes are generated in one case but a bytearray in another? But fixing this at the central location appears to be more suitable.

Do you want to submit a corresponding PR?

sbourlon · 2024-02-05T21:04:34Z

Because of the method get_original_bytes(self) -> bytes of the class TextStringObject

    def get_original_bytes(self) -> bytes:
        # We're a text string object, but the library is trying to get our raw
        # bytes.  This can happen if we auto-detected this string as text, but
        # we were wrong.  It's pretty common.  Return the original bytes that
        # would have been used to create this object, based upon the autodetect
        # method.
        if self.autodetect_utf16:
            return codecs.BOM_UTF16_BE + self.encode("utf-16be") # <-- returns bytes
        elif self.autodetect_pdfdocencoding:
            return encode_pdfdocencoding(self) # <-- returns bytearray
        else:
            raise Exception("no information about original bytes")

I opened the PR #2440 which cast the return of encode_pdfdocencoding into bytes.

sbourlon mentioned this issue Feb 5, 2024

BUG: encode_pdfdocencoding() always returns bytes #2440

Merged

stefan6419846 added is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF generic The generic submodule is affected labels Feb 15, 2024

stefan6419846 closed this as completed in efe7790 Feb 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

generic/_utils.py: function create_string_object not working with the bytearray type #2434

generic/_utils.py: function create_string_object not working with the bytearray type #2434

sbourlon commented Feb 2, 2024

stefan6419846 commented Feb 2, 2024

sbourlon commented Feb 3, 2024

stefan6419846 commented Feb 3, 2024

sbourlon commented Feb 5, 2024

generic/_utils.py: function create_string_object not working with the bytearray type #2434

generic/_utils.py: function create_string_object not working with the bytearray type #2434

Comments

sbourlon commented Feb 2, 2024

Environment

Code + PDF

Traceback

stefan6419846 commented Feb 2, 2024

sbourlon commented Feb 3, 2024

stefan6419846 commented Feb 3, 2024

sbourlon commented Feb 5, 2024