Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

generic/_utils.py: function create_string_object not working with the bytearray type #2434

Closed
sbourlon opened this issue Feb 2, 2024 · 4 comments
Labels
generic The generic submodule is affected is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF

Comments

@sbourlon
Copy link
Contributor

sbourlon commented Feb 2, 2024

Hello pypdf team,

While trying to to get the fields of my PDF with the function PdfReader.get_fields(), my code received an exception from the function create_string_object (in pypdf/generic/_utils.py, line 113) because it received a bytearray instead of a str or bytes.

By looking at the traceback, the error occurs when the function def decrypt_object(self, obj: PdfObject) -> PdfObject detects that the object to decrypt is either of type ByteStringObject or TextStringObject, before calling create_string_object.

The documentation about the bytearray type states:

bytearray objects are a mutable counterpart to bytes objects.

As bytearray objects are mutable, they support the mutable sequence operations in addition to the common bytes and bytearray operations described in Bytes and Bytearray Operations.

source: https://docs.python.org/3/library/stdtypes.html#bytearray

So it seems like the function create_string_object could accept bytearray objects and could treat them as bytes instead of raising an exception.

After applying this fix, I was able to read the fields of my PDF.

diff --git a/pypdf/generic/_utils.py b/pypdf/generic/_utils.py
index e6da5cf..edc9153 100644
--- a/pypdf/generic/_utils.py
+++ b/pypdf/generic/_utils.py
@@ -129,7 +129,7 @@ def create_string_object(
     """
     if isinstance(string, str):
         return TextStringObject(string)
-    elif isinstance(string, bytes):
+    elif isinstance(string, bytes | bytearray):
         if isinstance(forced_encoding, (list, dict)):
             out = ""
             for x in string:

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-6.5.0-15-generic-x86_64-with-glibc2.38

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==3.17.4, crypt_provider=('cryptography', '41.0.7'), PIL=none

Code + PDF

This is a minimal, complete example that shows the issue:

pdf = PdfReader(file)
fields = pdf.get_fields()

I can't provide my PDF file because it contains personal information.

Traceback

This is the complete traceback I see:

Traceback (most recent call last):
  File "/home/stefan/src/source/pdf.py", line 235, in get_pdf_fields
    fields = self.reader.get_fields()
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/stefan/.venv/lib/python3.11/site-packages/pypdf/_reader.py", line 577, in get_fields
    field = f.get_object()
            ^^^^^^^^^^^^^^
  File "/home/stefan/.venv/lib/python3.11/site-packages/pypdf/generic/_base.py", line 312, in get_object
    obj = self.pdf.get_object(self)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/stefan/.venv/lib/python3.11/site-packages/pypdf/_reader.py", line 1417, in get_object
    retval = self._encryption.decrypt_object(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/stefan/.venv/lib/python3.11/site-packages/pypdf/_encryption.py", line 850, in decrypt_object
    return cf.decrypt_object(obj)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/stefan/.venv/lib/python3.11/site-packages/pypdf/_encryption.py", line 104, in decrypt_object
    obj[key] = self.decrypt_object(value)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/stefan/.venv/lib/python3.11/site-packages/pypdf/_encryption.py", line 97, in decrypt_object
    obj = create_string_object(data)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/stefan/.venv/lib/python3.11/site-packages/pypdf/generic/_utils.py", line 163, in create_string_object
    raise TypeError(
TypeError: ('create_string_object should have str or unicode arg: %s', <class 'bytearray'>)
@stefan6419846
Copy link
Collaborator

Thanks for the report. While this patch might work, this will break mypy as decrypt should always receive and return bytes. Are you able to pinpoint where the bytearray is being generated initially? Having a quick look at the changelog of cryptography, there does not seem to be a related change.

@sbourlon
Copy link
Contributor Author

sbourlon commented Feb 3, 2024

Hello Stefan,

(I'm glad to meet another Stefan)

Thank you for your responsiveness.

The error seems to happen with objects of type TextStringObject (from _base.py) because its property original_bytes sometimes returns a bytearray though the function encode_pdfdocencoding(unicode_string: str) -> bytes.

mypy is currently accepting the bytearray to bytes promotion but there is an expirement to remove it.

Sources:

If the function encode_pdfdocencoding(unicode_string: str) -> bytes could only return immutable bytes (bytes type), which seems to be the intent according to its function signature, its return value could be casted to bytes, like so:

diff --git a/pypdf/generic/_base.py b/pypdf/generic/_base.py
index 5a27572..813b1df 100644
--- a/pypdf/generic/_base.py
+++ b/pypdf/generic/_base.py
@@ -650,4 +650,4 @@ def encode_pdfdocencoding(unicode_string: str) -> bytes:
             raise UnicodeEncodeError(
                 "pdfdocencoding", c, -1, -1, "does not exist in translation table"
             )
-    return retval
+    return bytes(retval)

Detailed explanation:

  1. The function PdfReader.get_object, in the "# otherwise, decrypt here..." section, decrypts the object using the function Encryption.decrypt_object(obj: PdfObject, ...)
  2. The function Encryption.decrypt_object(obj: PdfObject):
  3. Checks that the object is encrypted with the function Encryption._is_encryption_object(obj), which simply checks that the object is one of the following types: ByteStringObject, TextStringObject, StreamObject, ArrayObject, DictionaryObject.
  4. Decrypts the object with the function CryptFilter.decrypt_object(obj: PdfObject)
  5. The function CryptFilter.decrypt_object(obj: PdfObject):
  6. Calls the object's property obj.original_bytes, which in case of TextStringObject returns a bytearray
  7. Calls the function _utils.py.create_string_object(string: Union[str, bytes], ...) which accepts only str or bytes

@stefan6419846
Copy link
Collaborator

I am still not completely sure why bytes are generated in one case but a bytearray in another? But fixing this at the central location appears to be more suitable.

Do you want to submit a corresponding PR?

@sbourlon
Copy link
Contributor Author

sbourlon commented Feb 5, 2024

Because of the method get_original_bytes(self) -> bytes of the class TextStringObject

    def get_original_bytes(self) -> bytes:
        # We're a text string object, but the library is trying to get our raw
        # bytes.  This can happen if we auto-detected this string as text, but
        # we were wrong.  It's pretty common.  Return the original bytes that
        # would have been used to create this object, based upon the autodetect
        # method.
        if self.autodetect_utf16:
            return codecs.BOM_UTF16_BE + self.encode("utf-16be") # <-- returns bytes
        elif self.autodetect_pdfdocencoding:
            return encode_pdfdocencoding(self) # <-- returns bytearray
        else:
            raise Exception("no information about original bytes")

I opened the PR #2440 which cast the return of encode_pdfdocencoding into bytes.

@stefan6419846 stefan6419846 added is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF generic The generic submodule is affected labels Feb 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
generic The generic submodule is affected is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF
Projects
None yet
Development

No branches or pull requests

2 participants