Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regular PDF detected as encrypted and decryption with empty string fails #245

Closed
cycomanic opened this issue Jan 22, 2016 · 5 comments
Closed
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF needs-pdf The issue needs a PDF file to show the problem workflow-encryption From a users perspective, encryption is the affected feature/workflow

Comments

@cycomanic
Copy link

Hi,
I have a problem with a regular PDF that somehow gets detected as encrypted. I tried the method mentioned in #51, i.e.

input = pyPdf.PdfFileReader(<your file>)
if input.isEncrypted:
    input.decrypt('')

but I then I get:

----> 1 inpdf.decrypt(b'')

/home/jschrod/Downloads/Python/PyPDF2/build/lib/PyPDF2/pdf.py in decrypt(self, password)
   1971         self._override_encryption = True
   1972         try:
-> 1973             return self._decrypt(password)
   1974         finally:
   1975             self._override_encryption = False

/home/jschrod/Downloads/Python/PyPDF2/build/lib/PyPDF2/pdf.py in _decrypt(self, password)
   1977     def _decrypt(self, password):
   1978         encrypt = self.trailer['/Encrypt'].getObject()
-> 1979         if encrypt['/Filter'] != '/Standard':
   1980             raise NotImplementedError("only Standard PDF encryption handler is available")
   1981         if not (encrypt['/V'] in (1, 2)):

TypeError: 'NullObject' object has no attribute '__getitem__'

I seem to get around the issue if I do

inpdf._override_encryption=True
inpdf._flatten()

after which a

inpdf.getPage(X)

succeeds and allows me to access the pdf normally (and seemingly without issues), which seems to demonstrate that it is not really encrypted.

Cheers
Jochen

@michi88
Copy link

michi88 commented Feb 12, 2016

Thanks for the workaround @cycomanic. I'm having the same issue.

@mstamy2 mstamy2 added the workflow-encryption From a users perspective, encryption is the affected feature/workflow label Aug 2, 2016
@fsiordia
Copy link

fsiordia commented Jul 2, 2018

Thank you so much @cycomanic!
That seems to work for me!

@MartinThoma MartinThoma added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Jun 26, 2022
@MartinThoma
Copy link
Member

@fsiordia @michi88 @cycomanic Do you still encounter the same issue with the latest PyPDF2 version? Could you upload an example PDF file?

@Thomas-Boi
Copy link

Thomas-Boi commented Jun 28, 2022

I have a similar problem that couldn't be solved by the workarounds described above.

I run this in a Windows and AWS Linux environment. I'm using Python 3.9

I'm trying to get the content of this PDF: fdo-fundingapplication-demandedefinancement.pdf

This is what I'm trying to do:

        response = requests.get(url)
        with open(pdf_path, "wb+") as file:
          file.write(response.content)
          
          # read from the beginning
          file.seek(0)

          reader = PdfReader(file)
          text = "".join([page.extract_text() for page in reader.pages])

However, I ran into this error when I run the above code.

Traceback (most recent call last):
  File "D:\project\scripts\get_page_hashes.py", line 104, in get_page_hashes
    text = "".join([page.extract_text() for page in reader.pages])
  File "D:\project\scripts\get_page_hashes.py", line 104, in <listcomp>
    text = "".join([page.extract_text() for page in reader.pages])
  File "D:\project\venv\lib\site-packages\PyPDF2\_page.py", line 1483, in __iter__
    for i in range(len(self)):
  File "D:\project\venv\lib\site-packages\PyPDF2\_page.py", line 1465, in __len__
    return self.length_function()
  File "D:\project\venv\lib\site-packages\PyPDF2\_reader.py", line 373, in _get_num_pages
    return self.trailer[TK.ROOT]["/Pages"]["/Count"]  # type: ignore
  File "D:\project\venv\lib\site-packages\PyPDF2\generic.py", line 650, in __getitem__
    return dict.__getitem__(self, key).get_object()
  File "D:\project\venv\lib\site-packages\PyPDF2\generic.py", line 221, in get_object
    obj = self.pdf.get_object(self)
  File "D:\project\venv\lib\site-packages\PyPDF2\_reader.py", line 1077, in get_object
    raise PdfReadError("File has not been decrypted")
PyPDF2.errors.PdfReadError: File has not been decrypted

The decrypt using empty string tricked yield the same problem. I also tried the override and flatten trick:

if reader.is_encrypted:
            reader._override_encryption = True
            reader._flatten()

However, the code now yielded this error:

Traceback (most recent call last):
  File "D:\project\scripts\get_page_hashes.py", line 101, in get_page_hashes
    reader._flatten()
  File "D:\project\venv\lib\site-packages\PyPDF2\_reader.py", line 952, in _flatten
    pages = catalog["/Pages"].get_object()  # type: ignore
  File "D:\project\venv\lib\site-packages\PyPDF2\generic.py", line 650, in __getitem__
    return dict.__getitem__(self, key).get_object()
  File "D:\project\venv\lib\site-packages\PyPDF2\generic.py", line 221, in get_object
    obj = self.pdf.get_object(self)
  File "D:\project\venv\lib\site-packages\PyPDF2\_reader.py", line 1044, in get_object
    retval = self._get_object_from_stream(indirect_reference)  # type: ignore
  File "D:\project\venv\lib\site-packages\PyPDF2\_reader.py", line 995, in _get_object_from_stream
    objnum = NumberObject.read_from_stream(stream_data)
  File "D:\project\venv\lib\site-packages\PyPDF2\generic.py", line 355, in read_from_stream
    num = read_until_regex(stream, NumberObject.NumberPattern)
  File "D:\project\venv\lib\site-packages\PyPDF2\_utils.py", line 131, in read_until_regex
    raise PdfStreamError(STREAM_TRUNCATED_PREMATURELY)
PyPDF2.errors.PdfStreamError: Stream has ended unexpectedly

EDIT:
My problem has been resolved by using pikepdf. I used the linked step before I read the file and it seems to work.

@MartinThoma MartinThoma added the needs-pdf The issue needs a PDF file to show the problem label Jul 10, 2022
@MartinThoma
Copy link
Member

I believe this was solved with #1015

Without an example PDF I'm not able to verify. For this reason, I close this issue now.

If anybody still encounters it with the latest version of PyPDF2, please let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF needs-pdf The issue needs a PDF file to show the problem workflow-encryption From a users perspective, encryption is the affected feature/workflow
Projects
None yet
Development

No branches or pull requests

6 participants