Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interactive PDFs are not working #2661

Closed
SalomonKisters opened this issue May 21, 2024 · 4 comments
Closed

Interactive PDFs are not working #2661

SalomonKisters opened this issue May 21, 2024 · 4 comments

Comments

@SalomonKisters
Copy link

I was trying to extract text from a pdf using pypdf over llamaindex. The pdf is interactive and I have linked it below. The non-interactive version is a version of the original pdf exported using the print function in edge.
I noticed the non-interactive version works, but the interactive version does not.

Environment

Which environment were you using when you encountered the problem?

I was using the llamaindex SimpleDirectoryReader, which uses pypdf underlying.

Code + PDF

This is a minimal, complete example that shows the issue:

Extracting the pdf using pypdf and getting this error: "Ignoring wrong pointing object {id} {gen} (offset {xref_entry[id]})"

Share here the PDF file(s) that cause the issue. The smaller they are, the
better. Let us know if we may add them to our tests!
SCION-book-non-interactive.pdf
SCION-book.pdf

Traceback

This is the complete traceback I see:

DEBUG - > [SimpleDirectoryReader] Total files added: 1
DEBUG - open file: /tmp/tmpwdnck6xk/SCION-book.pdf
WARNING - Ignoring wrong pointing object 5610 0 (offset 3356523)
WARNING - Ignoring wrong pointing object 5615 0 (offset 3354947)
WARNING - Object 5615 0 not defined.
DEBUG - open file: /tmp/tmpwdnck6xk/SCION-book.pdf
WARNING - Ignoring wrong pointing object 5610 0 (offset 3356523)
WARNING - Ignoring wrong pointing object 5615 0 (offset 3354947)
WARNING - Object 5615 0 not defined.
DEBUG - open file: /tmp/tmpwdnck6xk/SCION-book.pdf
WARNING - Ignoring wrong pointing object 5610 0 (offset 3356523)
WARNING - Ignoring wrong pointing object 5615 0 (offset 3354947)
Failed to load file /tmp/tmpwdnck6xk/SCION-book.pdf with error: RetryError[<Future at 0x7f324a133990 state=finished raised AttributeError>]. Skipping...

@SalomonKisters SalomonKisters changed the title Interactive PDFs are nto working Interactive PDFs are not working May 21, 2024
@stefan6419846
Copy link
Collaborator

Please provide some more details. What error are you getting? If any, how to reproduce it without dependencies on external tools/libraries? In your case, the output seems to be coming from logger_warning only, which usually indicates some specification breach, but should not really affect the results.

@SalomonKisters
Copy link
Author

SalomonKisters commented May 21, 2024

This is the simplest way to reproduce it. It works with the non-interactive file, but not with the interactive one.

import PyPDF2

def read_pdf_text(file_path):
    # Open the PDF file
    with open(file_path, 'rb') as file:
        # Create a PDF reader object
        reader = PyPDF2.PdfReader(file)
        
        # Initialize an empty string to hold the text
        text = ""
        
        # Iterate through each page and extract text
        for page_num in range(len(reader.pages)):
            page = reader.pages[page_num]
            text += page.extract_text()
        
        return text


file_path = 'SCION-book.pdf'
pdf_text = read_pdf_text(file_path)
print(pdf_text)

On this script I get the following error:

Traceback (most recent call last):
  File "D:\git_projects\meinThurgau-data-room\microservices\swidoc-ai-service-v2\testing\test.py", line 21, in <module>
    pdf_text = read_pdf_text(file_path)
               ^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\git_projects\meinThurgau-data-room\microservices\swidoc-ai-service-v2\testing\test.py", line 15, in read_pdf_text
    text += page.extract_text()
            ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\salom\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\PyPDF2\_page.py", line 1851, in extract_text
    return self._extract_text(
           ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\salom\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\PyPDF2\_page.py", line 1353, in _extract_text
    obj[content_key].get_object() if isinstance(content_key, str) else obj
    ~~~^^^^^^^^^^^^^
  File "C:\Users\salom\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\PyPDF2\generic\_data_structures.py", line 266, in __getitem__
    return dict.__getitem__(self, key).get_object()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\salom\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\PyPDF2\generic\_base.py", line 259, in get_object
    obj = self.pdf.get_object(self)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\salom\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\PyPDF2\_reader.py", line 1260, in get_object
    retval = read_object(self.stream, self)  # type: ignore
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\salom\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\PyPDF2\generic\_data_structures.py", line 1080, in read_object
    raise PdfReadError(
PyPDF2.errors.PdfReadError: Invalid Elementary Object starting with b'\xd5' @3358736: b'\\`Vi\x9f\x10\xd5\x02\'\xd3h\xa5p:\xf1\x82\x0b\x19\xcdj\xd5"\x0b\xcc*\rL\xac\x85Z\xe0t"\xad$TP\xfc?\xc0jRwf\xff\x0f\xb0\xab\xc0B\xa4\xb0\xd3i\xb4\x128\x9dy\xc1\x04\xd8Ru\xd6`\xdbP\x9a\xa8T\xa5\xc0\xe94ZI\x1cvL'

I have managed to fix the error using pdfrw to first read the file and re-write it, but I do not feel that is the best solution.

@stefan6419846
Copy link
Collaborator

PyPDF2 is not supported any more and you should definitely switch to pypdf. Nevertheless, it seems like there still is some issue with at least page 193:

Ignoring wrong pointing object 5610 0 (offset 3356523)
Ignoring wrong pointing object 5615 0 (offset 3354947)
Object 5615 0 not defined.
Traceback (most recent call last):
  File "/home/stefan/tmp/pypdf/run.py", line 22, in <module>
    pdf_text = read_pdf_text(file_path)
  File "/home/stefan/tmp/pypdf/run.py", line 16, in read_pdf_text
    text += page.extract_text()
  File "/home/stefan/tmp/pypdf/pypdf/_page.py", line 2083, in extract_text
    return self._extract_text(
  File "/home/stefan/tmp/pypdf/pypdf/_page.py", line 1604, in _extract_text
    obj[content_key].get_object() if isinstance(content_key, str) else obj
AttributeError: 'NoneType' object has no attribute 'get_object'

The corresponding page cannot be rendered as well and looks odd, thus it seems to be a general issue with your PDF file:

ksnip_20240521-164222

@SalomonKisters
Copy link
Author

Ah okay, fair enough. I can try it with pypdf, although llamaindex is still using pypdf2, so idk if that will help in my case.
Anyways, thanks for the quick help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants