Interactive PDFs are not working #2661

SalomonKisters · 2024-05-21T11:06:10Z

I was trying to extract text from a pdf using pypdf over llamaindex. The pdf is interactive and I have linked it below. The non-interactive version is a version of the original pdf exported using the print function in edge.
I noticed the non-interactive version works, but the interactive version does not.

Environment

Which environment were you using when you encountered the problem?

I was using the llamaindex SimpleDirectoryReader, which uses pypdf underlying.

Code + PDF

This is a minimal, complete example that shows the issue:

Extracting the pdf using pypdf and getting this error: "Ignoring wrong pointing object {id} {gen} (offset {xref_entry[id]})"

Share here the PDF file(s) that cause the issue. The smaller they are, the
better. Let us know if we may add them to our tests!
SCION-book-non-interactive.pdf
SCION-book.pdf

Traceback

This is the complete traceback I see:

DEBUG - > [SimpleDirectoryReader] Total files added: 1
DEBUG - open file: /tmp/tmpwdnck6xk/SCION-book.pdf
WARNING - Ignoring wrong pointing object 5610 0 (offset 3356523)
WARNING - Ignoring wrong pointing object 5615 0 (offset 3354947)
WARNING - Object 5615 0 not defined.
DEBUG - open file: /tmp/tmpwdnck6xk/SCION-book.pdf
WARNING - Ignoring wrong pointing object 5610 0 (offset 3356523)
WARNING - Ignoring wrong pointing object 5615 0 (offset 3354947)
WARNING - Object 5615 0 not defined.
DEBUG - open file: /tmp/tmpwdnck6xk/SCION-book.pdf
WARNING - Ignoring wrong pointing object 5610 0 (offset 3356523)
WARNING - Ignoring wrong pointing object 5615 0 (offset 3354947)
Failed to load file /tmp/tmpwdnck6xk/SCION-book.pdf with error: RetryError[<Future at 0x7f324a133990 state=finished raised AttributeError>]. Skipping...

stefan6419846 · 2024-05-21T14:27:06Z

Please provide some more details. What error are you getting? If any, how to reproduce it without dependencies on external tools/libraries? In your case, the output seems to be coming from logger_warning only, which usually indicates some specification breach, but should not really affect the results.

SalomonKisters · 2024-05-21T14:35:22Z

This is the simplest way to reproduce it. It works with the non-interactive file, but not with the interactive one.

import PyPDF2

def read_pdf_text(file_path):
    # Open the PDF file
    with open(file_path, 'rb') as file:
        # Create a PDF reader object
        reader = PyPDF2.PdfReader(file)
        
        # Initialize an empty string to hold the text
        text = ""
        
        # Iterate through each page and extract text
        for page_num in range(len(reader.pages)):
            page = reader.pages[page_num]
            text += page.extract_text()
        
        return text


file_path = 'SCION-book.pdf'
pdf_text = read_pdf_text(file_path)
print(pdf_text)

On this script I get the following error:

Traceback (most recent call last):
  File "D:\git_projects\meinThurgau-data-room\microservices\swidoc-ai-service-v2\testing\test.py", line 21, in <module>
    pdf_text = read_pdf_text(file_path)
               ^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\git_projects\meinThurgau-data-room\microservices\swidoc-ai-service-v2\testing\test.py", line 15, in read_pdf_text
    text += page.extract_text()
            ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\salom\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\PyPDF2\_page.py", line 1851, in extract_text
    return self._extract_text(
           ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\salom\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\PyPDF2\_page.py", line 1353, in _extract_text
    obj[content_key].get_object() if isinstance(content_key, str) else obj
    ~~~^^^^^^^^^^^^^
  File "C:\Users\salom\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\PyPDF2\generic\_data_structures.py", line 266, in __getitem__
    return dict.__getitem__(self, key).get_object()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\salom\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\PyPDF2\generic\_base.py", line 259, in get_object
    obj = self.pdf.get_object(self)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\salom\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\PyPDF2\_reader.py", line 1260, in get_object
    retval = read_object(self.stream, self)  # type: ignore
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\salom\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\PyPDF2\generic\_data_structures.py", line 1080, in read_object
    raise PdfReadError(
PyPDF2.errors.PdfReadError: Invalid Elementary Object starting with b'\xd5' @3358736: b'\\`Vi\x9f\x10\xd5\x02\'\xd3h\xa5p:\xf1\x82\x0b\x19\xcdj\xd5"\x0b\xcc*\rL\xac\x85Z\xe0t"\xad$TP\xfc?\xc0jRwf\xff\x0f\xb0\xab\xc0B\xa4\xb0\xd3i\xb4\x128\x9dy\xc1\x04\xd8Ru\xd6`\xdbP\x9a\xa8T\xa5\xc0\xe94ZI\x1cvL'

I have managed to fix the error using pdfrw to first read the file and re-write it, but I do not feel that is the best solution.

stefan6419846 · 2024-05-21T14:44:59Z

PyPDF2 is not supported any more and you should definitely switch to pypdf. Nevertheless, it seems like there still is some issue with at least page 193:

Ignoring wrong pointing object 5610 0 (offset 3356523)
Ignoring wrong pointing object 5615 0 (offset 3354947)
Object 5615 0 not defined.
Traceback (most recent call last):
  File "/home/stefan/tmp/pypdf/run.py", line 22, in <module>
    pdf_text = read_pdf_text(file_path)
  File "/home/stefan/tmp/pypdf/run.py", line 16, in read_pdf_text
    text += page.extract_text()
  File "/home/stefan/tmp/pypdf/pypdf/_page.py", line 2083, in extract_text
    return self._extract_text(
  File "/home/stefan/tmp/pypdf/pypdf/_page.py", line 1604, in _extract_text
    obj[content_key].get_object() if isinstance(content_key, str) else obj
AttributeError: 'NoneType' object has no attribute 'get_object'

The corresponding page cannot be rendered as well and looks odd, thus it seems to be a general issue with your PDF file:

SalomonKisters · 2024-05-21T14:53:37Z

Ah okay, fair enough. I can try it with pypdf, although llamaindex is still using pypdf2, so idk if that will help in my case.
Anyways, thanks for the quick help!

SalomonKisters changed the title ~~Interactive PDFs are nto working~~ Interactive PDFs are not working May 21, 2024

SalomonKisters closed this as completed May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interactive PDFs are not working #2661

Interactive PDFs are not working #2661

SalomonKisters commented May 21, 2024

stefan6419846 commented May 21, 2024

SalomonKisters commented May 21, 2024 •

edited by stefan6419846

Loading

stefan6419846 commented May 21, 2024

SalomonKisters commented May 21, 2024

Interactive PDFs are not working #2661

Interactive PDFs are not working #2661

Comments

SalomonKisters commented May 21, 2024

Environment

Code + PDF

Traceback

stefan6419846 commented May 21, 2024

SalomonKisters commented May 21, 2024 • edited by stefan6419846 Loading

stefan6419846 commented May 21, 2024

SalomonKisters commented May 21, 2024

SalomonKisters commented May 21, 2024 •

edited by stefan6419846

Loading