-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Interactive PDFs are not working #2661
Comments
Please provide some more details. What error are you getting? If any, how to reproduce it without dependencies on external tools/libraries? In your case, the output seems to be coming from |
This is the simplest way to reproduce it. It works with the non-interactive file, but not with the interactive one.
On this script I get the following error:
I have managed to fix the error using pdfrw to first read the file and re-write it, but I do not feel that is the best solution. |
PyPDF2 is not supported any more and you should definitely switch to pypdf. Nevertheless, it seems like there still is some issue with at least page 193:
The corresponding page cannot be rendered as well and looks odd, thus it seems to be a general issue with your PDF file: |
Ah okay, fair enough. I can try it with pypdf, although llamaindex is still using pypdf2, so idk if that will help in my case. |
I was trying to extract text from a pdf using pypdf over llamaindex. The pdf is interactive and I have linked it below. The non-interactive version is a version of the original pdf exported using the print function in edge.
I noticed the non-interactive version works, but the interactive version does not.
Environment
Which environment were you using when you encountered the problem?
I was using the llamaindex SimpleDirectoryReader, which uses pypdf underlying.
Code + PDF
This is a minimal, complete example that shows the issue:
Extracting the pdf using pypdf and getting this error: "Ignoring wrong pointing object {id} {gen} (offset {xref_entry[id]})"
Share here the PDF file(s) that cause the issue. The smaller they are, the
better. Let us know if we may add them to our tests!
SCION-book-non-interactive.pdf
SCION-book.pdf
Traceback
This is the complete traceback I see:
DEBUG - > [SimpleDirectoryReader] Total files added: 1
DEBUG - open file: /tmp/tmpwdnck6xk/SCION-book.pdf
WARNING - Ignoring wrong pointing object 5610 0 (offset 3356523)
WARNING - Ignoring wrong pointing object 5615 0 (offset 3354947)
WARNING - Object 5615 0 not defined.
DEBUG - open file: /tmp/tmpwdnck6xk/SCION-book.pdf
WARNING - Ignoring wrong pointing object 5610 0 (offset 3356523)
WARNING - Ignoring wrong pointing object 5615 0 (offset 3354947)
WARNING - Object 5615 0 not defined.
DEBUG - open file: /tmp/tmpwdnck6xk/SCION-book.pdf
WARNING - Ignoring wrong pointing object 5610 0 (offset 3356523)
WARNING - Ignoring wrong pointing object 5615 0 (offset 3354947)
Failed to load file /tmp/tmpwdnck6xk/SCION-book.pdf with error: RetryError[<Future at 0x7f324a133990 state=finished raised AttributeError>]. Skipping...
The text was updated successfully, but these errors were encountered: