Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

index out of bounds in pypdf._text_extraction.handle_tj #2320

Closed
rgwood-rely opened this issue Nov 29, 2023 · 9 comments · Fixed by #2342
Closed

index out of bounds in pypdf._text_extraction.handle_tj #2320

rgwood-rely opened this issue Nov 29, 2023 · 9 comments · Fixed by #2342
Labels
is-robustness-issue From a users perspective, this is about robustness

Comments

@rgwood-rely
Copy link
Contributor

On decoding a pdf in the second line:

if orientation in orientations:
    if isinstance(operands[0], str):

len(operands) == 0 and it raises an ex.

Should change it to:

if orientation in orientations and len(operands) > 0:

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
macOS-14.1.1-x86_64-i386-64bit

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==3.17.1, crypt_provider=('cryptography', '3.3.2'), PIL=10.1.0

Code + PDF

This is a minimal, complete example that shows the issue:

# sorry; PDF is confidential

Share here the PDF file(s) that cause the issue. The smaller they are, the
better. Let us know if we may add them to our tests!

Traceback

This is the complete traceback I see:

<our software>
    page_text = page_obj.extract_text()
  File "/Users/rgwood/mambaforge/envs/e_16730_w3/lib/python3.9/site-packages/pypdf/_page.py", line 2279, in extract_text
    return self._extract_text(
  File "/Users/rgwood/mambaforge/envs/e_16730_w3/lib/python3.9/site-packages/pypdf/_page.py", line 2115, in _extract_text
    process_operation(b"Tj", operands)
  File "/Users/rgwood/mambaforge/envs/e_16730_w3/lib/python3.9/site-packages/pypdf/_page.py", line 2075, in process_operation
    text, rtl_dir = handle_tj(
  File "/Users/rgwood/mambaforge/envs/e_16730_w3/lib/python3.9/site-packages/pypdf/_text_extraction/__init__.py", line 220, in handle_tj
    if isinstance(operands[0], str):
IndexError: list index out of range
@pubpub-zz
Copy link
Collaborator

pubpub-zz commented Nov 29, 2023

Without the PDF we can not analyse anything. If you agree email it privately to @MartinThoma (info@martin-thoma.de)
We will not disclose it

@rgwood-rely
Copy link
Contributor Author

rgwood-rely commented Nov 30, 2023

Is there a way to cut specific pages from a pdf? I tried:

qpdf input.pdf --pages . 1-7 -- output.pdf

But the error was not present in the resultant pdf.

@pubpub-zz
Copy link
Collaborator

pubpub-zz commented Nov 30, 2023

try:

import pypdf
w = pypdf.PdfWriter()
w.append("pdf_to_read.pdf",[0,2,5])  # the list of pages to extract
w.write("output.pdf")

@rgwood-rely
Copy link
Contributor Author

That worked!

@rgwood-rely
Copy link
Contributor Author

I sent a redacted pdf page to the above email instead of attaching here out of an abundance of caution.

@rgwood-rely
Copy link
Contributor Author

Link to code position:

if isinstance(operands[0], str):

@stefan6419846
Copy link
Collaborator

Have you considered submitting a corresponding PR for this (the offending line has already been part of your initial traceback)? I cannot debug this without a PDF file, but it looks like we can have an early return due to an empty operands list here.

@rgwood-rely
Copy link
Contributor Author

Hi Stefan. I've sent a PDF to info@martin-thoma.de as requested and yes this should be an easy fix. Will try and create a PR if I find a moment.

@rgwood-rely
Copy link
Contributor Author

PR: rgwood-rely:rgwood/2320_fix_index_out_of_bounds_in_handle_tj

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-robustness-issue From a users perspective, this is about robustness
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants