index out of bounds in pypdf._text_extraction.handle_tj #2320

rgwood-rely · 2023-11-29T02:53:53Z

On decoding a pdf in the second line:

if orientation in orientations:
    if isinstance(operands[0], str):

len(operands) == 0 and it raises an ex.

Should change it to:

if orientation in orientations and len(operands) > 0:

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
macOS-14.1.1-x86_64-i386-64bit

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==3.17.1, crypt_provider=('cryptography', '3.3.2'), PIL=10.1.0

Code + PDF

This is a minimal, complete example that shows the issue:

# sorry; PDF is confidential

Share here the PDF file(s) that cause the issue. The smaller they are, the
better. Let us know if we may add them to our tests!

Traceback

This is the complete traceback I see:

<our software>
    page_text = page_obj.extract_text()
  File "/Users/rgwood/mambaforge/envs/e_16730_w3/lib/python3.9/site-packages/pypdf/_page.py", line 2279, in extract_text
    return self._extract_text(
  File "/Users/rgwood/mambaforge/envs/e_16730_w3/lib/python3.9/site-packages/pypdf/_page.py", line 2115, in _extract_text
    process_operation(b"Tj", operands)
  File "/Users/rgwood/mambaforge/envs/e_16730_w3/lib/python3.9/site-packages/pypdf/_page.py", line 2075, in process_operation
    text, rtl_dir = handle_tj(
  File "/Users/rgwood/mambaforge/envs/e_16730_w3/lib/python3.9/site-packages/pypdf/_text_extraction/__init__.py", line 220, in handle_tj
    if isinstance(operands[0], str):
IndexError: list index out of range

The text was updated successfully, but these errors were encountered:

pubpub-zz · 2023-11-29T06:14:20Z

Without the PDF we can not analyse anything. If you agree email it privately to @MartinThoma (info@martin-thoma.de)
We will not disclose it

rgwood-rely · 2023-11-30T17:52:55Z

Is there a way to cut specific pages from a pdf? I tried:

qpdf input.pdf --pages . 1-7 -- output.pdf

But the error was not present in the resultant pdf.

pubpub-zz · 2023-11-30T18:14:44Z

try:

import pypdf
w = pypdf.PdfWriter()
w.append("pdf_to_read.pdf",[0,2,5])  # the list of pages to extract
w.write("output.pdf")

rgwood-rely · 2023-11-30T22:41:58Z

That worked!

rgwood-rely · 2023-12-01T00:31:19Z

I sent a redacted pdf page to the above email instead of attaching here out of an abundance of caution.

rgwood-rely · 2023-12-11T17:01:43Z

Link to code position:

pypdf/pypdf/_text_extraction/__init__.py

Line 220 in 38795f5

if isinstance(operands[0], str):

stefan6419846 · 2023-12-11T17:14:47Z

Have you considered submitting a corresponding PR for this (the offending line has already been part of your initial traceback)? I cannot debug this without a PDF file, but it looks like we can have an early return due to an empty operands list here.

rgwood-rely · 2023-12-13T23:30:25Z

Hi Stefan. I've sent a PDF to info@martin-thoma.de as requested and yes this should be an easy fix. Will try and create a PR if I find a moment.

rgwood-rely · 2023-12-13T23:56:33Z

PR: rgwood-rely:rgwood/2320_fix_index_out_of_bounds_in_handle_tj

Closes #2320

rgwood-rely mentioned this issue Dec 14, 2023

ROB: Out-of-bounds issue in handle_tj (text extraction) #2342

Merged

MartinThoma added the is-robustness-issue From a users perspective, this is about robustness label Dec 14, 2023

MartinThoma closed this as completed in #2342 Dec 14, 2023

MartinThoma pushed a commit that referenced this issue Dec 14, 2023

ROB: Out-of-bounds issue in handle_tj (text extraction) (#2342)

40bc577

Closes #2320

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

index out of bounds in pypdf._text_extraction.handle_tj #2320

index out of bounds in pypdf._text_extraction.handle_tj #2320

rgwood-rely commented Nov 29, 2023

pubpub-zz commented Nov 29, 2023 •

edited

Loading

rgwood-rely commented Nov 30, 2023 •

edited

Loading

pubpub-zz commented Nov 30, 2023 •

edited

Loading

rgwood-rely commented Nov 30, 2023

rgwood-rely commented Dec 1, 2023

rgwood-rely commented Dec 11, 2023

stefan6419846 commented Dec 11, 2023

rgwood-rely commented Dec 13, 2023

rgwood-rely commented Dec 13, 2023

index out of bounds in pypdf._text_extraction.handle_tj #2320

index out of bounds in pypdf._text_extraction.handle_tj #2320

Comments

rgwood-rely commented Nov 29, 2023

Environment

Code + PDF

Traceback

pubpub-zz commented Nov 29, 2023 • edited Loading

rgwood-rely commented Nov 30, 2023 • edited Loading

pubpub-zz commented Nov 30, 2023 • edited Loading

rgwood-rely commented Nov 30, 2023

rgwood-rely commented Dec 1, 2023

rgwood-rely commented Dec 11, 2023

stefan6419846 commented Dec 11, 2023

rgwood-rely commented Dec 13, 2023

rgwood-rely commented Dec 13, 2023

pubpub-zz commented Nov 29, 2023 •

edited

Loading

rgwood-rely commented Nov 30, 2023 •

edited

Loading

pubpub-zz commented Nov 30, 2023 •

edited

Loading